Changes between Version 6 and Version 7 of Public/User_Guide/TAMPI


Ignore:
Timestamp:
Jun 19, 2019, 4:10:41 PM (5 years ago)
Author:
Pedro Martinez-Ferror
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/TAMPI

    v6 v7  
    1414= Quick Overview =
    1515
    16 The **Task-Aware MPI** or TAMPI library ensures a **deadlock-free** execution of hybrid applications by implementing a cooperation mechanism between the MPI library and the parallel task-based runtime system.
     16The **Task-Aware MPI** or TAMPI library ensures a **deadlock-free** execution of hybrid applications by implementing a cooperation mechanism between the MPI library and a parallel task-based runtime system.
    1717
    1818TAMPI extends the functionality of standard MPI libraries by providing new mechanisms for improving the interoperability between parallel task-based programming models, such as **OpenMP** or **!OmpSs-2**, and both **blocking** and **non-blocking** MPI operations.
     
    3131We highly recommend to interactively log in a **cluster module (CM) node** to begin using TAMPI.
    3232
    33 A truly hybrid application should simply execute two MPI ranks, one on each NUMA socket to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel application.
    34 
    35 The command below requests an entire CM node for an interactive session with 2 MPI ranks (one per NUMA socket) and each rank using the 12 **physical** cores available on each socket (multi-threading disabled):
     33In most cases a truly hybrid application should simply execute two MPI ranks, each one on a different NUMA socket in order to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel instance of the same binary.
     34
     35The command below requests an entire CM node for an interactive session with 2 MPI ranks (1 MPI rank per NUMA socket) and each rank using the 12 **physical cores** available on each socket (i.e. multi-threading ignored):
    3636
    3737`srun -p dp-cn -N 1 -n 2 -c 12 --pty /bin/bash -i`
     
    3939Once you have entered a CM node you can check the system affinity via the **NUMA command** `srun numactl --show`:
    4040{{{
     41$ srun -p dp-cn -N 1 -n 2 -c 12 --pty /bin/bash -i
     42$ srun numactl --show
    4143policy: bind
    4244preferred node: 1
     
    6971Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (this MPI library has been compiled with multi-threading support enabled).
    7072
    71 You might want to request more MPI ranks per socket depending on your particular application.  See the examples below and the system affinity report:
     73You might want to request more MPI ranks per socket depending on your particular application.  See the examples below together with the corresponding system affinity report:
    7274
    7375`srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i`
    7476
    7577{{{
     78$ srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i
    7679$ srun numactl --show
    7780policy: bind
     
    179182}}}
    180183
    181 Finally, in case you would like to use 1 CM node with 2 MPI ranks (one MPI rank per socket) and 24 threads per socket (taking advantage of multi-threading), it is recommended to invoke the `srun` command as a regular job instead of running an ''interactive'' session.  Indeed, when running `srun` as a regular job:
     184Finally, in case you would like to take advantage of multi-threading, we do recommend to run your application as standards jobs via the command `srun`.  For example, requesting 1 CM node with 2 MPI ranks (1 MPI rank per socket) and 24 threads per socket (multi-threading) via a regular job submission to check the system affinity:
    182185
    183186`srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show`
    184187
    185  the reported system affinity is:
     188yields the following system affinity:
    186189{{{
    187190$ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show
     
    201204which indicates that each MPI rank is binded to a single NUMA socket.
    202205
    203 On the other hand, when running inside an interactive session:
     206On the other hand, when allocating an interactive session for the same purpose:
    204207
    205208`srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i`
    206209
    207 the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**:
     210one realises that the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**:
    208211{{{
    209212$ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i