Changes between Version 5 and Version 6 of Public/User_Guide/TAMPI


Ignore:
Timestamp:
Jun 19, 2019, 3:58:26 PM (5 years ago)
Author:
Pedro Martinez-Ferror
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/TAMPI

    v5 v6  
    3131We highly recommend to interactively log in a **cluster module (CM) node** to begin using TAMPI.
    3232
    33 **Presently, it seems that system affinity is not correctly setup for hybrid applications using multi-threading, therefore multi-threading will be ignored from now on.**
    34 
    3533A truly hybrid application should simply execute two MPI ranks, one on each NUMA socket to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel application.
    3634
     
    6967`module load TAMPI`
    7068
    71 Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (notice that this MPI library has been compiled with multi-threading support enabled).
    72 
    73 You might want to request more MPI ranks per socket depending on your particular application.  See the examples below and the system affinity report (note that all of them ignore multi-threading):
     69Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (this MPI library has been compiled with multi-threading support enabled).
     70
     71You might want to request more MPI ranks per socket depending on your particular application.  See the examples below and the system affinity report:
    7472
    7573`srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i`
     
    180178membind: 1
    181179}}}
     180
     181Finally, in case you would like to use 1 CM node with 2 MPI ranks (one MPI rank per socket) and 24 threads per socket (taking advantage of multi-threading), it is recommended to invoke the `srun` command as a regular job instead of running an ''interactive'' session.  Indeed, when running `srun` as a regular job:
     182
     183`srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show`
     184
     185 the reported system affinity is:
     186{{{
     187$ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show
     188policy: bind
     189preferred node: 0
     190physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
     191cpubind: 0
     192nodebind: 0
     193membind: 0
     194policy: bind
     195preferred node: 1
     196physcpubind: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
     197cpubind: 1
     198nodebind: 1
     199membind: 1
     200}}}
     201which indicates that each MPI rank is binded to a single NUMA socket.
     202
     203On the other hand, when running inside an interactive session:
     204
     205`srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i`
     206
     207the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**:
     208{{{
     209$ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i
     210$ srun numactl --show
     211policy: bind
     212preferred node: 0
     213physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
     214cpubind: 0 1
     215nodebind: 0 1
     216membind: 0 1
     217policy: bind
     218preferred node: 0
     219physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
     220cpubind: 0 1
     221nodebind: 0 1
     222membind: 0 1
     223}}}
     224
    182225
    183226----