Context Navigation

TAMPI

Timestamp:: Jun 19, 2019, 3:58:26 PM (5 years ago)
Author:: Pedro Martinez-Ferror
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/TAMPI

-                      v5
+                      v6
 We highly recommend to interactively log in a **cluster module (CM) node** to begin using TAMPI.
-**Presently, it seems that system affinity is not correctly setup for hybrid applications using multi-threading, therefore multi-threading will be ignored from now on.**
 A truly hybrid application should simply execute two MPI ranks, one on each NUMA socket to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel application.
 …
 `module load TAMPI`
 Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (notice that this MPI library has been compiled with multi-threading support enabled).
 You might want to request more MPI ranks per socket depending on your particular application.  See the examples below and the system affinity report (note that all of them ignore multi-threading):
+Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (this MPI library has been compiled with multi-threading support enabled).
+You might want to request more MPI ranks per socket depending on your particular application.  See the examples below and the system affinity report:
 `srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i`
 …
 membind: 1
 }}}
+Finally, in case you would like to use 1 CM node with 2 MPI ranks (one MPI rank per socket) and 24 threads per socket (taking advantage of multi-threading), it is recommended to invoke the `srun` command as a regular job instead of running an ''interactive'' session.  Indeed, when running `srun` as a regular job:
+`srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show`
+ the reported system affinity is:
+{{{
+$ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show
+policy: bind
+preferred node: 0
+physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
+cpubind: 0
+nodebind: 0
+membind: 0
+policy: bind
+preferred node: 1
+physcpubind: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
+cpubind: 1
+nodebind: 1
+membind: 1
+}}}
+which indicates that each MPI rank is binded to a single NUMA socket.
+On the other hand, when running inside an interactive session:
+`srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i`
+the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**:
+{{{
+$ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i
+$ srun numactl --show
+policy: bind
+preferred node: 0
+physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
+cpubind: 0 1
+nodebind: 0 1
+membind: 0 1
+policy: bind
+preferred node: 0
+physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+cpubind: 0 1
+nodebind: 0 1
+membind: 0 1
+}}}
 ----