Context Navigation

TAMPI

Timestamp:: Jun 19, 2019, 4:10:41 PM (5 years ago)
Author:: Pedro Martinez-Ferror
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/TAMPI

-                      v6
+                      v7
 = Quick Overview =
 The **Task-Aware MPI** or TAMPI library ensures a **deadlock-free** execution of hybrid applications by implementing a cooperation mechanism between the MPI library and the parallel task-based runtime system.
+The **Task-Aware MPI** or TAMPI library ensures a **deadlock-free** execution of hybrid applications by implementing a cooperation mechanism between the MPI library and a parallel task-based runtime system.
 TAMPI extends the functionality of standard MPI libraries by providing new mechanisms for improving the interoperability between parallel task-based programming models, such as **OpenMP** or **!OmpSs-2**, and both **blocking** and **non-blocking** MPI operations.
 …
 We highly recommend to interactively log in a **cluster module (CM) node** to begin using TAMPI.
 A truly hybrid application should simply execute two MPI ranks, one on each NUMA socket to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel application.
 The command below requests an entire CM node for an interactive session with 2 MPI ranks (one per NUMA socket) and each rank using the 12 **physical** cores available on each socket (multi-threading disabled):
+In most cases a truly hybrid application should simply execute two MPI ranks, each one on a different NUMA socket in order to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel instance of the same binary.
+The command below requests an entire CM node for an interactive session with 2 MPI ranks (1 MPI rank per NUMA socket) and each rank using the 12 **physical cores** available on each socket (i.e. multi-threading ignored):
 `srun -p dp-cn -N 1 -n 2 -c 12 --pty /bin/bash -i`
 …
 Once you have entered a CM node you can check the system affinity via the **NUMA command** `srun numactl --show`:
 {{{
+$ srun -p dp-cn -N 1 -n 2 -c 12 --pty /bin/bash -i
+$ srun numactl --show
 policy: bind
 preferred node: 1
 …
 Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (this MPI library has been compiled with multi-threading support enabled).
 You might want to request more MPI ranks per socket depending on your particular application.  See the examples below and the system affinity report:
+You might want to request more MPI ranks per socket depending on your particular application.  See the examples below together with the corresponding system affinity report:
 `srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i`
 {{{
+$ srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i
 $ srun numactl --show
 policy: bind
 …
 }}}
 Finally, in case you would like to use 1 CM node with 2 MPI ranks (one MPI rank per socket) and 24 threads per socket (taking advantage of multi-threading), it is recommended to invoke the `srun` command as a regular job instead of running an ''interactive'' session.  Indeed, when running `srun` as a regular job:
+Finally, in case you would like to take advantage of multi-threading, we do recommend to run your application as standards jobs via the command `srun`.  For example, requesting 1 CM node with 2 MPI ranks (1 MPI rank per socket) and 24 threads per socket (multi-threading) via a regular job submission to check the system affinity:
 `srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show`
  the reported system affinity is:
+yields the following system affinity:
 {{{
 $ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show
 …
 which indicates that each MPI rank is binded to a single NUMA socket.
 On the other hand, when running inside an interactive session:
+On the other hand, when allocating an interactive session for the same purpose:
 `srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i`
 the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**:
+one realises that the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**:
 {{{
 $ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i