Changes between Version 6 and Version 7 of Public/User_Guide/TAMPI
- Timestamp:
- Jun 19, 2019, 4:10:41 PM (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/TAMPI
v6 v7 14 14 = Quick Overview = 15 15 16 The **Task-Aware MPI** or TAMPI library ensures a **deadlock-free** execution of hybrid applications by implementing a cooperation mechanism between the MPI library and theparallel task-based runtime system.16 The **Task-Aware MPI** or TAMPI library ensures a **deadlock-free** execution of hybrid applications by implementing a cooperation mechanism between the MPI library and a parallel task-based runtime system. 17 17 18 18 TAMPI extends the functionality of standard MPI libraries by providing new mechanisms for improving the interoperability between parallel task-based programming models, such as **OpenMP** or **!OmpSs-2**, and both **blocking** and **non-blocking** MPI operations. … … 31 31 We highly recommend to interactively log in a **cluster module (CM) node** to begin using TAMPI. 32 32 33 A truly hybrid application should simply execute two MPI ranks, one on each NUMA socket to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel application.34 35 The command below requests an entire CM node for an interactive session with 2 MPI ranks ( one per NUMA socket) and each rank using the 12 **physical** cores available on each socket (multi-threading disabled):33 In most cases a truly hybrid application should simply execute two MPI ranks, each one on a different NUMA socket in order to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel instance of the same binary. 34 35 The command below requests an entire CM node for an interactive session with 2 MPI ranks (1 MPI rank per NUMA socket) and each rank using the 12 **physical cores** available on each socket (i.e. multi-threading ignored): 36 36 37 37 `srun -p dp-cn -N 1 -n 2 -c 12 --pty /bin/bash -i` … … 39 39 Once you have entered a CM node you can check the system affinity via the **NUMA command** `srun numactl --show`: 40 40 {{{ 41 $ srun -p dp-cn -N 1 -n 2 -c 12 --pty /bin/bash -i 42 $ srun numactl --show 41 43 policy: bind 42 44 preferred node: 1 … … 69 71 Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (this MPI library has been compiled with multi-threading support enabled). 70 72 71 You might want to request more MPI ranks per socket depending on your particular application. See the examples below and thesystem affinity report:73 You might want to request more MPI ranks per socket depending on your particular application. See the examples below together with the corresponding system affinity report: 72 74 73 75 `srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i` 74 76 75 77 {{{ 78 $ srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i 76 79 $ srun numactl --show 77 80 policy: bind … … 179 182 }}} 180 183 181 Finally, in case you would like to use 1 CM node with 2 MPI ranks (one MPI rank per socket) and 24 threads per socket (taking advantage of multi-threading), it is recommended to invoke the `srun` command as a regular job instead of running an ''interactive'' session. Indeed, when running `srun` as a regular job:184 Finally, in case you would like to take advantage of multi-threading, we do recommend to run your application as standards jobs via the command `srun`. For example, requesting 1 CM node with 2 MPI ranks (1 MPI rank per socket) and 24 threads per socket (multi-threading) via a regular job submission to check the system affinity: 182 185 183 186 `srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show` 184 187 185 the reported system affinity is:188 yields the following system affinity: 186 189 {{{ 187 190 $ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show … … 201 204 which indicates that each MPI rank is binded to a single NUMA socket. 202 205 203 On the other hand, when running inside an interactive session:206 On the other hand, when allocating an interactive session for the same purpose: 204 207 205 208 `srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i` 206 209 207 the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**:210 one realises that the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**: 208 211 {{{ 209 212 $ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i