Changes between Version 5 and Version 6 of Public/User_Guide/TAMPI
- Timestamp:
- Jun 19, 2019, 3:58:26 PM (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/TAMPI
v5 v6 31 31 We highly recommend to interactively log in a **cluster module (CM) node** to begin using TAMPI. 32 32 33 **Presently, it seems that system affinity is not correctly setup for hybrid applications using multi-threading, therefore multi-threading will be ignored from now on.**34 35 33 A truly hybrid application should simply execute two MPI ranks, one on each NUMA socket to mitigate suboptimal memory accesses. Such an application will then use all the cores/threads available on each NUMA socket to run a shared-memory parallel application. 36 34 … … 69 67 `module load TAMPI` 70 68 71 Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules ( notice thatthis MPI library has been compiled with multi-threading support enabled).72 73 You might want to request more MPI ranks per socket depending on your particular application. See the examples below and the system affinity report (note that all of them ignore multi-threading):69 Note that loading the TAMPI module will automatically load the **!OmpSs-2** and **Parastation MPI** modules (this MPI library has been compiled with multi-threading support enabled). 70 71 You might want to request more MPI ranks per socket depending on your particular application. See the examples below and the system affinity report: 74 72 75 73 `srun -p dp-cn -N 1 -n 4 -c 6 --pty /bin/bash -i` … … 180 178 membind: 1 181 179 }}} 180 181 Finally, in case you would like to use 1 CM node with 2 MPI ranks (one MPI rank per socket) and 24 threads per socket (taking advantage of multi-threading), it is recommended to invoke the `srun` command as a regular job instead of running an ''interactive'' session. Indeed, when running `srun` as a regular job: 182 183 `srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show` 184 185 the reported system affinity is: 186 {{{ 187 $ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 numactl --show 188 policy: bind 189 preferred node: 0 190 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 191 cpubind: 0 192 nodebind: 0 193 membind: 0 194 policy: bind 195 preferred node: 1 196 physcpubind: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 197 cpubind: 1 198 nodebind: 1 199 membind: 1 200 }}} 201 which indicates that each MPI rank is binded to a single NUMA socket. 202 203 On the other hand, when running inside an interactive session: 204 205 `srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i` 206 207 the binding somehow remains interleaved between the two NUMA sockets thus yielding **suboptimal performance**: 208 {{{ 209 $ srun -p dp-cn -N 1 -n 2 -c 24 --ntasks-per-node=2 --ntasks-per-socket=1 --pty /bin/bash -i 210 $ srun numactl --show 211 policy: bind 212 preferred node: 0 213 physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 214 cpubind: 0 1 215 nodebind: 0 1 216 membind: 0 1 217 policy: bind 218 preferred node: 0 219 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 220 cpubind: 0 1 221 nodebind: 0 1 222 membind: 0 1 223 }}} 224 182 225 183 226 ----