10 | | !ParaStation MPI provides support for inter-module communication in federated high-speed networks. Therefore, so-called gateway (GW) daemons bridge the MPI traffic between the modules. This mechanism is transparent to the MPI application, i.e., the MPI ranks see a common `MPI_COMM_WORLD` across all modules within the job. However, the user has to account for these additional GW resources during the job submission. An example SLURM Batch script illustrating the submission of heterogeneous pack jobs including the allocation of GW resources can be found [wiki:User_Guide/Batch_system#HeterogeneousjobswithMPIcommunicationacrossmodules here]. |
| 10 | !ParaStation MPI provides support for inter-module communication in federated high-speed networks. |
| 11 | Therefore, so-called Gateway (GW) daemons bridge the MPI traffic between the modules. |
| 12 | This mechanism is transparent to the MPI application, i.e., the MPI ranks see a common `MPI_COMM_WORLD` across all modules within the job. |
| 13 | However, the user has to account for these additional Gateway resources during the job submission. |
| 14 | The following `srun` command line with so-called ''colon notation'' illustrates the submission of heterogeneous pack jobs including the allocation of Gateway resources: |
| 15 | |
| 16 | {{{ |
| 17 | srun --gw_num=1 --partition dp-cn -N8 -n64 ./mpi_hello : --partition dp-esb -N16 -n256 ./mpi_hello |
| 18 | }}} |
| 19 | |
| 20 | An MPI job started with this colon notation via srun will run in a single `MPI_COMM_WORLD`. |
| 21 | |
| 22 | However, workflows across modules may demand for multiple `MPI_COMM_WOLRD`s that may connect (and later disconnect) with each other during runtime. |
| 23 | The following simple job script is example that supports such a case: |
| 24 | |
| 25 | {{{ |
| 26 | #!/bin/bash |
| 27 | #SBATCH --gw_num=1 |
| 28 | #SBATCH --nodes=8 --partition=dp-cn |
| 29 | #SBATCH hetjob |
| 30 | #SBATCH --nodes=16 --partition=dp-esb |
| 31 | |
| 32 | srun -n64 --het-group 0 ./mpi_hello_accept & |
| 33 | srun -n256 --het-group 1 ./mpi_hello_connect & |
| 34 | wait |
| 35 | }}} |
| 36 | |
| 37 | Further examples of Slurm batch scripts illustrating the allocation of heterogeneous resources can be found [wiki:User_Guide/Batch_system#HeterogeneousjobswithMPIcommunicationacrossmodules here]. |
207 | | === Usage with older versions === |
| 234 | === Feature usage in environments without MSA support === |
| 235 | |
| 236 | On the DEEP-EST prototype, the Module ID is determined automatically and the environment variable `PSP_MSA_MODULE_ID` is then set accordingly. |
| 237 | However, on systems without this support, and/or on systems with a !ParaStation MPI ''before'' version 5.4.6, the user has to set and pass this variable explicitly, for example, via a bash script: |
| 238 | |
| 239 | {{{ |
| 240 | > cat msa.sh |
| 241 | #!/bin/bash |
| 242 | ID=$2 |
| 243 | APP=$1 |
| 244 | shift |
| 245 | shift |
| 246 | ARGS=$@ |
| 247 | PSP_MSA_MODULE_ID=${ID} ${APP} ${ARGS} |
| 248 | |
| 249 | > srun ./msa.sh ./IMB-MPI1 Bcast : ./msa.sh ./IMB-MPI1 Bcast |
| 250 | }}} |
| 251 | |
| 252 | |
231 | | Since psmpi-5.4.6, the Module ID is set automatically, which means that you can omit all the script stuff above. [[BR]] |
232 | | However, you can still use `PSP_MSA_MODULE_ID` and the script approach if you want to set the Module IDs ''explicitly'', e.g. for debugging and/or emulating reasons. |
233 | | |
234 | | |
235 | | {{{#!comment |
236 | | **Attention:** Please note that a meaningful usage of `PSP_MSA_AWARE_COLLOPS=2` requires `psmpi-5.4.5` or higher. |
237 | | Currently (effective April 2020), this means that !ParaStation MPI has to be loaded on the DEEP-EST system as a module of the devel-stage: |
238 | | {{{ |
239 | | # Set the environment to use psmpi-5.4.5-2: |
240 | | module --force purge |
241 | | module use $OTHERSTAGES |
242 | | module load Stages/Devel-2019a |
243 | | module load Intel |
244 | | module load ParaStationMPI |
245 | | }}} |
246 | | }}} |
| 276 | In addition, this script approach can always be useful if one wants to set the Module IDs ''explicitly'', e.g. for debugging and/or emulating reasons. |
| 277 | |
| 278 | |
| 279 | ---- |
258 | | === Current status on the DEEP system === |
259 | | Currently (effective October 2019), !ParaStation MPI supports CUDA-awareness for Extoll just from the semantic-related point of view: The usage of Device pointers as arguments for send and receive buffers when calling MPI functions is supported but by an explicit ''Staging'' when Extoll is used. |
260 | | This is because the Extoll runtime up to now does not support GPUDirect, but EXTOLL is currently working on this in the context of DEEP-EST. |
261 | | As soon as GPUDirect will be supported by Extoll, this will also be integrated and enabled in !ParaStation MPI. |
262 | | (BTW: For !InfiniBand communication, !ParaStation MPI is already GPUDirect enabled.) |
263 | | |
264 | | === Usage on the DEEP system === |
265 | | |
266 | | **Warning:** ''This manual section is currently under development. Therefore, the following usage guidelines may be not flawless and are likely to change in some respects in the near future! '' |
267 | | |
268 | | On the DEEP system, the CUDA awareness can be enabled by loading a module that links to a dedicated !ParaStation MPI library providing CUDA support: |
| 291 | |
| 292 | === Usage on the DEEP-EST system === |
| 293 | |
| 294 | On the DEEP-EST system, the CUDA awareness can be enabled by loading a module that links to a dedicated !ParaStation MPI library providing CUDA support: |
271 | | module load ParaStationMPI/5.4.0-1-CUDA |
272 | | }}} |
273 | | Please note that CUDA-awareness might impact the MPI performance on systems parts where CUDA is not used. |
274 | | Therefore, it might be useful (and the other way around necessary) to disable/enable the CUDA-awareness. |
275 | | Furthermore, additional optimisations such as GPUDirect, i.e., direct RMA transfers to/from CUDA device memory, are available with certain pscom plugins depending on the underlying hardware. |
| 297 | module load ParaStationMPI/5.4.2-1-CUDA |
| 298 | }}} |
| 299 | |
| 300 | Please note that CUDA awareness might impact the MPI performance on systems parts where CUDA is not used. |
| 301 | Therefore, it might be useful (and the other way around necessary) to disable/enable the CUDA awareness. |
| 302 | Furthermore, additional optimizations such as GPUDirect, i.e., direct RMA transfers to/from CUDA device memory, are available with certain pscom plugins depending on the underlying hardware. |
| 303 | |