10 | | ParaStationMPI provides support for inter-module communication in federated high-speed networks. Therefore, so-called gateway (GW) daemons bridge the MPI traffic between the modules. This mechanism is transparent to the MPI application, i.e., the MPI ranks see a common `MPI_COMM_WORLD` across all modules within the job. However, the user has to account for these additional GW resources during the job submission. An example SLURM Batch script illustrating the submission of heterogeneous pack jobs including the allocation of GW resources can be found [wiki:User_Guide/Batch_system#HeterogeneousjobswithMPIcommunicationacrossmodules here]. |
| 10 | !ParaStation MPI provides support for inter-module communication in federated high-speed networks. Therefore, so-called gateway (GW) daemons bridge the MPI traffic between the modules. This mechanism is transparent to the MPI application, i.e., the MPI ranks see a common `MPI_COMM_WORLD` across all modules within the job. However, the user has to account for these additional GW resources during the job submission. An example SLURM Batch script illustrating the submission of heterogeneous pack jobs including the allocation of GW resources can be found [wiki:User_Guide/Batch_system#HeterogeneousjobswithMPIcommunicationacrossmodules here]. |
| 11 | |
| 12 | == Modularity-aware Collectives == |
| 13 | |
| 14 | === Feature Description === |
| 15 | In the context of DEEP-EST and MSA, !ParaStation MPI has been extended by modularity awareness also for collective MPI operations. |
| 16 | In doing so, an MSA-aware collective operation is conducted in a hierarchical manner where the intra- and inter- module phases are strictly separated: |
| 17 | 1. First do all module-internal gathering and/or reduction operations if required. |
| 18 | 2. Then perform the inter-module operation with only one process per module being involved. |
| 19 | 3. Finally, distribute the data within each module in a strictly module-local manner. |
| 20 | This approach is here exemplarily shown in the following figure for a Broadcast operation with nine processes and three modules: |
| 21 | |
| 22 | [[Image(ParaStationMPI_MSA_Bcast.jpg)]] |
| 23 | |
| 24 | Besides Broadcast, the following collective operations are currently provided with this awareness: |
| 25 | * `MPI_Bcast` / `MPI_Ibcast` |
| 26 | * `MPI_Reduce` / `MPI_Ireduce` |
| 27 | * `MPI_Allreduce` / `MPI_Iallreduce` |
| 28 | * `MPI_Scan` / `MPI_Iscan` |
| 29 | * `MPI_Barrier` |
| 30 | |
| 31 | === Feature Usage === |
| 32 | For using this feature, the following environment variables must be set and/or considered: |
| 33 | {{{ |
| 34 | - PSP_MSA_AWARENESS=1 # NOT enabled by default |
| 35 | - PSP_MSA_AWARE_COLLOPS=1 # Enabled by default if PSP_MSA_AWARENESS=1 is set |
| 36 | - PSP_MSA_MODULE_ID=xyz # Pass the respective module ID (Integer) explicitly to the processes |
| 37 | }}} |
| 38 | |
| 39 | **Attention:** Please note that the environment variable for the respective Module ID (`PSP_MSA_MODULE_ID`) is currently ''not'' set automatically! |
| 40 | This means that the user has to set and pass this variable explicitly, for example, via a bash script: |
| 41 | {{{ |
| 42 | #!/bin/bash |
| 43 | # Script (script0.sh) for Module 0: (e.g. Cluster) |
| 44 | APP="./IMB-MPI1 Bcast" |
| 45 | MODULE_ID=0 # <- set an arbitrary ID for this module! |
| 46 | export PSP_MSA_AWARENESS=1 PSP_MSA_MODULE_ID="${MODULE_ID}" ./${APP} |
| 47 | }}} |
| 48 | {{{ |
| 49 | #!/bin/bash |
| 50 | # Script (script1.sh) for Module 1: (e.g. ESB) |
| 51 | APP="./IMB-MPI1 Bcast" |
| 52 | MODULE_ID=1 # <- set a different ID for this module! |
| 53 | export PSP_MSA_AWARENESS=1 PSP_MSA_MODULE_ID="${MODULE_ID}" ./${APP} |
| 54 | }}} |
| 55 | {{{ |
| 56 | > srun ./script0 : ./script1 |
| 57 | }}} |