Changes between Version 28 and Version 29 of Public/ParaStationMPI


Ignore:
Timestamp:
May 28, 2021, 5:26:26 PM (3 years ago)
Author:
Carsten Clauß
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/ParaStationMPI

    v28 v29  
    77----
    88
    9 == Heterogeneous Jobs using inter-module MPI communication ==
     9== Modular MPI Jobs ==
     10
     11=== Inter-module MPI Communication ===
     12 
    1013!ParaStation MPI provides support for inter-module communication in federated high-speed networks.
    1114Therefore, so-called Gateway (GW) daemons bridge the MPI traffic between the modules.
     
    2023An MPI job started with this colon notation via srun will run in a single `MPI_COMM_WORLD`.
    2124
    22 However, workflows across modules may demand for multiple `MPI_COMM_WOLRD`s that may connect (and later disconnect) with each other during runtime.
     25However, workflows across modules may demand for multiple `MPI_COMM_WOLRD` sessions that may connect (and later disconnect) with each other during runtime.
    2326The following simple job script is example that supports such a case:
    2427
     
    3841
    3942=== Application-dependent Tuning ===
    40 The GW protocol supports the fragmentation of larger messages into smaller chunks of a given length, i.e., the Maximum Transfer Unit (MTU). This way, the GW daemon may benefit from pipelining effect resulting in an overlapping of the message transfer from the source to the GW daemon and from the GW daemon to the destination. The chunk size may be influenced by setting the following environment variable:
     43
     44The Gateway protocol supports the fragmentation of larger messages into smaller chunks of a given length, i.e., the Maximum Transfer Unit (MTU).
     45This way, the Gateway daemon may benefit from pipelining effect resulting in an overlapping of the message transfer from the source to the Gateway daemon and from the Gateway daemon to the destination.
     46The chunk size may be influenced by setting the following environment variable:
    4147{{{
    4248PSP_GW_MTU=<chunk size in byte>
    4349}}}
    4450The optimal chunk size is highly dependent on the communication pattern and therefore has to be chosen for each application individually.
     51
     52
     53=== API Extensions for MSA awareness ===
     54
     55Besides transparent MSA support, there is the possibility for the application to adapt to modularity explicitly.
     56
     57For doing so, on the one hand, !ParaStation MPI provides a portable API addition for retrieving topology information by querying a ''Module ID'' via the `MPI_INFO_ENV` object:
     58
     59{{{
     60  int module_id;
     61  char value[MPI_MAX_INFO_VAL];
     62
     63  MPI_Info_get(MPI_INFO_ENV, "msa_module_id", MPI_MAX_INFO_VAL, value, &flag);
     64
     65  if (flag) { /* This MPI environment is modularity-aware! */
     66           
     67    my_module_id = atoi(value); /* Determine the module affinity of this process. */
     68
     69  } else { /* This MPI environment is NOT modularity-aware! */
     70           
     71          my_module_id = 0;     /* Assume a flat topology for all processes. */
     72  }
     73}}}
     74
     75On the other hand, there is the possibility to use a newly added ''split type'' for the standardized `MPI_Comm_split_type()` function for creating MPI communicators according to the modular topology of an MSA system:
     76
     77{{{
     78  MPI_Comm_split(MPI_COMM_WORLD, my_module_id, 0, &module_local_comm);
     79
     80  /*    After the split call, module_local_comm contains from the view of each
     81   *    process all the other processes that belong to the same local MSA module.
     82   */
     83       
     84  MPI_Comm_rank(module_local_comm, &my_module_local_rank);
     85
     86  printf("My module ID is %d and my module-local rank is %d\n", my_module_id, my_module_local_rank);
     87}}}
    4588
    4689
     
    307350PSP_UCP=1    # support GPUDirect via UCX in InfiniBand networks (e.g., this is currently true for the ESB nodes)
    308351}}}
     352
     353=== Testing for CUDA awareness ===
     354
     355!ParaStation MPI features three API extensions for querying whether the MPI library at hand is CUDA-aware or not.
     356
     357The first targets the compile time:
     358
     359{{{
     360#if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT
     361printf("The MPI library is CUDA-aware\n");
     362#endif
     363}}}
     364
     365...and the other two also the runtime:
     366
     367{{{
     368if (MPIX_Query_cuda_support())
     369    printf("The CUDA awareness is activated\n");
     370}}}
     371
     372or alternatively:
     373
     374{{{
     375MPI_Info_get(MPI_INFO_ENV, "cuda_aware", ..., value, &flag);
     376/*
     377 * If flag is set, then the library was built with CUDA support.
     378 * If, in addition, value points to the string "true", then the
     379 * CUDA awareness is also activated (i.e., PSP_CUDA=1 is set).
     380 */
     381}}}
     382
     383Please note that the first two API extensions are similar to those that Open MPI also provides with respect to CUDA awareness, whereas the latter is specific solely to !ParaStation MPI, but which is still quite portable due to the use of the generic `MPI_INFO_ENV` object.
    309384
    310385