Changes between Version 29 and Version 30 of Public/ParaStationMPI


Ignore:
Timestamp:
May 28, 2021, 6:02:31 PM (3 years ago)
Author:
Carsten Clauß
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/ParaStationMPI

    v29 v30  
    322322----
    323323
     324
    324325== CUDA Support by !ParaStation MPI ==
    325326
    326 === What is CUDA awareness for MPI? ===
    327 In brief, ''CUDA awareness'' in an MPI library means that a mixed CUDA + MPI application is allowed to pass pointers to CUDA buffers (these are memory regions located on the GPU, the so-called ''Device'' memory) directly to MPI functions such as `MPI_Send()` or `MPI_Recv()`. A non CUDA-aware MPI library would fail in such a case because the CUDA-memory cannot be accessed directly, e.g., via load/store or `memcpy()` but has to be transferred in advance to the host memory via special routines such as `cudaMemcpy()`. As opposed to this, a CUDA-aware MPI library recognizes that a pointer is associated with a buffer within the device memory and can then copy this buffer prior to the communication into a temporarily host buffer -- what is called ''staging'' of this buffer. Additionally, a CUDA-aware MPI library may also apply some kind of optimizations, e.g., by means of exploiting so-called ''GPUDirect'' capabilities that allow for direct RDMA transfers from and to the device memory.
    328 
    329 === Some external Resources ===
    330  * [http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#axzz44ZswsbEt Getting started with CUDA] (by NVIDIA)
    331  * [https://developer.nvidia.com/gpudirect NVIDIA GPUDirect Overview] (by NVIDIA)
    332  * [https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/ Introduction to CUDA-Aware MPI] (by NVIDIA)
     327
     328=== CUDA awareness for MPI ===
     329
     330In brief, [https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/ CUDA awareness] in an MPI library means that a mixed CUDA + MPI application is allowed to pass pointers to CUDA buffers (these are memory regions located on the GPU, the so-called ''Device'' memory) directly to MPI functions such as `MPI_Send()` or `MPI_Recv()`. A non CUDA-aware MPI library would fail in such a case because the CUDA-memory cannot be accessed directly, e.g., via load/store or `memcpy()` but has to be transferred in advance to the host memory via special routines such as `cudaMemcpy()`. As opposed to this, a CUDA-aware MPI library recognizes that a pointer is associated with a buffer within the device memory and can then copy this buffer prior to the communication into a temporarily host buffer -- what is called ''staging'' of this buffer. Additionally, a CUDA-aware MPI library may also apply some kind of optimizations, e.g., by means of exploiting so-called ''GPUDirect'' capabilities that allow for direct RDMA transfers from and to the device memory.
    333331
    334332
     
    386384----
    387385
    388 == NAM Integration for !ParaStation MPI ==
     386== Using Network Attached Memory with !ParaStation MPI ==
    389387
    390388=== Documentation ===
    391389
    392390* [https://deeptrac.zam.kfa-juelich.de:8443/trac/attachment/wiki/Public/ParaStationMPI/DEEP-EST_Task_6.1_MPI-NAM-Proposal.pdf Proposal for accessing the NAM via MPI]
    393 * [#APIPrototypeImplementation API Prototype Implementation]
    394 * [#UsageExampleontheDEEP-ESTSDV Usage Example on the DEEP-EST SDV]
    395 
    396 === API Prototype Implementation ===
    397 
    398 For evaluating the proposed semantics and API extensions, we have already developed a shared-memory-based prototype implementation where the persistent NAM is (more or less) “emulated” by persistent shared-memory (with ''deep_mem_kind=deep_mem_persistent'').
    399 
    400 **Advice to users**
    401 
    402 Please note that this prototype is not intended to actually emulate the NAM but shall rather offer a possibility for the later users and programmers to evaluate the proposed semantics from the MPI application’s point of view.
    403 Therefore, the focus here is not put on the question of how remote memory is managed at its location (currently by MPI processes running local to the memory  later by the NAM manager or the NAM itself), but on the question of how process-foreign memory regions can be exposed locally.
    404 That means that (at least currently) for accessing a persistent RMA window, it has to be made sure that there is at least one MPI process running locally to each of the window’s memory regions.
    405 
    406 === Extensions to MPI ===
    407 
    408 The API proposal strives to stick to the current MPI standard as close as possible and to avoid the addition of new API functions and other symbols.
    409 However, in order to make the usage of the prototype a little bit more convenient for the user, we have added at least a small set of new symbols (denoted with MPIX) that may be used by the applications.
    410 
    411 {{{
    412 extern int MPIX_WIN_DISP_UNITS;
    413 #define MPIX_WIN_FLAVOR_INTERCOMM        (MPI_WIN_FLAVOR_CREATE +       \
    414                                           MPI_WIN_FLAVOR_ALLOCATE +     \
    415                                           MPI_WIN_FLAVOR_DYNAMIC +      \
    416                                           MPI_WIN_FLAVOR_SHARED + 0)
    417 #define MPIX_WIN_FLAVOR_INTERCOMM_SHARED (MPI_WIN_FLAVOR_CREATE +       \
    418                                           MPI_WIN_FLAVOR_ALLOCATE +     \
    419                                           MPI_WIN_FLAVOR_DYNAMIC +      \
    420                                           MPI_WIN_FLAVOR_SHARED + 1)
    421 }}}
    422 
    423 === Code Example for a “Hello World” workflow ===
    424 
    425 The following two C codes should demonstrate how it shall become possible to pass intermediate data between two subsequent steps of a workflow (Step 1: hello / Step 2: world) via the persistent memory of the NAM (currently emulated by persistent shared-memory):
    426 
    427 {{{
    428 /** hello.c **/
    429 
    430      /* Create persistent MPI RMA window: */
    431         MPI_Info_create(&win_info);
    432         MPI_Info_set(win_info, "deep_mem_kind", "deep_mem_persistent");
    433         MPI_Win_allocate(sizeof(char) * HELLO_STR_LEN, sizeof(char), win_info, MPI_COMM_WORLD, 
    434                                           &win_base, &win);
    435 
    436         /* Put some content into the local region of the window: */
    437         if(argc > 1) {
    438                 snprintf(win_base, HELLO_STR_LEN, "Hello World from rank %d! %s", world_rank, argv[1]);
    439         } else {
    440                 snprintf(win_base, HELLO_STR_LEN, "Hello World from rank %d!", world_rank);
    441         }
    442         MPI_Win_fence(0, win);
    443 
    444         /* Retrieve port name of window: */
    445         MPI_Info_free(&win_info);
    446         MPI_Win_get_info(win, &win_info);
    447         MPI_Info_get(win_info, "deep_win_port_name", INFO_VALUE_LEN, info_value, &flag);
    448 
    449         if(flag) {
    450                 strcpy(port_name, info_value);
    451                 if(world_rank == root) printf("(%d) The Window's port name is: %s\n", world_rank, port_name);
    452         } else {
    453                 if(world_rank == root) printf("(%d) No port name found!\n", world_rank);
    454         }
    455 }}}
    456 {{{
    457 /** world.c **/
    458 
    459         /* Check for port name: (to be passed as a command line argument) */
    460         if(argc == 1) {
    461                 if(world_rank == root) printf("[%d] No port name found!\n", world_rank);
    462                 goto finalize;
    463         } else {
    464                 strcpy(port_name, argv[1]);
    465                 if(world_rank == root) printf("[%d] The Window's port name is: %s\n", world_rank, port_name);
    466         }
    467 
    468         /* Try to connect to the persistent window: */
    469         MPI_Info_create(&win_info);
    470         MPI_Info_set(win_info, "deep_win_connect", "true");
    471         MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    472         errcode = MPI_Comm_connect(port_name, win_info, root, MPI_COMM_WORLD, &inter_comm);
    473         printf("[%d] Connection to persistent memory region established!\n", world_rank);
    474 
    475          /* Retrieve the number of remote regions: (= former number of ranks) */
    476         MPI_Comm_remote_size(inter_comm, &remote_size);
    477         if(world_rank == root) printf("[%d] Number of remote regions: %d\n", world_rank, remote_size);
    478 
    479         /* Create window object for accessing the remote regions: */
    480         MPI_Win_create_dynamic(MPI_INFO_NULL, inter_comm, &win);
    481         MPI_Win_get_attr(win, MPI_WIN_CREATE_FLAVOR, &create_flavor, &flag);
    482         assert(*create_flavor == MPIX_WIN_FLAVOR_INTERCOMM);
    483         MPI_Win_fence(0, win);
    484 
    485         /* Check the accessibility and the content of the remote regions: */
    486         for(i=0; i<remote_size; i++) {
    487                 char hello_string[HELLO_STR_LEN];
    488                 MPI_Get(hello_string, HELLO_STR_LEN, MPI_CHAR, i, 0, HELLO_STR_LEN, MPI_CHAR, win);
    489                 MPI_Win_fence(0, win);
    490                 printf("[%d] Get from %d: %s\n", world_rank, i, hello_string);
    491         }
    492 }}}
    493 
    494 === Usage Example on the DEEP-ER SDV ===
    495 
    496 On the DEEP-ER SDV, there is already a special version of !ParaStation MPI installed that features all the introduced API extensions. It is accessible via the module system:
    497 {{{
    498 > module load parastation/5.2.1-1-mt-wp6
    499 }}}
    500 
    501 When allocating a session with N nodes, one can run an MPI session (let’s say with n processes distributed across the N nodes) where each of the processes is contributing its local and persistent memory region to an MPI window:
    502 {{{
    503 > salloc --partition=sdv --nodes=4 --time=01:00:00
    504 salloc: Granted job allocation 2514
    505 > srun -n4 -N4 ./hello 'Have fun!'
    506 (0) Running on deeper-sdv13
    507 (1) Running on deeper-sdv14
    508 (2) Running on deeper-sdv15
    509 (3) Running on deeper-sdv16
    510 (0) The Window's port name is: shmid:347897856:92010569
    511 (0) Calling finalize...
    512 (1) Calling finalize...
    513 (2) Calling finalize...
    514 (3) Calling finalize...
    515 (0) Calling finalize...
    516 (0) Finalize done!
    517 (1) Finalize done!
    518 (2) Finalize done!
    519 (3) Finalize done!
    520 }}}
    521 
    522 Afterwards, on all the nodes involved (and later on the NAM) one persistent memory region has been created by each of the MPI processes.
    523 The “port name” for accessing the persistent window again is in this example:
    524 {{{
    525 shmid:347897856:92010569
    526 }}}
    527 
    528 By means of this port name (here to be passes as a command line argument), all the processes of a subsequent MPI session can access the persistent window  provided that there is again at least one MPI processes running locally to each of the persistent but distributed regions:
    529 {{{
    530 > srun -n4 -N4 ./world shmid:347897856:92010569
    531 [0] Running on deeper-sdv13
    532 [1] Running on deeper-sdv14
    533 [2] Running on deeper-sdv15
    534 [3] Running on deeper-sdv16
    535 [0] The Window's port name is: shmid:347897856:92010569
    536 [1] Connection to persistent memory region established!
    537 [3] Connection to persistent memory region established!
    538 [0] Connection to persistent memory region established!
    539 [2] Connection to persistent memory region established!
    540 [0] Number of remote regions: 4
    541 [0] Get from 0: Hello World from rank 0! Have fun!
    542 [1] Get from 0: Hello World from rank 0! Have fun!
    543 [2] Get from 0: Hello World from rank 0! Have fun!
    544 [3] Get from 0: Hello World from rank 0! Have fun!
    545 [0] Get from 1: Hello World from rank 1! Have fun!
    546 [1] Get from 1: Hello World from rank 1! Have fun!
    547 [2] Get from 1: Hello World from rank 1! Have fun!
    548 [3] Get from 1: Hello World from rank 1! Have fun!
    549 [0] Get from 2: Hello World from rank 2! Have fun!
    550 [1] Get from 2: Hello World from rank 2! Have fun!
    551 [2] Get from 2: Hello World from rank 2! Have fun!
    552 [3] Get from 2: Hello World from rank 2! Have fun!
    553 [0] Get from 3: Hello World from rank 3! Have fun!
    554 [1] Get from 3: Hello World from rank 3! Have fun!
    555 [2] Get from 3: Hello World from rank 3! Have fun!
    556 [3] Get from 3: Hello World from rank 3! Have fun!
    557 [0] Calling finalize...
    558 [1] Calling finalize...
    559 [2] Calling finalize...
    560 [3] Calling finalize...
    561 [0] Finalize done!
    562 [1] Finalize done!
    563 [2] Finalize done!
    564 [3] Finalize done!
    565 }}}
    566 
    567 **Advice to users**
    568 
    569 Pleases note that if not all persistent memory regions are covered by the subsequent session, the “connection establishment” to the remote RMA window fails:
    570 
    571 {{{
    572 > srun -n4 -N2 ./world shmid: 347897856:92010569
    573 [3] Running on deeper-sdv14
    574 [1] Running on deeper-sdv13
    575 [2] Running on deeper-sdv14
    576 [0] Running on deeper-sdv13
    577 [0] The Window's port name is: shmid:347930624:92010569
    578 [0] ERROR: Could not connect to persistent memory region!
    579 application called MPI_Abort(MPI_COMM_WORLD, -1)
    580 
    581 }}}
    582 
    583 === Cleaning up of persistent memory regions ===
    584 
    585 If the connection to a persistent memory region succeeds, the window and all of its memory regions will eventually be removed by the MPI_Win_free call of the subsequent MPI session (here by world.c)  at least if not deep_mem_persistent is passes again as an Info argument.
    586 However, if a connection attempt fails, the persistent memory regions still persist.
    587 
    588 For explicitly cleaning up those artefacts, one can use a simple batch script:
    589 {{{
    590 #!/bin/bash
    591 keys=`ipcs | grep 777 | cut -d' ' -f2`
    592 for key in $keys ; do
    593         ipcrm -m $key
    594 done
    595 }}}
    596 
    597 **Advice to administrators**
    598 
    599 Obviously, a good idea would be the integration of e such an automated cleaning-up procedure as a default into the epilogue scripts for the jobs.
     391* [https://deeptrac.zam.kfa-juelich.de:8443/trac/attachment/wiki/Public/ParaStationMPI/DEEP-EST_Task_6.1_MPI-NAM-Manual.pdf This Manual for using the NAM as a PDF]
     392
     393More to come...