| 610 | |
| 611 | |
| 612 | === Pre-Allocated Memory and Segments === |
| 613 | |
| 614 | Without further info parameters than described so far, \texttt{MPI\_Win\_allocate()} will always try to allocate NAM memory itself and "on-demand". |
| 615 | However, a common use case might be that the required NAM memory needed by an application has already been allocated beforehand via the batch system—and the question is how such pre-allocated memory can be handled on MPI level. |
| 616 | In fact, using an existing NAM allocation during an `MPI_Win_allocate()` call instead of allocating new space in quite straight forward by applying `psnam_libnam_allocation_id` as a further info key plus the respective NAM allocation ID as the related info value. |
| 617 | |
| 618 | |
| 619 | ==== Usage of Segments ==== |
| 620 | |
| 621 | However, a NAM-based MPI window may possibly still consist of multiple regions, and it should also still be possible to build multiple MPI windows from the space of a single NAM (pre-)allocation. |
| 622 | Therefore, a means for subdividing NAM allocations needs to be provided—and that's exactly what segments are intended for: |
| 623 | A segment is a "meta-manifestation" that maintains a size and offset information for a sub-region within a larger allocation. |
| 624 | This offset can either be set explicitly via `psnam_segment_offset` (e.g., for splitting an allocation among multiple processes), or it can be managed dynamically and implicitly by the PSNAM layer (e.g., for using the allocated memory across multiple MPI windows). |
| 625 | |
| 626 | |
| 627 | ==== Recursive Use of Segments ==== |
| 628 | |
| 629 | The concept of segments can also be applied recursively. |
| 630 | For doing so, PSNAM windows of the "raw and flat" structure feature the info key `psnam_allocation_id` plus respective value that in turn can be used to pass a reference to an already existing allocation to a subsequent `MPI_Win_allocate()` call with `psnam_manifestation_segment` as the region manifestation. |
| 631 | That way, existing allocations can be divided into segments—which could then even further sub-divided into sub-sections, and so forth. |
| 632 | |
| 633 | |
| 634 | ==== Example ==== |
| 635 | |
| 636 | {{{ |
| 637 | MPI_Info_create(&info_set); |
| 638 | MPI_Info_set(info_set, "psnam_manifestation", "psnam_manifestation_libnam"); |
| 639 | MPI_Info_set(info_set, "psnam_libnam_allocation_id", getenv("SLURM_NAM_ALLOC_ID"); |
| 640 | MPI_Info_set(info_set, "psnam_structure", "psnam_structure_raw_and_flat"); |
| 641 | |
| 642 | MPI_Win_allocate(allocation_size, 1, info_set, MPI_COMM_WORLD, NULL, &raw_nam_win); |
| 643 | MPI_Win_get_info(raw_nam_win, &info_get); |
| 644 | MPI_Info_get(info_get, "psnam_allocation_id", MPI_MAX_INFO_VAL, segment_name, &flag); |
| 645 | |
| 646 | MPI_Info_set(info_set, "psnam_manifestation", "psnam_manifestation_segment"); |
| 647 | MPI_Info_set(info_set, "psnam_segment_allocation_id", segment_name); |
| 648 | sprintf(offset_value_str, "%d", (allocation_size / num_ranks) * my_rank); |
| 649 | MPI_Info_set(info_set, "psnam_segment_offset", offset_value_str); |
| 650 | |
| 651 | MPI_Info_set(info_set, "psnam_structure", "psnam_structure_managed_contiguous"); |
| 652 | MPI_Win_allocate(num_int_elements * sizeof(int), sizeof(int), info_set, MPI_COMM_WORLD, NULL, &win); |
| 653 | }}} |
| 654 | |
| 655 | |
| 656 | === Accessing Data in NAM Memory === |
| 657 | |
| 658 | Accesses to the NAM memory must always be made via `MPI_Put()` and `MPI_Get()` calls. |
| 659 | Direct load/store accesses are (of course) not possible--and `MPI_Accumulate()` is currently also not supported since the NAM is just a passive memory device, at least so far. |
| 660 | However, after an epoch of accessing the NAM, the respective origin buffers must not be reused or read until a synchronization has been performed. |
| 661 | Currently, only the `MPI_Win_fence()` mechanism is supported for doing so. |
| 662 | According to this loosely-synchronous model, computation phases alternate with NAM access phases, each completed by a call of `MPI_Win_fence()`, acting as a memory barrier and process synchronization point. |
| 663 | |
| 664 | |
| 665 | ==== Example ==== |
| 666 | |
| 667 | {{{ |
| 668 | for (pos = 0; pos < region_size; pos++) put_buf[pos] = put_rank+pos; |
| 669 | MPI_Put(put_buf, region_size, MPI_INT, target_region_rank, 0, region_size, MPI_INT, win); |
| 670 | MPI_Get(get_buf, region_size, MPI_INT, target_region_rank, 0, region_size, MPI_INT, win); |
| 671 | |
| 672 | MPI_Win_fence(0, win); |
| 673 | |
| 674 | for (pos = 0; pos < region_size - WIN_DISP; pos++) { |
| 675 | if (get_buf[pos] != put_rank+pos) { |
| 676 | fprintf(stderr, "ERROR at %d: %d vs. %d\n", pos, get_buf[pos], put_rank+pos); |
| 677 | } |
| 678 | } |
| 679 | }}} |
| 680 | |
| 681 | |
| 682 | === Alternative interface === |
| 683 | |
| 684 | The extensions presented so far were all of semantic nature, i.e. without introducing new API functions. |
| 685 | However, the changed usage of MPI standard functions may also be a bit confusing, which is why a set of macros is also provided, which in turn encapsulate the MPI functions used for the NAM handling. |
| 686 | That way, readability of application code with NAM employment can be improved. |
| 687 | |
| 688 | These encapsulating macros are the following: |
| 689 | |
| 690 | * `MPIX_Win_allocate_intercomm(size, disp_unit, info_set, comm, intercomm, win)` ...as an alias for `MPI_Win_allocate()`. |
| 691 | |
| 692 | * `MPIX_Win_connect_intercomm(window_name, info, root, comm, intercomm)` ...as an alias for `MPI_Comm_connect()`. |
| 693 | |
| 694 | * `MPIX_Win_create_intercomm(info, comm, win)` ...as an alias for `MPI_Win_create_dynamic()`. |
| 695 | |
| 696 | * `MPIX_Win_intercomm_query(win, rank, size, disp_unit)` ...as an alias for `MPI_Win_shared_query()`. |