Changes between Version 31 and Version 32 of Public/ParaStationMPI


Ignore:
Timestamp:
May 31, 2021, 11:33:40 AM (3 years ago)
Author:
Carsten Clauß
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/ParaStationMPI

    v31 v32  
    367367* [https://deeptrac.zam.kfa-juelich.de:8443/trac/attachment/wiki/Public/ParaStationMPI/DEEP-EST_Task_6.1_MPI-NAM-Manual.pdf This Manual for using the NAM as a PDF]
    368368
    369 More to come...
     369
     370=== Introduction ===
     371
     372One distinct feature of the DEEP-EST prototype is the Network Attached Memory (NAM):
     373Special memory regions that can directly be accessed via !Put/Get-operations from any node within the Extoll network.
     374For executing such RMA operations on the NAM, a new version of the libNAM is available to the users that features a corresponding low-level API for this purpose.
     375However, to make this programming more convenient—and in particular to also support parallel access to shared NAM data by multiple processes—an MPI extension with corresponding wrapper functionalities to the libNAM API has also been developed in the DEEP-EST project.
     376
     377This extension, which is called PSNAM, is a complementary part of the !ParaStation MPI—which is the MPI library of choice in the DEEP-EST project—and is as such also available to users on the DEEP-EST prototype system.
     378In this way, application programmers shall be enabled to use known MPI functions (especially those of the MPI RMA interface) for accessing NAM regions in a standardized (or at least harmonized) way under the familiar roof of an MPI world.
     379In doing so, the PSNAM extensions try to stick to the current MPI standard as close as possible and to avoid the introduction of new API functions wherever possible.
     380
     381----
     382**Attention:** ''Currently (as of May 2021), changes to the DEEP-EST system are foreseeable, which will also affect the availability of libNAM and the SW-NAM mockup.
     383The following text still reflects the current state and will soon be adapted accordingly.''
     384----
     385
     386=== Acquiring NAM Memory ===
     387
     388==== General Semantics ====
     389
     390The main issue when mapping the MPI RMA interface onto the libNAM API is the fact that MPI assumes that all target and memory regions for RMA operations are always associated with an MPI process being the owner of that memory. That means that in an MPI world, remote memory regions are always addressed by means of a process rank (plus handle, which is the respective window object, plus offset), whereas the libNAM API merely requires an opaque handle for addressing the respective NAM region (plus offset). Therefore, a mapping between remote MPI ranks and the remote NAM memory needs somehow to be realized. In PSNAM, this is achieved by sticking to the notion of an ownership in a sense that definite regions of the NAM memory space are logically assigned to particular MPI ranks. However, it has to be emphasised that this is a purely software-based mapping being conducted by the PSNAM wrapper layer. That means that the related MPI window regions (though globally accessible and located within the NAM) have then to be addressed by means of the rank of that process to which the NAM region is assigned.
     391
     392==== Semantic Terms ====
     393
     394At this point, the semantic terms of memory ''allocation'', memory ''region'' and memory ''segment'' are to be determined for their use within this proposal.
     395The reason for this is that, for example, the term "allocation" is commonly used for both: a resource, as granted by the job scheduler, and a memory region, as returned e.g. by malloc.
     396Therefore, we need a stricter nomenclature here:
     397
     398"''NAM Memory Allocation''":
     399A certain amount of contiguous NAM memory space that has been requested from the NAM Manager (and possibly granted through the job scheduler) for an MPI session.
     400
     401"''NAM Memory Segment''":
     402A certain amount of contiguous NAM memory space that is part of a NAM allocation.
     403According to this, a NAM allocation can logically be subdivided by the PSNAM wrapper layer into multiple memory segments, which can then again be assigned to MPI RMA windows.
     404
     405"''NAM Memory Region''":
     406A certain amount of contiguous NAM memory space that is associated to a certain MPI rank in the context of an MPI RMA window.
     407
     408For performance and also for management reasons, allocation requests towards the NAM and/or the resource manager should preferably occur rarely--so, for instance, only once at the beginning of an MPI session.
     409In order to provide MPI applications with the ability to handle multiple MPI RMA windows within such an allocation, PSNAM implements a further layer of memory management that allows for a logical acquiring and releasing of NAM segments within the limits of the granted allocation.
     410
     411
     412==== Interface Specification ====
     413
     414For assigning memory regions on the NAM with MPI RMA windows, a semantic extension to the well-known `MPI_Win_allocate()` function via its MPI info parameter can be used:
     415
     416{{{
     417MPI_Win_allocate(size, disp_unit, info, comm, baseptr, win)
     418IN   size      size of memory region in bytes (non-negative integer, may differ between processes)
     419IN   disp_unit local unit size for displacements, in bytes (positive integer)
     420IN   info      info argument (handle) with psnam info keys and values
     421IN   comm      intra-communicator (handle)
     422OUT  baseptr   always NULL in case of PSNAM windows
     423OUT  win       window object returned by the call (handle)
     424}}}
     425
     426`MPI_Win_allocate()`is a collective call to be executed by all processes in the group of `comm`.
     427This in turn enables the PSNAM wrapper layer to treat the set of allocated memory regions as an entity and logically link the regions to a shared RMA window.
     428
     429The semantic extension compared to the MPI standard is the evaluation of the following keys within the given MPI info object:
     430 * `psnam_manifestation`
     431 * `psnam_consistency`
     432 * `psnam_structure`
     433
     434The `psnam_manifestation` key specifies which memory type shall be used for a region.
     435The value for using the NAM is `psnam_manifestation_libnam` -- but it should be mentioned that also node-local persistent shared-memory ( `psnam_manifestation_pershm`) can here be chosen as another supported manifestation.
     436In fact, each process in `comm` can even select a different manifestation of these two for the composition of the window.
     437
     438The `psnam_consistency` key specifies whether the memory regions of an RMA window shall be persistent (`psnam_consistency_persistent`) or whether they shall be released during the respective `MPI_Win_free()` call (`psnam_consistency_volatile`).
     439This key must be selected equally among all processes in `comm`.
     440
     441The `psnam_structure` key specifies the memory layout as formed by the multiple regions of an MPI window.
     442Currently, the following three different memory layouts are supported:
     443 * `psnam_structure_raw_and_flat`}
     444 * `psnam_structure_managed_contiguous`}
     445 * `psnam_structure_managed_distributed`}
     446
     447The chosen memory layout also decides whether and how the PSNAM layer stores further meta data in the NAM regions to allow a later recreation of the structure while reconnecting to a persistent RMA window by another MPI session.
     448The chosen structure must be the same for all processes in `comm`.
     449
     450"''Raw and Flat''":
     451The `psnam_structure_raw_and_flat` layout is intended to store raw data (i.e. untyped data) in the NAM without adding meta information.
     452According to this layout, only rank 0 of comm is allowed to pass a size parameter greater than zero during the `MPI_Win_allocate()` call.
     453Hence, only rank 0 allocates one (contiguous) NAM region forming the window and all RMA operations on such a flat window have therefore to be addressed to target rank = 0.
     454
     455"''Managed Contiguous''":
     456In the `psnam_structure_managed_contiguous` case, also only rank 0 allocates (contiguous) NAM space, but this space is then subdivided according to the size parameters as passed by all processes in `comm`.
     457That means that here also processes with rank > 0 can pass a size > 0 and hence acquire a rank-addressable (sub-)region within this window.
     458Furthermore, the information about the number of processes and the respective region sizes forming that window is being stored as meta data within the NAM.
     459That way, a subsequent MPI session re-connecting to this window can retrieve this information and hence recreate the former structure of the window.
     460
     461"''Managed Distributed''":
     462In a `psnam_structure_managed_distributed` window, each process that passes a size > 0 also allocates NAM memory explicitly and on its own.
     463It then contributes this memory as a NAM region to the RMA window so that the corresponding NAM allocation becomes directly addressable by the respective process rank.
     464The following Figure to illustrates the differences between these three structure layouts.
     465
     466