369 | | More to come... |
| 369 | |
| 370 | === Introduction === |
| 371 | |
| 372 | One distinct feature of the DEEP-EST prototype is the Network Attached Memory (NAM): |
| 373 | Special memory regions that can directly be accessed via !Put/Get-operations from any node within the Extoll network. |
| 374 | For executing such RMA operations on the NAM, a new version of the libNAM is available to the users that features a corresponding low-level API for this purpose. |
| 375 | However, to make this programming more convenient—and in particular to also support parallel access to shared NAM data by multiple processes—an MPI extension with corresponding wrapper functionalities to the libNAM API has also been developed in the DEEP-EST project. |
| 376 | |
| 377 | This extension, which is called PSNAM, is a complementary part of the !ParaStation MPI—which is the MPI library of choice in the DEEP-EST project—and is as such also available to users on the DEEP-EST prototype system. |
| 378 | In this way, application programmers shall be enabled to use known MPI functions (especially those of the MPI RMA interface) for accessing NAM regions in a standardized (or at least harmonized) way under the familiar roof of an MPI world. |
| 379 | In doing so, the PSNAM extensions try to stick to the current MPI standard as close as possible and to avoid the introduction of new API functions wherever possible. |
| 380 | |
| 381 | ---- |
| 382 | **Attention:** ''Currently (as of May 2021), changes to the DEEP-EST system are foreseeable, which will also affect the availability of libNAM and the SW-NAM mockup. |
| 383 | The following text still reflects the current state and will soon be adapted accordingly.'' |
| 384 | ---- |
| 385 | |
| 386 | === Acquiring NAM Memory === |
| 387 | |
| 388 | ==== General Semantics ==== |
| 389 | |
| 390 | The main issue when mapping the MPI RMA interface onto the libNAM API is the fact that MPI assumes that all target and memory regions for RMA operations are always associated with an MPI process being the owner of that memory. That means that in an MPI world, remote memory regions are always addressed by means of a process rank (plus handle, which is the respective window object, plus offset), whereas the libNAM API merely requires an opaque handle for addressing the respective NAM region (plus offset). Therefore, a mapping between remote MPI ranks and the remote NAM memory needs somehow to be realized. In PSNAM, this is achieved by sticking to the notion of an ownership in a sense that definite regions of the NAM memory space are logically assigned to particular MPI ranks. However, it has to be emphasised that this is a purely software-based mapping being conducted by the PSNAM wrapper layer. That means that the related MPI window regions (though globally accessible and located within the NAM) have then to be addressed by means of the rank of that process to which the NAM region is assigned. |
| 391 | |
| 392 | ==== Semantic Terms ==== |
| 393 | |
| 394 | At this point, the semantic terms of memory ''allocation'', memory ''region'' and memory ''segment'' are to be determined for their use within this proposal. |
| 395 | The reason for this is that, for example, the term "allocation" is commonly used for both: a resource, as granted by the job scheduler, and a memory region, as returned e.g. by malloc. |
| 396 | Therefore, we need a stricter nomenclature here: |
| 397 | |
| 398 | "''NAM Memory Allocation''": |
| 399 | A certain amount of contiguous NAM memory space that has been requested from the NAM Manager (and possibly granted through the job scheduler) for an MPI session. |
| 400 | |
| 401 | "''NAM Memory Segment''": |
| 402 | A certain amount of contiguous NAM memory space that is part of a NAM allocation. |
| 403 | According to this, a NAM allocation can logically be subdivided by the PSNAM wrapper layer into multiple memory segments, which can then again be assigned to MPI RMA windows. |
| 404 | |
| 405 | "''NAM Memory Region''": |
| 406 | A certain amount of contiguous NAM memory space that is associated to a certain MPI rank in the context of an MPI RMA window. |
| 407 | |
| 408 | For performance and also for management reasons, allocation requests towards the NAM and/or the resource manager should preferably occur rarely--so, for instance, only once at the beginning of an MPI session. |
| 409 | In order to provide MPI applications with the ability to handle multiple MPI RMA windows within such an allocation, PSNAM implements a further layer of memory management that allows for a logical acquiring and releasing of NAM segments within the limits of the granted allocation. |
| 410 | |
| 411 | |
| 412 | ==== Interface Specification ==== |
| 413 | |
| 414 | For assigning memory regions on the NAM with MPI RMA windows, a semantic extension to the well-known `MPI_Win_allocate()` function via its MPI info parameter can be used: |
| 415 | |
| 416 | {{{ |
| 417 | MPI_Win_allocate(size, disp_unit, info, comm, baseptr, win) |
| 418 | IN size size of memory region in bytes (non-negative integer, may differ between processes) |
| 419 | IN disp_unit local unit size for displacements, in bytes (positive integer) |
| 420 | IN info info argument (handle) with psnam info keys and values |
| 421 | IN comm intra-communicator (handle) |
| 422 | OUT baseptr always NULL in case of PSNAM windows |
| 423 | OUT win window object returned by the call (handle) |
| 424 | }}} |
| 425 | |
| 426 | `MPI_Win_allocate()`is a collective call to be executed by all processes in the group of `comm`. |
| 427 | This in turn enables the PSNAM wrapper layer to treat the set of allocated memory regions as an entity and logically link the regions to a shared RMA window. |
| 428 | |
| 429 | The semantic extension compared to the MPI standard is the evaluation of the following keys within the given MPI info object: |
| 430 | * `psnam_manifestation` |
| 431 | * `psnam_consistency` |
| 432 | * `psnam_structure` |
| 433 | |
| 434 | The `psnam_manifestation` key specifies which memory type shall be used for a region. |
| 435 | The value for using the NAM is `psnam_manifestation_libnam` -- but it should be mentioned that also node-local persistent shared-memory ( `psnam_manifestation_pershm`) can here be chosen as another supported manifestation. |
| 436 | In fact, each process in `comm` can even select a different manifestation of these two for the composition of the window. |
| 437 | |
| 438 | The `psnam_consistency` key specifies whether the memory regions of an RMA window shall be persistent (`psnam_consistency_persistent`) or whether they shall be released during the respective `MPI_Win_free()` call (`psnam_consistency_volatile`). |
| 439 | This key must be selected equally among all processes in `comm`. |
| 440 | |
| 441 | The `psnam_structure` key specifies the memory layout as formed by the multiple regions of an MPI window. |
| 442 | Currently, the following three different memory layouts are supported: |
| 443 | * `psnam_structure_raw_and_flat`} |
| 444 | * `psnam_structure_managed_contiguous`} |
| 445 | * `psnam_structure_managed_distributed`} |
| 446 | |
| 447 | The chosen memory layout also decides whether and how the PSNAM layer stores further meta data in the NAM regions to allow a later recreation of the structure while reconnecting to a persistent RMA window by another MPI session. |
| 448 | The chosen structure must be the same for all processes in `comm`. |
| 449 | |
| 450 | "''Raw and Flat''": |
| 451 | The `psnam_structure_raw_and_flat` layout is intended to store raw data (i.e. untyped data) in the NAM without adding meta information. |
| 452 | According to this layout, only rank 0 of comm is allowed to pass a size parameter greater than zero during the `MPI_Win_allocate()` call. |
| 453 | Hence, only rank 0 allocates one (contiguous) NAM region forming the window and all RMA operations on such a flat window have therefore to be addressed to target rank = 0. |
| 454 | |
| 455 | "''Managed Contiguous''": |
| 456 | In the `psnam_structure_managed_contiguous` case, also only rank 0 allocates (contiguous) NAM space, but this space is then subdivided according to the size parameters as passed by all processes in `comm`. |
| 457 | That means that here also processes with rank > 0 can pass a size > 0 and hence acquire a rank-addressable (sub-)region within this window. |
| 458 | Furthermore, the information about the number of processes and the respective region sizes forming that window is being stored as meta data within the NAM. |
| 459 | That way, a subsequent MPI session re-connecting to this window can retrieve this information and hence recreate the former structure of the window. |
| 460 | |
| 461 | "''Managed Distributed''": |
| 462 | In a `psnam_structure_managed_distributed` window, each process that passes a size > 0 also allocates NAM memory explicitly and on its own. |
| 463 | It then contributes this memory as a NAM region to the RMA window so that the corresponding NAM allocation becomes directly addressable by the respective process rank. |
| 464 | The following Figure to illustrates the differences between these three structure layouts. |
| 465 | |
| 466 | |