Context Navigation

TAMPI_NAM

Timestamp:: Mar 3, 2021, 11:12:08 PM (3 years ago)
Author:: Kevin Sala
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/TAMPI_NAM

-                      v11
+                      v12
 === Using NAM in Heat benchmark ===
 In this benchmark, we use the NAM memory to periodically save the computed matrix. The idea is to save the different states (snapshots) of the matrix during the execution in a persistent NAM memory region. Then, another program could retrieve all the matrix states, process them and produce a GIF animation showing the evolution of the heat during the whole execution. Notice that we cannot use simple RAM memory for that since the matrix could be huge and we may want to store tens of matrix snapshots. We also want the possibility of storing it in a persistent way, so other programs can process the stored data. Moreover, the memory should be easily accessible by the multiple MPI ranks or their tasks in parallel. The NAM memory fulfills all these conditions and ParaStationMPI allows accessing NAM regions through standard MPI RMA operations.
+In this benchmark, we use the NAM memory to save the computed matrix periodically. The idea is to save different states (snapshots) of the matrix during the execution in a persistent NAM memory region. Then, another program could retrieve all the matrix snapshots, process them and produce a GIF animation showing the heat's evolution throughout the execution. Notice that we cannot use regular RAM for that purpose because the matrix could be huge, and we may want to store tens of matrix snapshots. We also want to keep it persistently so that other programs could process the stored data. Moreover, the memory should be easily accessible by the multiple MPI ranks or their tasks in parallel. The NAM memory satisfies all these requirements, and as stated previously, ParaStationMPI allows accessing NAM allocations through standard MPI RMA operations.
 During the execution of the application and every few timesteps (specified by the user), the benchmark saves the whole matrix into a specific NAM subregion. Each timestep saving a matrix snapshot uses a distinct NAM subregion. These subregions are placed one after the other, consecutively, but without overlapping. Thus, the total size of the NAM region is the size of the whole matrix multiplied by the number of times the matrix will be saved. However, the NAM memory region is allocated using the Managed Contiguous layout (`psnam_structure_managed_contiguous`). This means that the rank 0 allocates the whole region but each rank acquires a consecutive memory subregion where it will store its blocks' data for all the spanshots. For instance, the NAM allocation will first have all the space for storing all snapshots of the blocks from rank 0, followed by the space for all snapshots of blocks from rank 1, and so on. Notice that the NAM subregions are addressed by the rank it belongs to, simplifying the task of saving and retrieving the snapshots.
+The Heat benchmark allocates a single MPI window that holds the whole NAM region, which is used by all ranks (via the `MPI_COMM_WORLD` communicator) and throughout the execution. Every few timesteps (specified by the user), it saves the whole matrix into a specific NAM subregion. Each timestep that saves a matrix snapshot employs a distinct NAM subregion. These subregions are placed one after the other, consecutively, without overlapping. Thus, the entire NAM region's size is the full matrix size multiplied by the number of times the matrix will be saved. Even so, we allocate the NAM memory region using the Managed Contiguous layout (`psnam_structure_managed_contiguous`). This means that rank 0 allocates the whole region, but each rank acquires a consecutive memory subset, where it will store its blocks' data for every snapshot. For instance, the NAM allocation will first have the space for storing all snapshots of the blocks from rank 0, followed by the space for all snapshots of blocks from rank 1, and so on. By using that layout, NAM subregions are rank-addressed using the rank it belongs to, simplifying the saving and retrieving of snapshots.
 When there is a timestep that requires a snapshot, the application instantiates multiple tasks that save the matrix data into the corresponding NAM subregion. Each MPI rank creates a task for saving the data of each matrix block into the NAM subregion. These communication tasks do not have any data dependency between them, so they can run in parallel writing data to the NAM region using regular `MPI_Put`. Ranks only write to the subregions that belong to themselves, never in other ranks' subregions. Even so, all `MPI_Put` calls must be done inside an RMA access epoch, so there must be one fence call before all the `MPI_Put` calls and another one after them to close the epoch for each of the timesteps with snapshot. Here is where we use the new function `MPI_Win_ifence` together with the TAMPI non-blocking support. In this way, we can fully taskify both synchronization and writing of the NAM window, keeping the data-flow model, and without having to stop the parallelism (e.g., with a `taskwait`) to perform the snapshots. Thanks to the task data dependencies and TAMPI, we cleanly include the snapshots in the application's data-flow execution, as regular communication tasks with dependencies.
+Then, when a timestep requires a snapshot, the application instantiates multiple tasks that save the matrix data into the corresponding NAM subregion. Each MPI rank creates a task for writing (copying) the data of each of its matrix blocks into the NAM subregion. These communication tasks do not have any data dependency between them, so they can write data to the NAM using regular `MPI_Put` in parallel. Ranks only write to their own subregions, never in other ranks' subregions. Nevertheless, we must call all MPI_Puts inside an MPI RMA access epoch, so there must be a window fence call before all the MPI_Puts, and another one after them to close the epoch, for each of the snapshot timesteps. Here is where we leverage the new function `MPI_Win_ifence` along with the TAMPI non-blocking support. In this way, we can fully taskify both synchronization and writing of the NAM window, keeping the data-flow model and without closing the parallelism for doing the snapshots (e.g., with a `taskwait`) . Thanks to the task data dependencies and TAMPI, we cleanly include the snapshots in the application's data-flow execution as any other regular communication task with dependencies.
 The following pseudo-code shows how the saving of snapshots work in `02.heat_itampi_ompss2_tasks.bin`:
+The following pseudo-code shows how the saving of snapshots works in `02.heat_itampi_ompss2_tasks.bin`:
 {{{#!c
 …
 }}}
 The function above is the main procedure that executes all Heat application's timesteps applying the Gauss-Seidel method. This function is executed by all MPI ranks and each one works with their corresponding blocks from the matrix. In each timestep, the `gaussSeidelSolver` function instantiates all the computation and communication tasks that process the rank's blocks and exchanges the halo rows with the neighboring ranks. These tasks declare the proper input/output dependencies on the blocks they are reading/writing. Every some timesteps, the algorithm calls `namSaveMatrix` in order to perform a snapshot of the data computed after computing that timestep. Notice that `namSaveMatrix` will have to instantiate tasks with input dependencies on the blocks in order to perform the snapshot in the correct moment of the execution. Notice also that each snapshot is identified by the `namSnapshotId`, which will be used to know where the snapshot data should stored inside the NAM region. After all tasks from all timesteps have been instantiated, the application calls a taskwait to wait for the completion of all computation, communication and snapshot tasks.
+The function above is the main procedure that executes the Heat's timesteps applying the Gauss-Seidel method. All MPI ranks run this function concurrently, each one working with their corresponding blocks from the matrix. In each timestep, the `gaussSeidelSolver` function instantiates all the computation and communication tasks that process the rank's blocks and exchanges the halo rows with the neighboring ranks. These tasks declare the proper input/output dependencies on the blocks they are reading/writing. Every some timesteps, the algorithm calls `namSaveMatrix` that will issue tasks to perform a snapshot of the data computed after computing that timestep. Notice that `namSaveMatrix` will have to instantiate tasks with input dependencies on the blocks in order to perform the snapshot at the right moment of the execution, i.e. after the `gaussSeidelSolver` tasks from that timestep. Notice also that we identify each snapshot with the `namSnapshotId`, which we use to know where we should store the snapshot data inside the NAM window. This taskwait at the end of the algorithm is the only one in the whole execution; we do not close the parallelism anywhere else.
 {{{#!c