Context Navigation

TAMPI_NAM

Timestamp:: Mar 3, 2021, 8:29:53 PM (4 years ago)
Author:: Kevin Sala
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/TAMPI_NAM

-                      v9
+                      v10
 During the execution of the application and every few timesteps (specified by the user), the benchmark saves the whole matrix into a specific NAM subregion. Each timestep saving a matrix snapshot uses a distinct NAM subregion. These subregions are placed one after the other, consecutively, but without overlapping. Thus, the total size of the NAM region is the size of the whole matrix multiplied by the number of times the matrix will be saved. However, the NAM memory region is allocated using the Managed Contiguous layout (`psnam_structure_managed_contiguous`). This means that the rank 0 allocates the whole region but each rank acquires a consecutive memory subregion where it will store its blocks' data for all the spanshots. For instance, the NAM allocation will first have all the space for storing all snapshots of the blocks from rank 0, followed by the space for all snapshots of blocks from rank 1, and so on. Notice that the NAM subregions are addressed by the rank it belongs to, simplifying the task of saving and retrieving the snapshots.
 When there is a timestep that requires a snapshot, the application instantiates multiple tasks that save the matrix data into the corresponding NAM subregion. Each MPI rank creates a task for saving the data of each matrix block into the NAM subregion. These communication tasks do not have any data dependency between them, so they can run in parallel writing data to the NAM region using regular `MPI_Put`. Ranks only write to the subregions that belong to themselves, never in other ranks' subregions. Even so, all `MPI_Put` calls must be done inside an RMA access epoch, so there must be one fence call before all the `MPI_Put` calls and another one after them to close the epoch for each of the timesteps with snapshot. Thus, here is where we use the new function `MPI_Win_ifence` together with the TAMPI non-blocking support. In this way, we taskify both synchronization and writing of NAM regions, keeping the data-flow model, and without having to stop the parallelism (e.g., with a `taskwait`) to perform the snapshots. Thanks to the task data dependencies and TAMPI, we cleanly include the snapshots in the application's data-flow execution as any other regular task.
+When there is a timestep that requires a snapshot, the application instantiates multiple tasks that save the matrix data into the corresponding NAM subregion. Each MPI rank creates a task for saving the data of each matrix block into the NAM subregion. These communication tasks do not have any data dependency between them, so they can run in parallel writing data to the NAM region using regular `MPI_Put`. Ranks only write to the subregions that belong to themselves, never in other ranks' subregions. Even so, all `MPI_Put` calls must be done inside an RMA access epoch, so there must be one fence call before all the `MPI_Put` calls and another one after them to close the epoch for each of the timesteps with snapshot. Here is where we use the new function `MPI_Win_ifence` together with the TAMPI non-blocking support. In this way, we can fully taskify both synchronization and writing of the NAM window, keeping the data-flow model, and without having to stop the parallelism (e.g., with a `taskwait`) to perform the snapshots. Thanks to the task data dependencies and TAMPI, we cleanly include the snapshots in the application's data-flow execution, as regular communication tasks with dependencies.
 The following pseudo-code shows how the saving of snapshots work in `02.heat_itampi_ompss2_tasks.bin`:
 …
+}
 }}}
+The function above is the main procedure that executes all Heat application's timesteps applying the Gauss-Seidel method. This function is executed by all MPI ranks and each one works with their corresponding blocks from the matrix. In each timestep, the `gaussSeidelSolver` function instantiates all the computation and communication tasks that process the rank's blocks and exchanges the halo rows with the neighboring ranks. These tasks declare the proper input/output dependencies on the blocks they are reading/writing. Every some timesteps, the algorithm calls `namSaveMatrix` in order to perform a snapshot of the data computed after computing that timestep. Notice that `namSaveMatrix` will have to instantiate tasks with input dependencies on the blocks in order to perform the snapshot in the correct moment of the execution. Notice also that each snapshot is identified by the `namSnapshotId`, which will be used to know where the snapshot data should stored inside the NAM region. After all tasks from all timesteps have been instantiated, the application calls a taskwait to wait for the completion of all computation, communication and snapshot tasks.
 {{{#!c
 …
 }}}
+The function above is the one called periodically from the main function. This function instantiates the tasks that will perform the snapshot of the current rank's blocks into the corresponding NAM memory subregion. The first step is to compute the offset of the current snapshot inside the NAM region using the snapshot identifier. Before writing to the NAM window, the application must ensure that a MPI RMA access epoch has been opened in that window. That is what the first task is doing. After all blocks are ready to be read (see the task dependencies), the task can run and execute an `MPI_Win_ifence` to start the opening of the epoch generating an MPI request, and after that, the task binds its completion to the finalization of the request by calling `TAMPI_Iwait`. This last call is non-blocking and asynchronous, so the fence operation may not be completed after returning. The task can finish its execution but it will not complete until the fence operation finishes. Once it finishes, TAMPI will automatically complete the task and make the successor tasks ready. The successor tasks of the fence task are the ones that perform the actual writing of data to the NAM memory calling `MPI_Put`. All blocks can be saved in the NAM memory in parallel through different tasks. The source of the `MPI_Put` is the block itself (in regular RAM memory) and the destination is the place where the block must be written in the NAM memory. After all writer tasks have finished, the task responsible for closing the MPI RMA access epoch in the NAM window will be able to start. This one will behave similarly to the opening task.
+Notice that all tasks declare the proper dependencies on the matrix blocks and the NAM window to guarantee the correct order of execution. Thanks to these data dependencies and the TAMPI non-blocking feature, we can cleanly add the execution of the snapshots into the task graph, to be executed asynchronously, and being naturally interleaved with the other computation and communication tasks. Finally, it is worth noting that the writing of blocks to the NAM memory is done in parallel, trying to efficiently utilize the CPU and network resources of the machine.
 === Requirements ===