Context Navigation

Offloading_hybrid_apps

Timestamp:: Sep 18, 2019, 12:05:17 AM (6 years ago)
Author:: Kevin Sala
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/Offloading_hybrid_apps

-                      v20
+                      v21
 Moreover, some of the most modern GPU devices, such as NVIDIA Tesla V100,
 support the '''Unified Memory''' which facilitates the task of the users. With those
 devices, users do '''not have to move or copy the data to/from GPUs''', and also,
 '''pointers at the host are the same at the device'''.
 For these reasons, parallel applications should take benefit from these GPU
+resources, and they should try to '''offload''' the most compute-intensive parts of
 to the available GPUs. In this page, we briefly explain the approaches proposed by
+support the '''Unified Memory''', which facilitates the task of application
+developers. With those devices, users do '''not have to move or copy the data
+to/from GPUs''', and also, '''pointers at the host are the same at the device'''.
+For these reasons, developers should take benefit from these GPU resources by trying
+to '''offload''' the most compute-intensive parts of the applications to the available
+GPUs. In this page, we briefly explain the approaches proposed by
 the '''OpenMP''' and the '''!OmpSs-2''' programming models to facilitate the offloading
+of computation tasks to Unified Memory GPUs. Then, we show an hybrid application with
+'''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants that offloads some of the computation
+tasks, and that can be executed on the DEEP system.
+On the one hand, '''OpenMP''' provides the `target` directive which is the one used for
+offloading computational parts of OpenMP programs to the GPUs. It provides multiple
+of computation tasks to Unified Memory GPUs. Then, we show a hybrid application that
+has '''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants which offload some computation tasks.
+On the one hand, '''OpenMP''' provides the `target` directive, which is the one used for
+offloading computation parts of OpenMP programs to the GPUs. It provides multiple
 clauses to specify the '''copy directionality''' of the data, how the computational workload
 is '''distributed''', the '''data dependencies''', etc. The user can annotate a part of the
 program inside using the `target` directive and '''without having to program it with
+program using the `target` directive and '''without having to program it with
 a special programming language'''. For instance, when offloading a part of the program to
 NVIDIA GPUs, the user is not required to provide any implementation in CUDA. That part
+NVIDIA GPUs, the user is not required to provide any CUDA kernel. That part
 of the program is handled transparently by the compiler.
 …
 devices. CUDA kernels can be annotated as '''regular tasks''', and they can declare the
 corresponding '''data dependencies''' on the data buffers. When all the dependencies of a
 CUDA task are satisfied, the CUDA kernel associated to the task is '''automatically''' and
+CUDA task are satisfied, the CUDA kernel associated with the task is '''automatically''' and
 '''asynchronously offloaded''' to one of the available GPUs. To use that functionality, the
 user only has to allocate the buffers that CUDA kernels will access as Unified Memory buffers
+(i.e., using the `cudaMallocManaged()` function).
+(i.e., using the `cudaMallocManaged()` function). Additionally, users must annotate the CUDA
+tasks with the `device(cuda)` and `ndrange(...)` clauses.
 == N-Body Benchmark ==
 …
     transforming the high-level directives into a parallelized version of the application. The Nanos6
     runtime system provides the services to manage all the parallelism in the application (e.g., task
     creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm].
+    creation, synchronization, scheduling, etc.). Downloads at [https://github.com/bsc-pm].
   * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI. Not released yet.
   * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for enabling the CUDA variants, in which some of
     the N-body kernels are executed on the available GPU devices.
+    the N-body kernels are executed at the available GPU devices.
 === Versions ===
 …
   * `nbody.mpi.ompss2.bin`: Parallel version using '''MPI + !OmpSs-2 tasks'''. Both '''computation'''
     and '''communication''' phases are '''taskified''', however, communication tasks (each one sending
+    and '''communication''' phases are '''taskified'''. However, communication tasks (each one sending
     or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
     is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
+    is to prevent deadlocks between processes since communication tasks perform '''blocking MPI'''
     calls.
   * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the tasks that
     compute the forces between particles blocks to the available GPUs. Those computation tasks are
     offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
+    compute the forces between particles blocks to the available GPUs. The GPU processes offload those
+    computation tasks, which are the most compute-intensive parts of the program.
     The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their
     dependencies) but implemented in '''CUDA'''. However, since it is Unified Memory, the user '''does
 …
   * `nbody.tampi.ompss2.bin`: Parallel version using '''MPI + !OmpSs-2 tasks + TAMPI''' library. This
     version disables the artificial dependencies on the sentinel variable, so communication tasks can
     run in parallel and overlap computations. The TAMPI library is in charge of managing the '''blocking
+    version disables the artificial dependencies on the sentinel variable so that communication tasks can
+    run in parallel and overlap with computations. The TAMPI library is in charge of managing the '''blocking
     MPI''' calls to avoid the blocking of the underlying execution resources.
 …
   * `nbody.mpi.omp.bin`: Parallel version using '''MPI + OpenMP tasks'''. Both '''computation''' and
     '''communication''' phases are '''taskified''', however, communication tasks (each one sending
+    '''communication''' phases are '''taskified'''. However, communication tasks (each one sending
     or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
     is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
+    is to prevent deadlocks between processes since communication tasks perform '''blocking MPI'''
     calls.
   * `nbody.mpi.omptarget.bin`: The same as the previous version but '''offloading''' the tasks that
     compute the forces between particles blocks to the available GPUs. Those computation tasks are
     offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
+    compute the forces between particles blocks to the available GPUs. The GPU processes offload those
+    computation tasks, which are the most compute-intensive parts of the program.
     This is done through the `omp target` directive, declaring the corresponding dependencies, and
     specifying the `target` as `nowait` (i.e., asynchronous offload). Additionally, the target directive
 …
   * `nbody.tampi.omp.bin`: Parallel version using '''MPI + OpenMP tasks + TAMPI''' library. This
     version disables the artificial dependencies on the sentinel variable, so communication tasks can
+    version disables the artificial dependencies on the sentinel variable so that communication tasks can
     run in parallel and overlap computations. Since OpenMP only supports the non-blocking mechanism of
     TAMPI, this version leverages non-blocking primitive calls. In this way, TAMPI library is in charge
     of managing the '''non-blocking MPI''' operations to efficiently overlap communication and computation
     tasks.
+    of managing the '''non-blocking MPI''' operations to overlap communication and computation tasks
+    efficiently.
   * `nbody.tampi.omptarget.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for