Context Navigation

Offloading_hybrid_apps

Timestamp:: Sep 17, 2019, 4:38:50 PM (6 years ago)
Author:: Kevin Sala
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/Offloading_hybrid_apps

-                      v12
+                      v13
     calls.
   * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the computation
     tasks of forces between particles blocks to the available GPUs. Those computation tasks are
+  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the tasks that
+    compute the forces between particles blocks to the available GPUs. Those computation tasks are
     offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
     The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their
 …
   * `nbody.tampi.ompss2.cuda.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for
+    allowing the concurrent execution of communication tasks, and GPU processes offload the compute-intensive
+    tasks to the GPUs.
+  * `nbody.mpi.omp.bin`:
+  * `nbody.mpi.omptarget.bin`:
+    allowing the concurrent execution of communication tasks, and GPU processes '''offload''' the
+    compute-intensive tasks to the GPUs.
+  * `nbody.mpi.omp.bin`: Parallel version using MPI + OpenMP tasks. Both '''computation''' and
+    '''communication''' phases are '''taskified''', however, communication tasks (each one sending
+    or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
+    is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
+    calls.
+  * `nbody.mpi.omptarget.bin`: The same as the previous version but '''offloading''' the tasks that
+    compute the forces between particles blocks to the available GPUs. Those computation tasks are
+    offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
+    This is done through the `omp target` directive, declaring the corresponding dependencies, and
+    specifying the `target` as `nowait` (i.e., asynchronous offload). Additionally, the target directive
+    does not require the user to provide a CUDA implementation of the offloaded task. Finally, since we
+    are using the Unified Memory feature, we do not need to specify any data movement clause. We only
+    have to specify that the memory buffers are already device pointers (i.e., with `is_device_ptr`
+    clause). '''Note:''' This version is not compiled by default since it is still in a ''Work in
+    Progress'' state.
   * `nbody.tampi.omp.bin`: