Context Navigation

Offloading_hybrid_apps

Timestamp:: Sep 17, 2019, 4:22:02 PM (6 years ago)
Author:: Kevin Sala
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/Offloading_hybrid_apps

-                      v11
+                      v12
 Table of contents:
 * [#QuickOverview Quick Overview]
 * Examples:
 …
 The requirements of this application are shown in the following lists. The main requirements are:
+  * '''GNU Compiler Collection'''.
+  * An '''MPI''' implementation supporting the multi-threading level of thread support.
+  * The '''Task-Aware MPI (TAMPI)''' library which provides a clean interoperability mechanism
+    for MPI and OpenMP/!OmpSs-2 tasks. Downloads and info at [https://github.com/bsc-pm/tampi].
+  * The '''!OmpSs-2''' model which is the second generation of the '''!OmpSs''' programming model. It is a task-based
+    programming model originated from the ideas of the OpenMP and !StarSs programming models. The
+    specification and user-guide are available at [https://pm.bsc.es/ompss-2-docs/spec/] and
+  * The '''GNU''' or '''Intel®''' Compiler Collection.
+  * A '''Message Passing Interface (MPI)''' implementation supporting the '''multi-threading''' level of thread support.
+  * The '''Task-Aware MPI (TAMPI)''' library which defines a clean '''interoperability''' mechanism
+    for MPI and OpenMP/!OmpSs-2 tasks. It supports both blocking and non-blocking MPI operations
+    by providing two different interoperability mechanisms. Downloads and more information at
+    [https://github.com/bsc-pm/tampi].
+  * The '''!OmpSs-2''' model which is the second generation of the '''!OmpSs''' programming model. It is
+    a '''task-based''' programming model originated from the ideas of the OpenMP and !StarSs programming
+    models. The specification and user-guide are available at [https://pm.bsc.es/ompss-2-docs/spec/] and
     [https://pm.bsc.es/ompss-2-docs/user-guide/], respectively. !OmpSs-2 requires both '''Mercurium''' and
     '''Nanos6''' tools. Mercurium is a source-to-source compiler which provides the necessary support for
     transforming the high-level directives into a parallelized version of the application. The Nanos6
+    runtime system provides the services to manage all the parallelism in the application
+    (e.g., task creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm].
+  * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI.
+  * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for the CUDA variants, in which some of
+    runtime system provides the services to manage all the parallelism in the application (e.g., task
+    creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm].
+  * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI. Not released yet.
+  * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for enabling the CUDA variants, in which some of
     the N-body kernels are executed on the available GPU devices.
 === Versions ===
+The N-Body application has several versions which are compiled in different binaries,
+by executing the `make` command. All of them divide the particle space into smaller
+blocks. MPI processes are divided into two groups: GPU processes and CPU processes.
+GPU processes are responsible for computing the forces between each pair of particles
+blocks, and then, these forces are sent to the CPU processes, where each process
+updates its particles blocks using the received forces. The particles and forces blocks
+are equally distributed amongst each MPI process in each group. Thus, each MPI process
+is in charge of computing the forces or updating the particles of a consecutive chunk
+of blocks.
+The N-Body application has several versions which are built in different binaries. All
+of them divide the particle space into smaller blocks. MPI processes are divided into
+two groups: GPU processes and CPU processes. GPU processes are responsible for computing
+the forces between each pair of particles blocks, and then, these forces are sent to the
+CPU processes, where each process updates its particles blocks using the received forces.
+The particles and forces blocks are equally distributed amongst each MPI process in each
+group. Thus, each MPI process is in charge of computing the forces or updating the particles
+of a consecutive chunk of blocks.
 The available versions are:
+  * `nbody.mpi.bin`: Parallel version using MPI.
+  * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both computation and
+    communication phases are taskified, however, communication tasks are serialized by declaring an
+    artificial dependency on a sentinel variable. This is to prevent deadlocks between processes,
+    since communication tasks perform blocking MPI calls.
+  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
+    execute the most compute-instensive parts of the application at the available GPUs.
+  * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + TAMPI library. This
+  * `nbody.mpi.bin`: Simple parallel version using '''blocking MPI''' primitives for sending and
+    receiving each block of particles/forces.
+  * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both '''computation''' and
+    '''communication''' phases are '''taskified''', however, communication tasks (each one sending
+    or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
+    is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
+    calls.
+  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the computation
+    tasks of forces between particles blocks to the available GPUs. Those computation tasks are
+    offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
+    The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their
+    dependencies) but implemented in '''CUDA'''. However, since it is Unified Memory, the user '''does
+    not need to move the data to/from the GPU''' device.
+  * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + '''TAMPI''' library. This
     version disables the artificial dependencies on the sentinel variable, so communication tasks can
+    run in parallel. The TAMPI library is in charge of managing the blocking MPI calls to avoid the
+    blocking of the underlying execution resources.
+  * `nbody.tampi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
+    execute the most compute-instensive parts of the application at the available GPUs.
+    run in parallel and overlap computations. The TAMPI library is in charge of managing the '''blocking
+    MPI''' calls to avoid the blocking of the underlying execution resources.
+  * `nbody.tampi.ompss2.cuda.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for
+    allowing the concurrent execution of communication tasks, and GPU processes offload the compute-intensive
+    tasks to the GPUs.
   * `nbody.mpi.omp.bin`:
   * `nbody.mpi.omptarget.bin`:
   * `nbody.tampi.omp.bin`:
   * `nbody.tampi.omptarget.bin`: