Changes between Version 11 and Version 12 of Public/User_Guide/Offloading_hybrid_apps


Ignore:
Timestamp:
Sep 17, 2019, 4:22:02 PM (5 years ago)
Author:
Kevin Sala
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Offloading_hybrid_apps

    v11 v12  
    22
    33Table of contents:
     4
    45* [#QuickOverview Quick Overview]
    56* Examples:
     
    3334The requirements of this application are shown in the following lists. The main requirements are:
    3435
    35   * '''GNU Compiler Collection'''.
    36   * An '''MPI''' implementation supporting the multi-threading level of thread support.
    37   * The '''Task-Aware MPI (TAMPI)''' library which provides a clean interoperability mechanism
    38     for MPI and OpenMP/!OmpSs-2 tasks. Downloads and info at [https://github.com/bsc-pm/tampi].
    39   * The '''!OmpSs-2''' model which is the second generation of the '''!OmpSs''' programming model. It is a task-based
    40     programming model originated from the ideas of the OpenMP and !StarSs programming models. The
    41     specification and user-guide are available at [https://pm.bsc.es/ompss-2-docs/spec/] and
     36  * The '''GNU''' or '''Intel®''' Compiler Collection.
     37
     38  * A '''Message Passing Interface (MPI)''' implementation supporting the '''multi-threading''' level of thread support.
     39
     40  * The '''Task-Aware MPI (TAMPI)''' library which defines a clean '''interoperability''' mechanism
     41    for MPI and OpenMP/!OmpSs-2 tasks. It supports both blocking and non-blocking MPI operations
     42    by providing two different interoperability mechanisms. Downloads and more information at
     43    [https://github.com/bsc-pm/tampi].
     44
     45  * The '''!OmpSs-2''' model which is the second generation of the '''!OmpSs''' programming model. It is
     46    a '''task-based''' programming model originated from the ideas of the OpenMP and !StarSs programming
     47    models. The specification and user-guide are available at [https://pm.bsc.es/ompss-2-docs/spec/] and
    4248    [https://pm.bsc.es/ompss-2-docs/user-guide/], respectively. !OmpSs-2 requires both '''Mercurium''' and
    4349    '''Nanos6''' tools. Mercurium is a source-to-source compiler which provides the necessary support for
    4450    transforming the high-level directives into a parallelized version of the application. The Nanos6
    45     runtime system provides the services to manage all the parallelism in the application
    46     (e.g., task creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm].
    47   * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI.
    48   * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for the CUDA variants, in which some of
     51    runtime system provides the services to manage all the parallelism in the application (e.g., task
     52    creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm].
     53
     54  * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI. Not released yet.
     55
     56  * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for enabling the CUDA variants, in which some of
    4957    the N-body kernels are executed on the available GPU devices.
    5058
    5159=== Versions ===
    5260
    53 The N-Body application has several versions which are compiled in different binaries,
    54 by executing the `make` command. All of them divide the particle space into smaller
    55 blocks. MPI processes are divided into two groups: GPU processes and CPU processes.
    56 GPU processes are responsible for computing the forces between each pair of particles
    57 blocks, and then, these forces are sent to the CPU processes, where each process
    58 updates its particles blocks using the received forces. The particles and forces blocks
    59 are equally distributed amongst each MPI process in each group. Thus, each MPI process
    60 is in charge of computing the forces or updating the particles of a consecutive chunk
    61 of blocks.
     61The N-Body application has several versions which are built in different binaries. All
     62of them divide the particle space into smaller blocks. MPI processes are divided into
     63two groups: GPU processes and CPU processes. GPU processes are responsible for computing
     64the forces between each pair of particles blocks, and then, these forces are sent to the
     65CPU processes, where each process updates its particles blocks using the received forces.
     66The particles and forces blocks are equally distributed amongst each MPI process in each
     67group. Thus, each MPI process is in charge of computing the forces or updating the particles
     68of a consecutive chunk of blocks.
    6269
    6370The available versions are:
    6471
    65   * `nbody.mpi.bin`: Parallel version using MPI.
    66   * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both computation and
    67     communication phases are taskified, however, communication tasks are serialized by declaring an
    68     artificial dependency on a sentinel variable. This is to prevent deadlocks between processes,
    69     since communication tasks perform blocking MPI calls.
    70   * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
    71     execute the most compute-instensive parts of the application at the available GPUs.
    72   * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + TAMPI library. This
     72  * `nbody.mpi.bin`: Simple parallel version using '''blocking MPI''' primitives for sending and
     73    receiving each block of particles/forces.
     74
     75  * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both '''computation''' and
     76    '''communication''' phases are '''taskified''', however, communication tasks (each one sending
     77    or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
     78    is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
     79    calls.
     80
     81  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the computation
     82    tasks of forces between particles blocks to the available GPUs. Those computation tasks are
     83    offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
     84    The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their
     85    dependencies) but implemented in '''CUDA'''. However, since it is Unified Memory, the user '''does
     86    not need to move the data to/from the GPU''' device.
     87
     88  * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + '''TAMPI''' library. This
    7389    version disables the artificial dependencies on the sentinel variable, so communication tasks can
    74     run in parallel. The TAMPI library is in charge of managing the blocking MPI calls to avoid the
    75     blocking of the underlying execution resources.
    76   * `nbody.tampi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
    77     execute the most compute-instensive parts of the application at the available GPUs.
     90    run in parallel and overlap computations. The TAMPI library is in charge of managing the '''blocking
     91    MPI''' calls to avoid the blocking of the underlying execution resources.
     92
     93  * `nbody.tampi.ompss2.cuda.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for
     94    allowing the concurrent execution of communication tasks, and GPU processes offload the compute-intensive
     95    tasks to the GPUs.
     96
    7897  * `nbody.mpi.omp.bin`:
     98
    7999  * `nbody.mpi.omptarget.bin`:
     100
    80101  * `nbody.tampi.omp.bin`:
     102
    81103  * `nbody.tampi.omptarget.bin`:
    82104