Changes between Version 20 and Version 21 of Public/User_Guide/Offloading_hybrid_apps


Ignore:
Timestamp:
Sep 18, 2019, 12:05:17 AM (5 years ago)
Author:
Kevin Sala
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Offloading_hybrid_apps

    v20 v21  
    1616
    1717Moreover, some of the most modern GPU devices, such as NVIDIA Tesla V100,
    18 support the '''Unified Memory''' which facilitates the task of the users. With those
    19 devices, users do '''not have to move or copy the data to/from GPUs''', and also,
    20 '''pointers at the host are the same at the device'''.
    21 
    22 For these reasons, parallel applications should take benefit from these GPU
    23 resources, and they should try to '''offload''' the most compute-intensive parts of
    24 to the available GPUs. In this page, we briefly explain the approaches proposed by
     18support the '''Unified Memory''', which facilitates the task of application
     19developers. With those devices, users do '''not have to move or copy the data
     20to/from GPUs''', and also, '''pointers at the host are the same at the device'''.
     21
     22For these reasons, developers should take benefit from these GPU resources by trying
     23to '''offload''' the most compute-intensive parts of the applications to the available
     24GPUs. In this page, we briefly explain the approaches proposed by
    2525the '''OpenMP''' and the '''!OmpSs-2''' programming models to facilitate the offloading
    26 of computation tasks to Unified Memory GPUs. Then, we show an hybrid application with
    27 '''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants that offloads some of the computation
    28 tasks, and that can be executed on the DEEP system.
    29 
    30 On the one hand, '''OpenMP''' provides the `target` directive which is the one used for
    31 offloading computational parts of OpenMP programs to the GPUs. It provides multiple
     26of computation tasks to Unified Memory GPUs. Then, we show a hybrid application that
     27has '''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants which offload some computation tasks.
     28
     29On the one hand, '''OpenMP''' provides the `target` directive, which is the one used for
     30offloading computation parts of OpenMP programs to the GPUs. It provides multiple
    3231clauses to specify the '''copy directionality''' of the data, how the computational workload
    3332is '''distributed''', the '''data dependencies''', etc. The user can annotate a part of the
    34 program inside using the `target` directive and '''without having to program it with
     33program using the `target` directive and '''without having to program it with
    3534a special programming language'''. For instance, when offloading a part of the program to
    36 NVIDIA GPUs, the user is not required to provide any implementation in CUDA. That part
     35NVIDIA GPUs, the user is not required to provide any CUDA kernel. That part
    3736of the program is handled transparently by the compiler.
    3837
     
    4039devices. CUDA kernels can be annotated as '''regular tasks''', and they can declare the
    4140corresponding '''data dependencies''' on the data buffers. When all the dependencies of a
    42 CUDA task are satisfied, the CUDA kernel associated to the task is '''automatically''' and
     41CUDA task are satisfied, the CUDA kernel associated with the task is '''automatically''' and
    4342'''asynchronously offloaded''' to one of the available GPUs. To use that functionality, the
    4443user only has to allocate the buffers that CUDA kernels will access as Unified Memory buffers
    45 (i.e., using the `cudaMallocManaged()` function).
     44(i.e., using the `cudaMallocManaged()` function). Additionally, users must annotate the CUDA
     45tasks with the `device(cuda)` and `ndrange(...)` clauses.
    4646
    4747== N-Body Benchmark ==
     
    8181    transforming the high-level directives into a parallelized version of the application. The Nanos6
    8282    runtime system provides the services to manage all the parallelism in the application (e.g., task
    83     creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm].
     83    creation, synchronization, scheduling, etc.). Downloads at [https://github.com/bsc-pm].
    8484
    8585  * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI. Not released yet.
    8686
    8787  * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for enabling the CUDA variants, in which some of
    88     the N-body kernels are executed on the available GPU devices.
     88    the N-body kernels are executed at the available GPU devices.
    8989
    9090=== Versions ===
     
    105105
    106106  * `nbody.mpi.ompss2.bin`: Parallel version using '''MPI + !OmpSs-2 tasks'''. Both '''computation'''
    107     and '''communication''' phases are '''taskified''', however, communication tasks (each one sending
     107    and '''communication''' phases are '''taskified'''. However, communication tasks (each one sending
    108108    or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
    109     is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
     109    is to prevent deadlocks between processes since communication tasks perform '''blocking MPI'''
    110110    calls.
    111111
    112112  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the tasks that
    113     compute the forces between particles blocks to the available GPUs. Those computation tasks are
    114     offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
     113    compute the forces between particles blocks to the available GPUs. The GPU processes offload those
     114    computation tasks, which are the most compute-intensive parts of the program.
    115115    The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their
    116116    dependencies) but implemented in '''CUDA'''. However, since it is Unified Memory, the user '''does
     
    118118
    119119  * `nbody.tampi.ompss2.bin`: Parallel version using '''MPI + !OmpSs-2 tasks + TAMPI''' library. This
    120     version disables the artificial dependencies on the sentinel variable, so communication tasks can
    121     run in parallel and overlap computations. The TAMPI library is in charge of managing the '''blocking
     120    version disables the artificial dependencies on the sentinel variable so that communication tasks can
     121    run in parallel and overlap with computations. The TAMPI library is in charge of managing the '''blocking
    122122    MPI''' calls to avoid the blocking of the underlying execution resources.
    123123
     
    127127
    128128  * `nbody.mpi.omp.bin`: Parallel version using '''MPI + OpenMP tasks'''. Both '''computation''' and
    129     '''communication''' phases are '''taskified''', however, communication tasks (each one sending
     129    '''communication''' phases are '''taskified'''. However, communication tasks (each one sending
    130130    or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
    131     is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
     131    is to prevent deadlocks between processes since communication tasks perform '''blocking MPI'''
    132132    calls.
    133133
    134134  * `nbody.mpi.omptarget.bin`: The same as the previous version but '''offloading''' the tasks that
    135     compute the forces between particles blocks to the available GPUs. Those computation tasks are
    136     offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
     135    compute the forces between particles blocks to the available GPUs. The GPU processes offload those
     136    computation tasks, which are the most compute-intensive parts of the program.
    137137    This is done through the `omp target` directive, declaring the corresponding dependencies, and
    138138    specifying the `target` as `nowait` (i.e., asynchronous offload). Additionally, the target directive
     
    144144
    145145  * `nbody.tampi.omp.bin`: Parallel version using '''MPI + OpenMP tasks + TAMPI''' library. This
    146     version disables the artificial dependencies on the sentinel variable, so communication tasks can
     146    version disables the artificial dependencies on the sentinel variable so that communication tasks can
    147147    run in parallel and overlap computations. Since OpenMP only supports the non-blocking mechanism of
    148148    TAMPI, this version leverages non-blocking primitive calls. In this way, TAMPI library is in charge
    149     of managing the '''non-blocking MPI''' operations to efficiently overlap communication and computation
    150     tasks.
     149    of managing the '''non-blocking MPI''' operations to overlap communication and computation tasks
     150    efficiently.
    151151
    152152  * `nbody.tampi.omptarget.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for