Changes between Version 18 and Version 19 of Public/User_Guide/Offloading_hybrid_apps


Ignore:
Timestamp:
Sep 17, 2019, 6:49:35 PM (5 years ago)
Author:
Kevin Sala
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Offloading_hybrid_apps

    v18 v19  
    1313thousands of parallel computing nodes, connected by high-bandwidth network
    1414interconnections, and in most of the cases, each node leveraging one or more
    15 GPU devices.
     15'''GPU devices'''.
    1616
    1717Moreover, some of the most modern GPU devices, such as NVIDIA Tesla V100,
    18 support the Unified Memory which facilitates the task of the users. With those
    19 devices, users do not have to move or copy the data to/from the GPU, and also,
    20 pointers at the host are the same at the device.
    21 
    22 For these reasons, parallel applications should try to take benefit from these
    23 GPU resources, and they should try to offload the most compute-intensive parts
    24 of the applications to the available GPUs. In this page, we are going to briefly
    25 explain the approaches proposed by the OpenMP and the !OmpSs-2 programming models
    26 to facilitate the offloading of computation tasks to the Unified Memory GPUs. Then,
    27 we show an hybrid MPI + OpenMP/OmpSs-2 benchmark that offloads some tasks, and that
    28 we can execute on the DEEP system.
    29 
    30 On the one hand, OpenMP provides the `target` directive which is the one used for
     18support the '''Unified Memory''' which facilitates the task of the users. With those
     19devices, users do '''not have to move or copy the data to/from GPUs''', and also,
     20'''pointers at the host are the same at the device'''.
     21
     22For these reasons, parallel applications should take benefit from these GPU
     23resources, and they should try to '''offload''' the most compute-intensive parts of
     24to the available GPUs. In this page, we briefly explain the approaches proposed by
     25the '''OpenMP''' and the '''!OmpSs-2''' programming models to facilitate the offloading
     26of computation tasks to Unified Memory GPUs. Then, we show an hybrid application with
     27'''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants that offloads some of the computation
     28tasks, and that can be executed on the DEEP system.
     29
     30On the one hand, '''OpenMP''' provides the `target` directive which is the one used for
    3131offloading computational parts of OpenMP programs to the GPUs. It provides multiple
    32 clauses for specifying the copy directionality of the data, how the computational workload
    33 is distributed, the data dependencies, etc. The user can annotate a part of the program
    34 inside using the `target` directive and without having to program that part with a special
    35 programming language. For instance, when offloading a part of the program to NVIDIA GPUs,
    36 the user is not required to provide any implementation in CUDA. That part of the program
    37 is handled transparently by the compiler.
    38 
    39 On the other hand, !OmpSs-2 proposes another approach targeting NVIDIA Unified Memory
    40 devices. CUDA kernels can be annotated as regular tasks, and they can declare the
    41 corresponding data dependencies on the data buffers. When all the dependencies of a CUDA task
    42 are satisfied, the CUDA kernel associated to the task is offloaded to one of the available
    43 GPUs. To use that functionality, the user only has to allocate the buffers that CUDA kernels
    44 will access as Unified Memory buffers (i.e., using the `cudaMallocManaged()` function).
     32clauses to specify the '''copy directionality''' of the data, how the computational workload
     33is '''distributed''', the '''data dependencies''', etc. The user can annotate a part of the
     34program inside using the `target` directive and '''without having to program it with
     35a special programming language'''. For instance, when offloading a part of the program to
     36NVIDIA GPUs, the user is not required to provide any implementation in CUDA. That part
     37of the program is handled transparently by the compiler.
     38
     39On the other hand, '''!OmpSs-2''' proposes another approach targeting NVIDIA Unified Memory
     40devices. CUDA kernels can be annotated as '''regular tasks''', and they can declare the
     41corresponding '''data dependencies''' on the data buffers. When all the dependencies of a
     42CUDA task are satisfied, the CUDA kernel associated to the task is '''automatically''' and
     43'''asynchronously offloaded''' to one of the available GPUs. To use that functionality, the
     44user only has to allocate the buffers that CUDA kernels will access as Unified Memory buffers
     45(i.e., using the `cudaMallocManaged()` function).
    4546
    4647== N-Body Benchmark ==
     
    205206
    206207At this point, we are ready to execute the benchmark with multiple MPI processes.
    207 The benchmark accepts several options. The most relevant options are the number
    208 of total particles with `-p`, the number of timesteps with `-t`, and the maximum
    209 number of GPU processes with `-g`. More options can be seen passing the `-h` option.
    210 An example of an execution is:
     208The benchmark accepts several options. The most relevant options are the total
     209number of '''particles''' with `-p`, the number of '''timesteps''' with `-t`, and the
     210maximum number of '''GPU processes''' with `-g`. More options can be seen passing
     211the `-h` option. An example of an execution is:
    211212
    212213{{{#!bash
     
    218219GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all
    219220processes have access to GPU devices). Since the total number of particles is
    220 16384, each process will be in charge of computing/updating 4096 forces/particles,
    221 which are 2 blocks.
     22116384 and the blocks size is 2048, each process will be in charge of computing
     222/ updating 4096 forces / particles, which are 2 blocks.
    222223
    223224In the CUDA variants, a process can belong to the GPU processes group if it has
     
    238239}}}
    239240
    240 Finally, if you want to execute the benchmark without using an interactive session, you
    241 can modify the `submit.job` script and submit it into the job queue through the `sbatch`
    242 command.
     241Finally, the `submit.job` script can be used to submit a non-interactive job into
     242the job scheduler system. Feel free to modify the script with other parameters or
     243job configurations. The script can be submitted by:
     244
     245{{{#!bash
     246$ sbatch submit.job
     247}}}
    243248
    244249== References ==