Context Navigation

Offloading_hybrid_apps

Timestamp:: Sep 17, 2019, 6:49:35 PM (6 years ago)
Author:: Kevin Sala
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/Offloading_hybrid_apps

-                      v18
+                      v19
 thousands of parallel computing nodes, connected by high-bandwidth network
 interconnections, and in most of the cases, each node leveraging one or more
 GPU devices.
+'''GPU devices'''.
 Moreover, some of the most modern GPU devices, such as NVIDIA Tesla V100,
 support the Unified Memory which facilitates the task of the users. With those
 devices, users do not have to move or copy the data to/from the GPU, and also,
 pointers at the host are the same at the device.
 For these reasons, parallel applications should try to take benefit from these
+GPU resources, and they should try to offload the most compute-intensive parts
 of the applications to the available GPUs. In this page, we are going to briefly
+explain the approaches proposed by the OpenMP and the !OmpSs-2 programming models
+to facilitate the offloading of computation tasks to the Unified Memory GPUs. Then,
+we show an hybrid MPI + OpenMP/OmpSs-2 benchmark that offloads some tasks, and that
 we can execute on the DEEP system.
 On the one hand, OpenMP provides the `target` directive which is the one used for
+support the '''Unified Memory''' which facilitates the task of the users. With those
+devices, users do '''not have to move or copy the data to/from GPUs''', and also,
+'''pointers at the host are the same at the device'''.
+For these reasons, parallel applications should take benefit from these GPU
+resources, and they should try to '''offload''' the most compute-intensive parts of
+to the available GPUs. In this page, we briefly explain the approaches proposed by
+the '''OpenMP''' and the '''!OmpSs-2''' programming models to facilitate the offloading
+of computation tasks to Unified Memory GPUs. Then, we show an hybrid application with
+'''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants that offloads some of the computation
+tasks, and that can be executed on the DEEP system.
+On the one hand, '''OpenMP''' provides the `target` directive which is the one used for
 offloading computational parts of OpenMP programs to the GPUs. It provides multiple
+clauses for specifying the copy directionality of the data, how the computational workload
+is distributed, the data dependencies, etc. The user can annotate a part of the program
+inside using the `target` directive and without having to program that part with a special
+programming language. For instance, when offloading a part of the program to NVIDIA GPUs,
+the user is not required to provide any implementation in CUDA. That part of the program
+is handled transparently by the compiler.
+On the other hand, !OmpSs-2 proposes another approach targeting NVIDIA Unified Memory
+devices. CUDA kernels can be annotated as regular tasks, and they can declare the
+corresponding data dependencies on the data buffers. When all the dependencies of a CUDA task
+are satisfied, the CUDA kernel associated to the task is offloaded to one of the available
+GPUs. To use that functionality, the user only has to allocate the buffers that CUDA kernels
+will access as Unified Memory buffers (i.e., using the `cudaMallocManaged()` function).
+clauses to specify the '''copy directionality''' of the data, how the computational workload
+is '''distributed''', the '''data dependencies''', etc. The user can annotate a part of the
+program inside using the `target` directive and '''without having to program it with
+a special programming language'''. For instance, when offloading a part of the program to
+NVIDIA GPUs, the user is not required to provide any implementation in CUDA. That part
+of the program is handled transparently by the compiler.
+On the other hand, '''!OmpSs-2''' proposes another approach targeting NVIDIA Unified Memory
+devices. CUDA kernels can be annotated as '''regular tasks''', and they can declare the
+corresponding '''data dependencies''' on the data buffers. When all the dependencies of a
+CUDA task are satisfied, the CUDA kernel associated to the task is '''automatically''' and
+'''asynchronously offloaded''' to one of the available GPUs. To use that functionality, the
+user only has to allocate the buffers that CUDA kernels will access as Unified Memory buffers
+(i.e., using the `cudaMallocManaged()` function).
 == N-Body Benchmark ==
 …
 At this point, we are ready to execute the benchmark with multiple MPI processes.
 The benchmark accepts several options. The most relevant options are the number
+of total particles with `-p`, the number of timesteps with `-t`, and the maximum
+number of GPU processes with `-g`. More options can be seen passing the `-h` option.
 An example of an execution is:
+The benchmark accepts several options. The most relevant options are the total
+number of '''particles''' with `-p`, the number of '''timesteps''' with `-t`, and the
+maximum number of '''GPU processes''' with `-g`. More options can be seen passing
+the `-h` option. An example of an execution is:
 {{{#!bash
 …
 GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all
 processes have access to GPU devices). Since the total number of particles is
 , each process will be in charge of computing/updating 4096 forces/particles,
 which are 2 blocks.
+ and the blocks size is 2048, each process will be in charge of computing
+/ updating 4096 forces / particles, which are 2 blocks.
 In the CUDA variants, a process can belong to the GPU processes group if it has
 …
 }}}
+Finally, if you want to execute the benchmark without using an interactive session, you
+can modify the `submit.job` script and submit it into the job queue through the `sbatch`
+command.
+Finally, the `submit.job` script can be used to submit a non-interactive job into
+the job scheduler system. Feel free to modify the script with other parameters or
+job configurations. The script can be submitted by:
+{{{#!bash
+$ sbatch submit.job
+}}}
 == References ==