WikiPrint - from Polar Technologies

Offloading Hybrid Applications' Computation Tasks to GPUs

With MPI + OpenMP / OmpSs-2

Table of contents:

Quick Overview

Current and near-future High Performance Computing (HPC) systems consist of thousands of parallel computing nodes, connected by high-bandwidth network interconnections, and in most of the cases, each node leveraging one or more GPU devices.

Moreover, some of the most modern GPU devices, such as NVIDIA Tesla V100, support the Unified Memory, which facilitates the task of application developers. With those devices, users do not have to move or copy the data to/from GPUs, and also, pointers at the host are the same at the device.

For these reasons, developers should take benefit from these GPU resources by trying to offload the most compute-intensive parts of the applications to the available GPUs. In this page, we briefly explain the approaches proposed by the OpenMP and the OmpSs-2 programming models to facilitate the offloading of computation tasks to Unified Memory GPUs. Then, we show a hybrid application that has MPI+OpenMP and MPI+OmpSs-2 variants which offload some computation tasks.

On the one hand, OpenMP provides the target directive, which is the one used for offloading computation parts of OpenMP programs to the GPUs. It provides multiple clauses to specify the copy directionality of the data, how the computational workload is distributed, the data dependencies, etc. The user can annotate a part of the program using the target directive and without having to program it with a special programming language. For instance, when offloading a part of the program to NVIDIA GPUs, the user is not required to provide any CUDA kernel. That part of the program is handled transparently by the compiler.

On the other hand, OmpSs-2 proposes another approach targeting NVIDIA Unified Memory devices. CUDA kernels can be annotated as regular tasks, and they can declare the corresponding data dependencies on the data buffers. When all the dependencies of a CUDA task are satisfied, the CUDA kernel associated with the task is automatically and asynchronously offloaded to one of the available GPUs. To use that functionality, the user only has to allocate the buffers that CUDA kernels will access as Unified Memory buffers (i.e., using the cudaMallocManaged() function). Additionally, users must annotate the CUDA tasks with the device(cuda) and ndrange(...) clauses.

N-Body Benchmark

An N-Body simulation numerically approximates the evolution of a system of bodies in which each body continuously interacts with every other body. A familiar example is an astrophysical simulation in which each body represents a galaxy or an individual star, and the bodies attract each other through the gravitational force.

N-Body simulation arises in many other computational science problems as well. For example, protein folding is studied using N-body simulation to calculate electrostatic and Van der Waals forces. Turbulent fluid flow simulation and global illumination computation in computer graphics are other examples of problems that use N-Body simulation.

Users can clone or download this example from the ?https://pm.bsc.es/gitlab/DEEP-EST/apps/NBody repository and transfer it to a DEEP working directory.

Requirements

The requirements of this application are shown in the following lists. The main requirements are:

Versions

The N-Body application has several versions which are built in different binaries. All of them divide the particle space into smaller blocks. MPI processes are divided into two groups: GPU processes and CPU processes. GPU processes are responsible for computing the forces between each pair of particles blocks, and then, these forces are sent to the CPU processes, where each process updates its particles blocks using the received forces. The particles and forces blocks are equally distributed amongst each MPI process in each group. Thus, each MPI process is in charge of computing the forces or updating the particles of a consecutive chunk of blocks.

The available versions are:

Building & Executing on DEEP

The simplest way to compile this application on the DEEP system is:

# Clone the benchmark's repository
$ git clone https://pm.bsc.es/gitlab/DEEP-EST/apps/NBody.git
$ cd NBody

# Load the required environment (MPI, CUDA, OmpSs-2, OpenMP, etc.)
# Needed only once per session
$ source ./setenv_deep.sh

# Compile all N-Body variants
$ make

The benchmark versions are built with a specific block size, which is decided at compilation time (i.e., the binary names contain the block size). The default block size is 2048, but we can indicate a different block size when compiling by doing:

$ make BS=1024

The next step is the execution of the benchmark on the DEEP system. Since this application targets the offloading of computation tasks to Unified Memory GPU devices, we must execute it in a DEEP partition that features this kind of devices. A good example is the dp-dam partition, where each node features:

In this case, we are going to request an interactive job in a dp-dam node. All we need to is:

$ srun -p dp-dam -N 1 -n 8 -c 12 -t 01:00:00 --pty /bin/bash -i

With that command, we will be redirected to an interactive session in a dp-dam node, exclusive for us. Furthermore, by indicating that configuration (i.e., -N, -n and -c options), we are setting the default configuration for future srun executions in that session. Thus, when executing an MPI binary via the srun command, it is going to launch 8 processes and 12 CPUs per process by default. However, we should be able to change the configuration (without overtaking the initial number of resources) by overriding those parameters with new ones.

At this point, we are ready to execute the benchmark with multiple MPI processes. The benchmark accepts several options. The most relevant options are the total number of particles with -p, the number of timesteps with -t, and the maximum number of GPU processes with -g. More options can be seen passing the -h option. An example of execution is:

$ srun -n 8 -c 12 ./nbody.tampi.ompss2.cuda.2048bs.bin -t 100 -p 16384 -g 4

in which the application will perform 100 timesteps in 8 MPI processes with 12 cores per process (used by the OmpSs-2's runtime system). The maximum number of GPU processes is 4, so there will be 4 GPU processes and 4 CPU processes (all processes have access to GPU devices). Since the total number of particles is 16384 and the block size is 2048, each process will be in charge of computing/ updating 4096 forces/particles, which are 2 blocks.

In the CUDA variants, a process can belong to the GPU processes group if it has access to at least one GPU device. However, in the case of the non-CUDA versions, all processes can belong to the GPU processes group (i.e., we simulate the GPU processes). For this reason, the application provides the -g option in order to control the maximum number of GPU processes. By default, the number of GPU processes will be half of the total number of processes. Also, note that the non-CUDA variants cannot compute kernels on the GPU. In these cases, we have kept the structure of the application, but we have replaced the CUDA tasks by regular CPU tasks, as a simulation.

Similarly, the OpenMP variants can be executed following the same steps but setting the OMP_NUM_THREADS to the corresponding number of CPUs per process. As an example, we could execute the following command:

$ OMP_NUM_THREADS=24 srun -n 4 -c 24 ./nbody.tampi.omp.2048bs.bin -t 100 -p 8912 -g 2

Finally, the submit.job script can be used to submit a non-interactive job into the job scheduler system. Feel free to modify the script with other parameters or job configurations. We can submit the script by doing:

$ sbatch submit.job

References