18 | | support the Unified Memory which facilitates the task of the users. With those |
19 | | devices, users do not have to move or copy the data to/from the GPU, and also, |
20 | | pointers at the host are the same at the device. |
21 | | |
22 | | For these reasons, parallel applications should try to take benefit from these |
23 | | GPU resources, and they should try to offload the most compute-intensive parts |
24 | | of the applications to the available GPUs. In this page, we are going to briefly |
25 | | explain the approaches proposed by the OpenMP and the !OmpSs-2 programming models |
26 | | to facilitate the offloading of computation tasks to the Unified Memory GPUs. Then, |
27 | | we show an hybrid MPI + OpenMP/OmpSs-2 benchmark that offloads some tasks, and that |
28 | | we can execute on the DEEP system. |
29 | | |
30 | | On the one hand, OpenMP provides the `target` directive which is the one used for |
| 18 | support the '''Unified Memory''' which facilitates the task of the users. With those |
| 19 | devices, users do '''not have to move or copy the data to/from GPUs''', and also, |
| 20 | '''pointers at the host are the same at the device'''. |
| 21 | |
| 22 | For these reasons, parallel applications should take benefit from these GPU |
| 23 | resources, and they should try to '''offload''' the most compute-intensive parts of |
| 24 | to the available GPUs. In this page, we briefly explain the approaches proposed by |
| 25 | the '''OpenMP''' and the '''!OmpSs-2''' programming models to facilitate the offloading |
| 26 | of computation tasks to Unified Memory GPUs. Then, we show an hybrid application with |
| 27 | '''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants that offloads some of the computation |
| 28 | tasks, and that can be executed on the DEEP system. |
| 29 | |
| 30 | On the one hand, '''OpenMP''' provides the `target` directive which is the one used for |
32 | | clauses for specifying the copy directionality of the data, how the computational workload |
33 | | is distributed, the data dependencies, etc. The user can annotate a part of the program |
34 | | inside using the `target` directive and without having to program that part with a special |
35 | | programming language. For instance, when offloading a part of the program to NVIDIA GPUs, |
36 | | the user is not required to provide any implementation in CUDA. That part of the program |
37 | | is handled transparently by the compiler. |
38 | | |
39 | | On the other hand, !OmpSs-2 proposes another approach targeting NVIDIA Unified Memory |
40 | | devices. CUDA kernels can be annotated as regular tasks, and they can declare the |
41 | | corresponding data dependencies on the data buffers. When all the dependencies of a CUDA task |
42 | | are satisfied, the CUDA kernel associated to the task is offloaded to one of the available |
43 | | GPUs. To use that functionality, the user only has to allocate the buffers that CUDA kernels |
44 | | will access as Unified Memory buffers (i.e., using the `cudaMallocManaged()` function). |
| 32 | clauses to specify the '''copy directionality''' of the data, how the computational workload |
| 33 | is '''distributed''', the '''data dependencies''', etc. The user can annotate a part of the |
| 34 | program inside using the `target` directive and '''without having to program it with |
| 35 | a special programming language'''. For instance, when offloading a part of the program to |
| 36 | NVIDIA GPUs, the user is not required to provide any implementation in CUDA. That part |
| 37 | of the program is handled transparently by the compiler. |
| 38 | |
| 39 | On the other hand, '''!OmpSs-2''' proposes another approach targeting NVIDIA Unified Memory |
| 40 | devices. CUDA kernels can be annotated as '''regular tasks''', and they can declare the |
| 41 | corresponding '''data dependencies''' on the data buffers. When all the dependencies of a |
| 42 | CUDA task are satisfied, the CUDA kernel associated to the task is '''automatically''' and |
| 43 | '''asynchronously offloaded''' to one of the available GPUs. To use that functionality, the |
| 44 | user only has to allocate the buffers that CUDA kernels will access as Unified Memory buffers |
| 45 | (i.e., using the `cudaMallocManaged()` function). |
207 | | The benchmark accepts several options. The most relevant options are the number |
208 | | of total particles with `-p`, the number of timesteps with `-t`, and the maximum |
209 | | number of GPU processes with `-g`. More options can be seen passing the `-h` option. |
210 | | An example of an execution is: |
| 208 | The benchmark accepts several options. The most relevant options are the total |
| 209 | number of '''particles''' with `-p`, the number of '''timesteps''' with `-t`, and the |
| 210 | maximum number of '''GPU processes''' with `-g`. More options can be seen passing |
| 211 | the `-h` option. An example of an execution is: |