Changes between Version 20 and Version 21 of Public/User_Guide/Offloading_hybrid_apps
- Timestamp:
- Sep 18, 2019, 12:05:17 AM (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/Offloading_hybrid_apps
v20 v21 16 16 17 17 Moreover, some of the most modern GPU devices, such as NVIDIA Tesla V100, 18 support the '''Unified Memory''' which facilitates the task of the users. With those19 dev ices, users do '''not have to move or copy the data to/from GPUs''', and also,20 '''pointers at the host are the same at the device'''.21 22 For these reasons, parallel applications should take benefit from these GPU23 resources, and they should try to '''offload''' the most compute-intensive parts of 24 to the availableGPUs. In this page, we briefly explain the approaches proposed by18 support the '''Unified Memory''', which facilitates the task of application 19 developers. With those devices, users do '''not have to move or copy the data 20 to/from GPUs''', and also, '''pointers at the host are the same at the device'''. 21 22 For these reasons, developers should take benefit from these GPU resources by trying 23 to '''offload''' the most compute-intensive parts of the applications to the available 24 GPUs. In this page, we briefly explain the approaches proposed by 25 25 the '''OpenMP''' and the '''!OmpSs-2''' programming models to facilitate the offloading 26 of computation tasks to Unified Memory GPUs. Then, we show an hybrid application with 27 '''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants that offloads some of the computation 28 tasks, and that can be executed on the DEEP system. 29 30 On the one hand, '''OpenMP''' provides the `target` directive which is the one used for 31 offloading computational parts of OpenMP programs to the GPUs. It provides multiple 26 of computation tasks to Unified Memory GPUs. Then, we show a hybrid application that 27 has '''MPI+OpenMP''' and '''MPI+!OmpSs-2''' variants which offload some computation tasks. 28 29 On the one hand, '''OpenMP''' provides the `target` directive, which is the one used for 30 offloading computation parts of OpenMP programs to the GPUs. It provides multiple 32 31 clauses to specify the '''copy directionality''' of the data, how the computational workload 33 32 is '''distributed''', the '''data dependencies''', etc. The user can annotate a part of the 34 program insideusing the `target` directive and '''without having to program it with33 program using the `target` directive and '''without having to program it with 35 34 a special programming language'''. For instance, when offloading a part of the program to 36 NVIDIA GPUs, the user is not required to provide any implementation in CUDA. That part35 NVIDIA GPUs, the user is not required to provide any CUDA kernel. That part 37 36 of the program is handled transparently by the compiler. 38 37 … … 40 39 devices. CUDA kernels can be annotated as '''regular tasks''', and they can declare the 41 40 corresponding '''data dependencies''' on the data buffers. When all the dependencies of a 42 CUDA task are satisfied, the CUDA kernel associated tothe task is '''automatically''' and41 CUDA task are satisfied, the CUDA kernel associated with the task is '''automatically''' and 43 42 '''asynchronously offloaded''' to one of the available GPUs. To use that functionality, the 44 43 user only has to allocate the buffers that CUDA kernels will access as Unified Memory buffers 45 (i.e., using the `cudaMallocManaged()` function). 44 (i.e., using the `cudaMallocManaged()` function). Additionally, users must annotate the CUDA 45 tasks with the `device(cuda)` and `ndrange(...)` clauses. 46 46 47 47 == N-Body Benchmark == … … 81 81 transforming the high-level directives into a parallelized version of the application. The Nanos6 82 82 runtime system provides the services to manage all the parallelism in the application (e.g., task 83 creation, synchronization, scheduling, etc ). Downloads at [https://github.com/bsc-pm].83 creation, synchronization, scheduling, etc.). Downloads at [https://github.com/bsc-pm]. 84 84 85 85 * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI. Not released yet. 86 86 87 87 * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for enabling the CUDA variants, in which some of 88 the N-body kernels are executed onthe available GPU devices.88 the N-body kernels are executed at the available GPU devices. 89 89 90 90 === Versions === … … 105 105 106 106 * `nbody.mpi.ompss2.bin`: Parallel version using '''MPI + !OmpSs-2 tasks'''. Both '''computation''' 107 and '''communication''' phases are '''taskified''' , however, communication tasks (each one sending107 and '''communication''' phases are '''taskified'''. However, communication tasks (each one sending 108 108 or receiving a block) are serialized by an artificial dependency on a sentinel variable. This 109 is to prevent deadlocks between processes ,since communication tasks perform '''blocking MPI'''109 is to prevent deadlocks between processes since communication tasks perform '''blocking MPI''' 110 110 calls. 111 111 112 112 * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the tasks that 113 compute the forces between particles blocks to the available GPUs. Th ose computation tasks are114 offloaded by the '''GPU processes''' and theyare the most compute-intensive parts of the program.113 compute the forces between particles blocks to the available GPUs. The GPU processes offload those 114 computation tasks, which are the most compute-intensive parts of the program. 115 115 The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their 116 116 dependencies) but implemented in '''CUDA'''. However, since it is Unified Memory, the user '''does … … 118 118 119 119 * `nbody.tampi.ompss2.bin`: Parallel version using '''MPI + !OmpSs-2 tasks + TAMPI''' library. This 120 version disables the artificial dependencies on the sentinel variable , socommunication tasks can121 run in parallel and overlap computations. The TAMPI library is in charge of managing the '''blocking120 version disables the artificial dependencies on the sentinel variable so that communication tasks can 121 run in parallel and overlap with computations. The TAMPI library is in charge of managing the '''blocking 122 122 MPI''' calls to avoid the blocking of the underlying execution resources. 123 123 … … 127 127 128 128 * `nbody.mpi.omp.bin`: Parallel version using '''MPI + OpenMP tasks'''. Both '''computation''' and 129 '''communication''' phases are '''taskified''' , however, communication tasks (each one sending129 '''communication''' phases are '''taskified'''. However, communication tasks (each one sending 130 130 or receiving a block) are serialized by an artificial dependency on a sentinel variable. This 131 is to prevent deadlocks between processes ,since communication tasks perform '''blocking MPI'''131 is to prevent deadlocks between processes since communication tasks perform '''blocking MPI''' 132 132 calls. 133 133 134 134 * `nbody.mpi.omptarget.bin`: The same as the previous version but '''offloading''' the tasks that 135 compute the forces between particles blocks to the available GPUs. Th ose computation tasks are136 offloaded by the '''GPU processes''' and theyare the most compute-intensive parts of the program.135 compute the forces between particles blocks to the available GPUs. The GPU processes offload those 136 computation tasks, which are the most compute-intensive parts of the program. 137 137 This is done through the `omp target` directive, declaring the corresponding dependencies, and 138 138 specifying the `target` as `nowait` (i.e., asynchronous offload). Additionally, the target directive … … 144 144 145 145 * `nbody.tampi.omp.bin`: Parallel version using '''MPI + OpenMP tasks + TAMPI''' library. This 146 version disables the artificial dependencies on the sentinel variable , socommunication tasks can146 version disables the artificial dependencies on the sentinel variable so that communication tasks can 147 147 run in parallel and overlap computations. Since OpenMP only supports the non-blocking mechanism of 148 148 TAMPI, this version leverages non-blocking primitive calls. In this way, TAMPI library is in charge 149 of managing the '''non-blocking MPI''' operations to efficiently overlap communication and computation150 tasks.149 of managing the '''non-blocking MPI''' operations to overlap communication and computation tasks 150 efficiently. 151 151 152 152 * `nbody.tampi.omptarget.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for