94 | | allowing the concurrent execution of communication tasks, and GPU processes offload the compute-intensive |
95 | | tasks to the GPUs. |
96 | | |
97 | | * `nbody.mpi.omp.bin`: |
98 | | |
99 | | * `nbody.mpi.omptarget.bin`: |
| 94 | allowing the concurrent execution of communication tasks, and GPU processes '''offload''' the |
| 95 | compute-intensive tasks to the GPUs. |
| 96 | |
| 97 | * `nbody.mpi.omp.bin`: Parallel version using MPI + OpenMP tasks. Both '''computation''' and |
| 98 | '''communication''' phases are '''taskified''', however, communication tasks (each one sending |
| 99 | or receiving a block) are serialized by an artificial dependency on a sentinel variable. This |
| 100 | is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI''' |
| 101 | calls. |
| 102 | |
| 103 | * `nbody.mpi.omptarget.bin`: The same as the previous version but '''offloading''' the tasks that |
| 104 | compute the forces between particles blocks to the available GPUs. Those computation tasks are |
| 105 | offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program. |
| 106 | This is done through the `omp target` directive, declaring the corresponding dependencies, and |
| 107 | specifying the `target` as `nowait` (i.e., asynchronous offload). Additionally, the target directive |
| 108 | does not require the user to provide a CUDA implementation of the offloaded task. Finally, since we |
| 109 | are using the Unified Memory feature, we do not need to specify any data movement clause. We only |
| 110 | have to specify that the memory buffers are already device pointers (i.e., with `is_device_ptr` |
| 111 | clause). '''Note:''' This version is not compiled by default since it is still in a ''Work in |
| 112 | Progress'' state. |