35 | | * '''GNU Compiler Collection'''. |
36 | | * An '''MPI''' implementation supporting the multi-threading level of thread support. |
37 | | * The '''Task-Aware MPI (TAMPI)''' library which provides a clean interoperability mechanism |
38 | | for MPI and OpenMP/!OmpSs-2 tasks. Downloads and info at [https://github.com/bsc-pm/tampi]. |
39 | | * The '''!OmpSs-2''' model which is the second generation of the '''!OmpSs''' programming model. It is a task-based |
40 | | programming model originated from the ideas of the OpenMP and !StarSs programming models. The |
41 | | specification and user-guide are available at [https://pm.bsc.es/ompss-2-docs/spec/] and |
| 36 | * The '''GNU''' or '''Intel®''' Compiler Collection. |
| 37 | |
| 38 | * A '''Message Passing Interface (MPI)''' implementation supporting the '''multi-threading''' level of thread support. |
| 39 | |
| 40 | * The '''Task-Aware MPI (TAMPI)''' library which defines a clean '''interoperability''' mechanism |
| 41 | for MPI and OpenMP/!OmpSs-2 tasks. It supports both blocking and non-blocking MPI operations |
| 42 | by providing two different interoperability mechanisms. Downloads and more information at |
| 43 | [https://github.com/bsc-pm/tampi]. |
| 44 | |
| 45 | * The '''!OmpSs-2''' model which is the second generation of the '''!OmpSs''' programming model. It is |
| 46 | a '''task-based''' programming model originated from the ideas of the OpenMP and !StarSs programming |
| 47 | models. The specification and user-guide are available at [https://pm.bsc.es/ompss-2-docs/spec/] and |
45 | | runtime system provides the services to manage all the parallelism in the application |
46 | | (e.g., task creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm]. |
47 | | * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI. |
48 | | * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for the CUDA variants, in which some of |
| 51 | runtime system provides the services to manage all the parallelism in the application (e.g., task |
| 52 | creation, synchronization, scheduling, etc). Downloads at [https://github.com/bsc-pm]. |
| 53 | |
| 54 | * A derivative '''Clang + LLVM OpenMP''' that supports the non-blocking mode of TAMPI. Not released yet. |
| 55 | |
| 56 | * The '''CUDA''' tools and NVIDIA '''Unified Memory''' devices for enabling the CUDA variants, in which some of |
53 | | The N-Body application has several versions which are compiled in different binaries, |
54 | | by executing the `make` command. All of them divide the particle space into smaller |
55 | | blocks. MPI processes are divided into two groups: GPU processes and CPU processes. |
56 | | GPU processes are responsible for computing the forces between each pair of particles |
57 | | blocks, and then, these forces are sent to the CPU processes, where each process |
58 | | updates its particles blocks using the received forces. The particles and forces blocks |
59 | | are equally distributed amongst each MPI process in each group. Thus, each MPI process |
60 | | is in charge of computing the forces or updating the particles of a consecutive chunk |
61 | | of blocks. |
| 61 | The N-Body application has several versions which are built in different binaries. All |
| 62 | of them divide the particle space into smaller blocks. MPI processes are divided into |
| 63 | two groups: GPU processes and CPU processes. GPU processes are responsible for computing |
| 64 | the forces between each pair of particles blocks, and then, these forces are sent to the |
| 65 | CPU processes, where each process updates its particles blocks using the received forces. |
| 66 | The particles and forces blocks are equally distributed amongst each MPI process in each |
| 67 | group. Thus, each MPI process is in charge of computing the forces or updating the particles |
| 68 | of a consecutive chunk of blocks. |
65 | | * `nbody.mpi.bin`: Parallel version using MPI. |
66 | | * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both computation and |
67 | | communication phases are taskified, however, communication tasks are serialized by declaring an |
68 | | artificial dependency on a sentinel variable. This is to prevent deadlocks between processes, |
69 | | since communication tasks perform blocking MPI calls. |
70 | | * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to |
71 | | execute the most compute-instensive parts of the application at the available GPUs. |
72 | | * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + TAMPI library. This |
| 72 | * `nbody.mpi.bin`: Simple parallel version using '''blocking MPI''' primitives for sending and |
| 73 | receiving each block of particles/forces. |
| 74 | |
| 75 | * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both '''computation''' and |
| 76 | '''communication''' phases are '''taskified''', however, communication tasks (each one sending |
| 77 | or receiving a block) are serialized by an artificial dependency on a sentinel variable. This |
| 78 | is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI''' |
| 79 | calls. |
| 80 | |
| 81 | * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the computation |
| 82 | tasks of forces between particles blocks to the available GPUs. Those computation tasks are |
| 83 | offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program. |
| 84 | The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their |
| 85 | dependencies) but implemented in '''CUDA'''. However, since it is Unified Memory, the user '''does |
| 86 | not need to move the data to/from the GPU''' device. |
| 87 | |
| 88 | * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + '''TAMPI''' library. This |
74 | | run in parallel. The TAMPI library is in charge of managing the blocking MPI calls to avoid the |
75 | | blocking of the underlying execution resources. |
76 | | * `nbody.tampi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to |
77 | | execute the most compute-instensive parts of the application at the available GPUs. |
| 90 | run in parallel and overlap computations. The TAMPI library is in charge of managing the '''blocking |
| 91 | MPI''' calls to avoid the blocking of the underlying execution resources. |
| 92 | |
| 93 | * `nbody.tampi.ompss2.cuda.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for |
| 94 | allowing the concurrent execution of communication tasks, and GPU processes offload the compute-intensive |
| 95 | tasks to the GPUs. |
| 96 | |