81 | | * '''nbody.mpi.omp.${BS}bs.bin''': |
82 | | * '''nbody.mpi.omptarget.${BS}bs.bin''': |
83 | | * '''nbody.tampi.omp.${BS}bs.bin''': |
84 | | * '''nbody.tampi.omptarget.${BS}bs.bin''': |
| 82 | * `nbody.mpi.omp.bin`: |
| 83 | * `nbody.mpi.omptarget.bin`: |
| 84 | * `nbody.tampi.omp.bin`: |
| 85 | * `nbody.tampi.omptarget.bin`: |
| 86 | |
| 87 | === Building & Executing on DEEP === |
| 88 | |
| 89 | The simplest way to compile this application is: |
| 90 | |
| 91 | {{{#!bash |
| 92 | # Clone the benchmark's repository |
| 93 | $ git clone https://pm.bsc.es/gitlab/DEEP-EST/apps/NBody.git |
| 94 | $ cd NBody |
| 95 | |
| 96 | # Load the required environment (MPI, CUDA, OmpSs-2, OpenMP, etc.) |
| 97 | # Needed only once per session |
| 98 | $ source ./setenv_deep.sh |
| 99 | |
| 100 | # Compile the code |
| 101 | $ make |
| 102 | }}} |
| 103 | |
| 104 | The benchmark versions are built with a specific block size, which is |
| 105 | decided at compilation time (i.e., the binary names contain the block |
| 106 | size). The default block size of the benchmark is `2048`. Optionally, |
| 107 | you can indicate a different block size when compiling by doing: |
| 108 | |
| 109 | {{{#!bash |
| 110 | $ make BS=1024 |
| 111 | }}} |
| 112 | |
| 113 | The next step is the execution of the benchmark on the DEEP system. Since |
| 114 | this benchmarks targets the offloading of computatational tasks to the |
| 115 | GPUs, we must execute it in a DEEP partition that features this kind of |
| 116 | devices. A good example is the `dp-dam` partition, where each nodes features: |
| 117 | |
| 118 | * 2x Intel® Xeon® Platinum 8260M CPU @ 2.40GHz (24 cores/socket, 2 threads/core), '''96 CPUs/node''' |
| 119 | * 1x NVIDIA Tesla V100 (Volta) |
| 120 | * Extoll network interconnection |
| 121 | |
| 122 | In this case, we are going to request an interactive job in a `dp-dam` node. |
| 123 | All we need to is: |
| 124 | |
| 125 | {{{#!bash |
| 126 | $ srun -p dp-dam -N 1 -n 8 -c 12 -t 01:00:00 --pty /bin/bash -i |
| 127 | }}} |
| 128 | |
| 129 | With that command, we will be prompted to an interactive session in an exclusive |
| 130 | `dp-dam` node. We have indicated that we are going to create 8 processes with |
| 131 | 12 CPUs per process when executing binaries with the `srun` from within the node. |
| 132 | However, you should be able to change the configuration (without overtaking the |
| 133 | initial number of resources) when executing the binaries passing a different |
| 134 | configuration to the `srun` command. |
| 135 | |
| 136 | At this point, we are ready to execute the benchmark with multiple MPI processes. |
| 137 | The benchmark accepts several options. The most relevant options are the number |
| 138 | of total particles with `-p`, the number of timesteps with `-t`, and the maximum |
| 139 | number of GPU processes with `-g`. More options can be seen passing the `-h` option. |
| 140 | An example of an execution is: |
| 141 | |
| 142 | {{{#!bash |
| 143 | $ srun -n 8 -c 12 ./nbody.tampi.ompss2.cuda.2048bs.bin -t 100 -p 16384 -g 4 |
| 144 | }}} |
| 145 | |
| 146 | in which the application will perform 100 timesteps in 8 MPI processes with 12 |
| 147 | cores per process (used by the !OmpSs-2's runtime system). The maximum number of |
| 148 | GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all |
| 149 | processes have access to GPU devices). Since the total number of particles is |
| 150 | 16384, each process will be in charge of computing/updating 4096 forces/particles, |
| 151 | which are 2 blocks. |
| 152 | |
| 153 | In the CUDA variants, a process can belong to the GPU processes group if it has |
| 154 | access to at least one GPU device. However, in the case of the non-CUDA versions, |
| 155 | all processes can belong to the GPU processes group (i.e., the GPU processes are |
| 156 | simulated). For this reason, the application provides `-g` option in order to |
| 157 | control the maximum number of GPU processes. By default, the number of GPU processes |
| 158 | will be half of the total number of processes. |
| 159 | |
| 160 | Also note that the non-CUDA variants cannot compute kernels on the GPU. In this |
| 161 | cases, the structure of the application is kept but the CUDA tasks are replaced |
| 162 | by regular CPU tasks. |
| 163 | |
| 164 | |
| 165 | == References == |
| 166 | |
| 167 | |