Changes between Version 21 and Version 22 of Public/User_Guide/Offloading_hybrid_apps
- Timestamp:
- Sep 18, 2019, 9:06:16 AM (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/Offloading_hybrid_apps
v21 v22 157 157 === Building & Executing on DEEP === 158 158 159 The simplest way to compile this application is:159 The simplest way to compile this application on the DEEP system is: 160 160 161 161 {{{#!bash … … 174 174 The benchmark versions are built with a specific block size, which is 175 175 decided at compilation time (i.e., the binary names contain the block 176 size). The default block size is `2048` . Optionally,we can indicate176 size). The default block size is `2048`, but we can indicate 177 177 a different block size when compiling by doing: 178 178 … … 182 182 183 183 The next step is the execution of the benchmark on the DEEP system. Since 184 this benchmarks targets the offloading of computational tasks to theUnified184 this application targets the offloading of computation tasks to Unified 185 185 Memory GPU devices, we must execute it in a DEEP partition that features this 186 186 kind of devices. A good example is the [wiki:Public/User_Guide/DEEP-EST_DAM dp-dam] 187 partition, where each node sfeatures:187 partition, where each node features: 188 188 189 189 * 2x Intel® Xeon® Platinum 8260M CPU @ 2.40GHz (24 cores/socket, 2 threads/core), '''96 CPUs/node''' … … 198 198 }}} 199 199 200 With that command, we will be prompted to an interactive session in an exclusive 201 `dp-dam` node. We indicated that we want to launch 8 processes with 12 CPUs per 202 process at the moment of launching a binary in the allocated node through `srun`. 203 However, we should be able to change the configuration (without overtaking the 204 initial number of resources) when executing the binaries passing a different 205 configuration to the `srun` command. 200 With that command, we will be redirected to an interactive session in a `dp-dam` 201 node, exclusive for us. Furthermore, by indicating that configuration (i.e., `-N`, 202 `-n` and `-c` options), we are setting the default configuration for future `srun` 203 executions in that session. Thus, when executing an MPI binary via the `srun` command, 204 it is going to launch 8 processes and 12 CPUs per process by default. However, we 205 should be able to change the configuration (without overtaking the initial number 206 of resources) by overriding those parameters with new ones. 206 207 207 208 At this point, we are ready to execute the benchmark with multiple MPI processes. … … 209 210 number of '''particles''' with `-p`, the number of '''timesteps''' with `-t`, and the 210 211 maximum number of '''GPU processes''' with `-g`. More options can be seen passing 211 the `-h` option. An example of anexecution is:212 the `-h` option. An example of execution is: 212 213 213 214 {{{#!bash … … 217 218 in which the application will perform 100 timesteps in 8 MPI processes with 12 218 219 cores per process (used by the !OmpSs-2's runtime system). The maximum number of 219 GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all220 GPU processes is 4, so there will be 4 GPU processes and 4 CPU processes (all 220 221 processes have access to GPU devices). Since the total number of particles is 221 16384 and the block s size is 2048, each process will be in charge of computing222 / updating 4096 forces /particles, which are 2 blocks.222 16384 and the block size is 2048, each process will be in charge of computing/ 223 updating 4096 forces/particles, which are 2 blocks. 223 224 224 225 In the CUDA variants, a process can belong to the GPU processes group if it has 225 226 access to at least one GPU device. However, in the case of the non-CUDA versions, 226 all processes can belong to the GPU processes group (i.e., the GPU processes are227 simulated). For this reason, the application provides`-g` option in order to227 all processes can belong to the GPU processes group (i.e., we simulate the GPU 228 processes). For this reason, the application provides the `-g` option in order to 228 229 control the maximum number of GPU processes. By default, the number of GPU processes 229 will be half of the total number of processes. Also note that the non-CUDA variants230 cannot compute kernels on the GPU. In these cases, the structure of the application231 is kept but the CUDA tasks are replaced by regular CPU tasks.230 will be half of the total number of processes. Also, note that the non-CUDA variants 231 cannot compute kernels on the GPU. In these cases, we have kept the structure of the 232 application, but we have replaced the CUDA tasks by regular CPU tasks, as a simulation. 232 233 233 234 Similarly, the OpenMP variants can be executed following the same steps but setting … … 241 242 Finally, the `submit.job` script can be used to submit a non-interactive job into 242 243 the job scheduler system. Feel free to modify the script with other parameters or 243 job configurations. The script can be submitted by:244 job configurations. We can submit the script by doing: 244 245 245 246 {{{#!bash