Context Navigation

Offloading_hybrid_apps

Timestamp:: Sep 18, 2019, 9:06:16 AM (6 years ago)
Author:: Kevin Sala
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/Offloading_hybrid_apps

-                      v21
+                      v22
 === Building & Executing on DEEP ===
 The simplest way to compile this application is:
+The simplest way to compile this application on the DEEP system is:
 {{{#!bash
 …
 The benchmark versions are built with a specific block size, which is
 decided at compilation time (i.e., the binary names contain the block
 size). The default block size is `2048`. Optionally, we can indicate
+size). The default block size is `2048`, but we can indicate
 a different block size when compiling by doing:
 …
 The next step is the execution of the benchmark on the DEEP system. Since
 this benchmarks targets the offloading of computational tasks to the Unified
+this application targets the offloading of computation tasks to Unified
 Memory GPU devices, we must execute it in a DEEP partition that features this
 kind of devices. A good example is the [wiki:Public/User_Guide/DEEP-EST_DAM dp-dam]
 partition, where each nodes features:
+partition, where each node features:
 * 2x Intel® Xeon® Platinum 8260M CPU @ 2.40GHz (24 cores/socket, 2 threads/core), '''96 CPUs/node'''
 …
 }}}
+With that command, we will be prompted to an interactive session in an exclusive
+`dp-dam` node. We indicated that we want to launch 8 processes with 12 CPUs per
+process at the moment of launching a binary in the allocated node through `srun`.
+However, we should be able to change the configuration (without overtaking the
+initial number of resources) when executing the binaries passing a different
+configuration to the `srun` command.
+With that command, we will be redirected to an interactive session in a `dp-dam`
+node, exclusive for us. Furthermore, by indicating that configuration (i.e., `-N`,
+`-n` and `-c` options), we are setting the default configuration for future `srun`
+executions in that session. Thus, when executing an MPI binary via the `srun` command,
+it is going to launch 8 processes and 12 CPUs per process by default. However, we
+should be able to change the configuration (without overtaking the initial number
+of resources) by overriding those parameters with new ones.
 At this point, we are ready to execute the benchmark with multiple MPI processes.
 …
 number of '''particles''' with `-p`, the number of '''timesteps''' with `-t`, and the
 maximum number of '''GPU processes''' with `-g`. More options can be seen passing
 the `-h` option. An example of an execution is:
+the `-h` option. An example of execution is:
 {{{#!bash
 …
 in which the application will perform 100 timesteps in 8 MPI processes with 12
 cores per process (used by the !OmpSs-2's runtime system). The maximum number of
 GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all
+GPU processes is 4, so there will be 4 GPU processes and 4 CPU processes (all
 processes have access to GPU devices). Since the total number of particles is
 and the blocks size is 2048, each process will be in charge of computing
 / updating 4096 forces / particles, which are 2 blocks.
+and the block size is 2048, each process will be in charge of computing/
+updating 4096 forces/particles, which are 2 blocks.
 In the CUDA variants, a process can belong to the GPU processes group if it has
 access to at least one GPU device. However, in the case of the non-CUDA versions,
 all processes can belong to the GPU processes group (i.e., the GPU processes are
 simulated). For this reason, the application provides `-g` option in order to
+all processes can belong to the GPU processes group (i.e., we simulate the GPU
+processes). For this reason, the application provides the `-g` option in order to
 control the maximum number of GPU processes. By default, the number of GPU processes
 will be half of the total number of processes. Also note that the non-CUDA variants
 cannot compute kernels on the GPU. In these cases, the structure of the application
 is kept but the CUDA tasks are replaced by regular CPU tasks.
+will be half of the total number of processes. Also, note that the non-CUDA variants
+cannot compute kernels on the GPU. In these cases, we have kept the structure of the
+application, but we have replaced the CUDA tasks by regular CPU tasks, as a simulation.
 Similarly, the OpenMP variants can be executed following the same steps but setting
 …
 Finally, the `submit.job` script can be used to submit a non-interactive job into
 the job scheduler system. Feel free to modify the script with other parameters or
 job configurations. The script can be submitted by:
+job configurations. We can submit the script by doing:
 {{{#!bash