Changes between Version 21 and Version 22 of Public/User_Guide/Offloading_hybrid_apps


Ignore:
Timestamp:
Sep 18, 2019, 9:06:16 AM (5 years ago)
Author:
Kevin Sala
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Offloading_hybrid_apps

    v21 v22  
    157157=== Building & Executing on DEEP ===
    158158
    159 The simplest way to compile this application is:
     159The simplest way to compile this application on the DEEP system is:
    160160
    161161{{{#!bash
     
    174174The benchmark versions are built with a specific block size, which is
    175175decided at compilation time (i.e., the binary names contain the block
    176 size). The default block size is `2048`. Optionally, we can indicate
     176size). The default block size is `2048`, but we can indicate
    177177a different block size when compiling by doing:
    178178
     
    182182
    183183The next step is the execution of the benchmark on the DEEP system. Since
    184 this benchmarks targets the offloading of computational tasks to the Unified
     184this application targets the offloading of computation tasks to Unified
    185185Memory GPU devices, we must execute it in a DEEP partition that features this
    186186kind of devices. A good example is the [wiki:Public/User_Guide/DEEP-EST_DAM dp-dam]
    187 partition, where each nodes features:
     187partition, where each node features:
    188188
    189189* 2x Intel® Xeon® Platinum 8260M CPU @ 2.40GHz (24 cores/socket, 2 threads/core), '''96 CPUs/node'''
     
    198198}}}
    199199
    200 With that command, we will be prompted to an interactive session in an exclusive
    201 `dp-dam` node. We indicated that we want to launch 8 processes with 12 CPUs per
    202 process at the moment of launching a binary in the allocated node through `srun`.
    203 However, we should be able to change the configuration (without overtaking the
    204 initial number of resources) when executing the binaries passing a different
    205 configuration to the `srun` command.
     200With that command, we will be redirected to an interactive session in a `dp-dam`
     201node, exclusive for us. Furthermore, by indicating that configuration (i.e., `-N`,
     202`-n` and `-c` options), we are setting the default configuration for future `srun`
     203executions in that session. Thus, when executing an MPI binary via the `srun` command,
     204it is going to launch 8 processes and 12 CPUs per process by default. However, we
     205should be able to change the configuration (without overtaking the initial number
     206of resources) by overriding those parameters with new ones.
    206207
    207208At this point, we are ready to execute the benchmark with multiple MPI processes.
     
    209210number of '''particles''' with `-p`, the number of '''timesteps''' with `-t`, and the
    210211maximum number of '''GPU processes''' with `-g`. More options can be seen passing
    211 the `-h` option. An example of an execution is:
     212the `-h` option. An example of execution is:
    212213
    213214{{{#!bash
     
    217218in which the application will perform 100 timesteps in 8 MPI processes with 12
    218219cores per process (used by the !OmpSs-2's runtime system). The maximum number of
    219 GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all
     220GPU processes is 4, so there will be 4 GPU processes and 4 CPU processes (all
    220221processes have access to GPU devices). Since the total number of particles is
    221 16384 and the blocks size is 2048, each process will be in charge of computing
    222 / updating 4096 forces / particles, which are 2 blocks.
     22216384 and the block size is 2048, each process will be in charge of computing/
     223updating 4096 forces/particles, which are 2 blocks.
    223224
    224225In the CUDA variants, a process can belong to the GPU processes group if it has
    225226access to at least one GPU device. However, in the case of the non-CUDA versions,
    226 all processes can belong to the GPU processes group (i.e., the GPU processes are
    227 simulated). For this reason, the application provides `-g` option in order to
     227all processes can belong to the GPU processes group (i.e., we simulate the GPU
     228processes). For this reason, the application provides the `-g` option in order to
    228229control the maximum number of GPU processes. By default, the number of GPU processes
    229 will be half of the total number of processes. Also note that the non-CUDA variants
    230 cannot compute kernels on the GPU. In these cases, the structure of the application
    231 is kept but the CUDA tasks are replaced by regular CPU tasks.
     230will be half of the total number of processes. Also, note that the non-CUDA variants
     231cannot compute kernels on the GPU. In these cases, we have kept the structure of the
     232application, but we have replaced the CUDA tasks by regular CPU tasks, as a simulation.
    232233
    233234Similarly, the OpenMP variants can be executed following the same steps but setting
     
    241242Finally, the `submit.job` script can be used to submit a non-interactive job into
    242243the job scheduler system. Feel free to modify the script with other parameters or
    243 job configurations. The script can be submitted by:
     244job configurations. We can submit the script by doing:
    244245
    245246{{{#!bash