Changes between Version 5 and Version 6 of Public/User_Guide/Offloading_hybrid_apps


Ignore:
Timestamp:
Sep 17, 2019, 2:47:17 PM (5 years ago)
Author:
Kevin Sala
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Offloading_hybrid_apps

    v5 v6  
    44* [#QuickOverview Quick Overview]
    55* Examples:
    6   * [#NBodyBenchmark NBody Benchmark]
     6  * [#N-BodyBenchmark N-Body Benchmark]
     7* [#References References]
    78
    89----
     
    1011== Quick Overview ==
    1112
    12 == NBody Benchmark ==
     13== N-Body Benchmark ==
    1314
    1415Users can clone or download this examples from the https://pm.bsc.es/gitlab/DEEP-EST/apps/NBody
     
    1718
    1819=== Description ===
    19 An NBody simulation numerically approximates the evolution of a system of
     20An N-Body simulation numerically approximates the evolution of a system of
    2021bodies in which each body continuously interacts with every other body.  A
    2122familiar example is an astrophysical simulation in which each body represents a
     
    2728electrostatic and ''Van der Waals'' forces. Turbulent fluid flow simulation and
    2829global illumination computation in computer graphics are other examples of
    29 problems that use NBody simulation.
     30problems that use N-Body simulation.
    3031
    3132=== Requirements ===
     
    5455=== Versions ===
    5556
    56 The NBody application has several versions which are compiled in different binaries,
     57The N-Body application has several versions which are compiled in different binaries,
    5758by executing the `make` command. All of them divide the particle space into smaller
    5859blocks. MPI processes are divided into two groups: GPU processes and CPU processes.
     
    6667The available versions are:
    6768
    68   * '''nbody.mpi.${BS}bs.bin''': Parallel version using MPI.
    69   * '''nbody.mpi.ompss2.${BS}bs.bin''': Parallel version using MPI + !OmpSs-2 tasks. Both computation and
     69  * `nbody.mpi.bin`: Parallel version using MPI.
     70  * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both computation and
    7071    communication phases are taskified, however, communication tasks are serialized by declaring an
    7172    artificial dependency on a sentinel variable. This is to prevent deadlocks between processes,
    7273    since communication tasks perform blocking MPI calls.
    73   * '''nbody.mpi.ompss2.cuda.${BS}bs.bin''': The same as the previous version but using CUDA tasks to
     74  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
    7475    execute the most compute-instensive parts of the application at the available GPUs.
    75   * '''nbody.tampi.ompss2.${BS}bs.bin''': Parallel version using MPI + !OmpSs-2 tasks + TAMPI library. This
     76  * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + TAMPI library. This
    7677    version disables the artificial dependencies on the sentinel variable, so communication tasks can
    7778    run in parallel. The TAMPI library is in charge of managing the blocking MPI calls to avoid the
    7879    blocking of the underlying execution resources.
    79   * '''nbody.tampi.ompss2.cuda.${BS}bs.bin''': The same as the previous version but using CUDA tasks to
     80  * `nbody.tampi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
    8081    execute the most compute-instensive parts of the application at the available GPUs.
    81   * '''nbody.mpi.omp.${BS}bs.bin''':
    82   * '''nbody.mpi.omptarget.${BS}bs.bin''':
    83   * '''nbody.tampi.omp.${BS}bs.bin''':
    84   * '''nbody.tampi.omptarget.${BS}bs.bin''':
     82  * `nbody.mpi.omp.bin`:
     83  * `nbody.mpi.omptarget.bin`:
     84  * `nbody.tampi.omp.bin`:
     85  * `nbody.tampi.omptarget.bin`:
     86
     87=== Building & Executing on DEEP ===
     88
     89The simplest way to compile this application is:
     90
     91{{{#!bash
     92# Clone the benchmark's repository
     93$ git clone https://pm.bsc.es/gitlab/DEEP-EST/apps/NBody.git
     94$ cd NBody
     95
     96# Load the required environment (MPI, CUDA, OmpSs-2, OpenMP, etc.)
     97# Needed only once per session
     98$ source ./setenv_deep.sh
     99
     100# Compile the code
     101$ make
     102}}}
     103
     104The benchmark versions are built with a specific block size, which is
     105decided at compilation time (i.e., the binary names contain the block
     106size). The default block size of the benchmark is `2048`. Optionally,
     107you can indicate a different block size when compiling by doing:
     108
     109{{{#!bash
     110$ make BS=1024
     111}}}
     112
     113The next step is the execution of the benchmark on the DEEP system. Since
     114this benchmarks targets the offloading of computatational tasks to the
     115GPUs, we must execute it in a DEEP partition that features this kind of
     116devices. A good example is the `dp-dam` partition, where each nodes features:
     117
     118* 2x Intel® Xeon® Platinum 8260M CPU @ 2.40GHz (24 cores/socket, 2 threads/core), '''96 CPUs/node'''
     119* 1x NVIDIA Tesla V100 (Volta)
     120* Extoll network interconnection
     121
     122In this case, we are going to request an interactive job in a `dp-dam` node.
     123All we need to is:
     124
     125{{{#!bash
     126$ srun -p dp-dam -N 1 -n 8 -c 12 -t 01:00:00 --pty /bin/bash -i
     127}}}
     128
     129With that command, we will be prompted to an interactive session in an exclusive
     130`dp-dam` node. We have indicated that we are going to create 8 processes with
     13112 CPUs per process when executing binaries with the `srun` from within the node.
     132However, you should be able to change the configuration (without overtaking the
     133initial number of resources) when executing the binaries passing a different
     134configuration to the `srun` command.
     135
     136At this point, we are ready to execute the benchmark with multiple MPI processes.
     137The benchmark accepts several options. The most relevant options are the number
     138of total particles with `-p`, the number of timesteps with `-t`, and the maximum
     139number of GPU processes with `-g`. More options can be seen passing the `-h` option.
     140An example of an execution is:
     141
     142{{{#!bash
     143$ srun -n 8 -c 12 ./nbody.tampi.ompss2.cuda.2048bs.bin -t 100 -p 16384 -g 4
     144}}}
     145
     146in which the application will perform 100 timesteps in 8 MPI processes with 12
     147cores per process (used by the !OmpSs-2's runtime system). The maximum number of
     148GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all
     149processes have access to GPU devices). Since the total number of particles is
     15016384, each process will be in charge of computing/updating 4096 forces/particles,
     151which are 2 blocks.
     152
     153In the CUDA variants, a process can belong to the GPU processes group if it has
     154access to at least one GPU device. However, in the case of the non-CUDA versions,
     155all processes can belong to the GPU processes group (i.e., the GPU processes are
     156simulated). For this reason, the application provides `-g` option in order to
     157control the maximum number of GPU processes. By default, the number of GPU processes
     158will be half of the total number of processes.
     159
     160Also note that the non-CUDA variants cannot compute kernels on the GPU. In this
     161cases, the structure of the application is kept but the CUDA tasks are replaced
     162by regular CPU tasks.
     163
     164
     165== References ==
     166
     167