    44* [#QuickOverview Quick Overview]
    55* Examples:
     6  * [#N-BodyBenchmark N-Body Benchmark]
     7* [#References References]
    1011== Quick Overview ==
     13== N-Body Benchmark ==
    1415Users can clone or download this examples from the
    1819=== Description ===
     20An N-Body simulation numerically approximates the evolution of a system of
    2021bodies in which each body continuously interacts with every other body.  A
    2122familiar example is an astrophysical simulation in which each body represents a
    2728electrostatic and ''Van der Waals'' forces. Turbulent fluid flow simulation and
    2829global illumination computation in computer graphics are other examples of
     30problems that use N-Body simulation.
    3132=== Requirements ===
    5455=== Versions ===
     57The N-Body application has several versions which are compiled in different binaries,
    5758by executing the `make` command. All of them divide the particle space into smaller
    5859blocks. MPI processes are divided into two groups: GPU processes and CPU processes.
    6667The available versions are:
     69  * `nbody.mpi.bin`: Parallel version using MPI.
     70  * `nbody.mpi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks. Both computation and
    7071    communication phases are taskified, however, communication tasks are serialized by declaring an
    7172    artificial dependency on a sentinel variable. This is to prevent deadlocks between processes,
    7273    since communication tasks perform blocking MPI calls.
     74  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
    7475    execute the most compute-instensive parts of the application at the available GPUs.
     76  * `nbody.tampi.ompss2.bin`: Parallel version using MPI + !OmpSs-2 tasks + TAMPI library. This
    7677    version disables the artificial dependencies on the sentinel variable, so communication tasks can
    7778    run in parallel. The TAMPI library is in charge of managing the blocking MPI calls to avoid the
    7879    blocking of the underlying execution resources.
     80  * `nbody.tampi.ompss2.cuda.bin`: The same as the previous version but using CUDA tasks to
    8081    execute the most compute-instensive parts of the application at the available GPUs.
     82  * `nbody.mpi.omp.bin`:
     83  * `nbody.mpi.omptarget.bin`:
     84  * `nbody.tampi.omp.bin`:
     85  * `nbody.tampi.omptarget.bin`:
     87=== Building & Executing on DEEP ===
     89The simplest way to compile this application is:
     92# Clone the benchmark's repository
     93$ git clone
     94$ cd NBody
     96# Load the required environment (MPI, CUDA, OmpSs-2, OpenMP, etc.)
     97# Needed only once per session
     98$ source ./
     100# Compile the code
     101$ make
     104The benchmark versions are built with a specific block size, which is
     105decided at compilation time (i.e., the binary names contain the block
     106size). The default block size of the benchmark is `2048`. Optionally,
     107you can indicate a different block size when compiling by doing:
     110$ make BS=1024
     113The next step is the execution of the benchmark on the DEEP system. Since
     114this benchmarks targets the offloading of computatational tasks to the
     115GPUs, we must execute it in a DEEP partition that features this kind of
     116devices. A good example is the `dp-dam` partition, where each nodes features:
     118* 2x Intel® Xeon® Platinum 8260M CPU @ 2.40GHz (24 cores/socket, 2 threads/core), '''96 CPUs/node'''
     119* 1x NVIDIA Tesla V100 (Volta)
     120* Extoll network interconnection
     122In this case, we are going to request an interactive job in a `dp-dam` node.
     123All we need to is:
     126$ srun -p dp-dam -N 1 -n 8 -c 12 -t 01:00:00 --pty /bin/bash -i
     129With that command, we will be prompted to an interactive session in an exclusive
     130`dp-dam` node. We have indicated that we are going to create 8 processes with
     13112 CPUs per process when executing binaries with the `srun` from within the node.
     132However, you should be able to change the configuration (without overtaking the
     133initial number of resources) when executing the binaries passing a different
     134configuration to the `srun` command.
     136At this point, we are ready to execute the benchmark with multiple MPI processes.
     137The benchmark accepts several options. The most relevant options are the number
     138of total particles with `-p`, the number of timesteps with `-t`, and the maximum
     139number of GPU processes with `-g`. More options can be seen passing the `-h` option.
     140An example of an execution is:
     143$ srun -n 8 -c 12 ./nbody.tampi.ompss2.cuda.2048bs.bin -t 100 -p 16384 -g 4
     146in which the application will perform 100 timesteps in 8 MPI processes with 12
     147cores per process (used by the !OmpSs-2's runtime system). The maximum number of
     148GPU processes is 4, so there will be 4 CPU processes and 4 GPU processes (all
     149processes have access to GPU devices). Since the total number of particles is
     15016384, each process will be in charge of computing/updating 4096 forces/particles,
     151which are 2 blocks.
     153In the CUDA variants, a process can belong to the GPU processes group if it has
     154access to at least one GPU device. However, in the case of the non-CUDA versions,
     155all processes can belong to the GPU processes group (i.e., the GPU processes are
     156simulated). For this reason, the application provides `-g` option in order to
     157control the maximum number of GPU processes. By default, the number of GPU processes
     158will be half of the total number of processes.
     160Also note that the non-CUDA variants cannot compute kernels on the GPU. In this
     161cases, the structure of the application is kept but the CUDA tasks are replaced
     162by regular CPU tasks.
