Changes between Version 12 and Version 13 of Public/User_Guide/Offloading_hybrid_apps


Ignore:
Timestamp:
Sep 17, 2019, 4:38:50 PM (5 years ago)
Author:
Kevin Sala
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Offloading_hybrid_apps

    v12 v13  
    7979    calls.
    8080
    81   * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the computation
    82     tasks of forces between particles blocks to the available GPUs. Those computation tasks are
     81  * `nbody.mpi.ompss2.cuda.bin`: The same as the previous version but '''offloading''' the tasks that
     82    compute the forces between particles blocks to the available GPUs. Those computation tasks are
    8383    offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
    8484    The `calculate_forces_block_cuda` task is annotated as a regular task (e.g., with their
     
    9292
    9393  * `nbody.tampi.ompss2.cuda.bin`: A mix of the previous two variants where '''TAMPI''' is leveraged for
    94     allowing the concurrent execution of communication tasks, and GPU processes offload the compute-intensive
    95     tasks to the GPUs.
    96 
    97   * `nbody.mpi.omp.bin`:
    98 
    99   * `nbody.mpi.omptarget.bin`:
     94    allowing the concurrent execution of communication tasks, and GPU processes '''offload''' the
     95    compute-intensive tasks to the GPUs.
     96
     97  * `nbody.mpi.omp.bin`: Parallel version using MPI + OpenMP tasks. Both '''computation''' and
     98    '''communication''' phases are '''taskified''', however, communication tasks (each one sending
     99    or receiving a block) are serialized by an artificial dependency on a sentinel variable. This
     100    is to prevent deadlocks between processes, since communication tasks perform '''blocking MPI'''
     101    calls.
     102
     103  * `nbody.mpi.omptarget.bin`: The same as the previous version but '''offloading''' the tasks that
     104    compute the forces between particles blocks to the available GPUs. Those computation tasks are
     105    offloaded by the '''GPU processes''' and they are the most compute-intensive parts of the program.
     106    This is done through the `omp target` directive, declaring the corresponding dependencies, and
     107    specifying the `target` as `nowait` (i.e., asynchronous offload). Additionally, the target directive
     108    does not require the user to provide a CUDA implementation of the offloaded task. Finally, since we
     109    are using the Unified Memory feature, we do not need to specify any data movement clause. We only
     110    have to specify that the memory buffers are already device pointers (i.e., with `is_device_ptr`
     111    clause). '''Note:''' This version is not compiled by default since it is still in a ''Work in
     112    Progress'' state.
    100113
    101114  * `nbody.tampi.omp.bin`: