Changes between Version 31 and Version 32 of Public/User_Guide/OmpSs-2
- Timestamp:
- Jun 12, 2019, 11:46:53 AM (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/OmpSs-2
v31 v32 1 1 [[Image(OmpSs2_logo_full.png)]] 2 2 3 = Programming with OmpSs-2 =3 = Programming with !OmpSs-2 = 4 4 5 5 Table of contents: … … 20 20 = Quick Overview = 21 21 22 OmpSs-2 is a programming model composed of a set of directives and library routines that can be used in conjunction with a high-level programming language (such as C, C++ or Fortran) in order to develop concurrent applications. Its name originally comes from two other programming models: **OpenMP** and **StarSs**. The design principles of these two programming models constitute the fundamental ideas used to conceive theOmpSs philosophy.22 !OmpSs-2 is a programming model composed of a set of directives and library routines that can be used in conjunction with a high-level programming language (such as C, C++ or Fortran) in order to develop concurrent applications. Its name originally comes from two other programming models: **OpenMP** and **!StarSs**. The design principles of these two programming models constitute the fundamental ideas used to conceive the !OmpSs philosophy. 23 23 24 24 [[Image(OmpSsOpenMP.png, 30%)]] 25 25 26 OmpSs-2 **thread-pool** execution model differs from the **fork-join** parallelism implemented in OpenMP.26 !OmpSs-2 **thread-pool** execution model differs from the **fork-join** parallelism implemented in OpenMP. 27 27 28 28 [[Image(pools.png, 30%)]] … … 32 32 [[Image(taskGraph.png, 15%)]] 33 33 34 The reference implementation of OmpSs-2 is based on the **Mercurium** source-to-source compiler and the **Nanos6** runtime library:34 The reference implementation of !OmpSs-2 is based on the **Mercurium** source-to-source compiler and the **Nanos6** runtime library: 35 35 * Mercurium source-to-source compiler provides the necessary support for transforming the high-level directives into a parallelized version of the application. 36 36 * Nanos6 runtime library provides services to manage all the parallelism in the user-application, including task creation, synchronization and data movement, as well as support for resource heterogeneity. … … 38 38 [[Image(MercuriumNanos.png, 35%)]] 39 39 40 **Additional information** about the OmpSs-2 programming model can be found at:41 * OmpSs-2 official website. [https://pm.bsc.es/ompss-2]42 * OmpSs-2 specification. [https://pm.bsc.es/ftp/ompss-2/doc/spec]43 * OmpSs-2 user guide. [https://pm.bsc.es/ftp/ompss-2/doc/user-guide]44 * OmpSs-2 examples repository. [https://pm.bsc.es/gitlab/ompss-2/examples]45 * OmpSs-2 manual with examples and exercises. [https://pm.bsc.es/ftp/ompss-2/doc/examples/index.html]40 **Additional information** about the !OmpSs-2 programming model can be found at: 41 * !OmpSs-2 official website. [https://pm.bsc.es/ompss-2] 42 * !OmpSs-2 specification. [https://pm.bsc.es/ftp/ompss-2/doc/spec] 43 * !OmpSs-2 user guide. [https://pm.bsc.es/ftp/ompss-2/doc/user-guide] 44 * !OmpSs-2 examples repository. [https://pm.bsc.es/gitlab/ompss-2/examples] 45 * !OmpSs-2 manual with examples and exercises. [https://pm.bsc.es/ftp/ompss-2/doc/examples/index.html] 46 46 * Mercurium official website. [https://www.bsc.es/research-and-development/software-and-apps/software-list/mercurium-ccfortran-source-source-compiler Link 1], [https://pm.bsc.es/mcxx Link 2] 47 47 * Nanos official website. [https://www.bsc.es/research-and-development/software-and-apps/software-list/nanos-rtl Link 1], [https://pm.bsc.es/nanox Link 2] … … 51 51 = Quick Setup on DEEP System = 52 52 53 We highly recommend to log in a **cluster module (CM) node** to begin using OmpSs-2. To request an entire CM node for an interactive session, please execute the following command:53 We highly recommend to log in a **cluster module (CM) node** to begin using !OmpSs-2. To request an entire CM node for an interactive session, please execute the following command: 54 54 `srun --partition=dp-cn --nodes=1 --ntasks=48 --ntasks-per-socket=24 --ntasks-per-node=48 --pty /bin/bash -i` 55 55 56 56 Note that the command above is consistent with the actual hardware configuration of the cluster module with **hyper-threading enabled**. 57 57 58 OmpSs-2 has already been installed on DEEP and can be used by simply executing the following commands:58 !OmpSs-2 has already been installed on DEEP and can be used by simply executing the following commands: 59 59 * `modulepath="/usr/local/software/skylake/Stages/2018b/modules/all/Core:$modulepath"` 60 60 * `modulepath="/usr/local/software/skylake/Stages/2018b/modules/all/Compiler/mpi/intel/2019.0.117-GCC-7.3.0:$modulepath"` … … 63 63 * `module load OmpSs-2` 64 64 65 Remember that OmpSs?-2 uses a **thread-pool** execution model which means that it **permanently uses all the threads** present on the system. Users are strongly encouraged to always check the **system affinity** by running the **NUMA command** `numactl --show`:65 Remember that !OmpSs-2 uses a **thread-pool** execution model which means that it **permanently uses all the threads** present on the system. Users are strongly encouraged to always check the **system affinity** by running the **NUMA command** `numactl --show`: 66 66 {{{ 67 67 $ numactl --show … … 83 83 Notice that both commands return consistent outputs and, even though an entire node with two sockets has been requested, only the first NUMA node (i.e. socket) has been correctly bind. As a result, only 48 threads of the first socket (0-11, 24-35), from which 24 are physical and 24 logical (hyper-threading enabled), are going to be utilised whilst the other 48 threads available in the second socket will remain idle. Therefore, **the system affinity showed above is not valid since it does not represent the resources requested via SLURM.** 84 84 85 System affinity can be used to specify, for example, the ratio of MPI and OmpSs-2 processes for a hybrid application and can be modified by user request in different ways:85 System affinity can be used to specify, for example, the ratio of MPI and !OmpSs-2 processes for a hybrid application and can be modified by user request in different ways: 86 86 * Via SLURM. However, if the affinity does not correspond to the resources requested like in the previous example, it should be reported to the system administrators. 87 87 * Via the command `numactl`. … … 98 98 == System configuration == 99 99 100 Please refer to section [#QuickSetuponDEEPSystem Quick Setup on DEEP System] to get a functional version of OmpSs-2 on DEEP. It is also recommended to runOmpSs-2 on a cluster module (CM) node.100 Please refer to section [#QuickSetuponDEEPSystem Quick Setup on DEEP System] to get a functional version of !OmpSs-2 on DEEP. It is also recommended to run !OmpSs-2 on a cluster module (CM) node. 101 101 102 102 == Building and running the examples == … … 147 147 ---- 148 148 149 = multisaxpy benchmark ( OmpSs-2) =149 = multisaxpy benchmark (!OmpSs-2) = 150 150 151 151 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/multisaxpy] and transfer it to a DEEP working directory. … … 221 221 222 222 223 = dot-product benchmark ( OmpSs-2) =223 = dot-product benchmark (!OmpSs-2) = 224 224 225 225 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/dot-product] and transfer it to a DEEP working directory. … … 247 247 248 248 249 = mergesort benchmark ( OmpSs-2) =249 = mergesort benchmark (!OmpSs-2) = 250 250 251 251 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/mergesort] and transfer it to a DEEP working directory. … … 274 274 275 275 276 = nqueens benchmark ( OmpSs-2) =276 = nqueens benchmark (!OmpSs-2) = 277 277 278 278 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/nqueens] and transfer it to a DEEP working directory. … … 305 305 306 306 307 = matmul benchmark ( OmpSs-2) =307 = matmul benchmark (!OmpSs-2) = 308 308 309 309 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/matmul] and transfer it to a DEEP working directory. … … 336 336 337 337 338 = Cholesky benchmark ( OmpSs-2+MKL) =338 = Cholesky benchmark (!OmpSs-2+MKL) = 339 339 340 340 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/cholesky] and transfer it to a DEEP working directory. … … 342 342 == Description == 343 343 344 This benchmark is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose. This Cholesky decomposition is carried out with OmpSs-2 using tasks with priorities.344 This benchmark is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose. This Cholesky decomposition is carried out with !OmpSs-2 using tasks with priorities. 345 345 346 346 There are **3 implementations** of this benchmark. … … 351 351 The Makefile has three additional rules: 352 352 * **run:** runs each version one after the other. 353 * **run-graph:** runs the OmpSs-2 versions with the graph instrumentation.354 * **run-extrae:** runs the OmpSs-2 versions with the extrae instrumentation.353 * **run-graph:** runs the !OmpSs-2 versions with the graph instrumentation. 354 * **run-extrae:** runs the !OmpSs-2 versions with the extrae instrumentation. 355 355 356 356 For the graph instrumentation, it is recommended to view the resulting PDF in single page mode and to advance through the pages. This will show the actual instantiation and execution of the code. For the extrae instrumentation, extrae must be loaded and available at least through the `LD_LIBRARY_PATH` environment variable. … … 374 374 375 375 376 = nbody benchmark (MPI+ OmpSs-2+TAMPI) =376 = nbody benchmark (MPI+!OmpSs-2+TAMPI) = 377 377 378 378 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/nbody] and transfer it to a DEEP working directory. … … 387 387 it is not. 388 388 389 The interoperability versions (MPI+ OmpSs-2+TAMPI) are compiled only if the environment variable `TAMPI_HOME` is set to the Task-Aware MPI (TAMPI) library's installation directory.389 The interoperability versions (MPI+!OmpSs-2+TAMPI) are compiled only if the environment variable `TAMPI_HOME` is set to the Task-Aware MPI (TAMPI) library's installation directory. 390 390 391 391 == Execution instructions == … … 397 397 `mpiexec -n 4 -bind-to hwthread:16 ./nbody -t 100 -p 8192` 398 398 399 in which the application will perform 100 timesteps in 4 MPI processes with 16 hardware threads in each process (used by the OmpSs-2 runtime). The total number of particles will be 8192 so that each process will have 2048 particles (2 blocks per process).399 in which the application will perform 100 timesteps in 4 MPI processes with 16 hardware threads in each process (used by the !OmpSs-2 runtime). The total number of particles will be 8192 so that each process will have 2048 particles (2 blocks per process). 400 400 401 401 == References == … … 408 408 409 409 410 = heat benchmark (MPI+ OmpSs-2+TAMPI) =410 = heat benchmark (MPI+!OmpSs-2+TAMPI) = 411 411 412 412 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/heat] and transfer it to a DEEP working directory. … … 421 421 binaries by executing the command `make`. 422 422 423 The interoperability versions (MPI+ OmpSs-2+TAMPI) are compiled only if the environment variable `TAMPI_HOME` is set to the Task-Aware MPI (TAMPI) library's installation directory.423 The interoperability versions (MPI+!OmpSs-2+TAMPI) are compiled only if the environment variable `TAMPI_HOME` is set to the Task-Aware MPI (TAMPI) library's installation directory. 424 424 425 425 == Execution instructions == … … 432 432 433 433 in which the application will perform 150 timesteps in 4 MPI processes with 16 434 hardware threads in each process (used by the OmpSs-2 runtime). The size of the434 hardware threads in each process (used by the !OmpSs-2 runtime). The size of the 435 435 matrix in each dimension will be 8192 (8192^2^ elements in total), this means 436 436 that each process will have 2048x8192 elements (16 blocks per process). … … 444 444 ---- 445 445 446 = krist benchmark ( OmpSs-2+CUDA) =446 = krist benchmark (!OmpSs-2+CUDA) = 447 447 448 448 Users must clone/download this example's repository from [https://pm.bsc.es/gitlab/ompss-2/examples/krist] and transfer it to a DEEP working directory.