[[TOC]]

= Information about the batch system (SLURM) =

The DEEP prototype system is running SLURM for resource management. Documentation of Slurm can be found [https://slurm.schedmd.com/ here].

== Overview ==
Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform or module).

== Available Partitions ==
Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the `--partition=...` switch:

  || '''Name''' || '''Description''' ||
  || dp-cn || dp-cn[01-50], DEEP Cluster nodes (Xeon Skylake) ||
  || dp-dam || dp-dam[01-16], DEEP DAM nodes (Xeon Cascadelake + 1 V100 + 1 Stratix 10) ||
  || dp-esb || dp-esb[log:@26-75 "[01-75]"], DEEP ESB nodes connected with IB EDR (Xeon Cascadelake + 1 V100) ||
  || dp-sdv-esb || dp-sdv-esb[01-02], DEEP ESB Test nodes (Xeon Cascadelake + 1 V100) ||
  || ml-gpu || ml-gpu[01-03], GPU test nodes for ML applications (4 V100 cards) ||
  || knl || knl[01,04-06], KNL nodes ||
  || knl256 || knl[01,05], KNL nodes with 64 cores ||
  || knl272 || knl[04,06], KNL nodes with 68 cores ||
  || snc4 || knl[05], KNL node in snc4 memory mode ||
  || debug || all compute nodes (no gateways) ||

Anytime, you can list the state of the partitions with the `sinfo` command. The properties of a partition (.e.g. the maximum walltime) can be seen using

{{{
scontrol show partition <partition>
}}}
== Remark about environment ==
By default, Slurm passes the environment from your job submission session directly to the execution environment. Please be aware of this when running jobs with `srun` or when submitting scripts with `sbatch`. This behavior can be controlled via the `--export` option. Please refer to the [https://slurm.schedmd.com/ Slurm documentation] to get more information about this.

In particular, when submitting job scripts, **it is recommended to load the necessary modules within the script and submit the script from a clean environment.**

== An introductory example ==
Suppose you have an mpi executable named `hello_mpi`. There are three ways to start the binary.

=== From a shell on a node ===
If you just need one node to run your interactive session on you can simply use the `srun` command (without `salloc`), e.g.:

{{{
[kreutz1@deepv ~]$ srun -A deep -N 1 -n 8 -p dp-cn -t 00:30:00 --pty --interactive bash
[kreutz1@dp-cn22 ~]$ srun -n 8 hostname
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22
}}}
The environment is transported to the remote shell, no `.profile`, `.bashrc`, ... are sourced (especially not the modules default from `/etc/profile.d/modules.sh`). As of March 2020, an account has to be specified using the `--account` (short `-A`)  option, which is "deepsea" for DEEP-SEA project members. For people not  included in the DEEP-SEA project, please use the "Budget" name you  received along with your account creation.

===  ===
Assume you would like to run an MPI task on 4 cluster nodes with 2 tasks per node. It's necessary to use `salloc` then:

{{{
[kreutz1@deepv Temp]$ salloc -A deep -p dp-cn -N 4 -n 8 -t 00:30:00 srun --pty --interactive /bin/bash
[kreutz1@dp-cn01 Temp]$ srun -N 4 -n 8 ./MPI_HelloWorld
Hello World from rank 3 of 8 on dp-cn02
Hello World from rank 7 of 8 on dp-cn04
Hello World from rank 2 of 8 on dp-cn02
Hello World from rank 6 of 8 on dp-cn04
Hello World from rank 0 of 8 on dp-cn01
Hello World from rank 4 of 8 on dp-cn03
Hello World from rank 1 of 8 on dp-cn01
Hello World from rank 5 of 8 on dp-cn03
}}}
[[BR]]Once you get to the compute node, start your application using `srun`. Note that the number of tasks used is the same as specified in the initial `srun` command above (4 nodes with two tasks each). It's also possible to use less nodes in the `srun` command.  So the following command would work as well:

{{{
[kreutz1@dp-cn01 Temp]$ srun -N 1 -n 1 ./MPI_HelloWorld
Hello World from rank 0 of 1 on dp-cn01
}}}
=== Running directly from the front ends ===
You can run the application directly from the frontend, bypassing the shell. Do not forget to set the correct environment for running your executable on the login node as this will be used for execution with `srun`.

{{{
[kreutz1@deepv Temp]$ ml GCC/10.3.0 ParaStationMPI/5.4.9-1
[kreutz1@deepv Temp]$ srun -A deep -p dp-cn -N 4 -n 8 -t 00:30:00 ./MPI_HelloWorld
Hello World from rank 7 of 8 on dp-cn04
Hello World from rank 3 of 8 on dp-cn02
Hello World from rank 6 of 8 on dp-cn04
Hello World from rank 2 of 8 on dp-cn02
Hello World from rank 4 of 8 on dp-cn03
Hello World from rank 0 of 8 on dp-cn01
Hello World from rank 1 of 8 on dp-cn01
Hello World from rank 5 of 8 on dp-cn03
}}}
It can be useful to create an allocation which can be used for several runs of your job:

{{{
[kreutz1@deepv Temp]$ salloc -A deep -p dp-cn -N 4 -n 8 -t 00:30:00
salloc: Granted job allocation 69263
[kreutz1@deepv Temp]$ srun ./MPI_HelloWorld
Hello World from rank 7 of 8 on dp-cn04
Hello World from rank 3 of 8 on dp-cn02
Hello World from rank 6 of 8 on dp-cn04
Hello World from rank 2 of 8 on dp-cn02
Hello World from rank 5 of 8 on dp-cn03
Hello World from rank 1 of 8 on dp-cn01
Hello World from rank 4 of 8 on dp-cn03
Hello World from rank 0 of 8 on dp-cn01
...
# several more runs
...
[kreutz1@deepv Temp]$ exit
exit
salloc: Relinquishing job allocation 69263
}}}
Note that in this case the `-N` and `-n` options for the `srun` command can be skipped (they default to the corresponding options given to `salloc`).

=== Batch script ===
As stated above, it is recommended to load the necessary modules within the script and submit the script from a clean environment.

The following script `hello_cluster.sh` will unload all modules and load the modules required for executing the given binary:

{{{
#!/bin/bash

#SBATCH --partition=dp-esb
#SBATCH -A deep
#SBATCH -N 4
#SBATCH -n 8
#SBATCH -o /p/project/cdeep/kreutz1/hello_cluster-%j.out
#SBATCH -e /p/project/cdeep/kreutz1/hello_cluster-%j.err
#SBATCH --time=00:10:00

ml purge
ml GCC ParaStationMPI
srun ./MPI_HelloWorld
}}}
This script requests 4 nodes of the ESB module with 8 tasks, specifies the stdout and stderr files, and asks for 10 minutes of walltime.  You can submit the job script as follows:

{{{
[kreutz1@deepv Temp]$ sbatch hello_cluster.sh 
Submitted batch job 69264
}}}
... and check what it's doing:

{{{
[kreutz1@deepv Temp]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             69264     dp-cn hello_cl  kreutz1 CG       0:04      4 dp-cn[01-04]
}}}
Once finished, you can check the result (and the error file if needed)

{{{
[kreutz1@deepv Temp]$ cat /p/project/cdeep/kreutz1/hello_cluster-69264.out 
Hello World from rank 7 of 8 on dp-esb37
Hello World from rank 3 of 8 on dp-esb35
Hello World from rank 5 of 8 on dp-esb36
Hello World from rank 1 of 8 on dp-esb34
Hello World from rank 6 of 8 on dp-esb37
Hello World from rank 2 of 8 on dp-esb35
Hello World from rank 4 of 8 on dp-esb36
Hello World from rank 0 of 8 on dp-esb34
}}}
{{{#!comment JK: not available anymore in current SLURM version

== Submitting jobs to alternative modules ==
Users can submit batch jobs to multiple modules by using `--module-list` extension for Slurm `sbatch` command. This extension accepts two modules, a primary module, submitted with higher priority, and an alternative module that receives a lower priority in the job queue. In the below example the job is submitted to two modules: the primary module is dp-cn, while the secondary module is dp-dam.

`sbatch --module-list=dp-cm,dp-dam job.batch`

The parameters for the alternative module are automatically calculated by using an internal conversion model. Module-list is an alternative to `--partition`, which does not apply any conversion model, and submits the job to multiple partitions but with the same configuration. Available conversion models:

 * CM to DAM: number of requested nodes / 2, number of tasks per node x2 (CPU cores ratio)
 * ESB to CM: same number of nodes, time limit * 10 (due to GPU vs CPU performance), number of tasks per node / 3 (CPU cores ratio)
 * DAM to CM: number of nodes * 2 (due to available memory), time limit * 10 (due to GPU vs CPU performance), number of tasks per node / 2 (CPU cores ratio)
 * DAM to ESB: number of nodes * 4 (due to available memory), time limit / 4 (due to using more nodes), number of tasks per node / 6 (CPU cores ratio)

At submission time it is recommended to specify the number of nodes, the number of tasks per node, the number of CPUs per task, and the number of GPUs per node.

Module-list is currently not compatible with other dependencies specified with `--dependency` clause, and also with other partitions specified with `--partition`. 

}}}


== Information on past jobs and accounting ==
The `sacct` command can be used to enquire the Slurm database about a past job.

{{{
[kreutz1@deepv Temp]$ sacct -j 69268
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
69268+0            bash      dp-cn deepest-a+         96  COMPLETED      0:0 
69268+0.0    MPI_Hello+            deepest-a+          2  COMPLETED      0:0 
69268+1            bash     dp-dam deepest-a+        384  COMPLETED      0:0 
}}}
On the Cluster (CM) nodes it's possible to query the consumed energy for a certain job:

{{{
[kreutz1@deepv kreutz1]$ sacct -o ConsumedEnergy,JobName,JobID,CPUTime,AllocNodes -j 69326
ConsumedEnergy    JobName        JobID    CPUTime AllocNodes 
-------------- ---------- ------------ ---------- ---------- 
       496.70K hpl_MKL_O+ 69326          16:28:48          1 
             0      batch 69326.batch    16:28:48          1 
       496.70K xlinpack_+ 69326.0        08:10:24          1 
}}}
This feature will also be for the ESB nodes.


== Advanced topics ==

For further details on the batchsystem and `psslurm` which is used on the DEEP system as well as on the JSC production systems, please refer to the in-depth description for using the 
[https://apps.fz-juelich.de/jsc/hps/jureca/batchsystem.html?highlight=multithreading#allocations-jobs-and-job-steps Batchsystem on Jureca]. Among extended examples for allocation
of nodes you can find information on job steps, dependency chains and multithreading there. If you are interested in pinning of threads and tasks to certain CPUs or cores,
please also take a look into the [https://apps.fz-juelich.de/jsc/hps/jureca/affinity.html Processor Anffinity] sections of the Jureca documentation. Most of the information
provided there will also refer to the DEEP system.

== FAQ ==
=== Is there a cheat sheet for all main Slurm commands? ===
Yes, it is available [https://slurm.schedmd.com/pdfs/summary.pdf here].

=== Why's my job not running? ===
You can check the state of your job with

{{{
scontrol show job <job id>
}}}
In the output, look for the `Reason` field.

You can check the existing reservations using

{{{
scontrol show res
}}}
=== How can I check which jobs are running in the machine? ===
Please use the `squeue` command ( the "-u $USER" option to only list jobs belonging to your user id).
A graphical overview can be displayed using `slurmtop` command.

=== How do I do chain jobs with dependencies? ===
Please confer the `sbatch`/`srun` man page, especially the

{{{
-d, --dependency=<dependency_list>
}}}
entry.

Also, jobs can be chained after they have been submitted using the `scontrol` command by updating their `Dependency` field.

=== How can check the status of partitions and nodes? ===
The main command to use is `sinfo`. By default, when called alone, `sinfo` will list the available partitions and the number of nodes in each partition in a given status. For example:

{{{
[deamicis1@deepv hybridhello]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
sdv             up   20:00:00     11   idle deeper-sdv[06-16]
knl             up   20:00:00      1  drain knl01
knl             up   20:00:00      3   idle knl[04-06]
knl256          up   20:00:00      1  drain knl01
knl256          up   20:00:00      1   idle knl05
knl272          up   20:00:00      2   idle knl[04,06]
snc4            up   20:00:00      1   idle knl05
extoll          up   20:00:00     11   idle deeper-sdv[06-16]
ml-gpu          up   20:00:00      3   idle ml-gpu[01-03]
dp-cn           up   20:00:00      1  drain dp-cn33
dp-cn           up   20:00:00      5   resv dp-cn[09-10,25,49-50]
dp-cn           up   20:00:00     44   idle dp-cn[01-08,11-24,26-32,34-48]
dp-dam          up   20:00:00      1 drain* dp-dam08
dp-dam          up   20:00:00      2  drain dp-dam[03,07]
dp-dam          up   20:00:00      3   resv dp-dam[05,09-10]
dp-dam          up   20:00:00      2  alloc dp-dam[01,04]
dp-dam          up   20:00:00      8   idle dp-dam[02,06,11-16]
dp-dam-ext      up   20:00:00      2   resv dp-dam[09-10]
dp-dam-ext      up   20:00:00      6   idle dp-dam[11-16]
dp-esb          up   20:00:00     51 drain* dp-esb[11,26-75]
dp-esb          up   20:00:00      2  drain dp-esb[08,23]
dp-esb          up   20:00:00      2  alloc dp-esb[09-10]
dp-esb          up   20:00:00     20   idle dp-esb[01-07,12-22,24-25]
dp-sdv-esb      up   20:00:00      2   resv dp-sdv-esb[01-02]
psgw-cluster    up   20:00:00      1   idle nfgw01
psgw-booster    up   20:00:00      1   idle nfgw02
debug           up   20:00:00      1 drain* dp-dam08
debug           up   20:00:00      4  drain dp-cn33,dp-dam[03,07],knl01
debug           up   20:00:00     10   resv dp-cn[09-10,25,49-50],dp-dam[05,09-10],dp-sdv-esb[01-02]
debug           up   20:00:00      2  alloc dp-dam[01,04]
debug           up   20:00:00     69   idle deeper-sdv[06-16],dp-cn[01-08,11-24,26-32,34-48],dp-dam[02,06,11-16],knl[04-06],ml-gpu[01-03]
}}}
Please refer to the man page for `sinfo` for more information.

One useful command to check on the nodes is: `sinfo -R -h -o "%n %12U %19H %6t %E" | sort -u`

=== Can I join stderr and stdout like it was done with `-joe` in Torque? ===
Not directly. In your batch script, redirect stdout and stderr to the same file:

`#!sh ... #SBATCH -o /point/to/the/common/logfile-%j.log #SBATCH -e /point/to/the/common/logfile-%j.log ... `

(The `%j` will place the job id in the output file). N.B. It might be more efficient to redirect the output of your script's commands to a dedicated file.

{{{#!comment

=== What is the default binding/pinning behaviour on DEEP? ===
DEEP uses a !ParTec-modified version of Slurm called psslurm. In psslurm, the options concerning binding and pinning are different from the ones provided in Vanilla Slurm. By default, psslurm will use a ''by rank'' pinning strategy, assigning each Slurm task to a different physical thread on the node starting from OS processor 0. For example:

`#!sh [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=1 srun -N 1 -n 4 -p dp-cn ./HybridHello | sort -k9n -k11n Hello from node dp-cn50, core 0; AKA rank 0, thread 0 Hello from node dp-cn50, core 1; AKA rank 1, thread 0 Hello from node dp-cn50, core 2; AKA rank 2, thread 0 Hello from node dp-cn50, core 3; AKA rank 3, thread 0 `

**Attention:** please be aware that the psslurm affinity settings only affect the tasks spawned by Slurm. When using threaded  applications, the thread affinity will be inherited from the task affinity of the process originally spawned by Slurm. For example, for a hybrid MPI-OpenMP application: `#!sh [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=4 srun -N 1 -n 4 -c 4 -p dp-dam ./HybridHello | sort -k9n -k11n Hello from node dp-dam01, core 0-3; AKA rank 0, thread 0 Hello from node dp-dam01, core 0-3; AKA rank 0, thread 1 Hello from node dp-dam01, core 0-3; AKA rank 0, thread 2 Hello from node dp-dam01, core 0-3; AKA rank 0, thread 3 Hello from node dp-dam01, core 4-7; AKA rank 1, thread 0 Hello from node dp-dam01, core 4-7; AKA rank 1, thread 1 Hello from node dp-dam01, core 4-7; AKA rank 1, thread 2 Hello from node dp-dam01, core 4-7; AKA rank 1, thread 3 Hello from node dp-dam01, core 8-11; AKA rank 2, thread 0 Hello from node dp-dam01, core 8-11; AKA rank 2, thread 1 Hello from node dp-dam01, core 8-11; AKA rank 2, thread 2 Hello from node dp-dam01, core 8-11; AKA rank 2, thread 3 Hello from node dp-dam01, core 12-15; AKA rank 3, thread 0 Hello from node dp-dam01, core 12-15; AKA rank 3, thread 1 Hello from node dp-dam01, core 12-15; AKA rank 3, thread 2 Hello from node dp-dam01, core 12-15; AKA rank 3, thread 3 `

Be sure to explicitly set the thread affinity settings in your script (e.g. exporting environment variables) or directly in your code. Taking the previous example: `#!sh [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=4 OMP_PROC_BIND=close srun -N 1 -n 4 -c 4 -p dp-dam ./HybridHello | sort -k9n -k11n Hello from node dp-dam01, core 0; AKA rank 0, thread 0 Hello from node dp-dam01, core 1; AKA rank 0, thread 1 Hello from node dp-dam01, core 2; AKA rank 0, thread 2 Hello from node dp-dam01, core 3; AKA rank 0, thread 3 Hello from node dp-dam01, core 4; AKA rank 1, thread 0 Hello from node dp-dam01, core 5; AKA rank 1, thread 1 Hello from node dp-dam01, core 6; AKA rank 1, thread 2 Hello from node dp-dam01, core 7; AKA rank 1, thread 3 Hello from node dp-dam01, core 8; AKA rank 2, thread 0 Hello from node dp-dam01, core 9; AKA rank 2, thread 1 Hello from node dp-dam01, core 10; AKA rank 2, thread 2 Hello from node dp-dam01, core 11; AKA rank 2, thread 3 Hello from node dp-dam01, core 12; AKA rank 3, thread 0 Hello from node dp-dam01, core 13; AKA rank 3, thread 1 Hello from node dp-dam01, core 14; AKA rank 3, thread 2 Hello from node dp-dam01, core 15; AKA rank 3, thread 3 `

Please refer to the [https://apps.fz-juelich.de/jsc/hps/jureca/affinity.html following page] on the JURECA documentation for more information about how to affect affinity on the DEEP system using psslurm options. Please be aware that different partitions on DEEP have different number of sockets per node and cores/threads per socket with respect to JURECA. Please refer to the [wiki:System_overview] or run the `lstopo-no-graphics` on the compute nodes to get more information about the hardware configuration on the different modules.

=== How do I use SMT on the DEEP CPUs? ===
On DEEP, SMT is enabled by default on all nodes. Please be aware that on all JSC systems (including DEEP), each hardware thread is exposed by the OS as a separate CPU. For a ''n''-core node, with ''m'' hardware threads per core, the OS cores from ''0'' to ''n-1'' will correspond to the first hardware thread of all hardware cores (from all sockets), the OS cores from ''n'' to ''2n-1'' to the second hardware thread of the hardware cores, and so on.

For instance, on a Cluster node (with two sockets with 12 cores each, with 2 hardware threads per core):

{{{
[deamicis1@deepv hybridhello]$ srun -N 1 -n 1 -p dp-cn lstopo-no-graphics --no-caches --no-io --no-bridges --of ascii
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Machine (191GB total)                                                                                                                                  │
│                                                                                                                                                        │
│ ┌────────────────────────────────────────────────────────────────────────┐  ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │  │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ NUMANode P#0 (95GB)                                                │ │  │ │ NUMANode P#1 (96GB)                                                │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │  │ └────────────────────────────────────────────────────────────────────┘ │ │
│ │                                                                        │  │                                                                        │ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │  │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Package P#0                                                        │ │  │ │ Package P#1                                                        │ │ │
│ │ │                                                                    │ │  │ │                                                                    │ │ │
│ │ │ ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │ │  │ │ ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │ │ │
│ │ │ │ Core P#0    │  │ Core P#1    │  │ Core P#2    │  │ Core P#3    │ │ │  │ │ │ Core P#0    │  │ Core P#3    │  │ Core P#4    │  │ Core P#8    │ │ │ │
│ │ │ │             │  │             │  │             │  │             │ │ │  │ │ │             │  │             │  │             │  │             │ │ │ │
│ │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │  │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │ │
│ │ │ │ │ PU P#0  │ │  │ │ PU P#1  │ │  │ │ PU P#2  │ │  │ │ PU P#3  │ │ │ │  │ │ │ │ PU P#12 │ │  │ │ PU P#13 │ │  │ │ PU P#14 │ │  │ │ PU P#15 │ │ │ │ │
│ │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │  │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │ │
│ │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │  │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │ │
│ │ │ │ │ PU P#24 │ │  │ │ PU P#25 │ │  │ │ PU P#26 │ │  │ │ PU P#27 │ │ │ │  │ │ │ │ PU P#36 │ │  │ │ PU P#37 │ │  │ │ PU P#38 │ │  │ │ PU P#39 │ │ │ │ │
│ │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │  │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │ │
│ │ │ └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │ │  │ │ └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │ │ │
│ │ │                                                                    │ │  │ │                                                                    │ │ │
│ │ │ ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │ │  │ │ ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │ │ │
│ │ │ │ Core P#4    │  │ Core P#9    │  │ Core P#10   │  │ Core P#16   │ │ │  │ │ │ Core P#9    │  │ Core P#10   │  │ Core P#11   │  │ Core P#16   │ │ │ │
│ │ │ │             │  │             │  │             │  │             │ │ │  │ │ │             │  │             │  │             │  │             │ │ │ │
│ │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │  │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │ │
│ │ │ │ │ PU P#4  │ │  │ │ PU P#5  │ │  │ │ PU P#6  │ │  │ │ PU P#7  │ │ │ │  │ │ │ │ PU P#16 │ │  │ │ PU P#17 │ │  │ │ PU P#18 │ │  │ │ PU P#19 │ │ │ │ │
│ │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │  │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │ │
│ │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │  │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │ │
│ │ │ │ │ PU P#28 │ │  │ │ PU P#29 │ │  │ │ PU P#30 │ │  │ │ PU P#31 │ │ │ │  │ │ │ │ PU P#40 │ │  │ │ PU P#41 │ │  │ │ PU P#42 │ │  │ │ PU P#43 │ │ │ │ │
│ │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │  │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │ │
│ │ │ └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │ │  │ │ └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │ │ │
│ │ │                                                                    │ │  │ │                                                                    │ │ │
│ │ │ ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │ │  │ │ ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │ │ │
│ │ │ │ Core P#18   │  │ Core P#19   │  │ Core P#25   │  │ Core P#26   │ │ │  │ │ │ Core P#17   │  │ Core P#18   │  │ Core P#24   │  │ Core P#26   │ │ │ │
│ │ │ │             │  │             │  │             │  │             │ │ │  │ │ │             │  │             │  │             │  │             │ │ │ │
│ │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │  │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │ │
│ │ │ │ │ PU P#8  │ │  │ │ PU P#9  │ │  │ │ PU P#10 │ │  │ │ PU P#11 │ │ │ │  │ │ │ │ PU P#20 │ │  │ │ PU P#21 │ │  │ │ PU P#22 │ │  │ │ PU P#23 │ │ │ │ │
│ │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │  │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │ │
│ │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │  │ │ │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │ │ │ │
│ │ │ │ │ PU P#32 │ │  │ │ PU P#33 │ │  │ │ PU P#34 │ │  │ │ PU P#35 │ │ │ │  │ │ │ │ PU P#44 │ │  │ │ PU P#45 │ │  │ │ PU P#46 │ │  │ │ PU P#47 │ │ │ │ │
│ │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │  │ │ │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │ │ │ │
│ │ │ └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │ │  │ │ └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │  │ └────────────────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────┘  └────────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Host: dp-cn50                                                                                                                                          │
│                                                                                                                                                        │
│ Indexes: physical                                                                                                                                      │
│                                                                                                                                                        │
│ Date: Thu 21 Nov 2019 15:22:31 CET                                                                                                                     │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
}}}
The `PU P#X` are the Processing Units numbers exposed by the OS.

To exploit SMT, simply run a job using a number of tasks*threads_per_task higher than the number of physical cores available on a node. Please refer to the [https://apps.fz-juelich.de/jsc/hps/jureca/smt.html relevant page] of the JURECA documentation for more information on how to use SMT on the DEEP nodes.

**Attention**: currently the only way to assign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example: `#!sh [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=2 OMP_PROC_BIND=close OMP_PLACES=threads srun -N 1 -n 2 -p dp-dam --cpu-bind=mask_cpu:$(printf '%x' "$((2#1000000000000000000000000000000000000000000000001))"),$(printf '%x' "$((2#10000000000000000000000000000000000000000000000010))") ./HybridHello | sort -k9n -k11n Hello from node dp-dam01, core 0; AKA rank 0, thread 0 Hello from node dp-dam01, core 48; AKA rank 0, thread 1 Hello from node dp-dam01, core 1; AKA rank 1, thread 0 Hello from node dp-dam01, core 49; AKA rank 1, thread 1 `

This can be cumbersome for jobs using a large number of tasks per node. In such cases, a tool like [https://www.open-mpi.org/projects/hwloc/ hwloc] (currently available on the compute nodes, but not on the login node!) can be used to calculate the affinity masks to be passed to psslurm. }}}

{{{#!comment

== pbs/slurm dictionary ==
|| '''PBS command''' || '''closest slurm equivalent''' ||
|| qsub || sbatch ||
|| qsub -I || salloc +  srun --pty bash -i ||
|| qsub into an existing reservation || ... --reservation= <reservation> ... ||
|| pbsnodes || scontrol show node ||
|| pbsnodes (-ln) || sinfo (-R) or sinfo -Rl -h -o "%n %12U %19H %6t %E" | sort -u ||
|| pbsnodes -c -N n <node> || scontrol update NodeName= <node> State=RESUME ||
|| pbsnodes -o <node> || scontrol update NodeName= <node> State=DRAIN reason="some comment here" ||
|| pbstop || smap ||
|| qstat || squeue ||
|| checkjob <job> || scontrol show job <job> ||
|| checkjob -v <job> || scontrol show -d job <job> ||
|| showres || scontrol show res ||
|| setres || scontrol create reservation [ReservationName= <reservation>] user=partec Nodes=j3c![053-056] StartTime=now duration=Unlimited Flags=IGNORE_JOBS||
|| setres -u <user> ALL || scontrol create reservation ReservationName=\<some name> user=\<user> Nodes=ALL startTime=now duration=unlimited FLAGS=maint,ignore_jobs ||
|| releaseres || scontrol delete ReservationName= <reservation> ||

}}}