wiki:Public/User_Guide/Batch_system

Version 64 (modified by Anke Kreuzer, 2 months ago) (diff)

Information about the batch system (SLURM)

The DEEP prototype system is running SLURM for resource management. Documentation of Slurm can be found here.

Overview

Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are srun and sbatch. The srun command can be used to spawn processes (please do not use mpiexec), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform or module).

Available Partitions

Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the --partition=... switch:

Name Description
dp-cn dp-cn[01-50], DEEP Cluster nodes (Xeon Skylake)
dp-dam dp-dam[01-16], DEEP DAM nodes (Xeon Cascadelake + 1 V100 + 1 Stratix 10)
dp-esb dp-esb[log:@26-75 "[01-75]"], DEEP ESB nodes connected with IB EDR (Xeon Cascadelake + 1 V100)
dp-sdv-esb dp-sdv-esb[01-02], DEEP ESB Test nodes (Xeon Cascadelake + 1 V100)
ml-gpu ml-gpu[01-03], GPU test nodes for ML applications (4 V100 cards)
knl knl[01,04-06], KNL nodes
knl256 knl[01,05], KNL nodes with 64 cores
knl272 knl[04,06], KNL nodes with 68 cores
snc4 knl[05], KNL node in snc4 memory mode
debug all compute nodes (no gateways)

Anytime, you can list the state of the partitions with the sinfo command. The properties of a partition (.e.g. the maximum walltime) can be seen using

scontrol show partition <partition>

Remark about environment

By default, Slurm passes the environment from your job submission session directly to the execution environment. Please be aware of this when running jobs with srun or when submitting scripts with sbatch. This behavior can be controlled via the --export option. Please refer to the Slurm documentation to get more information about this.

In particular, when submitting job scripts, it is recommended to load the necessary modules within the script and submit the script from a clean environment.

An introductory example

Suppose you have an mpi executable named hello_mpi. There are three ways to start the binary.

From a shell on a node

If you just need one node to run your interactive session on you can simply use the srun command (without salloc), e.g.:

[kreutz1@deepv ~]$ srun -A deep -N 1 -n 8 -p dp-cn -t 00:30:00 --pty --interactive bash
[kreutz1@dp-cn22 ~]$ srun -n 8 hostname
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22
dp-cn22

The environment is transported to the remote shell, no .profile, .bashrc, … are sourced (especially not the modules default from /etc/profile.d/modules.sh). As of March 2020, an account has to be specified using the --account (short -A) option, which is "deepsea" for DEEP-SEA project members. For people not included in the DEEP-SEA project, please use the "Budget" name you received along with your account creation.

Assume you would like to run an MPI task on 4 cluster nodes with 2 tasks per node. It's necessary to use salloc then:

[kreutz1@deepv Temp]$ salloc -A deep -p dp-cn -N 4 -n 8 -t 00:30:00 srun --pty --interactive /bin/bash
[kreutz1@dp-cn01 Temp]$ srun -N 4 -n 8 ./MPI_HelloWorld
Hello World from rank 3 of 8 on dp-cn02
Hello World from rank 7 of 8 on dp-cn04
Hello World from rank 2 of 8 on dp-cn02
Hello World from rank 6 of 8 on dp-cn04
Hello World from rank 0 of 8 on dp-cn01
Hello World from rank 4 of 8 on dp-cn03
Hello World from rank 1 of 8 on dp-cn01
Hello World from rank 5 of 8 on dp-cn03


Once you get to the compute node, start your application using srun. Note that the number of tasks used is the same as specified in the initial srun command above (4 nodes with two tasks each). It's also possible to use less nodes in the srun command. So the following command would work as well:

[kreutz1@dp-cn01 Temp]$ srun -N 1 -n 1 ./MPI_HelloWorld
Hello World from rank 0 of 1 on dp-cn01

Running directly from the front ends

You can run the application directly from the frontend, bypassing the shell. Do not forget to set the correct environment for running your executable on the login node as this will be used for execution with srun.

[kreutz1@deepv Temp]$ ml GCC/10.3.0 ParaStationMPI/5.4.9-1
[kreutz1@deepv Temp]$ srun -A deep -p dp-cn -N 4 -n 8 -t 00:30:00 ./MPI_HelloWorld
Hello World from rank 7 of 8 on dp-cn04
Hello World from rank 3 of 8 on dp-cn02
Hello World from rank 6 of 8 on dp-cn04
Hello World from rank 2 of 8 on dp-cn02
Hello World from rank 4 of 8 on dp-cn03
Hello World from rank 0 of 8 on dp-cn01
Hello World from rank 1 of 8 on dp-cn01
Hello World from rank 5 of 8 on dp-cn03

It can be useful to create an allocation which can be used for several runs of your job:

[kreutz1@deepv Temp]$ salloc -A deep -p dp-cn -N 4 -n 8 -t 00:30:00
salloc: Granted job allocation 69263
[kreutz1@deepv Temp]$ srun ./MPI_HelloWorld
Hello World from rank 7 of 8 on dp-cn04
Hello World from rank 3 of 8 on dp-cn02
Hello World from rank 6 of 8 on dp-cn04
Hello World from rank 2 of 8 on dp-cn02
Hello World from rank 5 of 8 on dp-cn03
Hello World from rank 1 of 8 on dp-cn01
Hello World from rank 4 of 8 on dp-cn03
Hello World from rank 0 of 8 on dp-cn01
...
# several more runs
...
[kreutz1@deepv Temp]$ exit
exit
salloc: Relinquishing job allocation 69263

Note that in this case the -N and -n options for the srun command can be skipped (they default to the corresponding options given to salloc).

Batch script

As stated above, it is recommended to load the necessary modules within the script and submit the script from a clean environment.

The following script hello_cluster.sh will unload all modules and load the modules required for executing the given binary:

#!/bin/bash

#SBATCH --partition=dp-esb
#SBATCH -A deep
#SBATCH -N 4
#SBATCH -n 8
#SBATCH -o /p/project/cdeep/kreutz1/hello_cluster-%j.out
#SBATCH -e /p/project/cdeep/kreutz1/hello_cluster-%j.err
#SBATCH --time=00:10:00

ml purge
ml GCC ParaStationMPI
srun ./MPI_HelloWorld

This script requests 4 nodes of the ESB module with 8 tasks, specifies the stdout and stderr files, and asks for 10 minutes of walltime. You can submit the job script as follows:

[kreutz1@deepv Temp]$ sbatch hello_cluster.sh 
Submitted batch job 69264

… and check what it's doing:

[kreutz1@deepv Temp]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             69264     dp-cn hello_cl  kreutz1 CG       0:04      4 dp-cn[01-04]

Once finished, you can check the result (and the error file if needed)

[kreutz1@deepv Temp]$ cat /p/project/cdeep/kreutz1/hello_cluster-69264.out 
Hello World from rank 7 of 8 on dp-esb37
Hello World from rank 3 of 8 on dp-esb35
Hello World from rank 5 of 8 on dp-esb36
Hello World from rank 1 of 8 on dp-esb34
Hello World from rank 6 of 8 on dp-esb37
Hello World from rank 2 of 8 on dp-esb35
Hello World from rank 4 of 8 on dp-esb36
Hello World from rank 0 of 8 on dp-esb34

Information on past jobs and accounting

The sacct command can be used to enquire the Slurm database about a past job.

[kreutz1@deepv Temp]$ sacct -j 69268
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
69268+0            bash      dp-cn deepest-a+         96  COMPLETED      0:0 
69268+0.0    MPI_Hello+            deepest-a+          2  COMPLETED      0:0 
69268+1            bash     dp-dam deepest-a+        384  COMPLETED      0:0 

On the Cluster (CM) nodes it's possible to query the consumed energy for a certain job:

[kreutz1@deepv kreutz1]$ sacct -o ConsumedEnergy,JobName,JobID,CPUTime,AllocNodes -j 69326
ConsumedEnergy    JobName        JobID    CPUTime AllocNodes 
-------------- ---------- ------------ ---------- ---------- 
       496.70K hpl_MKL_O+ 69326          16:28:48          1 
             0      batch 69326.batch    16:28:48          1 
       496.70K xlinpack_+ 69326.0        08:10:24          1 

This feature will also be for the ESB nodes.

Advanced topics

For further details on the batchsystem and psslurm which is used on the DEEP system as well as on the JSC production systems, please refer to the in-depth description for using the Batchsystem on Jureca. Among extended examples for allocation of nodes you can find information on job steps, dependency chains and multithreading there. If you are interested in pinning of threads and tasks to certain CPUs or cores, please also take a look into the Processor Anffinity sections of the Jureca documentation. Most of the information provided there will also refer to the DEEP system.

FAQ

Is there a cheat sheet for all main Slurm commands?

Yes, it is available here.

Why's my job not running?

You can check the state of your job with

scontrol show job <job id>

In the output, look for the Reason field.

You can check the existing reservations using

scontrol show res

How can I check which jobs are running in the machine?

Please use the squeue command ( the "-u $USER" option to only list jobs belonging to your user id). A graphical overview can be displayed using slurmtop command.

How do I do chain jobs with dependencies?

Please confer the sbatch/srun man page, especially the

-d, --dependency=<dependency_list>

entry.

Also, jobs can be chained after they have been submitted using the scontrol command by updating their Dependency field.

How can check the status of partitions and nodes?

The main command to use is sinfo. By default, when called alone, sinfo will list the available partitions and the number of nodes in each partition in a given status. For example:

[deamicis1@deepv hybridhello]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
sdv             up   20:00:00     11   idle deeper-sdv[06-16]
knl             up   20:00:00      1  drain knl01
knl             up   20:00:00      3   idle knl[04-06]
knl256          up   20:00:00      1  drain knl01
knl256          up   20:00:00      1   idle knl05
knl272          up   20:00:00      2   idle knl[04,06]
snc4            up   20:00:00      1   idle knl05
extoll          up   20:00:00     11   idle deeper-sdv[06-16]
ml-gpu          up   20:00:00      3   idle ml-gpu[01-03]
dp-cn           up   20:00:00      1  drain dp-cn33
dp-cn           up   20:00:00      5   resv dp-cn[09-10,25,49-50]
dp-cn           up   20:00:00     44   idle dp-cn[01-08,11-24,26-32,34-48]
dp-dam          up   20:00:00      1 drain* dp-dam08
dp-dam          up   20:00:00      2  drain dp-dam[03,07]
dp-dam          up   20:00:00      3   resv dp-dam[05,09-10]
dp-dam          up   20:00:00      2  alloc dp-dam[01,04]
dp-dam          up   20:00:00      8   idle dp-dam[02,06,11-16]
dp-dam-ext      up   20:00:00      2   resv dp-dam[09-10]
dp-dam-ext      up   20:00:00      6   idle dp-dam[11-16]
dp-esb          up   20:00:00     51 drain* dp-esb[11,26-75]
dp-esb          up   20:00:00      2  drain dp-esb[08,23]
dp-esb          up   20:00:00      2  alloc dp-esb[09-10]
dp-esb          up   20:00:00     20   idle dp-esb[01-07,12-22,24-25]
dp-sdv-esb      up   20:00:00      2   resv dp-sdv-esb[01-02]
psgw-cluster    up   20:00:00      1   idle nfgw01
psgw-booster    up   20:00:00      1   idle nfgw02
debug           up   20:00:00      1 drain* dp-dam08
debug           up   20:00:00      4  drain dp-cn33,dp-dam[03,07],knl01
debug           up   20:00:00     10   resv dp-cn[09-10,25,49-50],dp-dam[05,09-10],dp-sdv-esb[01-02]
debug           up   20:00:00      2  alloc dp-dam[01,04]
debug           up   20:00:00     69   idle deeper-sdv[06-16],dp-cn[01-08,11-24,26-32,34-48],dp-dam[02,06,11-16],knl[04-06],ml-gpu[01-03]

Please refer to the man page for sinfo for more information.

One useful command to check on the nodes is: sinfo -R -h -o "%n %12U %19H %6t %E" | sort -u

Can I join stderr and stdout like it was done with -joe in Torque?

Not directly. In your batch script, redirect stdout and stderr to the same file:

#!sh ... #SBATCH -o /point/to/the/common/logfile-%j.log #SBATCH -e /point/to/the/common/logfile-%j.log ...

(The %j will place the job id in the output file). N.B. It might be more efficient to redirect the output of your script's commands to a dedicated file.