= Information about the batch system (SLURM) =

For the old torque documentation, please see [wiki:"Public/User_Guide/Batch_system_torque" the old documentation].

Please confer /etc/slurm/README.

The documentation of Slurm can be found [https://slurm.schedmd.com/ here]. 

== Overview ==

Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform.

== Remark about modules ==

By default, Slurm passes the environment from your job submission session directly to the execution environment. Please be aware of this when running jobs with `srun` or when submitting scripts with `sbatch`. This behavior can be controlled via the `--export` option. Please refer to the [https://slurm.schedmd.com/ Slurm documentation] to get more information about this.

In particular, when submitting job scripts, it is recommended to load the necessary modules within the script and submit the script from a clean environment.


== An introductory example ==

Suppose you have an mpi executable named {{{hello_mpi}}}. There are three ways to start the binary.


=== From a shell on a node ===

First, start a shell on a node. You would like to run your mpi task on 4 machines with 2 tasks per machine:
{{{
niessen@deepl:src/mpi > srun --partition=sdv -N 4 -n 8 --pty /bin/bash -i
niessen@deeper-sdv04:/direct/homec/zdvex/niessen/src/mpi >
}}}

The environment is transported to the remote shell, no {{{.profile}}}, {{{.bashrc}}}, ... are sourced (especially not the modules default from {{{/etc/profile.d/modules.sh}}}).

Once you get to the compute node, start your application using {{{srun}}}. Note that the number of tasks used is the same as specified in the initial {{{srun}}} command above (4 nodes with two tasks each):
{{{
niessen@deeper-sdv04:/direct/homec/zdvex/niessen/src/mpi > srun ./hello_cluster 
srun: cluster configuration lacks support for cpu binding
Hello world from process 6 of 8 on deeper-sdv07
Hello world from process 7 of 8 on deeper-sdv07
Hello world from process 3 of 8 on deeper-sdv05
Hello world from process 4 of 8 on deeper-sdv06
Hello world from process 0 of 8 on deeper-sdv04
Hello world from process 2 of 8 on deeper-sdv05
Hello world from process 5 of 8 on deeper-sdv06
Hello world from process 1 of 8 on deeper-sdv04
}}}

You can ignore the warning about the cpu binding. !ParaStation will pin you processes.

=== Running directly from the front ends ===

You can run the application directly from the frontend, bypassing the shell:

{{{
niessen@deepl:src/mpi > srun --partition=sdv -N 4 -n 8 ./hello_cluster 
Hello world from process 4 of 8 on deeper-sdv06
Hello world from process 6 of 8 on deeper-sdv07
Hello world from process 3 of 8 on deeper-sdv05
Hello world from process 0 of 8 on deeper-sdv04
Hello world from process 2 of 8 on deeper-sdv05
Hello world from process 5 of 8 on deeper-sdv06
Hello world from process 7 of 8 on deeper-sdv07
Hello world from process 1 of 8 on deeper-sdv04
}}}

In this case, it can be useful to create an allocation which you can use for several runs of your job:

{{{
niessen@deepl:src/mpi > salloc --partition=sdv -N 4 -n 8 
salloc: Granted job allocation 955
niessen@deepl:~/src/mpi>srun ./hello_cluster 
Hello world from process 3 of 8 on deeper-sdv05
Hello world from process 1 of 8 on deeper-sdv04
Hello world from process 7 of 8 on deeper-sdv07
Hello world from process 5 of 8 on deeper-sdv06
Hello world from process 2 of 8 on deeper-sdv05
Hello world from process 0 of 8 on deeper-sdv04
Hello world from process 6 of 8 on deeper-sdv07
Hello world from process 4 of 8 on deeper-sdv06
niessen@deepl:~/src/mpi> # several more runs
...
niessen@deepl:~/src/mpi>exit
exit
salloc: Relinquishing job allocation 955
}}}

=== Batch script ===

Given the following script {{{hello_cluster.sh}}}: (it has to be executable):

{{{
#!/bin/bash

#SBATCH --partition=sdv
#SBATCH -N 4
#SBATCH -n 8
#SBATCH -o /homec/zdvex/niessen/src/mpi/hello_cluster-%j.log
#SBATCH -e /homec/zdvex/niessen/src/mpi/hello_cluster-%j.err
#SBATCH --time=00:10:00

srun ./hello_cluster
}}}

This script requests 4 nodes with 8 tasks, specifies the stdout and stderr files, and asks for 10 minutes of walltime. Submit:

{{{
niessen@deepl:src/mpi > sbatch ./hello_cluster.sh
Submitted batch job 956
}}}

Check what it's doing:

{{{
niessen@deepl:src/mpi > squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               956       sdv hello_cl  niessen  R       0:00      4 deeper-sdv[04-07]
}}}

Check the result:

{{{
niessen@deepl:src/mpi > cat hello_cluster-956.log 
Hello world from process 5 of 8 on deeper-sdv06
Hello world from process 1 of 8 on deeper-sdv04
Hello world from process 7 of 8 on deeper-sdv07
Hello world from process 3 of 8 on deeper-sdv05
Hello world from process 0 of 8 on deeper-sdv04
Hello world from process 2 of 8 on deeper-sdv05
Hello world from process 4 of 8 on deeper-sdv06
Hello world from process 6 of 8 on deeper-sdv07
}}}

== Available Partitions ==

Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch:

 * sdv: The DEEP-ER sdv nodes
 * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration)
 * knl256: the 256-core knls
 * knl272: the 272-core knls
 * snc4: the knls configured in SNC-4 mode
 * knm: The DEEP-ER knm nodes
 * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each
 * extoll: the sdv and knl nodes in the extoll fabric

Anytime, you can list the state of the partitions with the {{{sinfo}}} command. The properties of a partition can be seen using

{{{
scontrol show partition <partition>
}}}

== Interactive Jobs ==

== Batch Jobs ==

== FAQ ==

=== Why's my job not running? ===

You can check the state of your job with

{{{
scontrol show job <job id>
}}}

In the output, look for the {{{Reason}}} field.

You can check the existing reservations using

{{{
scontrol show res
}}}

=== How can I check which jobs are running in the machine? ===

Please use the {{{squeue}}} command.

=== How do I do chain jobs with dependencies? ===

Please confer the {{{sbatch}}}/{{{srun}}} man page, especially the 

{{{
-d, --dependency=<dependency_list>
}}}

entry.

=== How can get a list of broken nodes? ===

The command to use is

{{{
sinfo -Rl -h -o "%n %12U %19H %6t %E" | sort -u
}}}

See also the translation table below.

=== Can I join stderr and stdout like it was done with {{{-joe}}} in Torque? ===

Not directly. In your batch script, redirect stdout and stderr to the same file:

{{{
...
#SBATCH -o /point/to/the/common/logfile-%j.log
#SBATCH -e /point/to/the/common/logfile-%j.log
...
}}}

(The {{{%j}}} will place the job id in the output file). N.B. It might be more efficient to redirect the output of your script's commands to a dedicated file.

=== What's the equivalent of {{{qsub -l nodes=x:ppn=y:cluster+n_b:ppn=p_b:booster}}}? ===

As of version 17.11 of Slurm, heterogeneous jobs are supported. For example, the user can run:

{{{
srun --partition=sdv -N 1 -n 1 hostname : --partition=knl -N 1 -n 1 hostname
deeper-sdv01
knl05
}}}

In order to submit a heterogeneous job, the user needs to set the batch script similarly to the following:

{{{#!sh
#!/bin/bash

#SBATCH --job-name=imb_execute_1
#SBATCH --account=deep
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH --output=job.out
#SBATCH --error=job.err
#SBATCH --time=00:02:00

#SBATCH --partition=sdv
#SBATCH --constraint=
#SBATCH --nodes=1
#SBATCH --ntasks=12
#SBATCH --ntasks-per-node=12
#SBATCH --cpus-per-task=1

#SBATCH packjob

#SBATCH --partition=knl
#SBATCH --constraint=
#SBATCH --nodes=1
#SBATCH --ntasks=12
#SBATCH --ntasks-per-node=12
#SBATCH --cpus-per-task=1

srun ./app_sdv : ./app_knl
}}}

Here the `packjob` keyword allows to define Slurm parameter for each sub-job of the heterogeneous job.

If you need to load modules before launching the application, it's suggested to create wrapper scripts around the applications, and submit such scripts with srun, like this:

{{{#!sh
...
srun ./script_sdv.sh : ./script_knl.sh
}}}

where a script should contain:

{{{#!sh
#!/bin/bash

module load ...
./app_sdv
}}}

This way it will also be possible to load different modules on the different partitions used in the heterogeneous job.


== pbs/slurm dictionary ==
|| '''PBS command''' || '''closest slurm equivalent''' ||
|| qsub || sbatch ||
|| qsub -I || salloc +  srun --pty bash -i ||
|| qsub into an existing reservation || ... --reservation= <reservation> ... ||
|| pbsnodes || scontrol show node ||
|| pbsnodes (-ln) || sinfo (-R) or sinfo -Rl -h -o "%n %12U %19H %6t %E" | sort -u ||
|| pbsnodes -c -N n <node> || scontrol update NodeName= <node> State=RESUME ||
|| pbsnodes -o <node> || scontrol update NodeName= <node> State=DRAIN reason="some comment here" ||
|| pbstop || smap ||
|| qstat || squeue ||
|| checkjob <job> || scontrol show job <job> ||
|| checkjob -v <job> || scontrol show -d job <job> ||
|| showres || scontrol show res ||
|| setres || scontrol create reservation [ReservationName= <reservation>] user=partec Nodes=j3c![053-056] StartTime=now duration=Unlimited Flags=IGNORE_JOBS||
|| setres -u <user> ALL || scontrol create reservation ReservationName=\<some name> user=\<user> Nodes=ALL startTime=now duration=unlimited FLAGS=maint,ignore_jobs ||
|| releaseres || scontrol delete ReservationName= <reservation> ||