Changes between Version 15 and Version 16 of Public/User_Guide/Batch_system


Ignore:
Timestamp:
Feb 6, 2019, 6:29:40 PM (5 years ago)
Author:
Jacopo de Amicis
Comment:

Updated info on modules, available partitions and heterogeneous jobs.

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Batch_system

    v15 v16  
    55Please confer /etc/slurm/README.
    66
     7The documentation of Slurm can be found [https://slurm.schedmd.com/ here].
     8
    79== Overview ==
    810
    9 Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are {{{srun}}} and {{{sbatch}}}. The {{{srun}}} command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform.
    10 
    11 == !!!OUTDATED!!! Remark about modules ==
    12 
    13 Slurm passes the environment from your job submission session directly to the execution environment. The setup as used with torque therefore doesn't work anymore. Please use
    14 
    15 {{{
    16 # workaround for missing module file
    17 . /etc/profile.d/modules.sh
    18 
    19 module purge
    20 module load  intel/16.3 parastation/intel1603-e10-5.1.9-1_11_gc11866c_e10 extoll
    21 }}}
    22 
    23 instead.
     11Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform.
     12
     13== Remark about modules ==
     14
     15By default, Slurm passes the environment from your job submission session directly to the execution environment. Please be aware of this when running jobs with `srun` or when submitting scripts with `sbatch`. This behavior can be controlled via the `--export` option. Please refer to the [https://slurm.schedmd.com/ Slurm documentation] to get more information about this.
     16
     17In particular, when submitting job scripts, it is recommended to load the necessary modules within the script and submit the script from a clean environment.
     18
    2419
    2520== An introductory example ==
    2621
    2722Suppose you have an mpi executable named {{{hello_mpi}}}. There are three ways to start the binary.
     23
    2824
    2925=== From a shell on a node ===
     
    140136Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch:
    141137
    142  * cluster: decommisioned ~~The old DEEP cluster nodes {{{deep[001-128]}}}~~
    143138 * sdv: The DEEP-ER sdv nodes
    144139 * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration)
     
    147142 * snc4: the knls configured in SNC-4 mode
    148143 * knm: The DEEP-ER knm nodes
     144 * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each
    149145 * extoll: the sdv and knl nodes in the extoll fabric
    150146
     
    200196
    201197See also the translation table below.
    202 
    203 === Can I still use the old DEEP Booster nodes? ===
    204 
    205 Yes, please use
    206 
    207 {{{
    208 qsub -q booster ...
    209 }}}
    210 
    211 You cannot run a common job on both the old DEEP cluster and DEEP booster.
    212198
    213199=== Can I join stderr and stdout like it was done with {{{-joe}}} in Torque? ===
     
    226212=== What's the equivalent of {{{qsub -l nodes=x:ppn=y:cluster+n_b:ppn=p_b:booster}}}? ===
    227213
    228 Mixing nodes from different partitions will appear in version 17.11 of slurm. As a workaround, you can explicitly request nodes:
    229 
    230 {{{
    231 srun/sbatch --partition=extoll -w cluster1,...,clusterx,booster1,...,boostern_b -n ...
    232 }}}
    233 
    234 With this the same number of processes will be launched on all allocated nodes. With the following example the number of processes per node can be different for each partition. one node of the sdv partition and one of the knl partition is allocated here. The -m plane=X option sets the number of processes on the first part of nodes (in this case 4 and then 1 process is left for the knl node, because -n is set to 5):
    235 
    236 {{{
    237 -bash-4.1$ srun --partition=extoll -N2 -n 5 -C '[sdv*1&knl*1]' -m plane=4 hostname
    238 deeper-sdv16
    239 deeper-sdv16
    240 deeper-sdv16
    241 deeper-sdv16
    242 knl01
    243 }}}
    244 
    245 To change the node where to start your job (e.g. start on one partition and then spawn the rest of the processes later within your code) please use the -r option for srun.
    246 
    247 {{{
    248 -bash-4.1$ salloc --partition=extoll -N2 -n 5 -C '[sdv*1&knl*1]' -m plane=4
    249 salloc: Granted job allocation 5581
    250 -bash-4.1$ srun -n 1 -r 1 hostname
    251 knl02
    252 }}}
     214As of version 17.11 of Slurm, heterogeneous jobs are supported. For example, the user can run:
     215
     216{{{
     217srun --partition=sdv -N 1 -n 1 hostname : --partition=knl -N 1 -n 1 hostname
     218deeper-sdv01
     219knl05
     220}}}
     221
     222In order to submit a heterogeneous job, the user needs to set the batch script similarly to the following:
     223
     224{{{#!sh
     225#!/bin/bash
     226
     227#SBATCH --job-name=imb_execute_1
     228#SBATCH --account=deep
     229#SBATCH --mail-user=
     230#SBATCH --mail-type=ALL
     231#SBATCH --output=job.out
     232#SBATCH --error=job.err
     233#SBATCH --time=00:02:00
     234
     235#SBATCH --partition=sdv
     236#SBATCH --constraint=
     237#SBATCH --nodes=1
     238#SBATCH --ntasks=12
     239#SBATCH --ntasks-per-node=12
     240#SBATCH --cpus-per-task=1
     241
     242#SBATCH packjob
     243
     244#SBATCH --partition=knl
     245#SBATCH --constraint=
     246#SBATCH --nodes=1
     247#SBATCH --ntasks=12
     248#SBATCH --ntasks-per-node=12
     249#SBATCH --cpus-per-task=1
     250
     251srun ./app_sdv : ./app_knl
     252}}}
     253
     254Here the `packjob` keyword allows to define Slurm parameter for each sub-job of the heterogeneous job.
     255
     256If you need to load modules before launching the application, it's suggested to create wrapper scripts around the applications, and submit such scripts with srun, like this:
     257
     258{{{#!sh
     259...
     260srun ./script_sdv.sh : ./script_knl.sh
     261}}}
     262
     263where a script should contain:
     264
     265{{{#!sh
     266#!/bin/bash
     267
     268module load ...
     269./app_sdv
     270}}}
     271
     272This way it will also be possible to load different modules on the different partitions used in the heterogeneous job.
     273
    253274
    254275== pbs/slurm dictionary ==