Changes between Version 24 and Version 25 of Public/User_Guide/Batch_system


Ignore:
Timestamp:
Jan 22, 2020, 3:25:48 PM (4 years ago)
Author:
Jochen Kreutz
Comment:

2020-01-22 JK: added example for heterogeneous job script using a gateway; small corrections

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Batch_system

    v24 v25  
    1414
    1515Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform).
     16
     17== Available Partitions ==
     18
     19Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch:
     20
     21 * dp-cn: The DEEP-EST cluster nodes
     22 * dp-dam: The DEEP-EST DAM nodes
     23 * sdv: The DEEP-ER sdv nodes
     24 * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration)
     25 * knl256: the 256-core knls
     26 * knl272: the 272-core knls
     27 * snc4: the knls configured in SNC-4 mode
     28{{{#!comment KNMs removed
     29 * knm: The DEEP-ER knm nodes
     30}}}
     31 * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each
     32 * extoll: the sdv nodes in the extoll fabric ('''KNL nodes not on Extoll connectivity anymore! ''')
     33 * dam: prototype dam nodes, two of which equipped with Intel Arria 10G FPGAs.
     34
     35Anytime, you can list the state of the partitions with the {{{sinfo}}} command. The properties of a partition can be seen using
     36
     37{{{
     38scontrol show partition <partition>
     39}}}
    1640
    1741== Remark about environment ==
     
    151175The user can also request several sets of nodes in a heterogeneous allocation using `salloc`. For example:
    152176{{{
    153 salloc --partiton=dp-cn -N 2 : --partition=dp-dam -N 4
     177salloc --partition=dp-cn -N 2 : --partition=dp-dam -N 4
    154178}}}
    155179
     
    185209}}}
    186210
    187 Here the `packjob` keyword allows to define Slurm parameter for each sub-job of the heterogeneous job. Some Slurm options can be defined once at the beginning of the script and are automatically propagated to all sub-jobs of the heterogeneous job, while some others (i.e. `--nodes` or `--ntasks`) must be defined for each sub-job. You can find a list of the propagated options on the [https://slurm.schedmd.com/heterogeneous_jobs.html#submitting Slurm documentation].
     211Here the `packjob` keyword allows to define Slurm parameters for each sub-job of the heterogeneous job. Some Slurm options can be defined once at the beginning of the script and are automatically propagated to all sub-jobs of the heterogeneous job, while some others (i.e. `--nodes` or `--ntasks`) must be defined for each sub-job. You can find a list of the propagated options on the [https://slurm.schedmd.com/heterogeneous_jobs.html#submitting Slurm documentation].
    188212
    189213When submitting a heterogeneous job with this colon notation using ParaStationMPI, a unique `MPI_COMM_WORLD` is created, spanning across the two partitions. If this is not desired, one can use the `--pack-group` key to submit independent job steps to the different node-groups of a heterogeneous allocation:
     
    200224
    201225In order to establish MPI communication across modules using different interconnect technologies, some special Gateway nodes must be used. On the DEEP-EST system, MPI communication across gateways is needed only between Infiniband and Extoll interconnects.
    202 
    203 **Attention:** Only ParaStation MPI supports MPI communication across gateway nodes.
    204 
    205 **Attention:** During the first part of 2020, only the DAM nodes will have Extoll interconnect, while the CM and the ESB nodes will be connected via Infiniband. This will change later during the course of the project (expected Summer 2020), when the ESB will be equipped with Extoll connectivity (Infiniband, which will be removed from the ESB and left only for the CM).
     226**Attention:** Only !ParaStation MPI supports MPI communication across gateway nodes.
     227
     228This is an example job script for setting up an Intel MPI benchmark between a Cluster and a DAM node using a IB <-> Extoll gateway for MPI communication:
     229
     230{{{
     231#!/bin/bash
     232
     233# Script to launch IMB PingPong between DAM-CN using 1 Gateway
     234# Use the gateway allocation provided by SLURM
     235# Use the packjob feature to launch separately CM and DAM executable
     236
     237#SBATCH --job-name=imb
     238##SBATCH --account=cdeep
     239#SBATCH --output=IMB-%j.out
     240#SBATCH --error=IMB-%j.err
     241#SBATCH --time=00:05:00
     242#SBATCH --gw_num=1
     243#SBATCH --gw_binary=/opt/parastation/bin/psgwd.extoll
     244#SBATCH --gw_psgwd_per_node=1
     245
     246#SBATCH --partition=dp-cn
     247#SBATCH --nodes=1
     248#SBATCH --ntasks=1
     249
     250#SBATCH packjob
     251
     252#SBATCH --partition=dp-dam-ext
     253#SBATCH --nodes=1
     254#SBATCH --ntasks=1
     255
     256echo "DEBUG: SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
     257echo "DEBUG: SLURM_NNODES=$SLURM_NNODES"
     258echo "DEBUG: SLURM_TASKS_PER_NODE=$SLURM_TASKS_PER_NODE"
     259
     260# Execute
     261srun hostname : hostname
     262srun module_dp-cn.sh : module_dp-dam-ext.sh
     263}}}
     264
     265It uses two execution scripts for loading the correct environment and starting the IMB on the CM and the DAM node (this approach can also be used to start different programs, e.g. one could think of a master and worker use case). The execution scripts could look like:
     266
     267{{{
     268#!/bin/bash
     269# Script for the CN using InfiniBand
     270
     271module load Intel ParaStationMPI pscom
     272
     273# Execution
     274EXEC=$PWD/mpi-benchmarks/IMB-MPI1
     275LD_LIBRARY_PATH=/opt/parastation/lib64:$LD_LIBRARY_PATH PSP_OPENIB_HCA=mlx5_0 ${EXEC} PingPong
     276}}}
     277
     278{{{
     279#!/bin/bash
     280# Script for the DAM using Extoll
     281
     282module load Intel ParaStationMPI pscom extoll
     283
     284# Execution
     285EXEC=$PWD/mpi-benchmarks/IMB-MPI1
     286LD_LIBRARY_PATH=/opt/parastation/lib64:/opt/extoll/x86_64/lib:$LD_LIBRARY_PATH PSP_DEBUG=3 PSP_EXTOLL=1 PSP_VELO=1 PSP_RENDEZVOUS_VELO=2048 PSP_OPENIB=0 ${EXEC} PingPong
     287}}}
     288
     289**Attention:** During the first part of 2020, only the DAM nodes will have Extoll interconnect, while the CM and the ESB nodes will be connected via Infiniband. This will change later during the course of the project (expected Summer 2020), when the ESB will be equipped with Extoll connectivity (Infiniband will be removed from the ESB and left only for the CM).
    206290
    207291A general description of how the user can request and use gateway nodes is provided at [https://apps.fz-juelich.de/jsc/hps/jureca/modular-jobs.html#mpi-traffic-across-modules this section] of the JURECA documentation.
     
    234318}}}
    235319
    236 == Available Partitions ==
    237 
    238 Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch:
    239 
    240  * dp-cn: The DEEP-EST cluster nodes
    241  * dp-dam: The DEEP-EST DAM nodes
    242  * sdv: The DEEP-ER sdv nodes
    243  * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration)
    244  * knl256: the 256-core knls
    245  * knl272: the 272-core knls
    246  * snc4: the knls configured in SNC-4 mode
    247 {{{#!comment KNMs removed
    248  * knm: The DEEP-ER knm nodes
    249 }}}
    250  * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each
    251  * extoll: the sdv nodes in the extoll fabric ('''KNL nodes not on Extoll connectivity anymore! ''')
    252  * dam: prototype dam nodes, two of which equipped with Intel Arria 10G FPGAs.
    253 
    254 Anytime, you can list the state of the partitions with the {{{sinfo}}} command. The properties of a partition can be seen using
    255 
    256 {{{
    257 scontrol show partition <partition>
    258 }}}
     320
    259321
    260322== Information on past jobs and accounting ==
     
    262324The `sacct` command can be used to enquire the Slurm database about a past job.
    263325
     326{{{
     327[kreutz1@deepv Temp]$ sacct -j 69268
     328       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
     329------------ ---------- ---------- ---------- ---------- ---------- --------
     33069268+0            bash      dp-cn deepest-a+         96  COMPLETED      0:0
     33169268+0.0    MPI_Hello+            deepest-a+          2  COMPLETED      0:0
     33269268+1            bash     dp-dam deepest-a+        384  COMPLETED      0:0
     333}}}
     334
    264335
    265336== FAQ ==
     
    287358=== How can I check which jobs are running in the machine? ===
    288359
    289 Please use the {{{squeue}}} command.
     360Please use the {{{squeue}}} command ( the "-u $USER" option to only list jobs belonging to your user id).
    290361
    291362=== How do I do chain jobs with dependencies? ===
     
    299370entry.
    300371
    301 Also, jobs chan be chained after they have been submitted using the `scontrol` command by updating their `Dependency` field.
     372Also, jobs can be chained after they have been submitted using the `scontrol` command by updating their `Dependency` field.
    302373
    303374=== How can check the status of partitions and nodes? ===
     
    475546To exploit SMT, simply run a job using a number of tasks*threads_per_task higher than the number of physical cores available on a node. Please refer to the [https://apps.fz-juelich.de/jsc/hps/jureca/smt.html relevant page] of the JURECA documentation for more information on how to use SMT on the DEEP nodes.
    476547
    477 **Attention**: currently the only way of assign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example:
     548**Attention**: currently the only way to assign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example:
    478549{{{#!sh
    479550[deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=2 OMP_PROC_BIND=close OMP_PLACES=threads srun -N 1 -n 2 -p dp-dam --cpu-bind=mask_cpu:$(printf '%x' "$((2#1000000000000000000000000000000000000000000000001))"),$(printf '%x' "$((2#10000000000000000000000000000000000000000000000010))") ./HybridHello | sort -k9n -k11n