Context Navigation

Batch_system

Timestamp:: Jan 22, 2020, 3:25:48 PM (5 years ago)
Author:: Jochen Kreutz
Comment:: 2020-01-22 JK: added example for heterogeneous job script using a gateway; small corrections

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/User_Guide/Batch_system

-                      v24
+                      v25
 Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform).
+== Available Partitions ==
+Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch:
+ * dp-cn: The DEEP-EST cluster nodes
+ * dp-dam: The DEEP-EST DAM nodes
+ * sdv: The DEEP-ER sdv nodes
+ * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration)
+ * knl256: the 256-core knls
+ * knl272: the 272-core knls
+ * snc4: the knls configured in SNC-4 mode
+{{{#!comment KNMs removed
+ * knm: The DEEP-ER knm nodes
+}}}
+ * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each
+ * extoll: the sdv nodes in the extoll fabric ('''KNL nodes not on Extoll connectivity anymore! ''')
+ * dam: prototype dam nodes, two of which equipped with Intel Arria 10G FPGAs.
+Anytime, you can list the state of the partitions with the {{{sinfo}}} command. The properties of a partition can be seen using
+{{{
+scontrol show partition <partition>
+}}}
 == Remark about environment ==
 …
 The user can also request several sets of nodes in a heterogeneous allocation using `salloc`. For example:
 {{{
 salloc --partiton=dp-cn -N 2 : --partition=dp-dam -N 4
+salloc --partition=dp-cn -N 2 : --partition=dp-dam -N 4
 }}}
 …
 }}}
 Here the `packjob` keyword allows to define Slurm parameter for each sub-job of the heterogeneous job. Some Slurm options can be defined once at the beginning of the script and are automatically propagated to all sub-jobs of the heterogeneous job, while some others (i.e. `--nodes` or `--ntasks`) must be defined for each sub-job. You can find a list of the propagated options on the [https://slurm.schedmd.com/heterogeneous_jobs.html#submitting Slurm documentation].
+Here the `packjob` keyword allows to define Slurm parameters for each sub-job of the heterogeneous job. Some Slurm options can be defined once at the beginning of the script and are automatically propagated to all sub-jobs of the heterogeneous job, while some others (i.e. `--nodes` or `--ntasks`) must be defined for each sub-job. You can find a list of the propagated options on the [https://slurm.schedmd.com/heterogeneous_jobs.html#submitting Slurm documentation].
 When submitting a heterogeneous job with this colon notation using ParaStationMPI, a unique `MPI_COMM_WORLD` is created, spanning across the two partitions. If this is not desired, one can use the `--pack-group` key to submit independent job steps to the different node-groups of a heterogeneous allocation:
 …
 In order to establish MPI communication across modules using different interconnect technologies, some special Gateway nodes must be used. On the DEEP-EST system, MPI communication across gateways is needed only between Infiniband and Extoll interconnects.
+**Attention:** Only ParaStation MPI supports MPI communication across gateway nodes.
+**Attention:** During the first part of 2020, only the DAM nodes will have Extoll interconnect, while the CM and the ESB nodes will be connected via Infiniband. This will change later during the course of the project (expected Summer 2020), when the ESB will be equipped with Extoll connectivity (Infiniband, which will be removed from the ESB and left only for the CM).
+**Attention:** Only !ParaStation MPI supports MPI communication across gateway nodes.
+This is an example job script for setting up an Intel MPI benchmark between a Cluster and a DAM node using a IB <-> Extoll gateway for MPI communication:
+{{{
+#!/bin/bash
+# Script to launch IMB PingPong between DAM-CN using 1 Gateway
+# Use the gateway allocation provided by SLURM
+# Use the packjob feature to launch separately CM and DAM executable
+#SBATCH --job-name=imb
+##SBATCH --account=cdeep
+#SBATCH --output=IMB-%j.out
+#SBATCH --error=IMB-%j.err
+#SBATCH --time=00:05:00
+#SBATCH --gw_num=1
+#SBATCH --gw_binary=/opt/parastation/bin/psgwd.extoll
+#SBATCH --gw_psgwd_per_node=1
+#SBATCH --partition=dp-cn
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH packjob
+#SBATCH --partition=dp-dam-ext
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+echo "DEBUG: SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+echo "DEBUG: SLURM_NNODES=$SLURM_NNODES"
+echo "DEBUG: SLURM_TASKS_PER_NODE=$SLURM_TASKS_PER_NODE"
+# Execute
+srun hostname : hostname
+srun module_dp-cn.sh : module_dp-dam-ext.sh
+}}}
+It uses two execution scripts for loading the correct environment and starting the IMB on the CM and the DAM node (this approach can also be used to start different programs, e.g. one could think of a master and worker use case). The execution scripts could look like:
+{{{
+#!/bin/bash
+# Script for the CN using InfiniBand
+module load Intel ParaStationMPI pscom
+# Execution
+EXEC=$PWD/mpi-benchmarks/IMB-MPI1
+LD_LIBRARY_PATH=/opt/parastation/lib64:$LD_LIBRARY_PATH PSP_OPENIB_HCA=mlx5_0 ${EXEC} PingPong
+}}}
+{{{
+#!/bin/bash
+# Script for the DAM using Extoll
+module load Intel ParaStationMPI pscom extoll
+# Execution
+EXEC=$PWD/mpi-benchmarks/IMB-MPI1
+LD_LIBRARY_PATH=/opt/parastation/lib64:/opt/extoll/x86_64/lib:$LD_LIBRARY_PATH PSP_DEBUG=3 PSP_EXTOLL=1 PSP_VELO=1 PSP_RENDEZVOUS_VELO=2048 PSP_OPENIB=0 ${EXEC} PingPong
+}}}
+**Attention:** During the first part of 2020, only the DAM nodes will have Extoll interconnect, while the CM and the ESB nodes will be connected via Infiniband. This will change later during the course of the project (expected Summer 2020), when the ESB will be equipped with Extoll connectivity (Infiniband will be removed from the ESB and left only for the CM).
 A general description of how the user can request and use gateway nodes is provided at [https://apps.fz-juelich.de/jsc/hps/jureca/modular-jobs.html#mpi-traffic-across-modules this section] of the JURECA documentation.
 …
 }}}
+== Available Partitions ==
+Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch:
+ * dp-cn: The DEEP-EST cluster nodes
+ * dp-dam: The DEEP-EST DAM nodes
+ * sdv: The DEEP-ER sdv nodes
+ * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration)
+ * knl256: the 256-core knls
+ * knl272: the 272-core knls
+ * snc4: the knls configured in SNC-4 mode
+{{{#!comment KNMs removed
+ * knm: The DEEP-ER knm nodes
+}}}
+ * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each
+ * extoll: the sdv nodes in the extoll fabric ('''KNL nodes not on Extoll connectivity anymore! ''')
+ * dam: prototype dam nodes, two of which equipped with Intel Arria 10G FPGAs.
+Anytime, you can list the state of the partitions with the {{{sinfo}}} command. The properties of a partition can be seen using
+{{{
+scontrol show partition <partition>
+}}}
 == Information on past jobs and accounting ==
 …
 The `sacct` command can be used to enquire the Slurm database about a past job.
+{{{
+[kreutz1@deepv Temp]$ sacct -j 69268
+       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
+------------ ---------- ---------- ---------- ---------- ---------- --------
++0            bash      dp-cn deepest-a+         96  COMPLETED      0:0
++0.0    MPI_Hello+            deepest-a+          2  COMPLETED      0:0
++1            bash     dp-dam deepest-a+        384  COMPLETED      0:0
+}}}
 == FAQ ==
 …
 === How can I check which jobs are running in the machine? ===
 Please use the {{{squeue}}} command.
+Please use the {{{squeue}}} command ( the "-u $USER" option to only list jobs belonging to your user id).
 === How do I do chain jobs with dependencies? ===
 …
 entry.
 Also, jobs chan be chained after they have been submitted using the `scontrol` command by updating their `Dependency` field.
+Also, jobs can be chained after they have been submitted using the `scontrol` command by updating their `Dependency` field.
 === How can check the status of partitions and nodes? ===
 …
 To exploit SMT, simply run a job using a number of tasks*threads_per_task higher than the number of physical cores available on a node. Please refer to the [https://apps.fz-juelich.de/jsc/hps/jureca/smt.html relevant page] of the JURECA documentation for more information on how to use SMT on the DEEP nodes.
 **Attention**: currently the only way of assign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example:
+**Attention**: currently the only way to assign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example:
 {{{#!sh
 [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=2 OMP_PROC_BIND=close OMP_PLACES=threads srun -N 1 -n 2 -p dp-dam --cpu-bind=mask_cpu:$(printf '%x' "$((2#1000000000000000000000000000000000000000000000001))"),$(printf '%x' "$((2#10000000000000000000000000000000000000000000000010))") ./HybridHello | sort -k9n -k11n