Changes between Version 24 and Version 25 of Public/User_Guide/Batch_system
- Timestamp:
- Jan 22, 2020, 3:25:48 PM (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/Batch_system
v24 v25 14 14 15 15 Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform). 16 17 == Available Partitions == 18 19 Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch: 20 21 * dp-cn: The DEEP-EST cluster nodes 22 * dp-dam: The DEEP-EST DAM nodes 23 * sdv: The DEEP-ER sdv nodes 24 * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration) 25 * knl256: the 256-core knls 26 * knl272: the 272-core knls 27 * snc4: the knls configured in SNC-4 mode 28 {{{#!comment KNMs removed 29 * knm: The DEEP-ER knm nodes 30 }}} 31 * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each 32 * extoll: the sdv nodes in the extoll fabric ('''KNL nodes not on Extoll connectivity anymore! ''') 33 * dam: prototype dam nodes, two of which equipped with Intel Arria 10G FPGAs. 34 35 Anytime, you can list the state of the partitions with the {{{sinfo}}} command. The properties of a partition can be seen using 36 37 {{{ 38 scontrol show partition <partition> 39 }}} 16 40 17 41 == Remark about environment == … … 151 175 The user can also request several sets of nodes in a heterogeneous allocation using `salloc`. For example: 152 176 {{{ 153 salloc --partit on=dp-cn -N 2 : --partition=dp-dam -N 4177 salloc --partition=dp-cn -N 2 : --partition=dp-dam -N 4 154 178 }}} 155 179 … … 185 209 }}} 186 210 187 Here the `packjob` keyword allows to define Slurm parameter for each sub-job of the heterogeneous job. Some Slurm options can be defined once at the beginning of the script and are automatically propagated to all sub-jobs of the heterogeneous job, while some others (i.e. `--nodes` or `--ntasks`) must be defined for each sub-job. You can find a list of the propagated options on the [https://slurm.schedmd.com/heterogeneous_jobs.html#submitting Slurm documentation].211 Here the `packjob` keyword allows to define Slurm parameters for each sub-job of the heterogeneous job. Some Slurm options can be defined once at the beginning of the script and are automatically propagated to all sub-jobs of the heterogeneous job, while some others (i.e. `--nodes` or `--ntasks`) must be defined for each sub-job. You can find a list of the propagated options on the [https://slurm.schedmd.com/heterogeneous_jobs.html#submitting Slurm documentation]. 188 212 189 213 When submitting a heterogeneous job with this colon notation using ParaStationMPI, a unique `MPI_COMM_WORLD` is created, spanning across the two partitions. If this is not desired, one can use the `--pack-group` key to submit independent job steps to the different node-groups of a heterogeneous allocation: … … 200 224 201 225 In order to establish MPI communication across modules using different interconnect technologies, some special Gateway nodes must be used. On the DEEP-EST system, MPI communication across gateways is needed only between Infiniband and Extoll interconnects. 202 203 **Attention:** Only ParaStation MPI supports MPI communication across gateway nodes. 204 205 **Attention:** During the first part of 2020, only the DAM nodes will have Extoll interconnect, while the CM and the ESB nodes will be connected via Infiniband. This will change later during the course of the project (expected Summer 2020), when the ESB will be equipped with Extoll connectivity (Infiniband, which will be removed from the ESB and left only for the CM). 226 **Attention:** Only !ParaStation MPI supports MPI communication across gateway nodes. 227 228 This is an example job script for setting up an Intel MPI benchmark between a Cluster and a DAM node using a IB <-> Extoll gateway for MPI communication: 229 230 {{{ 231 #!/bin/bash 232 233 # Script to launch IMB PingPong between DAM-CN using 1 Gateway 234 # Use the gateway allocation provided by SLURM 235 # Use the packjob feature to launch separately CM and DAM executable 236 237 #SBATCH --job-name=imb 238 ##SBATCH --account=cdeep 239 #SBATCH --output=IMB-%j.out 240 #SBATCH --error=IMB-%j.err 241 #SBATCH --time=00:05:00 242 #SBATCH --gw_num=1 243 #SBATCH --gw_binary=/opt/parastation/bin/psgwd.extoll 244 #SBATCH --gw_psgwd_per_node=1 245 246 #SBATCH --partition=dp-cn 247 #SBATCH --nodes=1 248 #SBATCH --ntasks=1 249 250 #SBATCH packjob 251 252 #SBATCH --partition=dp-dam-ext 253 #SBATCH --nodes=1 254 #SBATCH --ntasks=1 255 256 echo "DEBUG: SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST" 257 echo "DEBUG: SLURM_NNODES=$SLURM_NNODES" 258 echo "DEBUG: SLURM_TASKS_PER_NODE=$SLURM_TASKS_PER_NODE" 259 260 # Execute 261 srun hostname : hostname 262 srun module_dp-cn.sh : module_dp-dam-ext.sh 263 }}} 264 265 It uses two execution scripts for loading the correct environment and starting the IMB on the CM and the DAM node (this approach can also be used to start different programs, e.g. one could think of a master and worker use case). The execution scripts could look like: 266 267 {{{ 268 #!/bin/bash 269 # Script for the CN using InfiniBand 270 271 module load Intel ParaStationMPI pscom 272 273 # Execution 274 EXEC=$PWD/mpi-benchmarks/IMB-MPI1 275 LD_LIBRARY_PATH=/opt/parastation/lib64:$LD_LIBRARY_PATH PSP_OPENIB_HCA=mlx5_0 ${EXEC} PingPong 276 }}} 277 278 {{{ 279 #!/bin/bash 280 # Script for the DAM using Extoll 281 282 module load Intel ParaStationMPI pscom extoll 283 284 # Execution 285 EXEC=$PWD/mpi-benchmarks/IMB-MPI1 286 LD_LIBRARY_PATH=/opt/parastation/lib64:/opt/extoll/x86_64/lib:$LD_LIBRARY_PATH PSP_DEBUG=3 PSP_EXTOLL=1 PSP_VELO=1 PSP_RENDEZVOUS_VELO=2048 PSP_OPENIB=0 ${EXEC} PingPong 287 }}} 288 289 **Attention:** During the first part of 2020, only the DAM nodes will have Extoll interconnect, while the CM and the ESB nodes will be connected via Infiniband. This will change later during the course of the project (expected Summer 2020), when the ESB will be equipped with Extoll connectivity (Infiniband will be removed from the ESB and left only for the CM). 206 290 207 291 A general description of how the user can request and use gateway nodes is provided at [https://apps.fz-juelich.de/jsc/hps/jureca/modular-jobs.html#mpi-traffic-across-modules this section] of the JURECA documentation. … … 234 318 }}} 235 319 236 == Available Partitions == 237 238 Please note that there is no default partition configured. In order to run a job, you have to specify one of the following partitions, using the {{{--partition=...}}} switch: 239 240 * dp-cn: The DEEP-EST cluster nodes 241 * dp-dam: The DEEP-EST DAM nodes 242 * sdv: The DEEP-ER sdv nodes 243 * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration) 244 * knl256: the 256-core knls 245 * knl272: the 272-core knls 246 * snc4: the knls configured in SNC-4 mode 247 {{{#!comment KNMs removed 248 * knm: The DEEP-ER knm nodes 249 }}} 250 * ml-gpu: the machine learning nodes equipped with 4 Nvidia Tesla V100 GPUs each 251 * extoll: the sdv nodes in the extoll fabric ('''KNL nodes not on Extoll connectivity anymore! ''') 252 * dam: prototype dam nodes, two of which equipped with Intel Arria 10G FPGAs. 253 254 Anytime, you can list the state of the partitions with the {{{sinfo}}} command. The properties of a partition can be seen using 255 256 {{{ 257 scontrol show partition <partition> 258 }}} 320 259 321 260 322 == Information on past jobs and accounting == … … 262 324 The `sacct` command can be used to enquire the Slurm database about a past job. 263 325 326 {{{ 327 [kreutz1@deepv Temp]$ sacct -j 69268 328 JobID JobName Partition Account AllocCPUS State ExitCode 329 ------------ ---------- ---------- ---------- ---------- ---------- -------- 330 69268+0 bash dp-cn deepest-a+ 96 COMPLETED 0:0 331 69268+0.0 MPI_Hello+ deepest-a+ 2 COMPLETED 0:0 332 69268+1 bash dp-dam deepest-a+ 384 COMPLETED 0:0 333 }}} 334 264 335 265 336 == FAQ == … … 287 358 === How can I check which jobs are running in the machine? === 288 359 289 Please use the {{{squeue}}} command .360 Please use the {{{squeue}}} command ( the "-u $USER" option to only list jobs belonging to your user id). 290 361 291 362 === How do I do chain jobs with dependencies? === … … 299 370 entry. 300 371 301 Also, jobs c han be chained after they have been submitted using the `scontrol` command by updating their `Dependency` field.372 Also, jobs can be chained after they have been submitted using the `scontrol` command by updating their `Dependency` field. 302 373 303 374 === How can check the status of partitions and nodes? === … … 475 546 To exploit SMT, simply run a job using a number of tasks*threads_per_task higher than the number of physical cores available on a node. Please refer to the [https://apps.fz-juelich.de/jsc/hps/jureca/smt.html relevant page] of the JURECA documentation for more information on how to use SMT on the DEEP nodes. 476 547 477 **Attention**: currently the only way ofassign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example:548 **Attention**: currently the only way to assign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example: 478 549 {{{#!sh 479 550 [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=2 OMP_PROC_BIND=close OMP_PLACES=threads srun -N 1 -n 2 -p dp-dam --cpu-bind=mask_cpu:$(printf '%x' "$((2#1000000000000000000000000000000000000000000000001))"),$(printf '%x' "$((2#10000000000000000000000000000000000000000000000010))") ./HybridHello | sort -k9n -k11n