Changes between Version 53 and Version 54 of Public/User_Guide/Batch_system


Ignore:
Timestamp:
Oct 15, 2021, 11:33:43 AM (3 years ago)
Author:
Jochen Kreutz
Comment:

Move heterogeneous and workflow stuff to own wiki pages

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Batch_system

    v53 v54  
    184184Hello World from rank 0 of 8 on dp-esb34
    185185}}}
    186 
    187 == Heterogeneous jobs ==
    188 
    189 As of version 17.11 of Slurm, heterogeneous jobs are supported. For example, the user can run:
    190 
    191 {{{
    192 srun --account=deep --partition=dp-cn -N 1 -n 1 hostname : --partition=dp-dam -N 1 -n 1 hostname
    193 dp-cn01
    194 dp-dam01
    195 }}}
    196 
    197 Please notice the `:` separating the definitions for each sub-job of the heterogeneous job. Also, please be aware that it is possible to have more than two sub-jobs in a heterogeneous job.
    198 
    199 The user can also request several sets of nodes in a heterogeneous allocation using `salloc`. For example:
    200 {{{
    201 salloc --partition=dp-cn -N 2 : --partition=dp-dam -N 4
    202 }}}
    203 
    204 In order to submit a heterogeneous job via `sbatch`, the user needs to set the batch script similar to the following one:
    205 
    206 {{{#!sh
    207 #!/bin/bash
    208 
    209 #SBATCH --job-name=imb_execute_1
    210 #SBATCH --account=deep
    211 #SBATCH --mail-user=
    212 #SBATCH --mail-type=ALL
    213 #SBATCH --output=job.out
    214 #SBATCH --error=job.err
    215 #SBATCH --time=00:02:00
    216 
    217 #SBATCH --partition=dp-cn
    218 #SBATCH --nodes=1
    219 #SBATCH --ntasks=12
    220 #SBATCH --ntasks-per-node=12
    221 #SBATCH --cpus-per-task=1
    222 
    223 #SBATCH packjob
    224 
    225 #SBATCH --partition=dp-dam
    226 #SBATCH --constraint=
    227 #SBATCH --nodes=1
    228 #SBATCH --ntasks=12
    229 #SBATCH --ntasks-per-node=12
    230 #SBATCH --cpus-per-task=1
    231 
    232 srun ./app_cn : ./app_dam
    233 }}}
    234 
    235 Here the `packjob` keyword allows to define Slurm parameters for each sub-job of the heterogeneous job. Some Slurm options can be defined once at the beginning of the script and are automatically propagated to all sub-jobs of the heterogeneous job, while some others (i.e. `--nodes` or `--ntasks`) must be defined for each sub-job. You can find a list of the propagated options on the [https://slurm.schedmd.com/heterogeneous_jobs.html#submitting Slurm documentation].
    236 
    237 When submitting a heterogeneous job with this colon notation using ParaStationMPI, a unique `MPI_COMM_WORLD` is created, spanning across the two partitions. If this is not desired, one can use the `--pack-group` key to submit independent job steps to the different node-groups of a heterogeneous allocation:
    238 
    239 {{{#!sh
    240 srun --pack-group=0 ./app_cn ; srun --pack-group=1 ./app_dam
    241 }}}
    242 
    243 Using this configuration implies that inter-communication must be established manually by the applications during run time, if needed.
    244 
    245 For more information about heterogeneous jobs please refer to the [https://slurm.schedmd.com/heterogeneous_jobs.html relevant page] of the Slurm documentation.
    246 
    247 === Heterogeneous jobs with MPI communication across modules ===
    248 
    249 In order to establish MPI communication across modules using different interconnect technologies, some special Gateway nodes must be used. On the DEEP-EST system, MPI communication across gateways is needed only between Infiniband and Extoll interconnects.
    250 
    251 **Attention:** Only !ParaStation MPI supports MPI communication across gateway nodes.
    252 
    253 This is an example job script for setting up an Intel MPI benchmark between a Cluster and a DAM node using a IB <-> Extoll gateway for MPI communication:
    254 
    255 {{{#!sh
    256 #!/bin/bash
    257 
    258 # Script to launch IMB PingPong between DAM-CN using 1 Gateway
    259 # Use the gateway allocation provided by SLURM
    260 # Use the packjob feature to launch separately CM and DAM executable
    261 
    262 
    263 # General configuration of the job
    264 #SBATCH --job-name=modular-imb
    265 #SBATCH --account=deep
    266 #SBATCH --time=00:10:00
    267 #SBATCH --output=modular-imb-%j.out
    268 #SBATCH --error=modular-imb-%j.err
    269 
    270 # Configure the gateway daemon
    271 #SBATCH --gw_num=1
    272 #SBATCH --gw_psgwd_per_node=1
    273 
    274 # Configure node and process count on the CM
    275 #SBATCH --partition=dp-cn
    276 #SBATCH --nodes=1
    277 #SBATCH --ntasks-per-node=1
    278 
    279 #SBATCH packjob
    280 
    281 # Configure node and process count on the DAM
    282 #SBATCH --partition=dp-dam-ext
    283 #SBATCH --nodes=1
    284 #SBATCH --ntasks-per-node=1
    285 
    286 # Echo job configuration
    287 echo "DEBUG: SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
    288 echo "DEBUG: SLURM_NNODES=$SLURM_NNODES"
    289 echo "DEBUG: SLURM_TASKS_PER_NODE=$SLURM_TASKS_PER_NODE"
    290 
    291 
    292 # Set the environment to use PS-MPI
    293 module --force purge
    294 module use $OTHERSTAGES
    295 module load Stages/Devel-2019a
    296 module load Intel
    297 module load ParaStationMPI
    298 
    299 # Show the hosts we are running on
    300 srun hostname : hostname
    301 
    302 # Execute
    303 APP="./IMB-MPI1 Uniband"
    304 srun ${APP}  : ${APP}
    305 }}}
    306 
    307 
    308 
    309 **Attention:** During the first part of 2020, only the DAM nodes will have Extoll interconnect (and only the nodes belonging to the `deep-dam-ext` partition will have Extoll active), while the CM and the ESB nodes will be connected via Infiniband. This will change later during the course of the project (expected end of Summer 2020), when the ESB will be equipped with Extoll connectivity (Infiniband will be removed from the ESB and left only for the CM).
    310 
    311 A general description of how the user can request and use gateway nodes is provided at [https://apps.fz-juelich.de/jsc/hps/jureca/modular-jobs.html#mpi-traffic-across-modules this section] of the JURECA documentation.
    312 
    313 **Attention:** some information provided on the JURECA documentation do not apply for the DEEP system. In particular:
    314 * as of 31/03/2020, the DEEP system has 2 gateway nodes.
    315 
    316 * As of 09/01/2020 the gateway nodes are exclusive to the job requesting them. Given the limited number of gateway nodes available on the system, this may change in the future.
    317 
    318 * As of 09/04/2020 the `xenv` utility (necessary on JURECA to load modules for different architectures - Haswell and KNL) is not needed any more on DEEP when using the latest version of ParaStationMPI (currently available in the `Devel-2019a` stage and soon available on the default production stage).
    319 
    320 {{{#!comment
    321 If you need to load modules before launching the application, it's suggested to create wrapper scripts around the applications, and submit such scripts with srun, like this:
    322 
    323 {{{#!sh
    324 ...
    325 srun ./script_sdv.sh : ./script_knl.sh
    326 }}}
    327 
    328 where a script should contain:
    329 
    330 {{{#!sh
    331 #!/bin/bash
    332 
    333 module load ...
    334 ./app_sdv
    335 }}}
    336 
    337 This way it will also be possible to load different modules on the different partitions used in the heterogeneous job.
    338 }}}
    339 
    340 
    341 == Workflows ==
    342 
    343 The version of Slurm installed on the system enables workflows (chains of jobs) with the possibility of having some overlap between the dependent jobs. This allows them to exchange data over the network rather than writing and reading it using a common file system.
    344 
    345 Workflows can be submitted in two ways:
    346 - using the new `--delay` option provided in `sbatch` command, which allows to start a job with a fixed delay from the start of the previous job;
    347 - submitting separate jobs using an `afterok` dependency and later requesting a change in dependency type from `afterok` to `after` (using our provided shared library), which allows the second job to start if resources are available.
    348 
    349 An example project that uses all the features discussed is provided [https://gitlab.version.fz-juelich.de/DEEP-EST/mpi_connect_test here].
    350 
    351 The following simple example script helps to understand the mechanism of new {{{delay}}} switch for workflows.
    352 
    353 {{{#!sh
    354 [huda1@deepv scripts]$ cat test.sh
    355 #!/bin/sh
    356 
    357 NAME=$(hostname)
    358 echo "$NAME: Going to sleep for $1 seconds"
    359 sleep $1
    360 echo "$NAME: Awake"
    361 
    362 [huda1@deepv scripts]$ cat batch_workflow.sh
    363 #!/bin/bash
    364 #SBATCH -p sdv -N2 -t3
    365 
    366 #SBATCH packjob
    367 
    368 #SBATCH -p sdv -N1 -t3 --delay 2
    369 
    370 srun test.sh 175
    371 
    372 [huda1@deepv scripts]$
    373 }}}
    374 
    375 In the above {{{sbatch}}} script, the usage of {{{--delay}}} can be seen. The option takes values in minutes and allows us to delay the subsequent job of by a user-defined number of minutes from the start of the first job in the job pack. After submission of this job pack (which uses the same syntax as a heterogeneous job), Slurm divides it into separate jobs. Also, Slurm ensures that the delay is respected by using reservations, rather than the usual scheduler.
    376 
    377 Here is the example execution of this script.
    378 
    379 {{{
    380 [huda1@deepv scripts]$ sbatch batch_workflow.sh
    381 Submitted batch job 81458
    382 [huda1@deepv scripts]$ squeue -u huda1
    383              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    384              81458       sdv batch_wo    huda1 CF       0:01      2 deeper-sdv[02-03]
    385              81459       sdv batch_wo    huda1 PD       0:00      1 (Reservation)
    386 
    387 [huda1@deepv scripts]$
    388 }}}
    389 
    390 Here the second job (81459) will start 2 minutes after the start of the first job (81458), and it is listed as `PD` (`Pending`) with reason `Reservation` because it will start as soon as its reservation will begin.
    391 
    392 Similarly, the output files will be different for each separated job in the job pack. the final outputs are:
    393 {{{
    394 [huda1@deepv scripts]$ cat slurm-81458.out
    395 deeper-sdv02: Going to sleep for 175 seconds
    396 deeper-sdv03: Going to sleep for 175 seconds
    397 deeper-sdv02: Awake
    398 deeper-sdv03: Awake
    399 
    400 [huda1@deepv scripts]$ cat slurm-81459.out
    401 deeper-sdv01: Going to sleep for 175 seconds
    402 deeper-sdv01: Awake
    403 
    404 [huda1@deepv scripts]$
    405 }}}
    406 
    407 Another feature to note is that if there are multiple jobs in a job pack and any number of consecutive jobs have the same {{{delay}}} values, they are combined into a new heterogeneous job. This allows to have heterogeneous jobs within workflows. Here is an example of such a script:
    408 {{{#!sh
    409 [huda1@deepv scripts]$ cat batch_workflow_complex.sh
    410 #!/bin/bash
    411 
    412 #SBATCH -p sdv -N 2 -t 3
    413 #SBATCH -J first
    414 
    415 #SBATCH packjob
    416 
    417 #SBATCH -p sdv -N 1 -t 3 --delay 2
    418 #SBATCH -J second
    419 
    420 #SBATCH packjob
    421 
    422 #SBATCH -p sdv -N 1 -t 2 --delay 2
    423 #SBATCH -J second
    424 
    425 #SBATCH packjob
    426 
    427 #SBATCH -p sdv -N 2 -t 3 --delay 4
    428 #SBATCH -J third
    429 
    430 if [ "$SLURM_JOB_NAME" == "first" ]
    431 then
    432         srun ./test.sh 150
    433 
    434 elif [ "$SLURM_JOB_NAME" == "second" ]
    435 then
    436         srun ./test.sh 150 : ./test.sh 115
    437 
    438 elif [ "$SLURM_JOB_NAME" == "third" ]
    439 then
    440         srun ./test.sh 155
    441 
    442 fi
    443 
    444 [huda1@deepv scripts]$
    445 }}}
    446 
    447 Note the {{{delay}}} values for the second and third job in the script are equal.
    448 
    449 **Attention** The {{{delay}}} value for the 4th job ({{{-J third}}}) is relative to the start of the first job and not from the start of middle 2 jobs. So it will start after 2 minutes of the start time of the middle jobs. Also, note the usage of the environment variable {{{SLURM_JOB_NAME}}} in the script to decide which command to run in which job. The example execution leads to the following:
    450 
    451 The example execution leads to the following:
    452 {{{
    453 [huda1@deepv scripts]$ sbatch batch_workflow_complex.sh
    454 Submitted batch job 81460
    455 
    456 [huda1@deepv scripts]$ squeue -u huda1
    457              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    458            81461+0       sdv   second    huda1 PD       0:00      1 (Resources)
    459            81461+1       sdv   second    huda1 PD       0:00      1 (Resources)
    460              81463       sdv    third    huda1 PD       0:00      2 (Resources)
    461              81460       sdv    first    huda1 PD       0:00      2 (Resources)
    462 
    463 [huda1@deepv scripts]$
    464 }}}
    465 
    466 Note that the submitted heterogeneous job has been divided into a single job (81460), a job pack (81461) and again a single job (81643). Similarly, three different output files will be generated, one for each new job.
    467 {{{
    468 [huda1@deepv scripts]$ cat slurm-81460.out
    469 deeper-sdv03: Going to sleep for 150 seconds
    470 deeper-sdv04: Going to sleep for 150 seconds
    471 deeper-sdv03: Awake
    472 deeper-sdv04: Awake
    473 
    474 [huda1@deepv scripts]$ cat slurm-81461.out
    475 deeper-sdv01: Going to sleep for 150 seconds
    476 deeper-sdv02: Going to sleep for 115 seconds
    477 deeper-sdv02: Awake
    478 deeper-sdv01: Awake
    479 
    480 [huda1@deepv scripts]$ cat slurm-81463.out
    481 deeper-sdv01: Going to sleep for 155 seconds
    482 deeper-sdv02: Going to sleep for 155 seconds
    483 deeper-sdv01: Awake
    484 deeper-sdv02: Awake
    485 
    486 [huda1@deepv scripts]$
    487 }}}
    488 If a job exits earlier than the allocated time asked by the user, the corresponding reservation for this job is deleted 5 minutes after the end of the job, automatically and the resources become available for the other jobs. However, users should be careful with the requested time when submitting workflows as the larger time values can delay the scheduling of the workflows depending on the situation of the resources.
    489 
    490 The workflows created using {{{delay}}} switch ensure overlap between the applications. Instead, using the alternative method (which uses Slurm job dependencies) does not ensure a time overlap between two consecutive jobs of a workflow. Though, in this case users do not need to guess the time a job will take and how much should the delay between jobs starting times should be.
    491 
    492 Jobs can be chained in Slurm with the aid of the following script:
    493 {{{#!sh
    494 [huda1@deepv scripts]$ cat chain_jobs.sh
    495 #!/usr/bin/env bash
    496 
    497 if [ $# -lt 3 ]
    498 then
    499     echo "$0: ERROR (MISSING ARGUMENTS)"
    500     exit 1
    501 fi
    502 
    503 LOCKFILE=$1
    504 DEPENDENCY_TYPE=$2
    505 shift 2
    506 SUBMITSCRIPT=$*
    507 
    508 
    509 if  [ -f $LOCKFILE ]
    510 then
    511     if [[ "$DEPENDENCY_TYPE" =~ ^(after|afterany|afterok|afternotok)$ ]]; then 
    512         DEPEND_JOBID=`head -1 $LOCKFILE`
    513         echo "sbatch --dependency=${DEPENDENCY_TYPE}:${DEPEND_JOBID} $SUBMITSCRIPT"
    514         JOBID=`sbatch --dependency=${DEPENDENCY_TYPE}:${DEPEND_JOBID} $SUBMITSCRIPT`
    515     else
    516         echo "$0: ERROR (WRONG DEPENDENCY TYPE: choose among 'after', 'afterany', 'afterok' or 'afternotok')"
    517     fi
    518 else
    519     echo "sbatch $SUBMITSCRIPT"
    520     JOBID=`sbatch $SUBMITSCRIPT`
    521 fi
    522 
    523 echo "RETURN: $JOBID"
    524 # the JOBID is the last field of the output line
    525 echo ${JOBID##* } > $LOCKFILE
    526 
    527 exit 0
    528 }}}
    529 
    530 This is a modified version of the of the `chainJobs.sh` included in JUBE, which allows to select the desired dependency type between two consecutive jobs.
    531 Here is an example of submission of a workflow with Slurm dependencies using the previous script (here called `chain_jobs.sh`):
    532 {{{
    533 [huda1@deepv scripts]$ ./chain_jobs.sh lockfile afterok simple_job.sh
    534 sbatch simple_job.sh
    535 RETURN: Submitted batch job 98626
    536 [huda1@deepv scripts]$ ./chain_jobs.sh lockfile afterok simple_job.sh
    537 sbatch --dependency=afterok:98626 simple_job.sh
    538 RETURN: Submitted batch job 98627
    539 [huda1@deepv scripts]$ ./chain_jobs.sh lockfile afterok simple_job.sh
    540 sbatch --dependency=afterok:98627 simple_job.sh
    541 RETURN: Submitted batch job 98628
    542 [huda1@deepv scripts]$ squeue -u huda1
    543              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    544              98627       sdv simple_j    huda1 PD       0:00      2 (Dependency)
    545              98628       sdv simple_j    huda1 PD       0:00      2 (Dependency)
    546              98626       sdv simple_j    huda1  R       0:21      2 deeper-sdv[01-02]
    547 [huda1@deepv scripts]$ scontrol show job 98628 | grep Dependency
    548    JobState=PENDING Reason=Dependency Dependency=afterok:98627
    549 [huda1@deepv scripts]$ cat lockfile
    550 98628
    551 }}}
    552 Please note that `lockfile` must not exist previous to the first submission.
    553 After the first job submission, that file will contain the id of last submitted job, which is later used by the subsequent call to the `chain_job.sh` script to set the dependency.
    554 
    555 
    556 === {{{slurm_workflow}}} Library ===
    557 
    558 In order to improve the usability of workflows, a library has been developed and deployed on the system to allow users to interact with the scheduler from within applications involved in a workflow.
    559 The library is called `slurm_workflow`.
    560 
    561 The library has two functions.
    562 
    563 The first function is relevant to workflows created using the `--delay` switch and  moves all the reservations of the remaining workflow jobs.
    564 {{{
    565 /*
    566 IN:    number of minutes from now. The start time of the next reservation of the workflow is moved to this time if doable.
    567 OUT:   0 successful, non zero unsuccessful.slurm_wf_error is set.
    568 */
    569 int slurm_wf_move_all_res(uint32_t t);
    570 }}}
    571 The minimum value usable for the parameter is currently 2 (minutes).
    572 
    573 The second function changes the dependencies type of all jobs dependent on the current job from {{{afterok:job_id}}} to {{{after:job_id}}}.
    574 {{{
    575 /*
    576 OUT: 0 successful, error no otherwise.
    577 */
    578 
    579 int slurm_change_dep();
    580 }}}
    581 
    582 This enables the jobs in workflow eligible for allocation by Slurm.
    583 
    584 Both functions allow an application to notify the scheduler that it is ready for the start of the subsequent jobs of a workflow.
    585 This is particularly relevant in case a network connection must be established between the two applications, but only after a certain time from the start of the first job.
    586 
    587 When using the library, the header file can be included using `#include <slurm/slurm_workflow.h>` and the library should be linked against using `-lslurm_workflow -lslurm`.
    588186
    589187