Changes between Version 31 and Version 32 of Public/User_Guide/Batch_system


Ignore:
Timestamp:
Apr 14, 2020, 10:02:11 PM (4 years ago)
Author:
Zia Ul Huda
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Batch_system

    v31 v32  
    319319
    320320
     321== Workflows ==
     322
     323The new version of the installed slurm now supports workflows. The idea is to have an overlap between the dependent jobs so that they can communicate the data over the network instead of writing and reading it on storage. To enable the workflows, we have introduced a new switch {{{delay}}} to {{{sbatch}}} command. Here is a simple example script.
     324
     325{{{
     326[huda1@deepv scripts]$ cat test.sh
     327#!/bin/sh
     328
     329NAME=$(hostname)
     330echo "$NAME: Going to sleep for $1 seconds"
     331sleep $1
     332echo "$NAME: Awake"
     333
     334[huda1@deepv scripts]$ cat batch_workflow.sh
     335#!/bin/bash
     336#SBATCH -p sdv -N2 -t3
     337
     338#SBATCH packjob
     339
     340#SBATCH -p sdv -N1 -t3 --delay 2
     341
     342srun test.sh 175
     343
     344[huda1@deepv scripts]$
     345}}}
     346
     347In the above {{{sbatch}}} script, the usage of {{{--delay}}} can be seen. It takes thee values in minutes. The idea is to delay the corresponding job of a heterogeneous job by the provided number of minutes from the start of the first job in the job pack. After submission of this job pack, slurm divides it into separate jobs at the time of the resource reservation. So you can see multiple jobs in the output of {{{squeue}}} command. Here is the example execution of this script.
     348
     349{{{
     350[huda1@deepv scripts]$ sbatch batch_workflow.sh
     351Submitted batch job 81458
     352[huda1@deepv scripts]$ squeue -u huda1
     353             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     354             81458       sdv batch_wo    huda1 CF       0:01      2 deeper-sdv[02-03]
     355             81459       sdv batch_wo    huda1 PD       0:00      1 (Reservation)
     356
     357[huda1@deepv scripts]$
     358}}}
     359
     360Here the second job(81458) will start 2 minutes after the start of the first job(81459). Similarly, the output files will be different for each separated job in the job pack. the final outputs are:
     361{{{
     362[huda1@deepv scripts]$ cat slurm-81458.out
     363deeper-sdv02: Going to sleep for 175 seconds
     364deeper-sdv03: Going to sleep for 175 seconds
     365deeper-sdv02: Awake
     366deeper-sdv03: Awake
     367
     368[huda1@deepv scripts]$ cat slurm-81459.out
     369deeper-sdv01: Going to sleep for 175 seconds
     370deeper-sdv01: Awake
     371
     372[huda1@deepv scripts]$
     373}}}
     374
     375Another feature to note is that if there are multiple jobs in a job pack and any number of consecutive jobs have the same {{{delay}}} values, they are combined into a new heterogeneous job. Here is an example of such a script:
     376{{{
     377[huda1@deepv scripts]$ cat batch_workflow_complex.sh
     378#!/bin/bash
     379
     380#SBATCH -p sdv -N 2 -t 3
     381#SBATCH -J first
     382
     383#SBATCH packjob
     384
     385#SBATCH -p sdv -N 1 -t 3 --delay 2
     386#SBATCH -J second
     387
     388#SBATCH packjob
     389
     390#SBATCH -p sdv -N 1 -t 2 --delay 2
     391#SBATCH -J second
     392
     393#SBATCH packjob
     394
     395#SBATCH -p sdv -N 2 -t 3 --delay 4
     396#SBATCH -J third
     397
     398if [ "$SLURM_JOB_NAME" == "first" ]
     399then
     400        srun ./test.sh 150
     401
     402elif [ "$SLURM_JOB_NAME" == "second" ]
     403then
     404        srun ./test.sh 150 : ./test.sh 115
     405
     406elif [ "$SLURM_JOB_NAME" == "third" ]
     407then
     408        srun ./test.sh 155
     409
     410fi
     411
     412[huda1@deepv scripts]$
     413}}}
     414
     415Note the {{{delay}}} values for the second and third job in the script are equal. Also, note the usage of the environment variable {{{SLURM_JOB_NAME}}} in the script to decide which command to run in which job. The example execution leads to the following:
     416{{{
     417[huda1@deepv scripts]$ sbatch batch_workflow_complex.sh
     418Submitted batch job 81460
     419
     420[huda1@deepv scripts]$ squeue -u huda1
     421             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     422           81461+0       sdv   second    huda1 PD       0:00      1 (Resources)
     423           81461+1       sdv   second    huda1 PD       0:00      1 (Resources)
     424             81463       sdv    third    huda1 PD       0:00      2 (Resources)
     425             81460       sdv    first    huda1 PD       0:00      2 (Resources)
     426
     427[huda1@deepv scripts]$
     428}}}
     429
     430Note that the submitted heterogeneous job has been divided into a single job (81460), a job pack (81461) and again a single job (81643). Similarly, three different output files will be generated, one for each new job.
     431{{{
     432[huda1@deepv scripts]$ cat slurm-81460.out
     433deeper-sdv03: Going to sleep for 150 seconds
     434deeper-sdv04: Going to sleep for 150 seconds
     435deeper-sdv03: Awake
     436deeper-sdv04: Awake
     437
     438[huda1@deepv scripts]$ cat slurm-81461.out
     439deeper-sdv01: Going to sleep for 150 seconds
     440deeper-sdv02: Going to sleep for 115 seconds
     441deeper-sdv02: Awake
     442deeper-sdv01: Awake
     443
     444[huda1@deepv scripts]$ cat slurm-81463.out
     445deeper-sdv01: Going to sleep for 155 seconds
     446deeper-sdv02: Going to sleep for 155 seconds
     447deeper-sdv01: Awake
     448deeper-sdv02: Awake
     449
     450[huda1@deepv scripts]$
     451}}}
     452If a job exits earlier than the allocated time asked by the user, the corresponding reservation for this job is deleted automatically and the resources become available for the other jobs. However, users should be careful with the requested time when submitting workflows as the larger time values can delay the scheduling of the workflows depending on the situation of the resources.
    321453
    322454== Information on past jobs and accounting ==