Changes between Version 36 and Version 37 of Public/User_Guide/Batch_system


Ignore:
Timestamp:
Jul 28, 2020, 11:40:21 AM (4 years ago)
Author:
Zia Ul Huda
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Batch_system

    v36 v37  
    321321== Workflows ==
    322322
    323 The new version of the installed slurm now supports workflows. The idea is to have an overlap between the dependent jobs so that they can communicate the data over the network instead of writing and reading it on storage. To enable the workflows, we have introduced a new switch {{{delay}}} to {{{sbatch}}} command. Here is a simple example script.
     323The new version of the installed slurm now supports workflows. The idea is to have an overlap between the dependent jobs so that they can communicate the data over the network instead of writing and reading it on storage. We have provided two ways to achieve a workflow. One way is to use the new {{{delay}}} switch provided in {{{sbatch}}} command. While the other method is to submit jobs with dependencies of type {{{afterok}}} and later the independent job changes the dependency type of the dependent job to {{{after}}} using our provided shared library (explained below). Jacopo has developed an example project [https://gitlab.version.fz-juelich.de/deamicis1/mpi_connect_test/-/tree/test_zia_workflows] that uses all the features discussed here.
     324
     325The following simple example script helps understanding the mechanism of new {{{delay}}} switch for workflows.
    324326
    325327{{{
     
    450452[huda1@deepv scripts]$
    451453}}}
    452 If a job exits earlier than the allocated time asked by the user, the corresponding reservation for this job is deleted automatically and the resources become available for the other jobs. However, users should be careful with the requested time when submitting workflows as the larger time values can delay the scheduling of the workflows depending on the situation of the resources.
     454If a job exits earlier than the allocated time asked by the user, the corresponding reservation for this job is deleted 5 minutes after the end of the job, automatically and the resources become available for the other jobs. However, users should be careful with the requested time when submitting workflows as the larger time values can delay the scheduling of the workflows depending on the situation of the resources.
     455
     456The workflows created using {{{delay}}} switch ensure overlap between the applications. The second method that includes dependencies among jobs, does not ensure an overlap but avoids users to guess the time a job will take and how much should be the delay between jobs. The process is simple. A user submits a job and later a dependent job with a dependency of type {{{afterok}}}. Inside the first (independent) job, the application running calls the function provided in {{{slurm_workflow}}} library, that changes the dependency type of the dependent job to {{{after}}}. This enables the dependent job to be eligible for allocation by slurm immediately. However, the allocation of resources depends upon the situation of resources available in the system. The following script helps to submit jobs in the form of a chain with a provided dependency type.
     457{{{
     458[huda1@deepv scripts]$ cat chain_jobs.sh
     459#!/usr/bin/env bash
     460
     461if [ $# -lt 3 ]
     462then
     463    echo "$0: ERROR (MISSING ARGUMENTS)"
     464    exit 1
     465fi
     466
     467LOCKFILE=$1
     468DEPENDENCY_TYPE=$2
     469shift 2
     470SUBMITSCRIPT=$*
     471
     472
     473if  [ -f $LOCKFILE ]
     474then
     475    if [[ "$DEPENDENCY_TYPE" =~ ^(after|afterany|afterok|afternotok)$ ]]; then 
     476        DEPEND_JOBID=`head -1 $LOCKFILE`
     477        echo "sbatch --dependency=${DEPENDENCY_TYPE}:${DEPEND_JOBID} $SUBMITSCRIPT"
     478        JOBID=`sbatch --dependency=${DEPENDENCY_TYPE}:${DEPEND_JOBID} $SUBMITSCRIPT`
     479    else
     480        echo "$0: ERROR (WRONG DEPENDENCY TYPE: choose among 'after', 'afterany', 'afterok' or 'afternotok')"
     481    fi
     482else
     483    echo "sbatch $SUBMITSCRIPT"
     484    JOBID=`sbatch $SUBMITSCRIPT`
     485fi
     486
     487echo "RETURN: $JOBID"
     488# the JOBID is the last field of the output line
     489echo ${JOBID##* } > $LOCKFILE
     490
     491exit 0
     492}}}
     493
     494Here is the example of submission.
     495{{{
     496[huda1@deepv scripts]$ ./chain_jobs.sh lockfile afterok simple_job.sh
     497sbatch simple_job.sh
     498RETURN: Submitted batch job 98626
     499[huda1@deepv scripts]$ ./chain_jobs.sh lockfile afterok simple_job.sh
     500sbatch --dependency=afterok:98626 simple_job.sh
     501RETURN: Submitted batch job 98627
     502[huda1@deepv scripts]$ ./chain_jobs.sh lockfile afterok simple_job.sh
     503sbatch --dependency=afterok:98627 simple_job.sh
     504RETURN: Submitted batch job 98628
     505[huda1@deepv scripts]$ squeue -u huda1
     506             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     507             98627       sdv simple_j    huda1 PD       0:00      2 (Dependency)
     508             98628       sdv simple_j    huda1 PD       0:00      2 (Dependency)
     509             98626       sdv simple_j    huda1  R       0:21      2 deeper-sdv[01-02]
     510[huda1@deepv scripts]$ scontrol show job 98628 | grep Dependency
     511   JobState=PENDING Reason=Dependency Dependency=afterok:98627
     512[huda1@deepv scripts]$ cat lockfile
     51398628
     514}}}
     515Note that the {{{lockfile}}} contains the id of last submitted job.
    453516
    454517=== {{{slurm_workflow}}} Library ===