Changes between Version 37 and Version 38 of Public/User_Guide/Batch_system


Ignore:
Timestamp:
Jul 28, 2020, 5:27:42 PM (4 years ago)
Author:
Jacopo de Amicis
Comment:

Reworded parts of the workflow description.

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Batch_system

    v37 v38  
    321321== Workflows ==
    322322
    323 The new version of the installed slurm now supports workflows. The idea is to have an overlap between the dependent jobs so that they can communicate the data over the network instead of writing and reading it on storage. We have provided two ways to achieve a workflow. One way is to use the new {{{delay}}} switch provided in {{{sbatch}}} command. While the other method is to submit jobs with dependencies of type {{{afterok}}} and later the independent job changes the dependency type of the dependent job to {{{after}}} using our provided shared library (explained below). Jacopo has developed an example project [https://gitlab.version.fz-juelich.de/deamicis1/mpi_connect_test/-/tree/test_zia_workflows] that uses all the features discussed here.
     323The version of Slurm installed on the system enables workflows (chains of jobs) with the possibility of having some overlap between the dependent jobs. This allows them to exchange data over the network rather writing and reading it using a common file system.
     324
     325Workflows can be submitted in two ways:
     326- using the new `--delay` option provided in `sbatch` command, which allows to start a job with a fixed delay from the start of the previous job;
     327- submitting separate jobs using an `afterok` dependency and later requesting a change in dependency type from `afterok` to `after` (using our provided shared library), which allows the second job to start if resources are available.
     328
     329An example project that uses all the features discussed is provided [https://gitlab.version.fz-juelich.de/deamicis1/mpi_connect_test/-/tree/test_zia_workflows here].
    324330
    325331The following simple example script helps understanding the mechanism of new {{{delay}}} switch for workflows.
     
    347353}}}
    348354
    349 In the above {{{sbatch}}} script, the usage of {{{--delay}}} can be seen. It takes thee values in minutes. The idea is to delay the corresponding job of a heterogeneous job by the provided number of minutes from the start of the first job in the job pack. After submission of this job pack, slurm divides it into separate jobs at the time of the resource reservation. So you can see multiple jobs in the output of {{{squeue}}} command. Here is the example execution of this script.
     355In the above {{{sbatch}}} script, the usage of {{{--delay}}} can be seen. The option takes values in minutes, and allows to delay the subsequent job of by a user-defined number of minutes from the start of the first job in the job pack. After submission of this job pack (which uses the same syntax as a heterogeneous job), Slurm divides it into separate jobs. Also, Slurm ensures that the delay is respected by using reservations, rather than the usual scheduler.
     356
     357Here is the example execution of this script.
    350358
    351359{{{
     
    360368}}}
    361369
    362 Here the second job(81458) will start 2 minutes after the start of the first job(81459). Similarly, the output files will be different for each separated job in the job pack. the final outputs are:
     370Here the second job (81459) will start 2 minutes after the start of the first job (81458), and it is listed as `PD` (`Pending`) with reason `Reservation` because it will start as soon as its reservation will begin.
     371
     372Similarly, the output files will be different for each separated job in the job pack. the final outputs are:
    363373{{{
    364374[huda1@deepv scripts]$ cat slurm-81458.out
     
    375385}}}
    376386
    377 Another feature to note is that if there are multiple jobs in a job pack and any number of consecutive jobs have the same {{{delay}}} values, they are combined into a new heterogeneous job. Here is an example of such a script:
     387Another feature to note is that if there are multiple jobs in a job pack and any number of consecutive jobs have the same {{{delay}}} values, they are combined into a new heterogeneous job. This allows to have heterogeneous jobs within workflows. Here is an example of such a script:
    378388{{{
    379389[huda1@deepv scripts]$ cat batch_workflow_complex.sh
     
    415425}}}
    416426
    417 Note the {{{delay}}} values for the second and third job in the script are equal. Also, note the usage of the environment variable {{{SLURM_JOB_NAME}}} in the script to decide which command to run in which job. The example execution leads to the following:
     427Note the {{{delay}}} values for the second and third job in the script are equal. Also, note the usage of the environment variable {{{SLURM_JOB_NAME}}} in the script to decide which command to run in which job. Currently this is the only Slurm environment variable which allows to differentiate components of a heterogeneous job (and workflow) in a static manner (i.e. not using job-dependent variables like the job ID or the node lists).
     428
     429The example execution leads to the following:
    418430{{{
    419431[huda1@deepv scripts]$ sbatch batch_workflow_complex.sh