Changes between Version 41 and Version 42 of Public/User_Guide/Batch_system
- Timestamp:
- Jul 29, 2020, 12:52:11 AM (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/Batch_system
v41 v42 233 233 This is an example job script for setting up an Intel MPI benchmark between a Cluster and a DAM node using a IB <-> Extoll gateway for MPI communication: 234 234 235 {{{ 235 {{{#!sh 236 236 #!/bin/bash 237 237 … … 327 327 - submitting separate jobs using an `afterok` dependency and later requesting a change in dependency type from `afterok` to `after` (using our provided shared library), which allows the second job to start if resources are available. 328 328 329 An example project that uses all the features discussed is provided [https://gitlab.version.fz-juelich.de/ deamicis1/mpi_connect_test/-/tree/test_zia_workflowshere].329 An example project that uses all the features discussed is provided [https://gitlab.version.fz-juelich.de/DEEP-EST/mpi_connect_test here]. 330 330 331 331 The following simple example script helps to understand the mechanism of new {{{delay}}} switch for workflows. 332 332 333 {{{ 333 {{{#!sh 334 334 [huda1@deepv scripts]$ cat test.sh 335 335 #!/bin/sh … … 386 386 387 387 Another feature to note is that if there are multiple jobs in a job pack and any number of consecutive jobs have the same {{{delay}}} values, they are combined into a new heterogeneous job. This allows to have heterogeneous jobs within workflows. Here is an example of such a script: 388 {{{ 388 {{{#!sh 389 389 [huda1@deepv scripts]$ cat batch_workflow_complex.sh 390 390 #!/bin/bash … … 468 468 If a job exits earlier than the allocated time asked by the user, the corresponding reservation for this job is deleted 5 minutes after the end of the job, automatically and the resources become available for the other jobs. However, users should be careful with the requested time when submitting workflows as the larger time values can delay the scheduling of the workflows depending on the situation of the resources. 469 469 470 The workflows created using {{{delay}}} switch ensure overlap between the applications. The second method that includes dependencies among jobs, does not ensure an overlap but avoids users to guess the time a job will take and how much should be the delay between jobs. The process is simple. A user submits a job and later a dependent job with a dependency of type {{{afterok}}}. Inside the first (independent) job, the application running calls the function provided in {{{slurm_workflow}}} library, that changes the dependency type of the dependent job to {{{after}}}. This enables the dependent job to be eligible for allocation by slurm immediately. However, the allocation of resources depends upon the situation of resources available in the system. The following script helps to submit jobs in the form of a chain with a provided dependency type. 471 {{{ 470 The workflows created using {{{delay}}} switch ensure overlap between the applications. Instead, using the alternative method (which uses Slurm job dependencies) does not ensure a time overlap between two consecutive jobs of a workflow. Though, in this case users do not need to guess the time a job will take and how much should the delay between jobs starting times should be. 471 472 Jobs can be chained in Slurm with the aid of the following script: 473 {{{#!sh 472 474 [huda1@deepv scripts]$ cat chain_jobs.sh 473 475 #!/usr/bin/env bash … … 506 508 }}} 507 509 508 Here is the example of submission. 510 This is a modified version of the of the `chainJobs.sh` included in JUBE, which allows to select the desired dependency type between two consecutive jobs. 511 Here is an example of submission of a workflow with Slurm dependencies using the previous script (here called `chain_jobs.sh`): 509 512 {{{ 510 513 [huda1@deepv scripts]$ ./chain_jobs.sh lockfile afterok simple_job.sh … … 527 530 98628 528 531 }}} 529 Note that the {{{lockfile}}} contains the id of last submitted job. 532 Please note that `lockfile` must not exist previous to the first submission. 533 After the first job submission, that file will contain the id of last submitted job, which is later used by the subsequent call to the `chain_job.sh` script to set the dependency. 534 530 535 531 536 === {{{slurm_workflow}}} Library === 532 537 533 We have developed a library that developers can use to change the reservation beginning times or dependency type of the dependent jobs in a workflow. This library is called {{{slurm_workflow}}}. The library has two functions. 534 535 The first function moves all the reservations of the remaining workflow jobs to an earlier time when the workflow is created using {{{--delay}}} switch. 538 In order to improve the usability of workflows, a library has been developed and deployed on the system to allow users to interact with the scheduler from within applications involved in a workflow. 539 The library is called `slurm_workflow`. 540 541 The library has two functions. 542 543 The first function is relevant to workflows created using the `--delay` switch and moves all the reservations of the remaining workflow jobs. 536 544 {{{ 537 545 /* … … 541 549 int slurm_wf_move_all_res(uint32_t t); 542 550 }}} 551 The minimum value usable for the parameter is currently 2 (minutes). 543 552 544 553 The second function changes the dependencies type of all jobs dependent on the current job from {{{afterok:job_id}}} to {{{after:job_id}}}. … … 551 560 }}} 552 561 553 Call the above function to change all {{{afterok:$(SLURM_JOBID)}}} dependencies into {{{after:$(SLURM_JOBID)}} dependencies. This enables the jobs in workflow eligible for allocation by Slurm. 554 555 The header file can be included using {{{#include <slurm/slurm_workflow.h>}}} and should be linked using {{{-lslurm_workflow}}} and {{{-lslurm}}}. 562 This enables the jobs in workflow eligible for allocation by Slurm. 563 564 Both functions allow an application to notify the scheduler that it is ready for the start of the subsequent jobs of a workflow. 565 This is particularly relevant in case a network connection must be established between the two applications, but only after a certain time from the start of the first job. 566 567 When using the library, the header file can be included using `#include <slurm/slurm_workflow.h>` and the library should be linked against using `-lslurm_workflow -lslurm`. 556 568 557 569