| 9 | Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are {{{srun}}} and {{{sbatch}}}. The {{{srun}}} command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform. |
| 10 | |
| 11 | == An introductory example == |
| 12 | |
| 13 | Suppose you have an mpi executable named {{{hello_mpi}}}. There are three ways to start the binary. |
| 14 | |
| 15 | === From a shell on a node === |
| 16 | |
| 17 | First, start a shell on a node. You would like to run your mpi task on 4 machines with 2 tasks per machine: |
| 18 | {{{ |
| 19 | niessen@deepl:src/mpi > srun --partition=sdv -N 4 -n 8 --pty /bin/bash -i |
| 20 | niessen@deeper-sdv04:/direct/homec/zdvex/niessen/src/mpi > |
| 21 | }}} |
| 22 | |
| 23 | The environment is transported to the remote shell, no {{{.profile}}}, {{{.bashrc}}}, ... are sourced (especially not the modules default from {{{/etc/profile.d/modules.sh}}}). |
| 24 | |
| 25 | Once you get to the compute node, start your application using {{{srun}}}. Note that the number of tasks used is the same as specified in the initial {{{srun}}} command above (4 nodes with two tasks each): |
| 26 | {{{ |
| 27 | niessen@deeper-sdv04:/direct/homec/zdvex/niessen/src/mpi > srun ./hello_cluster |
| 28 | srun: cluster configuration lacks support for cpu binding |
| 29 | Hello world from process 6 of 8 on deeper-sdv07 |
| 30 | Hello world from process 7 of 8 on deeper-sdv07 |
| 31 | Hello world from process 3 of 8 on deeper-sdv05 |
| 32 | Hello world from process 4 of 8 on deeper-sdv06 |
| 33 | Hello world from process 0 of 8 on deeper-sdv04 |
| 34 | Hello world from process 2 of 8 on deeper-sdv05 |
| 35 | Hello world from process 5 of 8 on deeper-sdv06 |
| 36 | Hello world from process 1 of 8 on deeper-sdv04 |
| 37 | }}} |
| 38 | |
| 39 | You can ignore the warning about the cpu binding. ParaStation will pin you processes. |
| 40 | |
| 41 | === Running directly from the front ends === |
| 42 | |
| 43 | You can run the application directly from the frontend, bypassing the shell: |
| 44 | |
| 45 | {{{ |
| 46 | niessen@deepl:src/mpi > srun --partition=sdv -N 4 -n 8 ./hello_cluster |
| 47 | Hello world from process 4 of 8 on deeper-sdv06 |
| 48 | Hello world from process 6 of 8 on deeper-sdv07 |
| 49 | Hello world from process 3 of 8 on deeper-sdv05 |
| 50 | Hello world from process 0 of 8 on deeper-sdv04 |
| 51 | Hello world from process 2 of 8 on deeper-sdv05 |
| 52 | Hello world from process 5 of 8 on deeper-sdv06 |
| 53 | Hello world from process 7 of 8 on deeper-sdv07 |
| 54 | Hello world from process 1 of 8 on deeper-sdv04 |
| 55 | }}} |
| 56 | |
| 57 | In this case, it can be useful to create an allocation which you can use for several runs of your job: |
| 58 | |
| 59 | {{{ |
| 60 | niessen@deepl:src/mpi > salloc --partition=sdv -N 4 -n 8 |
| 61 | salloc: Granted job allocation 955 |
| 62 | niessen@deepl:~/src/mpi>srun ./hello_cluster |
| 63 | Hello world from process 3 of 8 on deeper-sdv05 |
| 64 | Hello world from process 1 of 8 on deeper-sdv04 |
| 65 | Hello world from process 7 of 8 on deeper-sdv07 |
| 66 | Hello world from process 5 of 8 on deeper-sdv06 |
| 67 | Hello world from process 2 of 8 on deeper-sdv05 |
| 68 | Hello world from process 0 of 8 on deeper-sdv04 |
| 69 | Hello world from process 6 of 8 on deeper-sdv07 |
| 70 | Hello world from process 4 of 8 on deeper-sdv06 |
| 71 | niessen@deepl:~/src/mpi> # several more runs |
| 72 | ... |
| 73 | niessen@deepl:~/src/mpi>exit |
| 74 | exit |
| 75 | salloc: Relinquishing job allocation 955 |
| 76 | }}} |
| 77 | |
| 78 | === Batch script === |
| 79 | |
| 80 | Given the following script {{{hello_cluster.sh}}}: (it has to be executable): |
| 81 | |
| 82 | {{{ |
| 83 | #!/bin/bash |
| 84 | |
| 85 | #SBATCH --partition=sdv |
| 86 | #SBATCH -N 4 |
| 87 | #SBATCH -n 8 |
| 88 | #SBATCH -o /homec/zdvex/niessen/src/mpi/hello_cluster-%j.log |
| 89 | #SBATCH -e /homec/zdvex/niessen/src/mpi/hello_cluster-%j.err |
| 90 | #SBATCH --time=00:10:00 |
| 91 | |
| 92 | srun ./hello_cluster |
| 93 | }}} |
| 94 | |
| 95 | This script requests 4 nodes with 8 tasks, specifies the stdout and stderr files, and asks for 10 minutes of walltime. Submit: |
| 96 | |
| 97 | {{{ |
| 98 | niessen@deepl:src/mpi > sbatch ./hello_cluster.sh |
| 99 | Submitted batch job 956 |
| 100 | }}} |
| 101 | |
| 102 | Check what it's doing: |
| 103 | |
| 104 | {{{ |
| 105 | niessen@deepl:src/mpi > squeue |
| 106 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 107 | 956 sdv hello_cl niessen R 0:00 4 deeper-sdv[04-07] |
| 108 | }}} |
| 109 | |
| 110 | Check the result: |
| 111 | |
| 112 | {{{ |
| 113 | niessen@deepl:src/mpi > cat hello_cluster-956.log |
| 114 | Hello world from process 5 of 8 on deeper-sdv06 |
| 115 | Hello world from process 1 of 8 on deeper-sdv04 |
| 116 | Hello world from process 7 of 8 on deeper-sdv07 |
| 117 | Hello world from process 3 of 8 on deeper-sdv05 |
| 118 | Hello world from process 0 of 8 on deeper-sdv04 |
| 119 | Hello world from process 2 of 8 on deeper-sdv05 |
| 120 | Hello world from process 4 of 8 on deeper-sdv06 |
| 121 | Hello world from process 6 of 8 on deeper-sdv07 |
| 122 | }}} |
| 123 | |