Changes between Version 23 and Version 24 of Public/User_Guide/Batch_system


Ignore:
Timestamp:
Jan 22, 2020, 11:00:46 AM (4 years ago)
Author:
Jochen Kreutz
Comment:

2020-01-22 JK: minor updates (new examples etc.) in the first part before heterogeneous jobs section

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/Batch_system

    v23 v24  
    1313== Overview ==
    1414
    15 Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform.
     15Slurm offers interactive and batch jobs (scripts submitted into the system). The relevant commands are `srun` and `sbatch`. The `srun` command can be used to spawn processes ('''please do not use mpiexec'''), both from the frontend and from within a batch script. You can also get a shell on a node to work locally there (e.g. to compile your application natively for a special platform).
    1616
    1717== Remark about environment ==
     
    3131First, start a shell on a node. You would like to run your mpi task on 4 machines with 2 tasks per machine:
    3232{{{
    33 niessen@deepl:src/mpi > srun --partition=sdv -N 4 -n 8 --pty /bin/bash -i
    34 niessen@deeper-sdv04:/direct/homec/zdvex/niessen/src/mpi >
     33[kreutz1@deepv /p/project/cdeep/kreutz1/Temp]$ srun -p dp-cn -N 4 -n 8 -t 00:30:00 --pty /bin/bash -i
     34[kreutz1@dp-cn01 /p/project/cdeep/kreutz1/Temp]$
    3535}}}
    3636
     
    3939Once you get to the compute node, start your application using {{{srun}}}. Note that the number of tasks used is the same as specified in the initial {{{srun}}} command above (4 nodes with two tasks each):
    4040{{{
    41 niessen@deeper-sdv04:/direct/homec/zdvex/niessen/src/mpi > srun ./hello_cluster
    42 srun: cluster configuration lacks support for cpu binding
    43 Hello world from process 6 of 8 on deeper-sdv07
    44 Hello world from process 7 of 8 on deeper-sdv07
    45 Hello world from process 3 of 8 on deeper-sdv05
    46 Hello world from process 4 of 8 on deeper-sdv06
    47 Hello world from process 0 of 8 on deeper-sdv04
    48 Hello world from process 2 of 8 on deeper-sdv05
    49 Hello world from process 5 of 8 on deeper-sdv06
    50 Hello world from process 1 of 8 on deeper-sdv04
    51 }}}
    52 
    53 You can ignore the warning about the cpu binding. !ParaStation will pin you processes.
     41[kreutz1@deepv Temp]$ srun -p dp-cn -N 4 -n 8 -t 00:30:00 --pty /bin/bash -i
     42[kreutz1@dp-cn01 Temp]$ srun ./MPI_HelloWorld
     43Hello World from rank 3 of 8 on dp-cn02
     44Hello World from rank 7 of 8 on dp-cn04
     45Hello World from rank 2 of 8 on dp-cn02
     46Hello World from rank 6 of 8 on dp-cn04
     47Hello World from rank 0 of 8 on dp-cn01
     48Hello World from rank 4 of 8 on dp-cn03
     49Hello World from rank 1 of 8 on dp-cn01
     50Hello World from rank 5 of 8 on dp-cn03
     51}}}
     52
     53You can ignore potential warnings about the cpu binding. !ParaStation will pin your processes.
    5454
    5555=== Running directly from the front ends ===
     
    5858
    5959{{{
    60 niessen@deepl:src/mpi > srun --partition=sdv -N 4 -n 8 ./hello_cluster
    61 Hello world from process 4 of 8 on deeper-sdv06
    62 Hello world from process 6 of 8 on deeper-sdv07
    63 Hello world from process 3 of 8 on deeper-sdv05
    64 Hello world from process 0 of 8 on deeper-sdv04
    65 Hello world from process 2 of 8 on deeper-sdv05
    66 Hello world from process 5 of 8 on deeper-sdv06
    67 Hello world from process 7 of 8 on deeper-sdv07
    68 Hello world from process 1 of 8 on deeper-sdv04
     60[kreutz1@deepv Temp]$ srun -p dp-cn -N 4 -n 8 -t 00:30:00 ./MPI_HelloWorld
     61Hello World from rank 7 of 8 on dp-cn04
     62Hello World from rank 3 of 8 on dp-cn02
     63Hello World from rank 6 of 8 on dp-cn04
     64Hello World from rank 2 of 8 on dp-cn02
     65Hello World from rank 4 of 8 on dp-cn03
     66Hello World from rank 0 of 8 on dp-cn01
     67Hello World from rank 1 of 8 on dp-cn01
     68Hello World from rank 5 of 8 on dp-cn03
    6969}}}
    7070
     
    7272
    7373{{{
    74 niessen@deepl:src/mpi > salloc --partition=sdv -N 4 -n 8
    75 salloc: Granted job allocation 955
    76 niessen@deepl:~/src/mpi>srun ./hello_cluster
    77 Hello world from process 3 of 8 on deeper-sdv05
    78 Hello world from process 1 of 8 on deeper-sdv04
    79 Hello world from process 7 of 8 on deeper-sdv07
    80 Hello world from process 5 of 8 on deeper-sdv06
    81 Hello world from process 2 of 8 on deeper-sdv05
    82 Hello world from process 0 of 8 on deeper-sdv04
    83 Hello world from process 6 of 8 on deeper-sdv07
    84 Hello world from process 4 of 8 on deeper-sdv06
    85 niessen@deepl:~/src/mpi> # several more runs
     74[kreutz1@deepv Temp]$ salloc -p dp-cn -N 4 -n 8 -t 00:30:00
     75salloc: Granted job allocation 69263
     76[kreutz1@deepv Temp]$ srun ./MPI_HelloWorld
     77Hello World from rank 7 of 8 on dp-cn04
     78Hello World from rank 3 of 8 on dp-cn02
     79Hello World from rank 6 of 8 on dp-cn04
     80Hello World from rank 2 of 8 on dp-cn02
     81Hello World from rank 5 of 8 on dp-cn03
     82Hello World from rank 1 of 8 on dp-cn01
     83Hello World from rank 4 of 8 on dp-cn03
     84Hello World from rank 0 of 8 on dp-cn01
    8685...
    87 niessen@deepl:~/src/mpi>exit
     86# several more runs
     87...
     88[kreutz1@deepv Temp]$ exit
    8889exit
    89 salloc: Relinquishing job allocation 955
     90salloc: Relinquishing job allocation 69263
    9091}}}
    9192
    9293=== Batch script ===
    9394
    94 Given the following script {{{hello_cluster.sh}}}: (it has to be executable):
     95Given the following script {{{hello_cluster.sh}}}:
    9596
    9697{{{
    9798#!/bin/bash
    9899
    99 #SBATCH --partition=sdv
     100#SBATCH --partition=dp-cn
    100101#SBATCH -N 4
    101102#SBATCH -n 8
    102 #SBATCH -o /homec/zdvex/niessen/src/mpi/hello_cluster-%j.log
    103 #SBATCH -e /homec/zdvex/niessen/src/mpi/hello_cluster-%j.err
     103#SBATCH -o /p/project/cdeep/kreutz1/hello_cluster-%j.out
     104#SBATCH -e /p/project/cdeep/kreutz1/hello_cluster-%j.err
    104105#SBATCH --time=00:10:00
    105106
    106 srun ./hello_cluster
     107srun ./MPI_HelloWorld
    107108}}}
    108109
     
    110111
    111112{{{
    112 niessen@deepl:src/mpi > sbatch ./hello_cluster.sh
    113 Submitted batch job 956
     113[kreutz1@deepv Temp]$ sbatch hello_cluster.sh
     114Submitted batch job 69264
    114115}}}
    115116
     
    117118
    118119{{{
    119 niessen@deepl:src/mpi > squeue
     120[kreutz1@deepv Temp]$ squeue -u $USER
    120121             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    121                956       sdv hello_cl  niessen  R       0:00      4 deeper-sdv[04-07]
     122             69264     dp-cn hello_cl  kreutz1 CG       0:04      4 dp-cn[01-04]
    122123}}}
    123124
     
    125126
    126127{{{
    127 niessen@deepl:src/mpi > cat hello_cluster-956.log
    128 Hello world from process 5 of 8 on deeper-sdv06
    129 Hello world from process 1 of 8 on deeper-sdv04
    130 Hello world from process 7 of 8 on deeper-sdv07
    131 Hello world from process 3 of 8 on deeper-sdv05
    132 Hello world from process 0 of 8 on deeper-sdv04
    133 Hello world from process 2 of 8 on deeper-sdv05
    134 Hello world from process 4 of 8 on deeper-sdv06
    135 Hello world from process 6 of 8 on deeper-sdv07
     128[kreutz1@deepv Temp]$ cat /p/project/cdeep/kreutz1/hello_cluster-69264.out
     129Hello World from rank 6 of 8 on dp-cn04
     130Hello World from rank 3 of 8 on dp-cn02
     131Hello World from rank 0 of 8 on dp-cn01
     132Hello World from rank 4 of 8 on dp-cn03
     133Hello World from rank 2 of 8 on dp-cn02
     134Hello World from rank 7 of 8 on dp-cn04
     135Hello World from rank 5 of 8 on dp-cn03
     136Hello World from rank 1 of 8 on dp-cn01
    136137}}}
    137138