200 | | === How can get a list of broken nodes? === |
201 | | |
202 | | The command to use is |
203 | | |
204 | | {{{ |
205 | | sinfo -Rl -h -o "%n %12U %19H %6t %E" | sort -u |
206 | | }}} |
207 | | |
208 | | See also the translation table below. |
209 | | |
210 | | === Can I join stderr and stdout like it was done with {{{-joe}}} in Torque? === |
| 278 | === How can check the status of partitions and nodes? === |
| 279 | |
| 280 | The main command to use is `sinfo`. By default, when called alone, `sinfo` will list the available partitions and the number of nodes in each partition in a given status. For example: |
| 281 | |
| 282 | {{{ |
| 283 | [deamicis1@deepv hybridhello]$ sinfo |
| 284 | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| 285 | sdv up 20:00:00 16 idle deeper-sdv[01-16] |
| 286 | knl up 20:00:00 1 drain knl01 |
| 287 | knl up 20:00:00 3 idle knl[04-06] |
| 288 | knl256 up 20:00:00 1 drain knl01 |
| 289 | knl256 up 20:00:00 1 idle knl05 |
| 290 | knl272 up 20:00:00 2 idle knl[04,06] |
| 291 | snc4 up 20:00:00 1 idle knl05 |
| 292 | dam up 20:00:00 1 down* protodam01 |
| 293 | dam up 20:00:00 3 idle protodam[02-04] |
| 294 | extoll up 20:00:00 16 idle deeper-sdv[01-16] |
| 295 | ml-gpu up 20:00:00 1 idle ml-gpu01 |
| 296 | dp-cn up 20:00:00 1 drain dp-cn49 |
| 297 | dp-cn up 20:00:00 2 alloc dp-cn[01,50] |
| 298 | dp-cn up 20:00:00 47 idle dp-cn[02-48] |
| 299 | dp-dam up 20:00:00 1 drain* dp-dam01 |
| 300 | dp-dam up 20:00:00 1 drain dp-dam02 |
| 301 | dp-dam up 20:00:00 14 down dp-dam[03-16] |
| 302 | dp-sdv-esb up 20:00:00 2 idle dp-sdv-esb[01-02] |
| 303 | psgw-cluster up 20:00:00 1 down* nfgw01 |
| 304 | psgw-booster up 20:00:00 1 down* nfgw02 |
| 305 | debug up 20:00:00 1 drain* dp-dam01 |
| 306 | debug up 20:00:00 1 down* protodam01 |
| 307 | debug up 20:00:00 3 drain dp-cn49,dp-dam02,knl01 |
| 308 | debug up 20:00:00 14 down dp-dam[03-16] |
| 309 | debug up 20:00:00 2 alloc dp-cn[01,50] |
| 310 | debug up 20:00:00 69 idle deeper-sdv[01-16],dp-cn[02-48],knl[04-06],protodam[02-04] |
| 311 | }}} |
| 312 | |
| 313 | Please refer to the man page for `sinfo` for more information. |
| 314 | |
| 315 | === Can I join stderr and stdout like it was done with {{{-joe}}} in Torque? === |
223 | | === What's the equivalent of {{{qsub -l nodes=x:ppn=y:cluster+n_b:ppn=p_b:booster}}}? === |
224 | | |
225 | | As of version 17.11 of Slurm, heterogeneous jobs are supported. For example, the user can run: |
226 | | |
227 | | {{{ |
228 | | srun --partition=sdv -N 1 -n 1 hostname : --partition=knl -N 1 -n 1 hostname |
229 | | deeper-sdv01 |
230 | | knl05 |
231 | | }}} |
232 | | |
233 | | In order to submit a heterogeneous job, the user needs to set the batch script similarly to the following: |
234 | | |
235 | | {{{#!sh |
236 | | #!/bin/bash |
237 | | |
238 | | #SBATCH --job-name=imb_execute_1 |
239 | | #SBATCH --account=deep |
240 | | #SBATCH --mail-user= |
241 | | #SBATCH --mail-type=ALL |
242 | | #SBATCH --output=job.out |
243 | | #SBATCH --error=job.err |
244 | | #SBATCH --time=00:02:00 |
245 | | |
246 | | #SBATCH --partition=sdv |
247 | | #SBATCH --constraint= |
248 | | #SBATCH --nodes=1 |
249 | | #SBATCH --ntasks=12 |
250 | | #SBATCH --ntasks-per-node=12 |
251 | | #SBATCH --cpus-per-task=1 |
252 | | |
253 | | #SBATCH packjob |
254 | | |
255 | | #SBATCH --partition=knl |
256 | | #SBATCH --constraint= |
257 | | #SBATCH --nodes=1 |
258 | | #SBATCH --ntasks=12 |
259 | | #SBATCH --ntasks-per-node=12 |
260 | | #SBATCH --cpus-per-task=1 |
261 | | |
262 | | srun ./app_sdv : ./app_knl |
263 | | }}} |
264 | | |
265 | | Here the `packjob` keyword allows to define Slurm parameter for each sub-job of the heterogeneous job. |
266 | | |
267 | | If you need to load modules before launching the application, it's suggested to create wrapper scripts around the applications, and submit such scripts with srun, like this: |
268 | | |
269 | | {{{#!sh |
270 | | ... |
271 | | srun ./script_sdv.sh : ./script_knl.sh |
272 | | }}} |
273 | | |
274 | | where a script should contain: |
275 | | |
276 | | {{{#!sh |
277 | | #!/bin/bash |
278 | | |
279 | | module load ... |
280 | | ./app_sdv |
281 | | }}} |
282 | | |
283 | | This way it will also be possible to load different modules on the different partitions used in the heterogeneous job. |
284 | | |
285 | | |
| 328 | |
| 329 | === What is the default binding/pinning behaviour on DEEP? === |
| 330 | |
| 331 | DEEP uses a !ParTec-modified version of Slurm called psslurm. In psslurm, the options concerning binding and pinning are different from the ones provided in Vanilla Slurm. By default, psslurm will use a ''by rank'' pinning strategy, assigning each Slurm task to a different physical thread on the node starting from OS processor 0. For example: |
| 332 | |
| 333 | {{{#!sh |
| 334 | [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=1 srun -N 1 -n 4 -p dp-cn ./HybridHello | sort -k9n -k11n |
| 335 | Hello from node dp-cn50, core 0; AKA rank 0, thread 0 |
| 336 | Hello from node dp-cn50, core 1; AKA rank 1, thread 0 |
| 337 | Hello from node dp-cn50, core 2; AKA rank 2, thread 0 |
| 338 | Hello from node dp-cn50, core 3; AKA rank 3, thread 0 |
| 339 | }}} |
| 340 | |
| 341 | **Attention:** please be aware that the psslurm affinity settings only affect the tasks spawned by Slurm. When using threaded applications, the thread affinity will be inherited from the task affinity of the process originally spawned by Slurm. For example, for a hybrid MPI-OpenMP application: |
| 342 | {{{#!sh |
| 343 | [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=4 srun -N 1 -n 4 -c 4 -p dp-dam ./HybridHello | sort -k9n -k11n |
| 344 | Hello from node dp-dam01, core 0-3; AKA rank 0, thread 0 |
| 345 | Hello from node dp-dam01, core 0-3; AKA rank 0, thread 1 |
| 346 | Hello from node dp-dam01, core 0-3; AKA rank 0, thread 2 |
| 347 | Hello from node dp-dam01, core 0-3; AKA rank 0, thread 3 |
| 348 | Hello from node dp-dam01, core 4-7; AKA rank 1, thread 0 |
| 349 | Hello from node dp-dam01, core 4-7; AKA rank 1, thread 1 |
| 350 | Hello from node dp-dam01, core 4-7; AKA rank 1, thread 2 |
| 351 | Hello from node dp-dam01, core 4-7; AKA rank 1, thread 3 |
| 352 | Hello from node dp-dam01, core 8-11; AKA rank 2, thread 0 |
| 353 | Hello from node dp-dam01, core 8-11; AKA rank 2, thread 1 |
| 354 | Hello from node dp-dam01, core 8-11; AKA rank 2, thread 2 |
| 355 | Hello from node dp-dam01, core 8-11; AKA rank 2, thread 3 |
| 356 | Hello from node dp-dam01, core 12-15; AKA rank 3, thread 0 |
| 357 | Hello from node dp-dam01, core 12-15; AKA rank 3, thread 1 |
| 358 | Hello from node dp-dam01, core 12-15; AKA rank 3, thread 2 |
| 359 | Hello from node dp-dam01, core 12-15; AKA rank 3, thread 3 |
| 360 | }}} |
| 361 | |
| 362 | Be sure to explicitly set the thread affinity settings in your script (e.g. exporting environment variables) or directly in your code. Taking the previous example: |
| 363 | {{{#!sh |
| 364 | [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=4 OMP_PROC_BIND=close srun -N 1 -n 4 -c 4 -p dp-dam ./HybridHello | sort -k9n -k11n |
| 365 | Hello from node dp-dam01, core 0; AKA rank 0, thread 0 |
| 366 | Hello from node dp-dam01, core 1; AKA rank 0, thread 1 |
| 367 | Hello from node dp-dam01, core 2; AKA rank 0, thread 2 |
| 368 | Hello from node dp-dam01, core 3; AKA rank 0, thread 3 |
| 369 | Hello from node dp-dam01, core 4; AKA rank 1, thread 0 |
| 370 | Hello from node dp-dam01, core 5; AKA rank 1, thread 1 |
| 371 | Hello from node dp-dam01, core 6; AKA rank 1, thread 2 |
| 372 | Hello from node dp-dam01, core 7; AKA rank 1, thread 3 |
| 373 | Hello from node dp-dam01, core 8; AKA rank 2, thread 0 |
| 374 | Hello from node dp-dam01, core 9; AKA rank 2, thread 1 |
| 375 | Hello from node dp-dam01, core 10; AKA rank 2, thread 2 |
| 376 | Hello from node dp-dam01, core 11; AKA rank 2, thread 3 |
| 377 | Hello from node dp-dam01, core 12; AKA rank 3, thread 0 |
| 378 | Hello from node dp-dam01, core 13; AKA rank 3, thread 1 |
| 379 | Hello from node dp-dam01, core 14; AKA rank 3, thread 2 |
| 380 | Hello from node dp-dam01, core 15; AKA rank 3, thread 3 |
| 381 | }}} |
| 382 | |
| 383 | Please refer to the [https://apps.fz-juelich.de/jsc/hps/jureca/affinity.html following page] on the JURECA documentation for more information about how to affect affinity on the DEEP system using psslurm options. Please be aware that different partitions on DEEP have different number of sockets per node and cores/threads per socket with respect to JURECA. Please refer to the [wiki:System_overview] or run the `lstopo-no-graphics` on the compute nodes to get more information about the hardware configuration on the different modules. |
| 384 | |
| 385 | |
| 386 | |
| 387 | === How do I use SMT on the DEEP CPUs? === |
| 388 | |
| 389 | On DEEP, SMT is enabled by default on all nodes. Please be aware that on all JSC systems (including DEEP), each hardware thread is exposed by the OS as a physical core. For a ''n''-core node, with ''m'' hardware threads per core, the OS cores from ''0'' to ''n-1'' will correspond to the first hardware thread of all hardware cores (from all sockets), the OS cores from ''n'' to ''2n-1'' to the second hardware thread of the hardware cores, and so on. |
| 390 | |
| 391 | For instance, on a Cluster node (with two sockets with 12 cores each, with 2 hardware threads per core): |
| 392 | {{{ |
| 393 | [deamicis1@deepv hybridhello]$ srun -N 1 -n 1 -p dp-cn lstopo-no-graphics --no-caches --no-io --no-bridges --of ascii |
| 394 | ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ |
| 395 | │ Machine (191GB total) │ |
| 396 | │ │ |
| 397 | │ ┌────────────────────────────────────────────────────────────────────────┐ ┌────────────────────────────────────────────────────────────────────────┐ │ |
| 398 | │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ |
| 399 | │ │ │ NUMANode P#0 (95GB) │ │ │ │ NUMANode P#1 (96GB) │ │ │ |
| 400 | │ │ └────────────────────────────────────────────────────────────────────┘ │ │ └────────────────────────────────────────────────────────────────────┘ │ │ |
| 401 | │ │ │ │ │ │ |
| 402 | │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ |
| 403 | │ │ │ Package P#0 │ │ │ │ Package P#1 │ │ │ |
| 404 | │ │ │ │ │ │ │ │ │ │ |
| 405 | │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ |
| 406 | │ │ │ │ Core P#0 │ │ Core P#1 │ │ Core P#2 │ │ Core P#3 │ │ │ │ │ │ Core P#0 │ │ Core P#3 │ │ Core P#4 │ │ Core P#8 │ │ │ │ |
| 407 | │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ |
| 408 | │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ |
| 409 | │ │ │ │ │ PU P#0 │ │ │ │ PU P#1 │ │ │ │ PU P#2 │ │ │ │ PU P#3 │ │ │ │ │ │ │ │ PU P#12 │ │ │ │ PU P#13 │ │ │ │ PU P#14 │ │ │ │ PU P#15 │ │ │ │ │ |
| 410 | │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ |
| 411 | │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ |
| 412 | │ │ │ │ │ PU P#24 │ │ │ │ PU P#25 │ │ │ │ PU P#26 │ │ │ │ PU P#27 │ │ │ │ │ │ │ │ PU P#36 │ │ │ │ PU P#37 │ │ │ │ PU P#38 │ │ │ │ PU P#39 │ │ │ │ │ |
| 413 | │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ |
| 414 | │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ |
| 415 | │ │ │ │ │ │ │ │ │ │ |
| 416 | │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ |
| 417 | │ │ │ │ Core P#4 │ │ Core P#9 │ │ Core P#10 │ │ Core P#16 │ │ │ │ │ │ Core P#9 │ │ Core P#10 │ │ Core P#11 │ │ Core P#16 │ │ │ │ |
| 418 | │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ |
| 419 | │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ |
| 420 | │ │ │ │ │ PU P#4 │ │ │ │ PU P#5 │ │ │ │ PU P#6 │ │ │ │ PU P#7 │ │ │ │ │ │ │ │ PU P#16 │ │ │ │ PU P#17 │ │ │ │ PU P#18 │ │ │ │ PU P#19 │ │ │ │ │ |
| 421 | │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ |
| 422 | │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ |
| 423 | │ │ │ │ │ PU P#28 │ │ │ │ PU P#29 │ │ │ │ PU P#30 │ │ │ │ PU P#31 │ │ │ │ │ │ │ │ PU P#40 │ │ │ │ PU P#41 │ │ │ │ PU P#42 │ │ │ │ PU P#43 │ │ │ │ │ |
| 424 | │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ |
| 425 | │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ |
| 426 | │ │ │ │ │ │ │ │ │ │ |
| 427 | │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ |
| 428 | │ │ │ │ Core P#18 │ │ Core P#19 │ │ Core P#25 │ │ Core P#26 │ │ │ │ │ │ Core P#17 │ │ Core P#18 │ │ Core P#24 │ │ Core P#26 │ │ │ │ |
| 429 | │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ |
| 430 | │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ |
| 431 | │ │ │ │ │ PU P#8 │ │ │ │ PU P#9 │ │ │ │ PU P#10 │ │ │ │ PU P#11 │ │ │ │ │ │ │ │ PU P#20 │ │ │ │ PU P#21 │ │ │ │ PU P#22 │ │ │ │ PU P#23 │ │ │ │ │ |
| 432 | │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ |
| 433 | │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │ |
| 434 | │ │ │ │ │ PU P#32 │ │ │ │ PU P#33 │ │ │ │ PU P#34 │ │ │ │ PU P#35 │ │ │ │ │ │ │ │ PU P#44 │ │ │ │ PU P#45 │ │ │ │ PU P#46 │ │ │ │ PU P#47 │ │ │ │ │ |
| 435 | │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ |
| 436 | │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ |
| 437 | │ │ └────────────────────────────────────────────────────────────────────┘ │ │ └────────────────────────────────────────────────────────────────────┘ │ │ |
| 438 | │ └────────────────────────────────────────────────────────────────────────┘ └────────────────────────────────────────────────────────────────────────┘ │ |
| 439 | └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ |
| 440 | ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ |
| 441 | │ Host: dp-cn50 │ |
| 442 | │ │ |
| 443 | │ Indexes: physical │ |
| 444 | │ │ |
| 445 | │ Date: Thu 21 Nov 2019 15:22:31 CET │ |
| 446 | └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ |
| 447 | }}} |
| 448 | The `PU P#X` are the Processing Units numbers exposed by the OS. |
| 449 | |
| 450 | To exploit SMT, simply run a job using a number of tasks*threads_per_task higher than the number of physical cores available on a node. Please refer to the [https://apps.fz-juelich.de/jsc/hps/jureca/smt.html relevant page] of the JURECA documentation for more information on how to use SMT on the DEEP nodes. |
| 451 | |
| 452 | **Attention**: currently the only way of assign Slurm tasks to hardware threads belonging to the same hardware core is to use the `--cpu-bind` option of psslurm using `mask_cpu` to provide affinity masks for each task. For example: |
| 453 | {{{#!sh |
| 454 | [deamicis1@deepv hybridhello]$ OMP_NUM_THREADS=2 OMP_PROC_BIND=close OMP_PLACES=threads srun -N 1 -n 2 -p dp-dam --cpu-bind=mask_cpu:$(printf '%x' "$((2#1000000000000000000000000000000000000000000000001))"),$(printf '%x' "$((2#10000000000000000000000000000000000000000000000010))") ./HybridHello | sort -k9n -k11n |
| 455 | Hello from node dp-dam01, core 0; AKA rank 0, thread 0 |
| 456 | Hello from node dp-dam01, core 48; AKA rank 0, thread 1 |
| 457 | Hello from node dp-dam01, core 1; AKA rank 1, thread 0 |
| 458 | Hello from node dp-dam01, core 49; AKA rank 1, thread 1 |
| 459 | }}} |
| 460 | |
| 461 | This can be cumbersome for jobs using a large number of tasks per node. In such cases, a tool like [https://www.open-mpi.org/projects/hwloc/ hwloc] (currently available on the compute nodes, but not on the login node!) can be used to calculate the affinity masks to be passed to psslurm. |
| 462 | |
| 463 | |
| 464 | {{{#!comment |