wiki:Public/User_Guide/PaS

Version 9 (modified by Jochen Kreutz, 4 years ago) (diff)

info on slurmtop added

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

To stay informed, please also read the information presented in the "Message of the day" when logging onto the system.

Detected HW and node issues

CM nodes

  • dp-cn33: node still offline after memory issues (#2338)
  • dp-cn49 and dp-cn50: nodes currently reserved for special use case

DAM nodes

  • dp-dam03: being investigated after unexptected reboot (#2323)
  • dp-dam07: showing problems with its FPGA (#2353)
  • dp-dam08: issues with second socket CPU seen (#2304)

ESB nodes

  • dp-esb08: GPU shows PCIe x8 connection only (#2370)
  • dp-esb11: no GPU device detected, under repair (#2358)
  • dp-esb23: MCE problems (#2350)

Software issues

GPU direct usage with IB on ESB

  • currently only available via Developer stage, for testing load:
module --force purge
module use $OTHERSTAGES
module load Stages/Devel-2019a
module load GCC/8.3.0
module load ParaStationMPI
  • use PSP_CUDA=1 and PSP_UCP=1

GPU direct usage with Extoll on DAM

  • new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes
  • only available via Developer stage, for testing load:
module --force purge
module use $OTHERSTAGES
module load Stages/Devel-2019a
module load GCC/8.3.0
module load ParaStationMPI
  • expect performance and stability issues

slurmtop

The slurmtoptool is not working properly, a workaround is to call it via

slurmtop 2> /dev/null