Version 11 (modified by 5 years ago) (diff) | ,
---|
Table of Contents
This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.
To stay informed, please also read the information presented in the "Message of the day" when logging onto the system.
Detected HW and node issues
CM nodes
- dp-cn33: node still offline after memory issues (#2338)
- dp-cn49 and dp-cn50: nodes currently reserved for special use case
DAM nodes
- dp-dam03: being investigated after unexptected reboot (#2323)
- dp-dam07: showing problems with its FPGA (#2353)
- dp-dam08: issues with second socket CPU seen (#2304)
ESB nodes
- dp-esb08: GPU shows PCIe x8 connection only (#2370)
- dp-esb11: no GPU device detected, under repair (#2358)
- dp-esb23: MCE problems (#2350)
Software issues
GPU direct usage with IB on ESB
- currently only available via Developer stage, for testing load:
module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCCcore/.8.3.0 module load GCC/8.3.0 or module load Intel module load ParaStationMPI
- use
PSP_CUDA=1
andPSP_UCP=1
GPU direct usage with Extoll on DAM
- new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes
- only available via Developer stage, for testing load:
module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCCcore/.8.3.0 module load GCC/8.3.0 or module load Intel module load ParaStationMPI
- expect performance and stability issues
slurmtop
The slurmtop
tool is not working properly, a workaround is to call it via
slurmtop 2> /dev/null