[[TOC]] This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. To stay informed, please also read the information presented in the "Message of the day" when logging onto the system. == Detected HW and node issues == === CM nodes === * dp-cn33: node still offline after memory issues (#2338) * dp-cn49 and dp-cn50: nodes currently reserved for special use case === DAM nodes === * dp-dam03: being investigated after unexptected reboot (#2323) * dp-dam07: showing problems with its FPGA (#2353) * dp-dam08: issues with second socket CPU seen (#2304) === ESB nodes === * dp-esb08: GPU shows PCIe x8 connection only (#2370) * dp-esb11: no GPU device detected, under repair (#2358) * dp-esb23: MCE problems (#2350) == Software issues == === GPU direct usage with IB on ESB === - currently only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 module load ParaStationMPI }}} - use `PSP_CUDA=1` and `PSP_UCP=1` === GPU direct usage with Extoll on DAM === - new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes - only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 module load ParaStationMPI }}} - expect performance and stability issues === slurmtop === The `slurmtop`tool is not working properly, a workaround is to call it via {{{ slurmtop 2> /dev/null }}}