[[TOC]] This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. == Detected HW and node issues == === CM nodes === * dp-cn33: node still offline after memory issues (#2338) * dp-cn49 and dp-cn50: nodes currently reserved for special use case === DAM nodes === * dp-dam03: being investigated after unexptected reboot (#2323) * dp-dam07: showing problems with its FPGA (#2353) * dp-dam08: issues with second socket CPU seen (#2304) === ESB nodes === * dp-esb08: GPU shows PCIe x8 connection only (#2370) * dp-esb11: no GPU device detected, under repair (#2358) * dp-esb23: MCE problems (#2350) == Software issues == === GPU direct usage with IB on ESB === - currently only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI }}} - use `PSP_CUDA=1` and `PSP_UCP=1` === GPU direct usage with Extoll on DAM === - new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes - only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI }}} - expect performance and stability issues === slurmtop === The `slurmtop`tool is not working properly, a workaround is to call it via {{{ slurmtop 2> /dev/null }}}