[[TOC]] This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. ''Last update: 2021-12-10'' [[span(style=color: #FF0000, System maintenance from Tuesday, 2021-12-14 to Thursday, 2021-12-16, limited user access !)]] {{{#!comment highlighted red text [[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]] }}} To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. == Detected HW and node issues == === CM nodes === * dp-cn05: memory issue - node at Megware for repair (#2682) * dp-cn25: FW issues (#2495) * dp-cn42: memory issue (#2675) * dp-cn[47-50]: rocky linux testbed === DAM nodes === * dp-dam08: memory issues (#2722) === ESB nodes === {{{#!comment JK: EM client has been fixed [[span(style=color: #FF0000, Currently facing issues in reading the ESB Energy Meter leading to nodes going offline. A fix is ready for roll-out)]] }}} * dp-esb[01-25]: currently being prepared as rocky linux testbed * dp-esb75: node currently reserved for special use case (#2568) === SDV nodes === * deeper-sdv cluster nodes (Haswell) have been taken offline: deeper-sdv[01-16] - not included in SLURM anymore - deeper-sdv[01-10] will be used for testing * knl01: NVMe issues (#2011) == Software issues == === SLURM jobs === - due to introduction of accounting there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs * this might lead to (temporary) failing job starts for certain users * if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-sea.eu` {{{#!comment JK: invalid === GPU direct usage with Extoll on DAM === - new Extoll driver for GPU direct over Extoll still shows low performance on the DAM nodes - available via Developer stage, for testing load: {{{ ml --force purge ml use $OTHERSTAGES ml load Stages/Devel-2020 ml load Intel ml load ParaStationMPI }}} - expect performance (and maybe also stability) issues }}}