[[TOC]] This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. ''Last update: 2021-02-26'' [[span(style=color: #FF0000, CM, DAM, ESB access limited to project related activities ! Please use "--reservation=maint-ticket_2600" for your jobs)]] {{{#!comment highlighted red text [[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]] }}} To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. == Detected HW and node issues == === CM nodes === * dp-cn25: FW issues (#2495) === DAM nodes === * dp-dam[01-08]: limited access due to Fabri3 setup * dp-dam02: node currently reserved for special use case (#2554) * dp-dam03: node currently reserved for special use case (#2242) === ESB nodes === {{{#!comment JK: EM client has been fixed [[span(style=color: #FF0000, Currently facing issues in reading the ESB Energy Meter leading to nodes going offline. A fix is ready for roll-out)]] }}} * **dp-esb[01-25]: currently not avialable due to Fabri3 installation** * dp-esb75: node currently reserved for special use case (#2568) === SDV nodes === * several nodes have been taken offline: - deeper-sdv[11-16] - deeper storage system: (deeper-fs[01-03], deeper-raids) * deeper-sdv[01-10]: currently not available: configuration change needed (low priority) * knl01: NVMe issues (#2011) == Software issues == === SLURM jobs === - due to introduction of accounting with the start of the early access program there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs - this might lead to (temporary) failing job starts for certain users - if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-est.eu` === GPU direct usage with IB on ESB === - only available via Developer stage, for testing load: {{{ ml --force purge ml use $OTHERSTAGES ml load Stages/Devel-2020 ml load Intel ml load ParaStationMPI }}} - use `PSP_CUDA=1` and `PSP_UCP=1` === GPU direct usage with Extoll on DAM === - new Extoll driver for GPU direct over Extoll still shows low performance on the DAM nodes - available via Developer stage, for testing load: {{{ ml --force purge ml use $OTHERSTAGES ml load Stages/Devel-2020 ml load Intel ml load ParaStationMPI }}} - expect performance (and maybe also stability) issues