Version 39 (modified by 4 years ago) (diff) | ,
---|
Table of Contents
This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.
Last update: 2021-04-09
System maintenance from Tuesday, 2021-04-13 to Thursday, 2021-04-15, no user access !
To stay informed, please refer to the News page. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.
Detected HW and node issues
CM nodes
- dp-cn25: FW issues (#2495)
DAM nodes
- dp-dam[01-08]: limited access due to Fabri3 setup
- dp-dam02: node currently reserved for special use case (#2554)
- dp-dam03: node currently reserved for special use case (#2242)
ESB nodes
- dp-esb[01-25]: currently not avialable due to Fabri3 installation
- dp-esb75: node currently reserved for special use case (#2568)
SDV nodes
- several nodes have been taken offline:
- deeper-sdv[11-16]
- deeper storage system: (deeper-fs[01-03], deeper-raids)
- deeper-sdv[01-10]: currently not available: configuration change needed (low priority)
- knl01: NVMe issues (#2011)
Software issues
SLURM jobs
- due to introduction of accounting with the start of the early access program there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs
- this might lead to (temporary) failing job starts for certain users
- if you cannot start jobs via SLURM, please write an email to the support list:
sup(at)deep-est.eu
GPU direct usage with IB on ESB
- only available via Developer stage, for testing load:
ml --force purge ml use $OTHERSTAGES ml load Stages/Devel-2020 ml load Intel ml load ParaStationMPI
- use
PSP_CUDA=1
andPSP_UCP=1
GPU direct usage with Extoll on DAM
- new Extoll driver for GPU direct over Extoll still shows low performance on the DAM nodes
- available via Developer stage, for testing load:
ml --force purge ml use $OTHERSTAGES ml load Stages/Devel-2020 ml load Intel ml load ParaStationMPI
- expect performance (and maybe also stability) issues