[[TOC]] This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. ''Last update: 2020-09-01'' {{{#!comment highlighted red text [[span(style=color: #FF0000, 2020-05-12: Currently no login possible )]] }}} To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. == Detected HW and node issues == === CM nodes === * dp-cn[09,10]: nodes currently reserved for special use case during working hours * dp-cn11: node was not responding (#2426) * dp-cn24: thermal trip asserted (#2443, #2306) * dp-cn41: node not responding (#2477) === DAM nodes === * dp-dam03: node currently reserved for special use case (#2242) * dp-dam04: showing low streams performance (#2401) * dp-dam05: node currently reserved for special use case * dp-dam07: showing problems with its FPGA (#2353) * dp-dam[09,10]: nodes currently reserved for special use case during working hours === ESB nodes === [[span(style=color: #FF0000, Currently facing issues in reading the ESB Energy Meter leading to nodes going offline. A fix is ready for roll-out)]] * dp-esb02: energy meter reading issues * dp-esb03: energy meter reading issues (#2466) * dp-esb11: wrong GPU Link Speed detected (#2358) * dp-esb23: MCE problems (#2350) * dp-esb24: CentOS8 Testbed (#2396) * dp-esb28: no access to bmc (#2430) * dp-esb33: no access to bmc (#2429) * dp-esb38: no access to bmc * dp-esb39: energy meter reading issues (#2432) * dp-esb52: energy meter reading issues (#2433) * dp-esb71: energy meter reading issues (#2432) * dp-esb73: energy meter reading issues (#2433) === SDV nodes === * deeper-sdv01: node reachable via nework, but marked as down in SLURM * nfgw[01,02]: node reachable via nework, but marked as down in SLURM * knl01: NVMe issues (#2011) {{{#!comment JK: status to be clarified on Thursday, 2020-09-03 == Software issues == === LDAP error message during login === - seeing frequent failovers between the master nodes that are being investigated - currntly, a failover might lead to seeing the following or a similar error message during login: {{{ Error: ldap_search: failed to open connection to LDAP server(s) and search. Exception: socket connection error while opening: [Errno 111] Connection refused }}} - the message usually can be ignored - in addition, some of the environment variables, e.g. `$PROJECT` are not set (properly) - if you see further issues or cannot login at all, please write an email to the support list: `sup(at)deep-est.eu` === SLURM jobs === - due to introduction of accounting with the start of the early access program there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs - this might lead to (temporary) failing job starts for certain users - if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-est.eu` === /sdv-work corrupted === - due to failing disks the SDV work filesystem mounted to `/sdv-work` got corrupted and has to be rebuild - meta data still seems to be ok, so directories and files can be seen, but no file access is possible - not sure if any data can be recovered since work filesystems are not in backup === GPU direct usage with IB on ESB === - currently only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI }}} - use `PSP_CUDA=1` and `PSP_UCP=1` === GPU direct usage with Extoll on DAM === - new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes - only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI }}} - expect performance and stability issues === Horovod === - currently only working with the developer stage === slurmtop === The `slurmtop`tool is not working properly, a workaround is to call it via {{{ slurmtop 2> /dev/null }}} }}}