Version 16 (modified by 5 years ago) (diff) | ,
---|
Table of Contents
This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.
To stay informed, please refer to the News page. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.
Detected HW and node issues
CM nodes
- dp-cn08: node offline after memory issues (#2385)
- dp-cn09: node currently reserved for special use case
- dp-cn10: node currently reserved for special use case
- dp-cn29: node still offline after memory issues (#2395)
- dp-cn49: node currently reserved for special use case
DAM nodes
- dp-dam03: node currently reserved for special use case
- dp-dam07: showing problems with its FPGA (#2353)
- dp-dam08: issues with second socket CPU seen (#2304)
- dp-dam09: node currently reserved for special use case
- dp-dam10: node currently reserved for special use case
ESB nodes
- dp-esb08: GPU shows PCIe x8 connection only (#2370)
- dp-esb11: no GPU device detected, under repair (#2358)
- dp-esb23: MCE problems (#2350)
- dp-esb24: offline due to spontaneous reboot
Software issues
LDAP error message during login
- currntly, a failover between the two master nodes might lead to seeing the following or a similar error message during login:
Error: ldap_search: failed to open connection to LDAP server(s) and search. Exception: socket connection error while opening: [Errno 111] Connection refused
- the message usually can be ignored
- if you see further issues or cannot login at all, please write an email to the support list:
sup(at)deep-est.eu
SLURM jobs
- due to introduction of accounting with the start of the early access program there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs
- this might lead to (temporary) failing job starts for certain users
- if you cannot start jobs via SLURM, please write an email to the support list:
sup(at)deep-est.eu
/sdv-work corrupted
- due to failing disks the SDV work filesystem mounted to
/sdv-work
got corrupted and has to be rebuild - meta data still seems to be ok, so directories and files can be seen, but no file access is possible
- not sure if any data can be recovered since work filesystems are not in backup
GPU direct usage with IB on ESB
- currently only available via Developer stage, for testing load:
module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI
- use
PSP_CUDA=1
andPSP_UCP=1
GPU direct usage with Extoll on DAM
- new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes
- only available via Developer stage, for testing load:
module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI
- expect performance and stability issues
slurmtop
The slurmtop
tool is not working properly, a workaround is to call it via
slurmtop 2> /dev/null