Public/User_Guide/PaS – DEEP

Context Navigation

Version 20 (modified by Jochen Kreutz, 4 years ago) (diff)
node info updated

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

To stay informed, please refer to the News page. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.

Detected HW and node issues

CM nodes

dp-cn09 - dp-cn16: nodes currently reservedfor special use case during working hours
dp-cn49: node currently reserved for special use case

DAM nodes

dp-dam03: node currently reserved for special use case
dp-dam04: showing low streams performance (#2401)
dp-dam05: node currently reserved for special use case
dp-dam07: showing problems with its FPGA (#2353)
dp-dam08: issues with second socket CPU seen (#2304)

ESB nodes

dp-esb08: GPU shows PCIe x8 connection only (#2370)
dp-esb11: no GPU device detected, under repair (#2358)
dp-esb23: MCE problems (#2350)
dp-esb24: offline due to spontaneous reboot

SDV ESB nodes

dp-sdv-esb[01,02]: replacement of V100 cards

Software issues

LDAP error message during login

seeing frequent failovers between the master nodes that are being investigated
currntly, a failover might lead to seeing the following or a similar error message during login:

Error: ldap_search: failed to open connection to LDAP server(s) and search. Exception: socket connection error while opening: [Errno 111] Connection refused

the message usually can be ignored
in addition, some of the environment variables, e.g. $PROJECT are not set (properly)
if you see further issues or cannot login at all, please write an email to the support list: sup(at)deep-est.eu

SLURM jobs

due to introduction of accounting with the start of the early access program there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs
this might lead to (temporary) failing job starts for certain users
if you cannot start jobs via SLURM, please write an email to the support list: sup(at)deep-est.eu

/sdv-work corrupted

due to failing disks the SDV work filesystem mounted to /sdv-work got corrupted and has to be rebuild
meta data still seems to be ok, so directories and files can be seen, but no file access is possible
not sure if any data can be recovered since work filesystems are not in backup

GPU direct usage with IB on ESB

currently only available via Developer stage, for testing load:

module --force purge
module use $OTHERSTAGES
module load Stages/Devel-2019a
module load GCC/8.3.0 or module load Intel
module load ParaStationMPI

use PSP_CUDA=1 and PSP_UCP=1

GPU direct usage with Extoll on DAM

new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes
only available via Developer stage, for testing load:

module --force purge
module use $OTHERSTAGES
module load Stages/Devel-2019a
module load GCC/8.3.0 or module load Intel
module load ParaStationMPI

expect performance and stability issues

Horovod

currently only working with the developer stage

slurmtop

The slurmtoptool is not working properly, a workaround is to call it via

slurmtop 2> /dev/null