[[TOC]] This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. == Detected HW and node issues == === CM nodes === {{{#!comment JK 2020-04-24: node back online * dp-cn08: node offline after memory issues (#2385) }}} * dp-cn09 - dp-cn16: nodes currently reservedfor special use case during working hours * dp-cn49: node currently reserved for special use case === DAM nodes === * dp-dam03: node currently reserved for special use case * dp-dam04: showing low streams performance (#2401) * dp-dam05: node currently reserved for special use case * dp-dam07: showing problems with its FPGA (#2353) * dp-dam08: issues with second socket CPU seen (#2304) === ESB nodes === * dp-esb08: GPU shows PCIe x8 connection only (#2370) * dp-esb11: no GPU device detected, under repair (#2358) * dp-esb23: MCE problems (#2350) * dp-esb24: offline due to spontaneous reboot === SDV ESB nodes === * dp-sdv-esb[01,02]: replacement of V100 cards == Software issues == === LDAP error message during login === - seeing frequent failovers between the master nodes that are being investigated - currntly, a failover might lead to seeing the following or a similar error message during login: {{{ Error: ldap_search: failed to open connection to LDAP server(s) and search. Exception: socket connection error while opening: [Errno 111] Connection refused }}} - the message usually can be ignored - in addition, some of the environment variables, e.g. `$PROJECT` are not set (properly) - if you see further issues or cannot login at all, please write an email to the support list: `sup(at)deep-est.eu` === SLURM jobs === - due to introduction of accounting with the start of the early access program there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs - this might lead to (temporary) failing job starts for certain users - if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-est.eu` === /sdv-work corrupted === - due to failing disks the SDV work filesystem mounted to `/sdv-work` got corrupted and has to be rebuild - meta data still seems to be ok, so directories and files can be seen, but no file access is possible - not sure if any data can be recovered since work filesystems are not in backup === GPU direct usage with IB on ESB === - currently only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI }}} - use `PSP_CUDA=1` and `PSP_UCP=1` === GPU direct usage with Extoll on DAM === - new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes - only available via Developer stage, for testing load: {{{ module --force purge module use $OTHERSTAGES module load Stages/Devel-2019a module load GCC/8.3.0 or module load Intel module load ParaStationMPI }}} - expect performance and stability issues === Horovod === - currently only working with the developer stage === slurmtop === The `slurmtop`tool is not working properly, a workaround is to call it via {{{ slurmtop 2> /dev/null }}}