[[TOC]]

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. 

== Detected HW and node issues ==

=== CM nodes ===

{{{#!comment JK 2020-04-24: node back online 
* dp-cn08: node offline after memory issues (#2385)  
}}}
* dp-cn09 - dp-cn16: nodes currently reservedfor special use case during working hours 
* dp-cn49: node currently reserved for special use case

=== DAM nodes ===

* dp-dam03: node currently reserved for special use case
* dp-dam04: showing low streams performance (#2401)
* dp-dam05: node currently reserved for special use case
* dp-dam07: showing problems with its FPGA (#2353)
* dp-dam08: issues with second socket CPU seen (#2304)


=== ESB nodes ===

* dp-esb08: GPU shows PCIe x8 connection only (#2370)
* dp-esb11: no GPU device detected, under repair (#2358)
* dp-esb23: MCE problems (#2350)
* dp-esb24: offline due to spontaneous reboot 


=== SDV ESB nodes ===

* dp-sdv-esb[01,02]: replacement of V100 cards


== Software issues ==

=== LDAP error message during login ===

- seeing frequent failovers between the master nodes that are being investigated
- currntly, a failover might lead to seeing the following or a similar error message during login:

{{{
Error: ldap_search: failed to open connection to LDAP server(s) and search. Exception: socket connection error while opening: [Errno 111] Connection refused 
}}}

- the message usually can be ignored 
- in addition, some of the environment variables, e.g. `$PROJECT` are not set (properly)
- if you see further issues or cannot login at all, please write an email to the support list: `sup(at)deep-est.eu`

=== SLURM jobs ===

- due to introduction of accounting with the start of the early access program there is some re-configuration
  of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs
- this might lead to (temporary) failing job starts for certain users
- if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-est.eu`

=== /sdv-work corrupted ===

- due to failing disks the SDV work filesystem mounted to `/sdv-work` got corrupted and has to be rebuild
- meta data still seems to be ok, so directories and files can be seen, but no file access is possible
- not sure if any data can be recovered since work filesystems are not in backup

=== GPU direct usage with IB on ESB ===

- currently only available via Developer stage, for testing load:

{{{
module --force purge
module use $OTHERSTAGES
module load Stages/Devel-2019a
module load GCC/8.3.0 or module load Intel
module load ParaStationMPI
}}}

- use `PSP_CUDA=1` and `PSP_UCP=1`


=== GPU direct usage with Extoll on DAM ===

- new Extoll driver for GPU direct over Extoll currently being tested on the DAM nodes
- only available via Developer stage, for testing load:

{{{
module --force purge
module use $OTHERSTAGES
module load Stages/Devel-2019a
module load GCC/8.3.0 or module load Intel
module load ParaStationMPI
}}}

- expect performance and stability issues


=== Horovod === 

- currently only working with the developer stage 


=== slurmtop ===

The `slurmtop`tool is not working properly, a workaround is to call it via

{{{
slurmtop 2> /dev/null
}}}