[[TOC]]

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

''Last update: 2021-02-26''

[[span(style=color: #FF0000, CM, DAM, ESB access limited to project related activities ! Please use "--reservation=maint-ticket_2600" for your jobs)]]
{{{#!comment highlighted red text
[[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]]
}}}

To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. 


== Detected HW and node issues ==

=== CM nodes ===

* dp-cn25: FW issues (#2495)
       

=== DAM nodes ===

* dp-dam[01-08]: limited access due to Fabri3 setup
* dp-dam02: node currently reserved for special use case (#2554)
* dp-dam03: node currently reserved for special use case (#2242)


=== ESB nodes ===

{{{#!comment JK: EM client has been fixed
[[span(style=color: #FF0000, Currently facing issues in reading the ESB Energy Meter leading to nodes going offline. A fix is ready for roll-out)]]
}}}

* **dp-esb[01-25]: currently not avialable due to Fabri3 installation**
* dp-esb75: node currently reserved for special use case (#2568)


=== SDV nodes ===

* several nodes have been taken offline: 
  - deeper-sdv[11-16]
  - deeper storage system: (deeper-fs[01-03], deeper-raids)
* deeper-sdv[01-10]: currently not available: configuration change needed (low priority)
* knl01: NVMe issues (#2011)


== Software issues ==

=== SLURM jobs ===

- due to introduction of accounting with the start of the early access program there is some re-configuration
  of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs
- this might lead to (temporary) failing job starts for certain users
- if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-est.eu`


=== GPU direct usage with IB on ESB ===

- only available via Developer stage, for testing load:

{{{
ml --force purge
ml use $OTHERSTAGES
ml load Stages/Devel-2020
ml load Intel
ml load ParaStationMPI
}}}

- use `PSP_CUDA=1` and `PSP_UCP=1`


=== GPU direct usage with Extoll on DAM ===

- new Extoll driver for GPU direct over Extoll still shows low performance on the DAM nodes
- available via Developer stage, for testing load:

{{{
ml --force purge
ml use $OTHERSTAGES
ml load Stages/Devel-2020
ml load Intel
ml load ParaStationMPI
}}}

- expect performance (and maybe also stability) issues