wiki:Public/User_Guide/PaS

Version 41 (modified by Jochen Kreutz, 3 years ago) (diff)

remove hint about GPU direct on DAM using Extoll

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

Last update: 2021-09-09

To stay informed, please refer to the News page. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.

Detected HW and node issues

CM nodes

  • dp-cn05: memory issue - node at Megware for repair (#2682)
  • dp-cn25: FW issues (#2495)
  • dp-cn42: memory issue (#2675)
  • dp-cn[47-50]: rocky linux testbed

DAM nodes

  • dp-dam08: memory issues (#2722)

ESB nodes

  • dp-esb[01-25]: currently being prepared as rocky linux testbed
  • dp-esb75: node currently reserved for special use case (#2568)

SDV nodes

  • deeper-sdv cluster nodes (Haswell) have been taken offline: deeper-sdv[01-16]
    • not included in SLURM anymore
    • deeper-sdv[01-10] will be used for testing
  • knl01: NVMe issues (#2011)

Software issues

SLURM jobs

  • due to introduction of accounting there is some re-configuration of user accounts needed within SLURM to assign the correct QOS levels and priorities for the jobs
    • this might lead to (temporary) failing job starts for certain users
    • if you cannot start jobs via SLURM, please write an email to the support list: sup(at)deep-est.eu