wiki:Public/User_Guide/PaS

Version 27 (modified by Jochen Kreutz, 4 years ago) (diff)

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

Last update: 2020-09-22

GPFS issues occured, user login currently not possible !

To stay informed, please refer to the News page. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.

Detected HW and node issues

CM nodes

  • dp-cn[09,10]: nodes currently reserved for special use case during working hours
  • dp-cn11: node was not responding (#2426)
  • dp-cn24: thermal trip asserted (#2443, #2306)
  • dp-cn41: node not responding (#2477)

DAM nodes

  • dp-dam03: node currently reserved for special use case (#2242)
  • dp-dam04: showing low streams performance (#2401)
  • dp-dam05: node currently reserved for special use case
  • dp-dam07: showing problems with its FPGA (#2353)
  • dp-dam[09,10]: nodes currently reserved for special use case during working hours

ESB nodes

  • dp-esb02: energy meter reading issues
  • dp-esb03: energy meter reading issues (#2466)
  • dp-esb11: wrong GPU Link Speed detected (#2358)
  • dp-esb23: MCE problems (#2350)
  • dp-esb24: CentOS8 Testbed (#2396)
  • dp-esb28: no access to bmc (#2430)
  • dp-esb33: no access to bmc (#2429)
  • dp-esb38: no access to bmc
  • dp-esb39: energy meter reading issues (#2432)
  • dp-esb52: energy meter reading issues (#2433)
  • dp-esb71: energy meter reading issues (#2432)
  • dp-esb73: energy meter reading issues (#2433)

SDV nodes

  • deeper-sdv01: node reachable via nework, but marked as down in SLURM
  • nfgw[01,02]: node reachable via nework, but marked as down in SLURM
  • knl01: NVMe issues (#2011)

Software issues

Modular jobs failing

  • users reported failing jobs that are doing MPI on more than one module using the gateways
  • the problem is being investigated