Changes between Version 28 and Version 29 of Public/User_Guide/PaS


Ignore:
Timestamp:
Oct 27, 2020, 2:17:11 PM (4 years ago)
Author:
Jochen Kreutz
Comment:

HW and SW sections updated to reflect current system status

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/PaS

    v28 v29  
    33This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.
    44
    5 ''Last update: 2020-09-23''
     5''Last update: 2020-10-27''
    66{{{#!comment highlighted red text
    77[[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]]
     
    1515=== CM nodes ===
    1616
    17 * dp-cn[01,09,10]: nodes currently reserved for special use case during working hours
    18 * dp-cn33: memory issues (#2464)
    19 * dp-cn49: configuration change required (#2291)
    20 * dp-cn50: node not reachable (#2488)
     17* dp-cn25: node shows Unknown SPS FW Health (#2495)
     18       
    2119
    2220=== DAM nodes ===
    2321
    2422* dp-dam03: node currently reserved for special use case (#2242)
    25 * dp-dam04: showing low streams performance (#2401)
    26 * dp-dam05: node currently reserved for special use case
    27 * dp-dam07: showing problems with its FPGA (#2353)
    28 * dp-dam[09,10]: nodes currently reserved for special use case during working hours
    2923
    3024
     
    3529}}}
    3630
    37 * dp-esb05: node hangs in state "Idle+Completing"
    3831* dp-esb11: wrong GPU Link Speed detected (#2358)
    3932* dp-esb24: CentOS8 Testbed (#2396)
    40 * dp-esb39: energy meter reading issues (#2432)
    4133* dp-esb52: energy meter reading issues (#2433)
    42 * dp-esb61: node not reachable (#2469)
    43 * dp-esb71: energy meter reading issues (#2432)
    44 * dp-esb73: energy meter reading issues (#2433)
    4534
    4635
    4736=== SDV nodes ===
    4837
    49 * deeper-sdv01: node reachable via nework, but marked as down in SLURM
     38* deeper-sdv[01-16]: currently offline after removing Extoll network
    5039* nfgw[01,02]: node reachable via nework, but marked as down in SLURM
    5140* knl01: NVMe issues (#2011)
    52 * ml-gpu02: memory issues reported with MCE (#2489)
    53 
    5441
    5542
    5643== Software issues ==
    57 
    58 === Modular jobs failing ===
    59 
    60 - users reported failing jobs that are doing MPI on more than one module using the gateways
    61 - the problem is being investigated
    62 
    6344
    6445=== SLURM jobs ===