Changes between Version 27 and Version 28 of Public/User_Guide/PaS


Ignore:
Timestamp:
Sep 23, 2020, 9:22:54 AM (4 years ago)
Author:
Jochen Kreutz
Comment:

HW issues updated

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/PaS

    v27 v28  
    33This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.
    44
    5 ''Last update: 2020-09-22''
     5''Last update: 2020-09-23''
    66{{{#!comment highlighted red text
    77[[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]]
    88}}}
    9 [[span(style=color: #FF0000, GPFS issues occured, user login currently not possible !)]]
    109
    1110To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.
     
    1615=== CM nodes ===
    1716
    18 * dp-cn[09,10]: nodes currently reserved for special use case during working hours
    19 * dp-cn11: node was not responding (#2426)
    20 * dp-cn24: thermal trip asserted (#2443, #2306)
    21 * dp-cn41: node not responding (#2477)
     17* dp-cn[01,09,10]: nodes currently reserved for special use case during working hours
     18* dp-cn33: memory issues (#2464)
     19* dp-cn49: configuration change required (#2291)
     20* dp-cn50: node not reachable (#2488)
    2221
    2322=== DAM nodes ===
     
    3635}}}
    3736
    38 * dp-esb02: energy meter reading issues
    39 * dp-esb03: energy meter reading issues (#2466)
     37* dp-esb05: node hangs in state "Idle+Completing"
    4038* dp-esb11: wrong GPU Link Speed detected (#2358)
    41 * dp-esb23: MCE problems (#2350)
    4239* dp-esb24: CentOS8 Testbed (#2396)
    43 * dp-esb28: no access to bmc (#2430)
    44 * dp-esb33: no access to bmc (#2429)
    45 * dp-esb38: no access to bmc
    4640* dp-esb39: energy meter reading issues (#2432)
    4741* dp-esb52: energy meter reading issues (#2433)
     42* dp-esb61: node not reachable (#2469)
    4843* dp-esb71: energy meter reading issues (#2432)
    4944* dp-esb73: energy meter reading issues (#2433)
     
    5550* nfgw[01,02]: node reachable via nework, but marked as down in SLURM
    5651* knl01: NVMe issues (#2011)
     52* ml-gpu02: memory issues reported with MCE (#2489)
    5753
    5854
     
    6662
    6763
    68 {{{#!comment JK: status to be clarified on Thursday, 2020-09-03
    69 === LDAP error message during login ===
    70 
    71 - seeing frequent failovers between the master nodes that are being investigated
    72 - currntly, a failover might lead to seeing the following or a similar error message during login:
    73 
    74 {{{
    75 Error: ldap_search: failed to open connection to LDAP server(s) and search. Exception: socket connection error while opening: [Errno 111] Connection refused
    76 }}}
    77 
    78 - the message usually can be ignored
    79 - in addition, some of the environment variables, e.g. `$PROJECT` are not set (properly)
    80 - if you see further issues or cannot login at all, please write an email to the support list: `sup(at)deep-est.eu`
    81 
    8264=== SLURM jobs ===
    8365
     
    8668- this might lead to (temporary) failing job starts for certain users
    8769- if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-est.eu`
     70
    8871
    8972=== GPU direct usage with IB on ESB ===
     
    11699
    117100- expect performance and stability issues
    118 
    119 
    120 === Horovod ===
    121 
    122 - currently only working with the developer stage
    123 
    124 
    125 }}}