Changes between Version 27 and Version 28 of Public/User_Guide/PaS
- Timestamp:
- Sep 23, 2020, 9:22:54 AM (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Public/User_Guide/PaS
v27 v28 3 3 This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. 4 4 5 ''Last update: 2020-09-2 2''5 ''Last update: 2020-09-23'' 6 6 {{{#!comment highlighted red text 7 7 [[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]] 8 8 }}} 9 [[span(style=color: #FF0000, GPFS issues occured, user login currently not possible !)]]10 9 11 10 To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. … … 16 15 === CM nodes === 17 16 18 * dp-cn[0 9,10]: nodes currently reserved for special use case during working hours19 * dp-cn 11: node was not responding (#2426)20 * dp-cn 24: thermal trip asserted (#2443, #2306)21 * dp-cn 41: node not responding (#2477)17 * dp-cn[01,09,10]: nodes currently reserved for special use case during working hours 18 * dp-cn33: memory issues (#2464) 19 * dp-cn49: configuration change required (#2291) 20 * dp-cn50: node not reachable (#2488) 22 21 23 22 === DAM nodes === … … 36 35 }}} 37 36 38 * dp-esb02: energy meter reading issues 39 * dp-esb03: energy meter reading issues (#2466) 37 * dp-esb05: node hangs in state "Idle+Completing" 40 38 * dp-esb11: wrong GPU Link Speed detected (#2358) 41 * dp-esb23: MCE problems (#2350)42 39 * dp-esb24: CentOS8 Testbed (#2396) 43 * dp-esb28: no access to bmc (#2430)44 * dp-esb33: no access to bmc (#2429)45 * dp-esb38: no access to bmc46 40 * dp-esb39: energy meter reading issues (#2432) 47 41 * dp-esb52: energy meter reading issues (#2433) 42 * dp-esb61: node not reachable (#2469) 48 43 * dp-esb71: energy meter reading issues (#2432) 49 44 * dp-esb73: energy meter reading issues (#2433) … … 55 50 * nfgw[01,02]: node reachable via nework, but marked as down in SLURM 56 51 * knl01: NVMe issues (#2011) 52 * ml-gpu02: memory issues reported with MCE (#2489) 57 53 58 54 … … 66 62 67 63 68 {{{#!comment JK: status to be clarified on Thursday, 2020-09-0369 === LDAP error message during login ===70 71 - seeing frequent failovers between the master nodes that are being investigated72 - currntly, a failover might lead to seeing the following or a similar error message during login:73 74 {{{75 Error: ldap_search: failed to open connection to LDAP server(s) and search. Exception: socket connection error while opening: [Errno 111] Connection refused76 }}}77 78 - the message usually can be ignored79 - in addition, some of the environment variables, e.g. `$PROJECT` are not set (properly)80 - if you see further issues or cannot login at all, please write an email to the support list: `sup(at)deep-est.eu`81 82 64 === SLURM jobs === 83 65 … … 86 68 - this might lead to (temporary) failing job starts for certain users 87 69 - if you cannot start jobs via SLURM, please write an email to the support list: `sup(at)deep-est.eu` 70 88 71 89 72 === GPU direct usage with IB on ESB === … … 116 99 117 100 - expect performance and stability issues 118 119 120 === Horovod ===121 122 - currently only working with the developer stage123 124 125 }}}