Changes between Version 46 and Version 47 of Public/User_Guide/PaS


Ignore:
Timestamp:
Jul 8, 2022, 8:34:47 AM (22 months ago)
Author:
Jochen Kreutz
Comment:

add link to NVML Dirver version mismatch info from HPS

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/PaS

    v46 v47  
    77**Please, use the support mailing list `sup(at)deep-sea-project.eu` to report any issues**
    88
    9 {{{#!comment highlighted red text
    10 [[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]]
    11 }}}
     9{{{#!comment highlighted red text [[span(style=color: #FF0000, System maintenance from Monday, 2020-09-07 to Friday, 2020-09-11, no user access !)]] }}}
    1210
    13 To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.
    14 
     11To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.
    1512
    1613== Detected HW and node issues ==
     14=== CM nodes ===
     15 * dp-cn25: SEL ProblemsFW issues (#2769)
    1716
    18 === CM nodes ===
     17 * dp-cn27: MCE Errors found (#2919)
    1918
    20 * dp-cn25: SEL ProblemsFW issues (#2769)
    21 
    22 * dp-cn27: MCE Errors found (#2919)
    23 
    24        
    2519=== DAM nodes ===
    26 
    27 * dp-dam02: reserved for FPGA tests
    28 * dp-dam03: PCI link speed degraded (#2931)
    29 * dp-dam10: PMEM module issue (#2875)
    30 * dp-dam16: testbed
    31 
     20 * dp-dam02: reserved for FPGA tests
     21 * dp-dam03: PCI link speed degraded (#2931)
     22 * dp-dam10: PMEM module issue (#2875)
     23 * dp-dam16: testbed
    3224
    3325=== ESB nodes ===
    34 
    35 * dp-esb[07]: used for Rocky 8.6 tests
    36 * dp-esb[11]: memory issues
    37 
     26 * dp-esb[07]: used for Rocky 8.6 tests
     27 * dp-esb[11]: memory issues
    3828
    3929=== SDV nodes ===
     30 * deeper-sdv cluster nodes (Haswell) have been taken offline: deeper-sdv[01-16]
     31   * not included in SLURM anymore
     32   * deeper-sdv[09-10] used for testing (please contact j.kreutz(at)fz-juelich.de if you would like to get access
    4033
    41 * deeper-sdv cluster nodes (Haswell) have been taken offline: deeper-sdv[01-16]
    42   - not included in SLURM anymore 
    43   - deeper-sdv[09-10] used for testing (please contact j.kreutz(at)fz-juelich.de if you would like to get access
     34 * knl01: serves as golden client for imaging only
    4435
    45 * knl01: serves as golden client for imaging only
    46 
    47 * dp-sdv-esb[01,02]: Slurm update required
    48 
     36 * dp-sdv-esb[01,02]: Slurm update required
    4937
    5038== Software issues ==
    51 
    5239=== nvidia driver mismatch ===
    53 
    54 - loading CUDA module and trying to run `nvidia-smi` (or any application trying to use the GPU) leads to
     40 * loading CUDA module and trying to run `nvidia-smi` (or any application trying to use the GPU) leads to
    5541
    5642{{{
    5743Failed to initialize NVML: Driver/library version mismatch
    5844}}}
    59 
    60 - workaround is to unload the unload the driver module: `ml -nvidia-driver/.default`
     45 * workaround is to unload the unload the driver module: `ml -nvidia-driver/.default`
     46 * for furhter information, please also see  [https://gitlab.jsc.fz-juelich.de/hps-public/easybuild-repository/-/wikis/Failed-to-initialize-NVML-Driver-library-version-mismatch-message here][[BR]]
    6147
    6248=== Easybuild ===
    63 
    64 - Moving the new Easybuild stage 2022 (in February) might cause unexpected behavior and problems with the installed software components:
    65 
    66 
    67 
    68 
    69 
     49 * Moving the new Easybuild stage 2022 (in February) might cause unexpected behavior and problems with the installed software components:
    7050
    7151{{{#!comment JK: invalid
     52
    7253=== GPU direct usage with Extoll on DAM ===
    73 
    74 - new Extoll driver for GPU direct over Extoll still shows low performance on the DAM nodes
    75 - available via Developer stage, for testing load:
     54 * new Extoll driver for GPU direct over Extoll still shows low performance on the DAM nodes
     55 * available via Developer stage, for testing load:
    7656
    7757{{{
     
    8262ml load ParaStationMPI
    8363}}}
     64 * expect performance (and maybe also stability) issues
    8465
    85 - expect performance (and maybe also stability) issues
    8666}}}