Changes between Version 66 and Version 67 of Public/User_Guide/PaS


Ignore:
Timestamp:
Mar 16, 2023, 8:50:42 AM (13 months ago)
Author:
Jochen Kreutz
Comment:

regular update to reflect current system status

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/PaS

    v66 v67  
    33This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.
    44
    5 ''Last update: 2022-12-08''
     5''Last update: 2023-03-16''
    66
    77{{{#!comment
     
    1616The system status is reported on [https://status.jsc.fz-juelich.de/ JSC status] as well.
    1717
     18
     19== Login node ==
     20
     21*  Time limit for user processes enforced on deepv login: **Processes will be killed after 24 hours**  In case of problems, please contact niessen@par-tec.com
     22
    1823== Detected HW and node issues ==
    1924
    2025=== Cooling issues ===
    21  * pump failures for JSC cooling loop have been detected
    22  * root cause still to be idedtified
    23  * considering manual mode to allow for operation of CM and ESB nodes in the meantime
     26 * pump in JSC cooling loop is running in manual mode: frequently running HPL jobs (with low priority) to create some load (waste heat)
     27   * HPL jobs can be killed on demand: in case of problems (your jobs being blocked by HPL runs), please contact j.kreutz@fz-juelich.de or niessen@par-tec.com
    2428
    2529=== CM nodes ===
    26  * dp-cn25: SEL ProblemsFW issues (#2769)
     30 * dp-cn25: Thermal issues within chassis slot (#2769)
    2731
    2832=== DAM nodes ===
    2933 * dp-dam02: reserved for FPGA tests
     34 * dp-dam13: failing healthcheck: memory_not_reclaimable
    3035 * dp-dam16: testbed
    3136
    3237=== ESB nodes ===
    33  * dp-esb[11]: memory issues (#2857)
    34  * dp-esb[31]: GPU issues (#2949)
     38 * dp-esb[07]: wrong BIOS settings (#2881)
     39 * dp-esb[17]: IB HCA issues (#3140)
     40 * dp-esb[75]: Easybuild testbed (#3094)
    3541
    3642
     
    4753
    4854=== MODULEPATH
    49 
    50  * MODULEPATH variable seems to get overwritten though being set correctly in `/etc/profile.d/modules.sh`
     55 
     56 * MODULEPATH variable might get overwritten when switching stages
    5157 * leads to various modules not being detected / found correctly
    52  * re-setting the MODULEPATH manually might solve the issue, please try:
     58 * re-setting the MODULEPATH manually might solve the issue, e.g. for the 2022 stage, please try:
    5359{{{
    5460export MODULEPATH=/usr/local/software/skylake/Stages/2022/modules/all/Compiler/sidecompiler/GCCcore/11.2.0:/usr/local/software/skylake/Stages/2022/modules/all/Compiler/GCCcore/11.2.0:/usr/local/software/skylake/Stages/2022/modules/all/Core:/usr/local/software/skylake/Stages/2022/modules/all/MPI:/usr/local/software/skylake/Stages/2022/modules/all/MPI_settings:/usr/local/software/skylake/Stages/2022/modules/all/comm_settings:/usr/local/software/skylake/Stages/2022/modules/all/pkg_settings:usr/local/software/skylake/Stages/2022/UI/Defaults:/usr/local/software/skylake/Stages/2022/UI/Tools:/usr/local/software/skylake/Stages/2022/UI/Compilers:/usr/local/software/skylake/userinstallations:/usr/local/software/skylake/OtherStages:/usr/local/software/skylake/Devel   
     
    5965=== Cuda and Rocky 8.6
    6066
    61 - New CUDA drivers on the compute nodes.In case of problems, please manually prepend your `LD_LIBRARY_PATH` (first for libcuda, second for libcublas, fft, etc.):
     67- New CUDA drivers on the compute nodes. In case of problems, please manually prepend your `LD_LIBRARY_PATH` (first for libcuda, second for libcublas, fft, etc.):
    6268{{{
    6369ln -s /usr/lib64/libcuda.so.1 .