[[TOC]]

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

''Last update: 2023-10-11''

{{{#!comment
{{{#!div style="font-size: 150%"
[[span(style=color: #FF0000, Due to global filesystem issues in the GPFS, user login is currently not possible !)]] 
}}}
}}}

**Please, use the support mailing list `sc(at)fz-juelich.de` to report any issues**

Please refer to the [wiki:Public/User_Guide/News Project News Page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.
The system status is reported on [https://status.jsc.fz-juelich.de/ JSC status] as well. 


== Login node ==

*  Time limit for user processes enforced on deepv login: **Processes will be killed after 24 hours**  In case of problems, please contact niessen@par-tec.com 

== Detected HW and node issues ==

=== Cooling issues ===
 * pump in JSC cooling loop is running in manual mode: frequently running HPL jobs (with low priority) to create some load (waste heat)
   * HPL jobs can be killed on demand: in case of problems (your jobs being blocked by HPL runs), please contact j.kreutz@fz-juelich.de or niessen@par-tec.com

=== CM nodes ===
 * dp-cn03: #2374 - Kernel update 
 * dp-cn25: #2769 - SEL Problems
 * dp-cn50: #2769 - Testbed for beegfs 

=== DAM nodes ===
 * dp-dam02: #2874 - Reserved for FPGA tests
 * dp-dam16: #3231 - ECC Memory

=== ESB nodes ===
 * dp-esb[17]: #3196 - Energy meter not working

=== SDV nodes ===
 * deeper-sdv cluster nodes (Haswell) have been taken offline: deeper-sdv[01-16]
   * not included in SLURM anymore
   * deeper-sdv[09-10] used for testing (please contact j.kreutz(at)fz-juelich.de if you would like to get access

 * knl01: serves as golden client for imaging only

 * dp-sdv-esb[01,02]: will only be powered on demand

== Software issues ==


=== Cuda and Rocky 8.6

- New CUDA drivers on the compute nodes. In case of problems, please manually prepend your `LD_LIBRARY_PATH` (first for libcuda, second for libcublas, fft, etc.):
{{{
ln -s /usr/lib64/libcuda.so.1 .
ln -s /usr/lib64/libnvidia-ml.so.1 .
LD_LIBRARY_PATH=.:/usr/local/cuda/lib64:$LD_LIBRARY_PATH srun <srun_args> <exe> <exe_args>
}}}


=== nvidia driver mismatch ===
 * loading CUDA module and trying to run `nvidia-smi` (or any application trying to use the GPU) leads to

{{{
Failed to initialize NVML: Driver/library version mismatch
}}}
 * workaround is to unload the unload the driver module: `ml -nvidia-driver/.default`
 * for furhter information, please also see  [https://gitlab.jsc.fz-juelich.de/hps-public/easybuild-repository/-/wikis/Failed-to-initialize-NVML-Driver-library-version-mismatch-message here][[BR]]


=== nvidia profiling tools ===

 * to launch the tools on a compute node using X-Forwarding another SSH session is needed:

{{{
srun --forward-x -p dp-esb -N 1 -n 1 --pty /bin/bash -i
ssh -X -J <your account>@deep.zam.kfa-juelich.de <your account>@<the node you received>
}}}

 * you will still see a warning "OpenGL Version check failed. Falling back to Mesa software rendering.", but the profling tool (e.g. `nsight-sys`) should start up