[[TOC]] This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen. ''Last update: 2023-03-16'' {{{#!comment {{{#!div style="font-size: 150%" [[span(style=color: #FF0000, Due to global filesystem issues in the GPFS, user login is currently not possible !)]] }}} }}} **Please, use the support mailing list `sup(at)deep-sea-project.eu` to report any issues** To stay informed, please refer to the [wiki:Public/User_Guide/News News page]. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system. The system status is reported on [https://status.jsc.fz-juelich.de/ JSC status] as well. == Login node == * Time limit for user processes enforced on deepv login: **Processes will be killed after 24 hours** In case of problems, please contact niessen@par-tec.com == Detected HW and node issues == === Cooling issues === * pump in JSC cooling loop is running in manual mode: frequently running HPL jobs (with low priority) to create some load (waste heat) * HPL jobs can be killed on demand: in case of problems (your jobs being blocked by HPL runs), please contact j.kreutz@fz-juelich.de or niessen@par-tec.com === CM nodes === * dp-cn25: Thermal issues within chassis slot (#2769) === DAM nodes === * dp-dam02: reserved for FPGA tests * dp-dam13: failing healthcheck: memory_not_reclaimable * dp-dam16: testbed === ESB nodes === * dp-esb[07]: wrong BIOS settings (#2881) * dp-esb[17]: IB HCA issues (#3140) * dp-esb[75]: Easybuild testbed (#3094) === SDV nodes === * deeper-sdv cluster nodes (Haswell) have been taken offline: deeper-sdv[01-16] * not included in SLURM anymore * deeper-sdv[09-10] used for testing (please contact j.kreutz(at)fz-juelich.de if you would like to get access * knl01: serves as golden client for imaging only * dp-sdv-esb[01,02]: will only be powered on demand == Software issues == === MODULEPATH * MODULEPATH variable might get overwritten when switching stages * leads to various modules not being detected / found correctly * re-setting the MODULEPATH manually might solve the issue, e.g. for the 2022 stage, please try: {{{ export MODULEPATH=/usr/local/software/skylake/Stages/2022/modules/all/Compiler/sidecompiler/GCCcore/11.2.0:/usr/local/software/skylake/Stages/2022/modules/all/Compiler/GCCcore/11.2.0:/usr/local/software/skylake/Stages/2022/modules/all/Core:/usr/local/software/skylake/Stages/2022/modules/all/MPI:/usr/local/software/skylake/Stages/2022/modules/all/MPI_settings:/usr/local/software/skylake/Stages/2022/modules/all/comm_settings:/usr/local/software/skylake/Stages/2022/modules/all/pkg_settings:usr/local/software/skylake/Stages/2022/UI/Defaults:/usr/local/software/skylake/Stages/2022/UI/Tools:/usr/local/software/skylake/Stages/2022/UI/Compilers:/usr/local/software/skylake/userinstallations:/usr/local/software/skylake/OtherStages:/usr/local/software/skylake/Devel }}} === Cuda and Rocky 8.6 - New CUDA drivers on the compute nodes. In case of problems, please manually prepend your `LD_LIBRARY_PATH` (first for libcuda, second for libcublas, fft, etc.): {{{ ln -s /usr/lib64/libcuda.so.1 . ln -s /usr/lib64/libnvidia-ml.so.1 . LD_LIBRARY_PATH=.:/usr/local/cuda/lib64:$LD_LIBRARY_PATH srun }}} === nvidia driver mismatch === * loading CUDA module and trying to run `nvidia-smi` (or any application trying to use the GPU) leads to {{{ Failed to initialize NVML: Driver/library version mismatch }}} * workaround is to unload the unload the driver module: `ml -nvidia-driver/.default` * for furhter information, please also seeĀ  [https://gitlab.jsc.fz-juelich.de/hps-public/easybuild-repository/-/wikis/Failed-to-initialize-NVML-Driver-library-version-mismatch-message here][[BR]] === nvidia profiling tools === * to launch the tools on a compute node using X-Forwarding another SSH session is needed: {{{ srun --forward-x -p dp-esb -N 1 -n 1 --pty /bin/bash -i ssh -X -J @deep.zam.kfa-juelich.de @ }}} * you will still see a warning "OpenGL Version check failed. Falling back to Mesa software rendering.", but the profling tool (e.g. `nsight-sys`) should start up