wiki:Public/User_Guide/PaS

Version 50 (modified by Jochen Kreutz, 20 months ago) (diff)

This page is intended to give a short overview on known issues and to provide potential solutions and workarounds to the issues seen.

Last update: 2022-08-04

Please, use the support mailing list sup(at)deep-sea-project.eu to report any issues

To stay informed, please refer to the News page. Also, please pay attention to the information contained in the "Message of the day" displayed when logging onto the system.

Detected HW and node issues

CM nodes

  • dp-cn25: SEL ProblemsFW issues (#2769)
  • dp-cn27: MCE Errors found (#2919)

DAM nodes

  • dp-dam02: reserved for FPGA tests
  • dp-dam03: PCI link speed degraded (#2931)
  • dp-dam10: PMEM module issue (#2875)
  • dp-dam16: testbed

ESB nodes

  • dp-esb[07]: used for Rocky 8.6 tests
  • dp-esb[11]: memory issues

SDV nodes

  • deeper-sdv cluster nodes (Haswell) have been taken offline: deeper-sdv[01-16]
    • not included in SLURM anymore
    • deeper-sdv[09-10] used for testing (please contact j.kreutz(at)fz-juelich.de if you would like to get access
  • knl01: serves as golden client for imaging only
  • dp-sdv-esb[01,02]: Slurm update required

Software issues

nvidia driver mismatch

  • loading CUDA module and trying to run nvidia-smi (or any application trying to use the GPU) leads to
Failed to initialize NVML: Driver/library version mismatch
  • workaround is to unload the unload the driver module: ml -nvidia-driver/.default
  • for furhter information, please also see  here

nvidia profiling tools

  • to launch the tools on a compute node using X-Forwarding another SSH session is needed:
srun --forward-x -p dp-esb -N 1 -n 1 --pty /bin/bash -i
ssh -X -J <your account>@deep.zam.kfa-juelich.de <your account>@<the node you received>
  • you will still see a warning "OpenGL Version check failed. Falling back to Mesa software rendering.", but the profling tool (e.g. nsight-sys) should start up

Easybuild

  • Moving the new Easybuild stage 2022 (in February) might cause unexpected behavior and problems with the installed software components: