= General information = We have 3 KNLs in the SDV right now.\\ All KNL nodes have their own local NVMe device installed. = Node allocation = Nodes can be allocated through the Slurm based batch system that is also used for the DEEP-EST system and the SDV Xeon Cluster. You can start an interactive session on our KNLs like this: {{{ srun --partition=knl -N 2 -n 8 --pty /bin/bash -i }}} {{{#!comment To use the knls (4 nodes) reserved for the DEEP-ER people (every day from 8:30am-5:00pm) please use: {{{ qsub -q deeper-knl -l nodes=X:ppn=Y:knl,walltime=... }}} }}} When using a batch script, you have to adapt the --partition option within your script: --partition=knl == Available knl partitions == * knl: The DEEP-ER knl nodes (all of them, regardless of cpu and configuration) * knl256: the 256-core knls (knl5) * knl272: the 272-core knls (knl4,knl6) * snc4: the knls configured in SNC-4 mode (knl5) = Compiling = Use the -xMIC-AVX512 flag instead of -mmic.\\ Check actual vectorisation with -qopt-report=5 -qopt-report-phase=vec -> info given in *.optrpt files == Multi-node Jobs == The KNL nodes are only connected via Gigabit Ethernet, hence there is no need to load the Extoll module to run jobs on multiple nodes.\\ = 5 things to consider when using KNL = 1. Make sure to use the fast MCDRAM: * When MCDRAM is in cache mode: * No changes are needed. * When MCDRAM is in flat mode: * If the total memory footprint of the application is smaller than the size of MCDRAM: numactl –m 1 ./my_application.out (Allocations that don’t fit into MCDRAM make the application fail.) * If the total memory footprint of the application is larger than the size of MCDRAM: numactl –p 1 ./my_application.out ( Allocations that don’t fit into MCDRAM spill over to DDR) * To make a manual choice of what should be allocated in the MCDRAM: Use the memkind library.\\ 2. Verify that the pinning is as you wish: * Start job on KNL node(s). * Log in on KNL. * Invoke htop. * Check the load distribution. * Remark: Each core can execute 1, 2 or 4 threads. On KNL – unlike on KNC – already one thread per core can lead to optimal performance.\\ 3. Use VTune/Advisor to analyse the performance: * Start job on KNL node(s). * Log in on KNL. * 'module load VTune / Advisor'. * Run amplxe-gui / advixe-gui. * Follow instructions. * Remark: If you run into erros of the sort “sepdk not available” please contact the administrator. Both tools rely on a kernel module to access hardware counter.\\ 4. Provide hints to the compiler: * Check *optrpt for info on vectorisation. * If you find “unaligned...” -> add alignment in your code by adding "#pragma vector aligned" before the loop. * If a loop does not vectorise although it clearly should, you can add "#pragma simd" before the loop. * Re-check *.optrpt. * Re-check in VTune / Advisor\\ 5. Verify the performance via benchmarks: * Set up JUBE for your code. * Benchmark the various versions with proper timing. * Be aware: VTune / Advisor sometimes give estimates that are a little off. It's imperative to check the actual performance.