Context Navigation

SDV_KNLs

Timestamp:: Jan 13, 2017, 11:30:14 AM (8 years ago)
Author:: Anke Kreuzer
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

TabularUnified Public/User_Guide/SDV_KNLs

-                      v4
+                      v5
 = Compiling =
+Use the -xMIC-AVX512 flag instead of -mmic.
+Use the -xMIC-AVX512 flag instead of -mmic.\\
+Check actual vectorisation with -qopt-report=5 -qopt-report-phase=vec -> info given in *.optrpt files
 = Using MPI over EXTOLL =
 set PSP_RENDEZVOUS_VELO = <max. message size>
+\\
+\\
+= 5 things to consider when using KNL =
+. Make sure to use the fast MCDRAM:
+  * When MCDRAM is in cache mode:
+    * No changes are needed.
+  * When MCDRAM is in flat mode:
+    * If the total memory footprint of the application is smaller than the size of MCDRAM: numactl –m 1 ./my_application.out (Allocations that don’t fit into MCDRAM make the application fail.)
+    * If the total memory footprint of the application is larger than the size of MCDRAM: numactl –p 1 ./my_application.out ( Allocations that don’t fit into MCDRAM spill over to DDR)
+    * To make a manual choice of what should be allocated in the MCDRAM: Use the memkind library.\\
+. Verify that the pinning is as you wish:
+  * Start job on KNL node(s).
+  * Log in on KNL.
+  * Invoke htop.
+  * Check the load distribution.
+  * Remark: Each core can execute 1, 2 or 4 threads. On KNL – unlike on KNC – already one thread per core can lead to optimal performance.\\
+. Use VTune/Advisor to analyse the performance:
+  * Start job on KNL node(s).
+  * Log in on KNL.
+  * 'module load VTune / Advisor'.
+  * Run amplxe-gui / advixe-gui.
+  * Follow instructions.
+  * Remark: If you run into erros of the sort “sepdk not available” please contact the administrator. Both tools rely on a kernel module to access hardware counter.\\
+. Provide hints to the compiler:
+  * Check *optrpt for info on vectorisation.
+  * If you find “unaligned...” -> add alignment in your code by adding "#pragma vector aligned" before the loop.
+  * If a loop does not vectorise although it clearly should, you can add "#pragma simd" before the loop.
+  * Re-check *.optrpt.
+  * Re-check in VTune / Advisor\\
+. Verify the performance via benchmarks:
+  * Set up JUBE for your code.
+  * Benchmark the various versions with proper timing.
+  * Be aware: VTune / Advisor sometimes give estimates that are a little off. It's imperative to check the actual performance.