Optimization
C++
Tools
Interesting tool to count cpu cycles and see the resulting assembly code with diverent compilers and optimization levels.
Intel Compilers
Compiler Auto Vectorization
The Intel compiler has several options for vectorization - the -x flag. To determine what is supported, examine the /proc/cpuinfo file and look for avx or sse specifications in the flags category. s. Note that the Intel compiler will try to vectorize a code with SSE2 instructions at optimizations of -O2 or higher. Disable this by specifying -no-vec.
- xSSE4.2 – lowest possible optimisation for (ww8-node1)
- xAVX – better, for all other PCs
- xAVX2 – better, but only on newest PCs in institute (ww8stud8)
- xHost – highest level of vectorization supported on the processor on which the simulation is commpiled
Multiple levels of vectorization - the -ax flag. This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported. The baseline can be defined with the -x flag, with -xSSE4.2 recommended. Multiple -ax flags can be specified to create several options. For example, compile with -axAVX -xSSE4.2. In this case, when run on ww8-node1, the baseline SSE4.2 execution path will be taken, on all other, the AVX execution path will be taken
optimizing the code
The -vec-report
flag can generates diagnostic information regarding vectorization to stdout. (optional parameter 0 … 5 (e.g., -vec-report0), with 0 disabling diagnostics and 5 providing the most detailed diagnostics) You can see which loops are which are notoptimized and why. The output can be useful to identify possible strategies to get a loop to vectorize.
Guided Auto Parallelization
The GAP feature can help analyze source code and generate advice on how to obtain better performance. In particular, GAP will suggest code changes or compiler options that will lead to better vectorized code. GAP may optionally allow the user to take advantage of the auto-parallelization capability that can generate multithreaded code for independent loop iterations; however, developers are encouraged to use explicit thread parallelism through mechanisms like OpenMP.
The GAP feature can be accessed by adding the -guide
(optional parameter =1 … =4). The report will be printed to stderr or it can be redirected to a file with the -guide-file=filename
option, or -guide-file-append=filename
. The GAP analysis can be targeted to a specific file, function, or source line with the -guide-opt=specification
option.
C++ 11
Watch out for mixing ABI:s when linking against libs that are compiled with GCC <5.x, as they don't have the modern ABI (C++11 ABI). Beginning with GCC 5.x modern ABI is default. More info: https://gcc.gnu.org/onlinedocs/libst..._dual_abi.html Modern ABI forbids copy-on-write for std::string and requires std::list to keep track of their size. If you app is multithreaded, and you do lots of string manipulations, then -D_GLIBCXX_USE_CXX11_ABI=0 is a good approach for winning back some lost performance.
In the case of upgrading compilers in horrendous legacy applications -D_GLIBCXX_USE_CXX11_ABI=0 is often a must to begin with.