Applying Optimization Strategies

The compiler may or may not apply the following optimizations to your loop: Interchange, Unrolling, Cache Blocking, and LoadPair. These transformations are discussed in the following sections, including how to transform loops manually and how to control them with directives or internal options.

Loop Interchange

Loop Interchange is a nested loop transformation applied by HLO that simply swaps the order of execution of two nested loops. It is typically done to provide sequential Unit Stride access to array elements used inside the loop to improve cache locality. The compiler -O3 (Linux*) or /O3 (Windows*) optimization looks for opportunities to apply loop interchange for you.

Cache Blocking

Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. The application controls this behavior by dividing a large array into smaller blocks of memory (tiles) so that a thread can make repeated accesses to that data while it is still in cache. For example, image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.

Cache blocking is applied at HLO and is used on large arrays where the arrays can’t all fit into cache at once. This is one way of pulling a subset of data into cache (in a small region) and using this cached data as effectively as possible before the data is replaced by new data from memory.

Blocking factors will be different for different architectures. Determine the blocking factors experimentally. For example, different blocking factors would be required for single precision versus double precision. Typically, the overall impact to performance can be significant.

Load Pair (Itanium® Compiler)

Load pairs (ldfp) are instructions that load two contiguous single or double precision values from memory in one move. Load pairs can significantly improve performance.

Manual Loop Transformations

There might be cases where these manual transformations are called acceptable or even preferred. As a general rule, you should let the compiler transform loops for you. Manually transform loops as a last resort, and only in cases where you are attempting to gain performance increases.

Manual loop transformations have many disadvantages, which include the following:

Application code becomes harder to maintain over time.
New compiler features can cause you to lose any optimization you gain by manually transforming the loop.
Architectural requirements might restrict your code to a specific architecture unintentionally.

The HLO report can give you an idea of what loop transformations have been applied by the compiler.

Experimentation is a critical component in manually transforming loops. You might try to apply a loop transformation that the compiler ignored. Sometimes, it is beneficial to apply a manual loop transformation that the compiler has already applied with -O3 (Linux) or /O3 (Windows).