Run the High-Level Optimization (HLO) report by entering a command similar to the following:
Platform |
Example Command |
---|---|
Linux* |
ifort -opt-report -opt-report-phase hlo -O3 a.f b.f |
Windows* |
ifort /Qopt-report /Qopt-report-phase hlo /O3 a.f b.f |
See Optimizer Report Generation for more information about options you can use to generate reports.
HLO does the following loop-level transformations:
Data Transformation
Inline cabs
HLO framework (loop recovery, data dependency test)
Combine malloc (also memset)
Distribution, interchange
Fusion
Predicate opt
Block
Unroll and jam
Scalar replacement
Data prefetch
LoadPair
Runtime data dependence checking
loop reversal
profile-guided unrolling
loop peeling
The HLO report provides information on all of the listed areas plus structure splitting and loop-carried scalar replacement.
The following is an example of the HLO report for a matrix multiply program:
Example |
---|
multiply_d_lx.c HLO REPORT LOG OPENED |
These report results demonstrate the following information:
There were 2 cache lines prefetched a distance of 74 loop iterations. The prefetch instruction corresponds to line 15 of the source code.
The compiler has unrolled the loop at line 15 by a factor of 8 times.
Manual optimization techniques, like manual cache blocking, should be generally avoided and used only as a last resort.
The HLO report tells you explicitly what loop transformations the compiler did. By not mentioning a given loop transformation, the report may imply by omission that there are transformations the developer might perform. A few of these transformations could also be applied by the developer or at least controlled by compiler switches. These are described in the following table:
Transformation |
Description |
---|---|
Distribution |
Distribute or split up one large loop into two smaller loops. This may be advantageous when too many registers are being consumed in a given large loop. |
Interchange |
Swap the order of execution of two nested loops to gain a cache locality or Unit Stride access performance advantage. |
Fusion |
Fuse two smaller loops with the same trip count together to improve data locality |
Block |
Cache blocking arranges a loop so that it will perform as many computations as possible on data already residing in cache. The next “block” of data is not read into cache until all computations with the first block are finished. |
Unroll & Jam |
Unrolling is a way of partially disassembling a loop structure so that fewer numbers of iterations of the loop are required, at the expense of each loop iteration being larger. It can be used to hide instruction and data latencies, to take advantage of floating point loadpair instructions, to increase the ratio of real work done per memory operation. |
Prefetch |
Makes request to bring data in from relatively slow memory to a faster cache several loop iterations ahead of when the data is actually needed. |
LoadPair |
Makes use of an instruction to bring two floating point data elements in from memory at a time. |