Software-based Speculative Precomputation (IA-32)

Software-based Speculative Precomputation (SSP), which is also known as helper-threading optimization, is an optimization technique to improve the performance of some programs by creating helper threads to do data prefetching dynamically. SSP can be effective for programs where the program performance is dominated by data-cache misses, typically due to pointer-chasing loops. Prefetching data reduces the impact of the long latency to access main memory. The resulting code must run on hardware with shared data cache and multi-threading capabilities to be effective, such as Intel® Pentium® 4 Processors with Hyper-Threading Technology.

SSP Behavior

SSP is available only in the IA-32 Compiler. SSP directly executes a subset (or slice) of the original program instructions on separate helper threads in parallel with the main computation thread. The helper threads run ahead of the main thread, compute the addresses of future memory accesses, and trigger early cache misses. This behavior hides the memory latency from the main thread.

The command line option to turn on the SSP is -ssp (Linux*) or /Qssp (Windows*). SSP must be used after generating profile feedback information by running the application with a representative set of data. See profrun Utility for more information.

When invoked with SSP, the compiler moves through the following stages:

Delinquent load identification: The compiler identifies the top cache-missing loads, which are known as delinquent loads, by examining the feedback information.
Loop selection: The compiler identifies regions of code within which speculative loads will be useful. Delinquent loads typically occur within a heavily traversed loop nest.
Program slicing: Within each region, the compiler looks at each delinquent load and identifies the slices of code required to compute the addresses of the delinquent loads.
Helper thread code generation: The compiler generates code for the slices. Additionally, the compiler generates code to invoke and schedule the helper threads at run-time.

Caution

Using SSP in conjunction with profiling and interprocedural optimization can degrade the performance of some programs. Experiment with SSP and closely watch the effect of SSP on the target applications before deploying applications using these techniques. Using SSP may also increase compile time.

Using SSP Optimization

SSP optimization requires several steps; the following procedure demonstrates using SSP optimization in a typical manner.

For the following example, assume that you have the following source files: a1.f, a2.f and a3.f, which will be compiled into an executable named go (Linux) or go.exe (Windows).

Create instrumented code by compiling the application using the -prof-gen (Linux) or /Qprof-gen (Windows) option to produce an executable with instrumented code, as shown in the examples below:

Platform	Command Examples
Linux	ifort -prof-gen a1.f a2.f a3.f -o go
Windows	ifort /Qprof-gen a1.f a2.f a3.f /Fego

For more information about the option used in this step, see the following topic:

- -prof-gen compiler option

Generate dynamic profile information by running the instrumented program with a representative set of data to create a dynamic profile information file.

Platform	Command Examples
Linux	go
Windows	go.exe

Executing the instrumented application generates a dynamic profile information file with a .dyn suffix. You can run the program more than once with different input data. The compiler will merge all of the .dyn files into a single .dpi file during a later step.

Prepare the application for the PMU by recompiling the application using both the -prof-gen-sampling and -prof-use (Linux) or /Qprof-gen-sampling and /Qprof-use (Windows) option to produce an executable that can gather information from the hardware Performance Monitoring Unit (PMU). The following command examples show how to combine the options during compilation:

Platform	Command Examples
Linux	ifort -prof-gen-sampling -prof-use -O3 -ipo a1.f a2.f a3.f -o go
Windows	ifort /Qprof-gen-sampling /Qprof-use /O3 /Qipo a1.f a2.f a3.f /Fego

For more information about the options used in this step, see the following topics:

- -prof-gen-sampling and -prof-use compiler options

Run the application, using the profrun utility, again with a representative set of data to create a file with hardware profile information, including delinquent load information.

Platform	Command Examples
Linux	profrun -dcache go
Windows	profrun -dcache go.exe

This step executes the application and generates file containing hardware profile information; the file resides in the local directory and has a .hpi suffix. You can run the program more than once with different input data. The hardware profile information for all runs will be merged automatically.

Compile the application a final time using both the -prof-use and -ssp (Linux) or /Qprof-use and /Qssp (Windows) options to produce an executable with SSP optimization enabled. The following command examples show how to combine the options during compilation:

Platform	Command Examples
Linux	ifort -prof-use -ssp -O3 -ipo a1.f a2.f a3.f -o go
Windows	ifort /Qprof-use /Qssp /O3 /Qipo a1.f a2.f a3.f -o go

For more information about the -ssp (Linux) or /Qssp (Windows) option used in this step, see the following topic in Compiler Options:

- -ssp compiler option

The final step compiles and links the source files with SSP, using the feedback from the instrumented execution phase and the cache miss information from the profiling execution phase.

See Profile-guided Optimizations Overview.

Software-based Speculative Precomputation (IA-32)

SSP Behavior

Using SSP Optimization

Platform

Command Examples

Platform

Command Examples

Platform

Command Examples

Platform

Command Examples

Platform

Command Examples