Software-based Speculative Precomputation (IA-32)

Software-based Speculative Precomputation (SSP), which is also known as helper-threading optimization, is an optimization technique to improve the performance of some programs by creating helper threads to do data prefetching dynamically. SSP can be effective for programs  where the program performance is dominated by data-cache misses, typically due to pointer-chasing loops. Prefetching data reduces the impact of the long latency to access main memory. The resulting code must run on hardware with shared data cache and multi-threading capabilities to be effective, such as IntelŪ PentiumŪ 4 Processors with Hyper-Threading Technology.

SSP Behavior

SSP is available only in the IA-32 Compiler. SSP directly executes a subset (or slice) of the original program instructions on separate helper threads in parallel with the main computation thread. The helper threads run ahead of the main thread, compute the addresses of future memory accesses, and trigger early cache misses. This behavior hides the memory latency from the main thread.

The command line option to turn on the SSP is -ssp (Linux*) or /Qssp (Windows*). SSP must be used after generating profile feedback information by running the application with a representative set of data. See profrun Utility for more information.

When invoked with SSP, the compiler moves through the following stages:

Caution

Using SSP in conjunction with profiling and interprocedural optimization can degrade the performance of some programs. Experiment with SSP and closely watch the effect of SSP on the target applications before deploying applications using these techniques. Using SSP may also increase compile time.

Using SSP Optimization

SSP optimization requires several steps; the following procedure demonstrates using SSP optimization in a typical manner.

For the following example, assume that you have the following source files: a1.f, a2.f and a3.f, which will be compiled into an executable named go (Linux) or go.exe (Windows).

  1. Create instrumented code by compiling the application using the  -prof-gen (Linux) or /Qprof-gen (Windows) option to produce an executable with instrumented code, as shown in the examples below:

Platform

Command Examples

Linux

ifort -prof-gen a1.f a2.f a3.f -o go

Windows

ifort /Qprof-gen a1.f a2.f a3.f /Fego

For more information about the option used in this step, see the following topic:

  1. Generate dynamic profile information by running the instrumented program with a representative set of data to create a dynamic profile information file.

Platform

Command Examples

Linux

go

Windows

go.exe

Executing the instrumented application generates a dynamic profile information file with a .dyn suffix. You can run the program more than once with different input data. The compiler will merge all of the .dyn files into a single .dpi file during a later step.

  1. Prepare the application for the PMU by recompiling the application using both the -prof-gen-sampling and -prof-use (Linux) or /Qprof-gen-sampling and /Qprof-use (Windows) option to produce an executable that can gather information from the hardware Performance Monitoring Unit (PMU). The following command examples show how to combine the options during compilation:

Platform

Command Examples

Linux

ifort -prof-gen-sampling -prof-use -O3 -ipo a1.f a2.f a3.f -o go

Windows

ifort /Qprof-gen-sampling /Qprof-use /O3 /Qipo a1.f a2.f a3.f /Fego

For more information about the options used in this step, see the following topics:

  1. Run the application, using the profrun utility, again with a representative set of data to create a file with hardware profile information, including delinquent load information.

Platform

Command Examples

Linux

profrun -dcache go

Windows

profrun -dcache go.exe

This step executes the application and generates file containing hardware profile information; the file resides in the local directory and has a .hpi suffix. You can run the program more than once with different input data.  The hardware profile information for all runs will be merged automatically.

  1. Compile the application a final time using both the -prof-use and -ssp (Linux) or /Qprof-use and /Qssp (Windows) options to produce an executable with SSP optimization enabled. The following command examples show how to combine the options during compilation:

Platform

Command Examples

Linux

ifort -prof-use -ssp -O3 -ipo a1.f a2.f a3.f -o go

Windows

ifort /Qprof-use /Qssp /O3 /Qipo a1.f a2.f a3.f -o go

For more information about the -ssp (Linux) or /Qssp (Windows) option used in this step, see the following topic in Compiler Options:

The final step compiles and links the source files with SSP, using the feedback from the instrumented execution phase and the cache miss information from the profiling execution phase.

See Profile-guided Optimizations Overview.