Software-based Speculative Precomputation (SSP), which is also known as helper-threading optimization, is an optimization technique to improve the performance of some programs by creating helper threads to do data prefetching dynamically. SSP can be effective for programs where the program performance is dominated by data-cache misses, typically due to pointer-chasing loops. Prefetching data reduces the impact of the long latency to access main memory. The resulting code must run on hardware with shared data cache and multi-threading capabilities to be effective, such as IntelŪ PentiumŪ 4 Processors with Hyper-Threading Technology.
SSP is available only in the IA-32 Compiler. SSP directly executes a subset (or slice) of the original program instructions on separate helper threads in parallel with the main computation thread. The helper threads run ahead of the main thread, compute the addresses of future memory accesses, and trigger early cache misses. This behavior hides the memory latency from the main thread.
The command line option to turn on the SSP is -ssp (Linux*) or /Qssp (Windows*). SSP must be used after generating profile feedback information by running the application with a representative set of data. See profrun Utility for more information.
When invoked with SSP, the compiler moves through the following stages:
Delinquent load identification: The compiler identifies the top cache-missing loads, which are known as delinquent loads, by examining the feedback information.
Loop selection: The compiler identifies regions of code within which speculative loads will be useful. Delinquent loads typically occur within a heavily traversed loop nest.
Program slicing: Within each region, the compiler looks at each delinquent load and identifies the slices of code required to compute the addresses of the delinquent loads.
Helper thread code generation: The compiler generates code for the slices. Additionally, the compiler generates code to invoke and schedule the helper threads at run-time.
Caution
Using SSP in conjunction with profiling and interprocedural optimization can degrade the performance of some programs. Experiment with SSP and closely watch the effect of SSP on the target applications before deploying applications using these techniques. Using SSP may also increase compile time.
SSP optimization requires several steps; the following procedure demonstrates using SSP optimization in a typical manner.
For the following example, assume that you have the following source files: a1.f, a2.f and a3.f, which will be compiled into an executable named go (Linux) or go.exe (Windows).
Create instrumented code by compiling the application using the -prof-gen (Linux) or /Qprof-gen (Windows) option to produce an executable with instrumented code, as shown in the examples below:
Platform |
Command Examples |
---|---|
Linux |
ifort -prof-gen a1.f a2.f a3.f -o go |
Windows |
ifort /Qprof-gen a1.f a2.f a3.f /Fego |
For more information about the option used in this step, see the following topic:
-prof-gen compiler option
Generate dynamic profile information by running the instrumented program with a representative set of data to create a dynamic profile information file.
Platform |
Command Examples |
---|---|
Linux |
go |
Windows |
go.exe |
Executing the instrumented application generates a dynamic profile information file with a .dyn suffix. You can run the program more than once with different input data. The compiler will merge all of the .dyn files into a single .dpi file during a later step.
Prepare the application for the PMU by recompiling the application using both the -prof-gen-sampling and -prof-use (Linux) or /Qprof-gen-sampling and /Qprof-use (Windows) option to produce an executable that can gather information from the hardware Performance Monitoring Unit (PMU). The following command examples show how to combine the options during compilation:
Platform |
Command Examples |
---|---|
Linux |
ifort -prof-gen-sampling -prof-use -O3 -ipo a1.f a2.f a3.f -o go |
Windows |
ifort /Qprof-gen-sampling /Qprof-use /O3 /Qipo a1.f a2.f a3.f /Fego |
For more information about the options used in this step, see the following topics:
-prof-gen-sampling and -prof-use compiler options
Run the application, using the profrun utility, again with a representative set of data to create a file with hardware profile information, including delinquent load information.
Platform |
Command Examples |
---|---|
Linux |
profrun -dcache go |
Windows |
profrun -dcache go.exe |
This step executes the application and generates file containing hardware profile information; the file resides in the local directory and has a .hpi suffix. You can run the program more than once with different input data. The hardware profile information for all runs will be merged automatically.
Compile the application a final time using both the -prof-use and -ssp (Linux) or /Qprof-use and /Qssp (Windows) options to produce an executable with SSP optimization enabled. The following command examples show how to combine the options during compilation:
Platform |
Command Examples |
---|---|
Linux |
ifort -prof-use -ssp -O3 -ipo a1.f a2.f a3.f -o go |
Windows |
ifort /Qprof-use /Qssp /O3 /Qipo a1.f a2.f a3.f -o go |
For more information about the -ssp (Linux) or /Qssp (Windows) option used in this step, see the following topic in Compiler Options:
-ssp compiler option
The final step compiles and links the source files with SSP, using the feedback from the instrumented execution phase and the cache miss information from the profiling execution phase.