Applying Performance Enhancement Strategies

Improving performance starts with identifying the characteristics of the application you are attempting to optimize. The following table lists some common application characteristics, indicates the overall potential performance impact you can expect, and provides suggested solutions to try. The key to using these strategies is experimentation.

In the context of this topic, view the potential impact categories as an indication of the possible performance increases that can be achieved when using the suggested strategy. It's possible that application or code design issues will prohibit achieving the indicated increases; however, the listed impacts are generally true. The impact categories are defined in terms of the following performance increases, when compared to the initially tested performance:

The following table is ordered by application characteristics and then by strategy with the most significant potential impact.

Application Characteristics

Impact

Suggested Strategies

Technical Applications

Technical applications with loopy code

High

Technical applications are those programs that have some subset of functions that consume a majority of total CPU cycles in loop nests.

Target loop nests using -03 (Linux*) or /O3 (Windows*) to enable more aggressive loop transformations and prefetching.

Use High-Level Optimization (HLO) reporting to determine which HLO optimizations the compiler chose to apply.

See High-Level Optimization Report.

(same as above)
ItaniumŪ Only

High

For -O2 and -O3 (Linux) or /O2 and /O3 (Windows), use the SWP report to determine if Software Pipelining occurred on key loops, and if not, why not.

You might be able to change the code to allow pipelining under the following conditions:

  • If recurrences are listed in the report that you suspect do not exist, eliminate aliasing problems (for example, using restrict), or use IVDEP on the loop.

  • If the loop is too large or runs out of registers, you might be able to distribute the loop into smaller segments; distribute the loop manually or by using the distribute directive.

  • If the compiler determines the Global Acyclic Scheduler can produce better results but you think the loop should still be pipelined, use the SWP directive on the loop.

(same as above)
IA-32 and IntelŪ EM64T Only

High

See Vectorization Overview and the remaining topics in the Auto-Vectorization section for applicable options.

See Vectorization Reports (IA-32) for specific details about when you can change code.

(same as above)

Medium

Use PGO profile to guide other optimizations.

See Profile-guided Optimizations Overview.

Applications with many denormalized floating-point value operations

Significant

Attempt to use flush-to-zero where appropriate.

Decide if a high degree of precision is necessary; if it's not then using flush-to-zero might help. Denormal values require hardware or operating system interventions to handle the computation. Denormalized floating point values are those which are too small to be represented in the normal manner, that is, the mantissa cannot be left-justified. Using flush-to-zero causes denormal numbers to be treated as zero by the hardware.

Remove floating-point denormals using some common strategies:

  • Change the data type to a larger data type.

  • Depending on the target architecture, use flush-to-zero or vectorization options.

IA-32 and IntelŪ EM64T:

  • Flush-to-zero mode is enabled by default for SSE2 instructions. The IntelŪ EM64T compiler generates SSE2 instructions by default. Enable SSE2 instructions in the IA-32 compiler by using -xW, -xN, -xB or -xP (Linux) or /QxW, /QxN, /QxB or /QxP (Windows).  

  • See Vectorization Support for more information.

ItaniumŪ:

  • The most common, easiest flush-to-zero strategy is to use the -ftz (Linux) or /Qftz (Windows) option on the source file containing PROGRAM.

  • Selecting -O3 (Linux) or /O3 (Windows) automatically enables -ftz (Linux) or /Qftz (Windows).

After using flush-to-zero, ensure that your program still gives correct results when treating denormalized values as zero.

Sparse matrix applications

Medium

See the suggested strategy for memory pointer disambiguation (below).

Use prefetch directive or prefetch intrinsics. Experiment with different prefetching schemes on indirect arrays.

See HLO Overview or Data Prefetching starting places for using prefetching.

Server application with branch-centric code and a fairly flat profile

Medium

Flat profile applications are those applications where no single module seems to consume CPU cycles inordinately.

Use PGO to communicate typical hot paths and functions to the compiler, so the IntelŪ compiler can arrange code in the optimal manner.

Use PGO on as much of the application as is feasible.

See Profile-guided Optimizations Overview.

Other Application Types

Applications with many small functions that are called from multiple locations

Low

Use -ip (Linux) or /Qip (Windows) to enable inter-procedural inlining within a single source module.

Streamlines code execution for simple functions by duplicating the code within the code block  that originally called the function. This will increase application size.

As a general rule, do not inline large, complicated functions.

See Interprocedural Optimizations Overview.

(same as above)

Low

Use -ipo (Linux) or /Qipo (Windows) to enable inter-procedural inlining both within and between multiple source modules. You might experience an additional increase over using -ip (Linux) or /Qip (Windows).

Using this option will increase link time due to the extended program flow analysis that occurs.

Interprocedural Optimization (IPO) can perform whole program analysis to help memory pointer disambiguation.

Apart from application-specific suggestions listed above, there are many application-, OS/Library-, and hardware-specific recommendations that can improve performance as suggested in the following tables:

Application-specific Recommendations

Application Area

Impact

Suggested Strategies

Cache Blocking

High

Different parameters for single precision versus double precision. Use -O3 (Linux) or /O3 (Windows) to enable automatic cache blocking; use the HLO report to determine if the compiler enabled cache blocking automatically. If not consider manual cache blocking.

See Cache Blocking.

Compiler directives for better alias analysis

Medium

Ignore vector dependencies. Use IVDEP and other directives to increase application speed.

See Vectorization Support.

Memory pointer disambiguation compiler switches

Medium

Experiment with the following compiler options:

  • -fno-fnalias (Linux)

  • -ansi-alias (Linux) or /Qansi-alias (Windows)

  • /Oa (Windows)

  • /Ow (Windows)

  • -alias-args (Linux) or /Qalias-args (Windows)

  • -safe-cray-ptr (Linux) or /Qsafe-cray-ptr (Windows)

Math functions

Low

Call Math Kernel Library (MKL) instead of user code.

Call F90 intrinsics instead of user code (to enable optimizations).

Library/OS Recommendations

Area

Impact

Description

Symbol preemption

Low

Linux has a less performance-friendly symbol preemption model than Windows. Linux uses full preemption, and Windows uses no preemption. Use -fminshared -fvisibility=protected.

Memory allocation

Low

Using third-party memory management libraries can help improve performance for applications that require extensive memory allocation.

Hardware/System Recommendations

Component

Impact

Description

Disk

Medium

Consider using more advanced hard drive storage strategies. For example, consider using SCSI instead of IDE.

Consider using the appropriate RAID level.

Consider increasing the number hard drives in your system.

Memory

Low

You can experience performance gains by distributing memory in a system. For example, if you have four open memory slots and only two slots are populated, populating the other two slots with memory will increase performance.

Processor

 

For many applications, performance scales with both processor speed and cache size.

Other Optimization Strategy Information

For more information on advanced or specialized optimization strategies, refer to the Intel Developer Services: Developer Centers (http://www.intel.com/cd/ids/developer/asmo-na/eng/19284.htm) web site.

Refer to the articles and links to additional resources in the listed topic areas of the following Developer Centers:

Developer Center

Topic Area

Software Technologies

  • IntelŪ Extended Memory 64 Technology

  • Intel Software Tools

Intel Processors & Related Technologies

  • All