This topic provides general guidelines for coding practices and techniques for using:
IA-32 architecture supporting MMX™ technology and Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2), and Streaming SIMD Extensions 3 (SSE3)
Itanium® architecture
This section describes practices, tools, coding rules and recommendations associated with the architecture features that can improve the performance on IA-32 and Itanium processor families. For all details about optimization for IA-32 processors, see Intel® Architecture Optimization Reference Manual (http://developer.intel.com/design/pentiumii/manuals/245127.htm). For all details about optimization for Itanium processor family, see the Intel® Itanium® 2 Processor Reference Manual for Software Development and Optimization (http://developer.intel.com/design/itanium2/manuals/251110.htm).
Note
If a guideline refers to a particular architecture only, this architecture is explicitly named. The default is for both IA-32 and Itanium architectures.
Performance of compiler-generated code may vary from one compiler to another. Intel® Visual Fortran Compiler generates code that is highly optimized for Intel architectures. You can significantly improve performance by using various compiler optimization options. In addition, you can help the compiler to optimize your Fortran program by following the guidelines described in this section.
To achieve optimum processor performance in your Fortran application, do the following:
avoiding memory access stalls
ensuring good floating-point performance
ensuring good SIMD integer performance
using vectorization
The following sections summarize and describe coding practices, rules and recommendations associated with the features that will contribute to optimizing the performance on Intel architecture-based processors.
The Intel compiler lays out Fortran arrays in column-major order. For example, in a two-dimensional array, elements A(22, 34) and A(23, 34) are contiguous in memory. For best performance, code arrays so that inner loops access them in a contiguous manner. Consider the following examples.
The code in example 1 will likely have higher performance than the code in example 2.
Example 1 |
---|
DO J = 1, N DO I = 1, N B(I,J) = A(I, J) + 1 END DO END DO |
The code above illustrates access to arrays A and B in the inner loop I in a contiguous manner which results in good performance.
Example 2 |
---|
DO I = 1, N DO J = 1, N B(I,J) = A(I, J) + 1 END DO END DO |
The code above illustrates access to arrays A and B in inner loop J in a non-contiguous manner which results in poor performance.
The compiler itself can transform the code so that inner loops access memory in a contiguous manner. To do that, you need to use advanced optimization options, such as -O3 (Linux*) or /O3 (Windows*) for both IA-32 and Itanium architectures, and -O3 (Linux) or /O3 (Windows) and -ax (Linux) or /Qax (Windows) for IA-32 only.
Alignment is an increasingly important factor in ensuring good performance. Aligned memory accesses are faster than unaligned accesses. If you use the interprocedural optimization on multiple files, the -ipo (Linux ) or /Qipo (Windows) option, the compiler analyzes the code and decides whether it is beneficial to pad arrays so that they start from an aligned boundary. Multiple arrays specified in a single common block can impose extra constraints on the compiler.
For example, consider the following COMMON statement
Example 3 |
---|
COMMON /AREA1/ A(200), X, B(200) |
If the compiler added padding to align A(1) at a 16-byte aligned address, the element B(1) would not be at a 16-byte aligned address. So it is better to split AREA1 as follows.
Example 4 |
---|
COMMON /AREA1/ A(200) COMMON /AREA2/ X COMMON /AREA3/ B(200) |
The above code provides the compiler maximum flexibility in determining the padding required for both A and B.
To improve floating-point performance, follow these rules:
Avoid exceeding representable ranges during computation since handling these cases can have a performance impact. Use REAL variables in single precision format unless the extra precision obtained through DOUBLE or REAL*8 with a larger precision formation will also increase memory size and bandwidth requirements.
For IA-32 only: Avoid repeatedly changing rounding modes between more than two values, which can lead to poor performance when the computation is done using non-SSE instructions. Hence avoid using FLOOR and TRUNC instructions together when generating non-SSE code. The same applies for using CEIL and TRUNC.
Another way to avoid the problem is to use the -x (Linux) or /Qx (Windows) option to do the computation using SSE instructions.
Reduce the impact of denormal exceptions for both architectures as described below.
Floating point computations with underflow can result in denormal values that have an adverse impact on performance.
For IA-32: take advantage of the SIMD capabilities of Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2), and Streaming SIMD Extensions 3 (SSE3) instructions.
The -x (Linux) or /Qx (Windows) options enable the flush-to-zero (FTZ) mode in SSE and SSE2 instructions, whereby underflow results are automatically converted to zero, which improves application performance. In addition, the -xP (Linux) or /QxP (Windows) option also enables the denormals-are-zero (DAZ) mode, whereby denormals are converted to zero on input, further improving performance. An application developer willing to trade pure IEEE-754 compliance for speed would benefit from these options. For more information on FTZ and DAZ, see Setting FTZ and DAZ Flags and "Floating-point Exceptions" in the Intel® Architecture Optimization Reference Manual (http://developer.intel.com/design/pentiumii/manuals/245127.htm).
For Itanium® architecture: enable flush-to-zero (FTZ) mode with the -ftz (Linux) or /Qftz (Windows) set by -O3 (Linux) or /O3 (Windows) option.
Many applications significantly increase their performance if they can implement vectorization, which uses streaming SIMD SSE2 instructions for the main computational loops. The Intel Compiler turns vectorization on (auto-vectorization) or you can implement it with compiler directives. See Auto-vectorization (IA-32 Only) section for complete details.
The Intel Fortran Compiler and the Intel® Threading Toolset have the capabilities that make developing multithreaded application easy. See Parallelism Overview. Multithreaded applications can show significant benefit on multiprocessor Intel symmetric multiprocessing (SMP) systems or on Intel processors with Hyper-Threading technology.