Coding Guidelines for Intel® Architectures

This topic provides general guidelines for coding practices and techniques for using:

This section describes practices, tools, coding rules and recommendations associated with the architecture features that can improve the performance on IA-32 and Itanium processor families. For all details about optimization for IA-32 processors, see Intel® Architecture Optimization Reference Manual (http://developer.intel.com/design/pentiumii/manuals/245127.htm). For all details about optimization for Itanium processor family, see the Intel® Itanium® 2 Processor Reference Manual for Software Development and Optimization (http://developer.intel.com/design/itanium2/manuals/251110.htm).

Note

If a guideline refers to a particular architecture only, this architecture is explicitly named. The default is for both IA-32 and Itanium architectures.

Performance of compiler-generated code may vary from one compiler to another. Intel® Visual Fortran Compiler generates code that is highly optimized for Intel architectures. You can significantly improve performance by using various compiler optimization options. In addition, you can help the compiler to optimize your Fortran program by following the guidelines described in this section.

To achieve optimum processor performance in your Fortran application, do the following:

The following sections summarize and describe coding practices, rules and recommendations associated with the features that will contribute to optimizing the performance on Intel architecture-based processors.

Memory Access

The Intel compiler lays out Fortran arrays in column-major order. For example, in a two-dimensional array, elements A(22, 34) and A(23, 34) are contiguous in memory. For best performance, code arrays so that inner loops access them in a contiguous manner. Consider the following examples.

The code in example 1 will likely have higher performance than the code in example 2.

Example 1

DO J = 1, N

   DO I = 1, N

      B(I,J) = A(I, J) + 1

   END DO

END DO

The code above illustrates access to arrays A and B in the inner loop I in a contiguous manner which results in good performance.

Example 2

DO I = 1, N

   DO J = 1, N

      B(I,J) = A(I, J) + 1

   END DO

END DO

The code above illustrates access to arrays A and B in inner loop J in a non-contiguous manner which results in poor performance.

The compiler itself can transform the code so that inner loops access memory in a contiguous manner. To do that, you need to use advanced optimization options, such as -O3 (Linux*) or /O3 (Windows*) for both IA-32 and Itanium architectures, and -O3 (Linux) or /O3 (Windows) and -ax (Linux) or /Qax (Windows) for IA-32 only.

Memory Layout

Alignment is an increasingly important factor in ensuring good performance. Aligned memory accesses are faster than unaligned accesses. If you use the interprocedural optimization on multiple files, the -ipo (Linux ) or /Qipo (Windows) option, the compiler analyzes the code and decides whether it is beneficial to pad arrays so that they start from an aligned boundary. Multiple arrays specified in a single common block can impose extra constraints on the compiler.

For example, consider the following COMMON statement

Example 3

COMMON /AREA1/ A(200), X, B(200)

If the compiler added padding to align A(1) at a 16-byte aligned address, the element B(1) would not be at a 16-byte aligned address. So it is better to split  AREA1 as follows.

Example 4

COMMON /AREA1/ A(200)

COMMON /AREA2/ X

COMMON /AREA3/ B(200)

The above code provides the compiler maximum flexibility in determining the padding required for both A and B.

Optimizing for Floating-point Applications

To improve floating-point performance, follow these rules:

Another way to avoid the problem is to use the -x (Linux) or /Qx (Windows) option to do the computation using SSE instructions.

Denormal Exceptions

Floating point computations with underflow can result in denormal values that have an adverse impact on performance.

Auto-vectorization (IA-32 Only)

Many applications significantly increase their performance if they can implement vectorization, which uses streaming SIMD SSE2 instructions for the main computational loops. The Intel Compiler turns vectorization on (auto-vectorization) or you can implement it with compiler directives. See Auto-vectorization (IA-32 Only) section for complete details.

Creating Multithreaded Applications

The Intel Fortran Compiler and the Intel® Threading Toolset have the capabilities that make developing multithreaded application easy. See Parallelism Overview. Multithreaded applications can show significant benefit on multiprocessor Intel symmetric multiprocessing (SMP) systems or on Intel processors with Hyper-Threading technology.