This section discusses the three major features of parallel programming supported by the Intel® compiler:
Parallelization with OpenMP*
Auto-parallelization
Auto-vectorization
Each of these features contributes to application performance depending on the number of processors, target architecture (IA-32 or Itanium® architecture), and the nature of the application. These features of parallel programming can be combined to contribute to application performance.
Parallel programming can be explicit, that is, defined by a programmer using OpenMP directives. Parallel programming can also be implicit, that is, detected automatically by the compiler. Implicit parallelism implements auto-parallelization of outer-most loops and auto-vectorization of innermost loops (or both).
Parallelism defined with OpenMP and auto-parallelization directives is based on thread-level parallelism (TLP). Parallelism defined with auto-vectorization techniques is based on instruction-level parallelism (ILP).
The Intel® compiler supports OpenMP and auto-parallelization on both IA-32 and Itanium architectures for multiprocessor systems as well as on single IA-32 processors with Hyper-Threading Technology (HT Technology). Auto-vectorization is supported on the families of the Pentium®, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors. To enhance the compilation of the code with auto-vectorization, users can also add vectorizer directives to their program. A closely related technique that is available on the Itanium-based systems is software pipelining (SWP).
The following table summarizes the different ways in which parallelism can be exploited with the Intel® Compiler.
Parallelism |
Description |
---|---|
Explicit |
Parallelism programmed by the user |
OpenMP*(TLP) IA-32 and Itanium® architectures |
Supported on:
|
Implicit |
Parallelism generated by the compiler and by user-supplied hints |
Auto-parallelization (TLP) |
Supported on:
|
Auto-vectorization (ILP) |
Supported on:
|
The Intel® compiler supports the OpenMP* Fortran version 2.0 API specification available from the www.openmp.org web site. The OpenMP directives relieve the user from having to deal with the low-level details of iteration space partitioning, data sharing, and thread scheduling and synchronization.
The Auto-parallelization feature of the Intel® compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor systems and IA-32 processors with the Hyper-Threading Technology.
Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results. For example, in the code below, TLP can be exploited in the outermost loop, while ILP can be exploited in the innermost loop.
Example |
---|
DO I = 1, 100 ! execute groups of iterations in ! different threads (TLP) DO J = 1, 32 ! execute in SIMD style with multimedia ! extension (ILP) A(J,I) = A(J,I) + 1 ENDDO ENDDO |
Auto-vectorization can help improve performance of an application that runs on systems based on Pentium®, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors.
The following table lists the options that enable auto-vectorization, auto-parallelization, and OpenMP support.
Windows* |
Linux* |
Description |
---|---|---|
Auto-vectorization: IA-32 only | ||
/Qx |
-x |
Generates specialized code to run exclusively on processors with the extensions specified by {K|W|N|B|P}. See the following topic in Compiler Options: |
/Qax |
-ax |
Generates, in a single binary, code specialized to the extensions specified by {K|W|N|B|P} and also generic IA-32 code. The generic code is usually slower. See the following topic in Compiler Options: |
/Qvec-report |
-vec-report |
Controls the diagnostic messages from the vectorizer, see subsection that follows the table. See the following topic in Compiler Options: |
Auto-parallelization: IA-32 and Itanium® architectures | ||
/Qparallel |
-parallel |
Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. See the following topic in Compiler Options: |
/Qpar-threshold[:n] |
-par-threshold{n} |
Sets a threshold for the auto of loops based on the probability of profitable execution of the loop in parallel, n=0 to 100. See the following topic in Compiler Options: |
/Qpar-report |
-par-report |
Controls the auto-parallelizer's diagnostic levels. See the following topic in Compiler Options: |
OpenMP: IA-32 and Itanium® architectures | ||
/Qopenmp |
-openmp |
Enables the parallelizer to generate multithreaded code based on the OpenMP directives. See the following topic in Compiler Options: |
/Qopenmp-report |
-openmp-report |
Controls the OpenMP parallelizer's diagnostic levels. See the following topic in Compiler Options: |
/Qopenmp-stubs |
-openmp-stubs |
Enables compilation of OpenMP programs in sequential mode. The OpenMP directives are ignored and a stub OpenMP library is linked. See the following topic in Compiler Options: |
Note
When both -openmp (Linux) or /Qopenmp (Windows) and -parallel (Linux) or /Qparallel (Windows) are specified on the command line, the -parallel (Linux) or /Qparallel (Windows) option is only applied in routines that do not contain OpenMP directives. For routines that contain OpenMP directives, only the -openmp (Linux) or /Qopenmp (Windows) option is applied.
With the right choice of options, the programmers can:
Increase the performance of your application with minimum effort
Use compiler features to develop multithreaded programs faster
Additionally, with the relatively small effort of adding OpenMP directives to their code, you can transform a sequential program into a parallel program. The following example shows OpenMP directives within the code.
Example |
---|
!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C) !Defines a parallel region !OMP$ PARALLEL DO ! Specifies a parallel region that ! implicitly contains a single DO directive DO I = 1, 1000 NUM = FOO(B(i), C(I)) X(I) = BAR(A(I), NUM) ! Assume FOO and BAR have no side effects ENDDO |
See examples of the auto-parallelization and auto-vectorization directives in the following topics.