The compiler incorporates many well-known and advanced optimization techniques designed to leverage IntelŪ processor features for higher performance on IA-32- and ItaniumŪ-based systems.
The IntelŪ compiler has a common intermediate representation for the supported languages, so that the OpenMP* directive-guidedparallelization and a majority of optimization techniques are applicable through a single high-level code transformation, irrespective of the source language.
The code transformations and optimizations in the Intel compiler can be categorized into the following functional areas:
Code restructuring and interprocedural optimizations (IPO)
OpenMP-based and automatic parallelization and vectorization
High-Level Optimizations (HLO) and scalar optimizations including memory optimizations such as loop control and data transformations, partial redundancy elimination (PRE), and partial dead store elimination (PDSE)
Low-level machine code generation and optimizations such as register allocation and instruction scheduling
The figure illustrates the interrelation of the different areas.
Parallelization guided by OpenMP directives or derived by automatic data dependency and control-flow analysis is a high-level code transformation that exploits both medium- and coarse-grained parallelism for IntelŪ processor and multiprocessor systems enabled with Hyper-Threading Technology (HT Technology) to achieve better performance and higher throughput.
The IntelŪ compiler has a common intermediate code representation (called IL0) into which applications are translated by the front-ends. Many optimization phases in the compiler work on the IL0 representation.
The IL0 has been extended to express the OpenMP directives. Implementing the OpenMP phase at the IL0 level allows the same implementation to be used across languages and architectures. The Intel compiler-generated code references a high-level multithreaded library API, which allows the compiler OpenMP transformation phase to be independent of the underlying operating systems.
The Intel compiler integrates OpenMP parallelization with advanced compiler optimizations to generate efficient multithreaded code that is typically faster than optimized uniprocessor code. An effective optimization phase ordering has been designed in the Intel compiler to make sure that all optimizations, such as IPO inlining, code restructuring; Igoto optimizations, and constant propagation, which are effectively enabled before the OpenMP parallelization, preserve legal OpenMP program semantics and necessary information for parallelization.
The integration also ensures that all optimizations after the OpenMP parallelization, such as automatic vectorization, loop transformation, PRE, and PDSE, can effectively help achieve a better cache locality and help minimize the number of computations and the number of references to memory. For example, given a double-nested OpenMP parallel loop, the parallelization methods are able to generate multithreaded code for the outer loop, while maintaining the loop structure, memory reference behavior, and symbol table information for the innermost loop. This behavior enables subsequent intra-register vectorization of the innermost loop to fully leverage the HT Technology and SIMD Streaming Extension features of Intel processors.
OpenMP parallelization in the Intel compiler includes:
A pre-pass that transforms OpenMP parallel sections into parallel loop and work-sharing sections into work-sharing loops.
A work-region graph builder that builds a region hierarchical graph based on the OpenMP-aware control-flow graph.
A loop analysis phase for building the loop structure that consists of loop control variable, loop lower-bound, loop upper-bound, loop pre-header, loop header, and control expression.
A variable classification phase that performs analysis of shared and private variables.
A multithreaded code generator that generates multithreaded code at compiler intermediate code level based on Guide, which is a multithreaded run-time library API.
A privatizer that performs privatization to handle firstprivate, private, lastprivate, and reduction variables.
A post-pass that generates code to cache in thread local storage for handling threadprivate variables.
OpenMP, a compiler-based threading method, provides a high-level interface to the underlying thread libraries. With OpenMP, you can use directives to describe parallelism to the compiler. Using the supplied directives removes much of the complexity of explicit threading because the compiler handles the details. OpenMP is less invasive, so significant source code modifications are not usually necessary. A non-OpenMP compiler simply ignores the directives, leaving the underlying serial code intact.
As in every other aspect of optimization, the key to attaining good parallel performance is choosing the right granularity for your application. Within the context of this discussion, granularity is the amount of work in the parallel task. If granularity is too fine performance can suffer from increased communication overhead. Conversely, if granularity is too coarse performance can suffer from load imbalance. The design goal is to determine the right granularity for the parallel tasks while avoiding load imbalance and communication overhead.
The amount of work for each parallel task, or granularity, of a multithreaded application greatly affects its parallel performance. When threading an application, the first step is to partition the problem into as many parallel tasks as possible. The second step is to determine the necessary communication in terms of data and synchronization. The third step is to consider the performance of the algorithm. Since communication and partitioning are not free operations, the operations often need to combine partitions. This overcomes the overheads and achieve the most efficient implementation. The combination step is the process of determining the best granularity for the application.
The granularity is often related to how balanced the workload is between threads. It is easier to balance the workload of a large number of small tasks but too many small tasks can lead to excessive parallel overhead. Therefore, coarse granularity is usually best. Increasing granularity too much can create load imbalance; tools like the IntelŪ Thread Profiler can help identify the right granularity for your application.
Note
For detailed information on Hyper-Threading Technology, refer to the IA-32 IntelŪ Architecture Optimization Reference Manual: http://developer.intel.com/design/pentium4/manuals/index_new.htm.