Programming with OpenMP*

The Intel® compiler accepts a Fortran program containing OpenMP* directives as input and produces a multithreaded version of the code. When the parallel program begins execution, a single thread exists. This thread is called the master thread. The master thread will continue to process serially until it encounters a parallel region.

Parallel Region

A parallel region is a block of code that must be executed by a team of threads in parallel. In the OpenMP Fortran API, a parallel construct is defined by placing OpenMP directive PARALLEL at the beginning and directive END PARALLEL at the end of the code segment. Code segments thus bounded can be executed in parallel.

A structured block of code is a collection of one or more executable statements with a single point of entry at the top and a single point of exit at the bottom.

The compiler supports worksharing and synchronization constructs. Each of these constructs consists of one or two specific OpenMP directives and sometimes the enclosed or following structured block of code. For complete definitions of constructs, see the OpenMP Fortran version 2.0 specifications.

At the end of the parallel region, threads wait until all team members have arrived. The team is logically disbanded (but may be reused in the next parallel region), and the master thread continues serial execution until it encounters the next parallel region.

Worksharing Construct

A worksharing construct divides the execution of the enclosed code region among the members of the team created on entering the enclosing parallel region. When the master thread enters a parallel region, a team of threads is formed. Starting from the beginning of the parallel region, code is replicated (executed by all team members) until a worksharing construct is encountered. A worksharing construct divides the execution of the enclosed code among the members of the team that encounter it.

The OpenMP SECTIONS or DO constructs are defined as worksharing constructs because they distribute the enclosed work among the threads of the current team. A worksharing construct is only distributed if it is encountered during dynamic execution of a parallel region. If the worksharing construct occurs lexically inside of the parallel region, then it is always executed by distributing the work among the team members. If the worksharing construct is not lexically (explicitly) enclosed by a parallel region (that is, it is orphaned), then the worksharing construct will be distributed among the team members of the closest dynamically-enclosing parallel region, if one exists. Otherwise, it will be executed serially.

When a thread reaches the end of a worksharing construct, it may wait until all team members within that construct have completed their work. When all of the work defined by the worksharing construct is finished, the team exits the worksharing construct and continues executing the code that follows.

A combined parallel/worksharing construct denotes a parallel region that contains only one worksharing construct.

Parallel Processing Directive Groups

The parallel processing directives include the following groups:

Parallel Region Directives

Worksharing Construct Directives

Combined Parallel/Worksharing Construct Directives

The combined parallel/worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct. The combined parallel/worksharing constructs are:

Synchronization and MASTER Directives

Synchronization is the interthread communication that ensures the consistency of shared data and coordinates parallel execution among threads. Shared data is consistent within a team of threads when all threads obtain the identical value when the data is accessed. A synchronization construct is used to insure this consistency of the shared data.

The OpenMP synchronization directives are CRITICAL, ORDERED, ATOMIC, FLUSH, and BARRIER.

Directive

Usage

CRITICAL

Within a parallel region or a worksharing construct only one thread at a time is allowed to execute the code within a CRITICAL construct.

ORDERED

Used in conjunction with a DO or SECTIONS construct to impose a serial order on the execution of a section of code.

ATOMIC

Used to update a memory location in an uninterruptable fashion.

FLUSH

Used to insure that all threads in a team have a consistent view of memory.

BARRIER

Forces all team members to gather at a particular point in code. Each team member that executes a BARRIER waits at the BARRIER until all of the team members have arrived. A BARRIER cannot be used within worksharing or other synchronization constructs due to the potential for deadlock.

MASTER

The directive is used to force execution by the master thread.

 See the list of OpenMP Directives and Clauses.

Data Sharing

Data sharing is specified at the start of a parallel region or worksharing construct by using the SHARED and PRIVATE clauses. All variables in the SHARED clause are shared among the members of a team. The application must do the following:

In addition to SHARED and PRIVATE variables, individual variables and entire common blocks can be privatized using the THREADPRIVATE directive.

Orphaned Directives

OpenMP contains a feature called orphaning that dramatically increases the expressiveness of parallel directives. Orphaning is a situation when directives related to a parallel region are not required to occur lexically within a single program unit. Directives such as CRITICAL, BARRIER, SECTIONS, SINGLE, MASTER, and DO can occur by themselves in a program unit, dynamically "binding" to the enclosing parallel region at run time.

Orphaned directives enable parallelism to be inserted into existing code with a minimum of code restructuring. Orphaning can also improve performance by enabling a single parallel region to bind with multiple DO directives located within called subroutines. Consider the following code segment:

Example

...
!$OMP PARALLEL
CALL PHASE1
CALL PHASE2
!$OMP END PARALLEL
...

SUBROUTINE PHASE1
!$OMP DO PRIVATE(i) SHARED(n)
DO i = 1, n
 CALL SOME_WORK(i)
END DO
!$OMP END DO
END
 

SUBROUTINE PHASE2
!$OMP DO PRIVATE(j) SHARED(n)
DO j = 1, n
 CALL MORE_WORK(j)
END DO
!$OMP END DO
END

Orphaned Directives Usage Rules

The following orphaned directives usage rules apply:

Preparing Code for OpenMP Processing

The following are the major stages and steps of preparing your code for using OpenMP. Typically, the first two stages can be done on uniprocessor or multiprocessor systems; later stages are typically done only on multiprocessor systems.

Before Inserting OpenMP Directives

Before inserting any OpenMP parallel directives, verify that your code is safe for parallel execution by doing the following:

Analyze

The analysis includes the following major actions:

  1. Profile the program to find out where it spends most of its time. This is the part of the program that benefits most from parallelization efforts. This stage can be accomplished using VTune™ Analyzer or basic PGO options.

  2. Wherever the program contains nested loops, choose the outer-most loop, which has very few cross-iteration dependencies.

Restructure

To restructure your program for successful OpenMP implementation, you can perform some or all of the following actions:

Tune

The tuning process should include minimizing the sequential code in critical sections and load balancing by using the SCHEDULE clause or the OMP_SCHEDULE environment variable.

Note

This step is typically performed on a multiprocessor system.