The Intel® compiler accepts a Fortran program containing OpenMP* directives as input and produces a multithreaded version of the code. When the parallel program begins execution, a single thread exists. This thread is called the master thread. The master thread will continue to process serially until it encounters a parallel region.
A parallel region is a block of code that must be executed by a team of threads in parallel. In the OpenMP Fortran API, a parallel construct is defined by placing OpenMP directive PARALLEL at the beginning and directive END PARALLEL at the end of the code segment. Code segments thus bounded can be executed in parallel.
A structured block of code is a collection of one or more executable statements with a single point of entry at the top and a single point of exit at the bottom.
The compiler supports worksharing and synchronization constructs. Each of these constructs consists of one or two specific OpenMP directives and sometimes the enclosed or following structured block of code. For complete definitions of constructs, see the OpenMP Fortran version 2.0 specifications.
At the end of the parallel region, threads wait until all team members have arrived. The team is logically disbanded (but may be reused in the next parallel region), and the master thread continues serial execution until it encounters the next parallel region.
A worksharing construct divides the execution of the enclosed code region among the members of the team created on entering the enclosing parallel region. When the master thread enters a parallel region, a team of threads is formed. Starting from the beginning of the parallel region, code is replicated (executed by all team members) until a worksharing construct is encountered. A worksharing construct divides the execution of the enclosed code among the members of the team that encounter it.
The OpenMP SECTIONS or DO constructs are defined as worksharing constructs because they distribute the enclosed work among the threads of the current team. A worksharing construct is only distributed if it is encountered during dynamic execution of a parallel region. If the worksharing construct occurs lexically inside of the parallel region, then it is always executed by distributing the work among the team members. If the worksharing construct is not lexically (explicitly) enclosed by a parallel region (that is, it is orphaned), then the worksharing construct will be distributed among the team members of the closest dynamically-enclosing parallel region, if one exists. Otherwise, it will be executed serially.
When a thread reaches the end of a worksharing construct, it may wait until all team members within that construct have completed their work. When all of the work defined by the worksharing construct is finished, the team exits the worksharing construct and continues executing the code that follows.
A combined parallel/worksharing construct denotes a parallel region that contains only one worksharing construct.
The parallel processing directives include the following groups:
PARALLEL and END PARALLEL
The DO and END DO directives specify parallel execution of loop iterations.
The SECTIONS and END SECTIONS directives specify parallel execution for arbitrary blocks of sequential code. Each SECTION is executed once by a thread in the team.
The SINGLE and END SINGLE directives define a section of code where exactly one thread is allowed to execute the code; threads not chosen to execute this section ignore the code.
The combined parallel/worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct. The combined parallel/worksharing constructs are:
PARALLEL DO and END PARALLEL DO
PARALLEL SECTIONS and END PARALLEL SECTIONS
WORKSHARE and PARALLEL WORKSHARE
Synchronization is the interthread communication that ensures the consistency of shared data and coordinates parallel execution among threads. Shared data is consistent within a team of threads when all threads obtain the identical value when the data is accessed. A synchronization construct is used to insure this consistency of the shared data.
The OpenMP synchronization directives are CRITICAL, ORDERED, ATOMIC, FLUSH, and BARRIER.
Directive |
Usage |
---|---|
CRITICAL |
Within a parallel region or a worksharing construct only one thread at a time is allowed to execute the code within a CRITICAL construct. |
ORDERED |
Used in conjunction with a DO or SECTIONS construct to impose a serial order on the execution of a section of code. |
ATOMIC |
Used to update a memory location in an uninterruptable fashion. |
FLUSH |
Used to insure that all threads in a team have a consistent view of memory. |
BARRIER |
Forces all team members to gather at a particular point in code. Each team member that executes a BARRIER waits at the BARRIER until all of the team members have arrived. A BARRIER cannot be used within worksharing or other synchronization constructs due to the potential for deadlock. |
MASTER |
The directive is used to force execution by the master thread. |
See the list of OpenMP Directives and Clauses.
Data sharing is specified at the start of a parallel region or worksharing construct by using the SHARED and PRIVATE clauses. All variables in the SHARED clause are shared among the members of a team. The application must do the following:
Synchronize access to these variables. All variables in the PRIVATE clause are private to each team member. For the entire parallel region, assuming t team members, there are t+1 copies of all the variables in the PRIVATE clause: one global copy that is active outside parallel regions and a PRIVATE copy for each team member.
Initialize PRIVATE variables at the start of a parallel region, unless the FIRSTPRIVATE clause is specified. In this case, the PRIVATE copy is initialized from the global copy at the start of the construct at which the FIRSTPRIVATE clause is specified.
Update the global copy of a PRIVATE variable at the end of a parallel region. However, the LASTPRIVATE clause of a DO directive enables updating the global copy from the team member that executed serially the last iteration of the loop.
In addition to SHARED and PRIVATE variables, individual variables and entire common blocks can be privatized using the THREADPRIVATE directive.
OpenMP contains a feature called orphaning that dramatically increases the expressiveness of parallel directives. Orphaning is a situation when directives related to a parallel region are not required to occur lexically within a single program unit. Directives such as CRITICAL, BARRIER, SECTIONS, SINGLE, MASTER, and DO can occur by themselves in a program unit, dynamically "binding" to the enclosing parallel region at run time.
Orphaned directives enable parallelism to be inserted into existing code with a minimum of code restructuring. Orphaning can also improve performance by enabling a single parallel region to bind with multiple DO directives located within called subroutines. Consider the following code segment:
Example |
---|
... SUBROUTINE PHASE2 |
The following orphaned directives usage rules apply:
An orphaned worksharing construct (SECTIONS, SINGLE, or DO) is executed by a team consisting of one thread, that is, serially.
Any collective operation (worksharing construct or BARRIER) executed inside of a worksharing construct is illegal.
It is illegal to execute a collective operation (worksharing construct or BARRIER) from within a synchronization region (CRITICAL/ORDERED).
The opening and closing directives of a directive pair (for example, DO and END DO) must occur in a single block of the program.
Private scoping of a variable can be specified at a worksharing construct. Shared scoping must be specified at the parallel region. For complete details, see the OpenMP Fortran version 2.0 specifications.
The following are the major stages and steps of preparing your code for using OpenMP. Typically, the first two stages can be done on uniprocessor or multiprocessor systems; later stages are typically done only on multiprocessor systems.
Before inserting any OpenMP parallel directives, verify that your code is safe for parallel execution by doing the following:
Place local variables on the stack. This is the default behavior of the Intel Compiler when -openmp (Linux*) or /Qopenmp (Windows*) is specified.
Use -automatic or -auto-scalar (Linux) or /automatic (Windows) to make the locals automatic. This is the default behavior of the Intel® compiler when -openmp (Linux) or /Qopenmp (Windows) is specified. Avoid using the -save (Linux) or /save (Windows) option, which inhibits stack allocation of local variables. By default, automatic local variables become shared across threads, so you may need to add synchronization code to ensure proper access by threads.
The analysis includes the following major actions:
Profile the program to find out where it spends most of its time. This is the part of the program that benefits most from parallelization efforts. This stage can be accomplished using VTune™ Analyzer or basic PGO options.
Wherever the program contains nested loops, choose the outer-most loop, which has very few cross-iteration dependencies.
To restructure your program for successful OpenMP implementation, you can perform some or all of the following actions:
If a chosen loop is able to execute iterations in parallel, introduce a PARALLEL DO construct around this loop.
Try to remove any cross-iteration dependencies by rewriting the algorithm.
Synchronize the remaining cross-iteration dependencies by placing CRITICAL constructs around the uses and assignments to variables involved in the dependencies.
List the variables that are present in the loop within appropriate SHARED, PRIVATE, LASTPRIVATE, FIRSTPRIVATE, or REDUCTION clauses.
List the DO index of the parallel loop as PRIVATE. This step is optional.
COMMON block elements must not be placed on the PRIVATE list if their global scope is to be preserved. The THREADPRIVATE directive can be used to privatize to each thread the COMMON block containing those variables with global scope. THREADPRIVATE creates a copy of the COMMON block for each of the threads in the team.
Any I/O in the parallel region should be synchronized.
Identify more parallel loops and restructure them.
If possible, merge adjacent PARALLEL DO constructs into a single parallel region containing multiple DO directives to reduce execution overhead.
The tuning process should include minimizing the sequential code in critical sections and load balancing by using the SCHEDULE clause or the OMP_SCHEDULE environment variable.
Note
This step is typically performed on a multiprocessor system.