Auto-parallelization feature implements some concepts of OpenMP, such as worksharing construct (with the PARALLEL DO directive). See Programming with OpenMP for worksharing construct. This section provides specifics of auto-parallelization.
A loop is parallelizable if:
The loop is countable at compile time: this means that an expression representing how many times the loop will execute (also called "the loop trip count") can be generated just before entering the loop.
There are no FLOW (READ after WRITE), OUTPUT (WRITE after READ) or ANTI (WRITE after READ) loop-carried data dependences. A loop-carried data dependence occurs when the same memory location is referenced in different iterations of the loop. At the compiler's discretion, a loop may be parallelized if any assumed inhibiting loop-carried dependencies can be resolved by runtime dependency testing.
The compiler may generate a runtime test for the profitability of executing in parallel for loop with loop parameters that are not compile-time constants.
Enhance the power and effectiveness of the auto-parallelizer by following these coding guidelines:
Expose the trip count of loops whenever possible; specifically use constants where the trip count is known and save loop parameters in local variables.
Avoid placing structures inside loop bodies that the compiler may assume to carry dependent data, for example, procedure calls, ambiguous indirect references or global references.
Insert the !DIR$ PARALLEL directive to disambiguate assumed data dependencies.
Insert the !DIR$ NOPARALLEL directive before loops known to have insufficient work to justify the overhead of sharing among threads.
For auto-parallelization processing, the compiler performs the following steps:
Data flow analysis ---> Loop classification ---> Dependence analysis ---> High-level parallelization --> Data partitioning ---> Multi-threaded code generation.
These steps include:
Data flow analysis: compute the flow of data through the program
Loop classification: determine loop candidates for parallelization based on correctness and efficiency as shown by threshold analysis
Dependence analysis: compute the dependence analysis for references in each loop nest
High-level parallelization:
- analyze dependence graph to determine loops which can execute in parallel.
- compute runtime dependency
Data partitioning: examine data reference and partition based on the following types of access: SHARED, PRIVATE, and FIRSTPRIVATE
Multi-threaded code generation:
- modify loop parameters
- generate entry/exit per threaded task
- generate calls to parallel runtime routines for thread creation and synchronization