The debugging of multithreaded program discussed in this section applies to both the OpenMP Fortran API and the Intel Fortran parallel compiler directives. When a program uses parallel decomposition directives, you must take into consideration that the bug might be caused either by an incorrect program statement or it might be caused by an incorrect parallel decomposition directive. In either case, the program to be debugged can be executed by multiple threads simultaneously.
To debug the multithreaded programs, you can use:
Intel Debugger for IA-32 and Intel Debugger for Itanium-based applications (idb)
Intel Fortran Compiler debugging options and methods; in particular, Compiling Source Lines with Debugging Statements.
Intel parallelization extension routines for low-level debugging.
VTune(TM) Performance Analyzer to define the problematic areas.
Other best known debugging methods and tips include:
Correct the program in single-threaded, uni-processor environment
Statically analyze locks
Use trace statement (such as print statement)
Think in parallel, make very few assumptions
Step through your code
Make sense of threads and callstack information
Identify the primary thread
Know what thread you are debugging
Single stepping in one thread does not mean single stepping in others
Watch out for context switch
Debuggers such as Intel Debugger for IA-32 and Intel Debugger for Itanium-based applications support the debugging of programs that are executed by multiple threads. However, the currently available versions of such debuggers do not directly support the debugging of parallel decomposition directives, and therefore, there are limitations on the debugging features.
Some of the new features used in OpenMP are not yet fully supported by the debuggers, so it is important to understand how these features work to know how to debug them. The two problem areas are:
Multiple entry points
Shared variables
You can use routine names (for example, padd) and entry names (for example, _PADD, ___PADD_6__par_loop0). FORTRAN Compiler, by default, first mangles lower/mixed case routine names to upper case. For example, pAdD() becomes PADD(), and this becomes entry name by adding one underscore. The secondary entry name mangling happens after that. That's why "__par_loop" part of the entry name stays as lower case. Debugger for some reason didn't take the upper case routine name "PADD" to set the breakpoint. Instead, it accepted the lower case routine name "padd".
The compiler implements a parallel region by enabling the code in the region and putting it into a separate, compiler-created entry point. Although this is different from outlining – the technique employed by other compilers, that is, creating a subroutine, – the same debugging technique can be applied.
The compiler-generated parallel region entry point name is constructed with a concatenation of the following strings:
"__" character
entry point name for the original routine (for example, _parallel)
"_" character
line number of the parallel region
__par_region for OpenMP parallel regions (!$OMP PARALLEL)
__par_loop for OpenMP parallel loops (!$OMP PARALLEL DO),
__par_section for OpenMP parallel sections (!$OMP PARALLEL SECTIONS)
sequence number of the parallel region (for each source file, sequence number starts from zero.)
Example 1 illustrates the debugging of the code with parallel region. Example 1 is produced by this command:
ifc -openmp -g -O0 -S file.f90
Let us consider the code of subroutine parallelin Example 1.
Subroutine PARALLEL() source listing |
1 subroutine parallel 2 integer id,OMP_GET_THREAD_NUM 3 !$OMP PARALLEL PRIVATE(id) 4 id = OMP_GET_THREAD_NUM() 5 !$OMP END PARALLEL 6 end |
The parallel region is at line 3. The compiler created two entry points: parallel_ and ___parallel_3__par_region0. The first entry point corresponds to the subroutine parallel(), while the second entry point corresponds to the OpenMP parallel region at line 3.
Machine Code Listing of the Subroutine parallel() |
.globl parallel_ |
Debugging the program at this level is just like debugging a program that uses POSIX threads directly. Breakpoints can be set in the threaded code just like any other routine. With GNU debugger, breakpoints can be set to source-level routine names (such as parallel). Breakpoints can also be set to entry point names (such as parallel_ and _parallel__3__par_region0). Note that Intel Fortran Compiler for Linux converted the upper case Fortran subroutine name to the lower case one.
When in a debugger, you can switch from one thread to another. Each thread has its own program counter so each thread can be in a different place in the code. Example 2 shows a Fortran subroutine PADD(). A breakpoint can be set at the entry point of OpenMP parallel region.
Source listing of the Subroutine PADD() |
12. SUBROUTINE
PADD(A, B, C, N) |
The first call stack below is obtained by breaking at the entry to subroutine PADD using GNU debugger. At this point, the program has not executed any OpenMP regions, and therefore has only one thread. The call stack shows a system runtime __libc_start_main function calling the Fortran main program parallel(), and parallel() calls subroutine padd(). When the program is executed by more than one thread, you can switch from one thread to another. The second and the third call stacks are obtained by breaking at the entry to the parallel region. The call stack of master contains the complete call sequence. At the top of the call stack is _padd__6__par_loop0(). Invocation of a threaded entry point involves a layer of Intel OpenMP library function calls (that is, functions with __kmp prefix). The call stack of the worker thread contains a partial call sequence that begins with a layer of Intel OpenMP library function calls.
ERRATA: GNU debugger sometimes fails to properly unwind the call stack of the immediate caller of Intel OpenMP library function __kmpc_fork_call().
Call Stack Dump of Master Thread upon Entry to Subroutine PADD
Switching from One Thread to Another
Call Stack Dump of Master Thread upon Entry to Parallel Region
Call Stack Dump of Worker Thread upon Entry to Parallel Region
Subroutine PADD() Machine Code Listing |
.globl
padd_ |
When a variable appears in a PRIVATE, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION clause on some block, the variable is made private to the parallel region by redeclaring it in the block. SHARED data, however, is not declared in the threaded code. Instead, it gets its declaration at the routine level. At the machine code level, these shared variables become incoming subroutine call arguments to the threaded entry points (such as ___PADD_6__par_loop0).
In Example 2, the entry point ___PADD_6_par_loop0 has six incoming parameters. The corresponding OpenMP parallel region has four shared variables. First two parameters (parameters 1 and 2) are reserved for the compiler's use, and each of the remaining four parameters corresponds to one shared variable. These four parameters exactly match the last four parameters to __kmpc_fork_call() in the machine code of PADD.
Note
The FIRSTPRIVATE, LASTPRIVATE, and REDUCTION
variables also require shared variables to get the values into or out
of the parallel region.
Due to the lack of support in debuggers, the correspondence between the shared variables (in their original names) and their contents cannot be seen in the debugger at the threaded entry point level. However, you can still move to the call stack of one of the subroutines and examine the contents of the variables at that level. This technique can be used to examine the contents of shared variables. In Example 2, contents of the shared variables A, B, C, and N can be examined if you move to the call stack of PARALLEL().