Parallel Execution

This section provides information on parallel execution.

Transforming Eligible Loops for Parallel Execution (+Oparallel)

The +Oparallel option causes the compiler to transform eligible loops for parallel execution on multiprocessor machines.

If you link separately from the compile line and you compiled with the +Oparallel option, you must link with the cc command and specify the +Oparallel option to link in the right startup files and runtime support.

When a program is compiled with the +Oparallel option, the compiler looks for opportunities for parallel execution in loops and generates parallel code to execute the loop on the number of processors set by the MP_NUMBER_OF_THREADS environment variable discussed below. By default, this is the number of processors on the executing machine.

For a discussion of parallelization, including how to use the +Oparallel option, see "Parallelizing C Programs" below. For more detail on +Oparallel, see the description in "Controlling Specific Optimizer Features" earlier in this chapter.

Environment Variable for Parallel Programs

The environment variable MP_NUMBER_OF_THREADS is available for use with parallel programs.

MP_NUMBER_OF_THREADS

The MP_NUMBER_OF_THREADS environment variable enables you to set the number of processors that are to execute your program in parallel. If you do not set this variable, it defaults to the number of processors on the executing machine.

On the C shell, the following command sets MP_NUMBER_OF_THREADS to indicate that programs compiled for parallel execution can execute on two processors:

setenv MP_NUMBER_OF_THREADS 2

If you use the Korn shell, the command is:

export MP_NUMBER_OF_THREADS=2

Parallelizing C Programs

The following sections discuss how to compile C programs for parallel execution and inhibitors to parallelization.

Compiling Code for Parallel Execution

The following command lines compile (without linking) three source files: x.c, y.c, and z.c. The files x.c and y.c are compiled for parallel execution. The file z.c is compiled for serial execution, even though its object file will be linked with x.o and y.o.

cc +O3 +Oparallel -c x.c y.c
cc +O3 -c z.c

The following command line links the three object files, producing the executable file para_prog:

cc +O3 +Oparallel -o para_prog x.o y.o z.o

As this command line implies, if you link and compile separately, you must use cc, not ld. The command line to link must also include the +Oparallel and +O3 options in order to link in the right startup files and runtime support.




	NOTE: To ensure the best performance from a parallel program, do not run more than one parallel program on a multiprocessor machine at the same time. Running two or more parallel programs simultaneously or running one parallel program on a heavily loaded system, will slow performance. You should run a parallel-executing program at a higher priority than any other user program; see `rtprio`(1) for information about setting real-time priorities.

HP-UX 10.20 users: At runtime, compiler-inserted code performs a check to determine if the call is to a system routine or to a user-defined routine with the same name as a system routine. If the call is to a system routine, the code inhibits parallel execution. If your program makes explicit use of threads, do not attempt to parallelize it.

Profiling Parallelized Programs

Profiling a program that has been compiled for parallel execution is performed in much the same way as it is for non-parallel programs:

Compile the program with the -G option.
Run the program to produce profiling data.
Run gprof against the program.
View the output from gprof.

The differences are:

Running the program in Step 2 produces a gmon.out file for the master process and gmon.out.1, gmon.out.2, and so on for each of the slave processes. Thus, if your program is to execute on two processors, Step 2 will produce two files, gmon.out and gmon.out.1.
The flat profile that you view in Step 4 indicates loops that were parallelized with the following notation:
routine_name##pr_line_0123
where routine_name is the name of the routine containing the loop, pr (parallel region) indicates that the loop was parallelized, and 0123 is the line number of the beginning of the loop or loops that are parallelized.

Conditions Inhibiting Loop Parallelization

The following sections describe different conditions that can inhibit parallelization.

Additionally, +Onoloop_transform and +Onoinline may be helpful options if you experience any problem while using +Oparallel.

Calling Routines with Side Effects

The compiler will not parallelize any loop containing a call to a routine that has side effects. A routine has side effects if it does any of the following:

Modifies its arguments
Modifies an extern or static variable
Redefines variables that are local to the calling routine
Performs I/O
Calls another subroutine or function that does any of the above

Indeterminate Iteration Counts

If the compiler determines that a runtime determination of a loop's iteration count cannot be made before the loop starts to execute, the compiler will not parallelize the loop. The reason for this precaution is that the runtime code must know the iteration count in order to know how many iterations to distribute to the different processors for execution.

The following conditions can prevent a runtime count:

The loop is an infinite loop.
A conditional break statement or goto out of the loop appears in the loop.
The loop modifies either the loop-control or loop-limit variable.
The loop is a while construct and the condition being tested is defined within the loop.

Data Dependence

When a loop is parallelized, the iterations are executed independently on different processors, and the order of execution will differ from the serial order that occurs on a single processor. This effect of parallelization is not a problem. The iterations could be executed in any order with no effect on the results. Consider the following loop:

for (i=0; i<5; i++)
    a[i] = a[i] * b[i]

In this example, the array a would always end up with the same data regardless of whether the order of execution were 0-1-2-3-4, 4-3-2-1-0, 3-1-4-0-2, or any other order. The independence of each iteration from the others makes the loop eligible candidate for parallelization.

Such is not the case in the following:

for (i=1; i<5; i++)
    a[i] = a[i-1] * b[i]

In this loop, the order of execution does matter. The data used in iteration i is dependent upon the data that was produced in the previous iteration [i-1]. a would end up with very different data if the order of execution were any other than 1-2-3-4. The data dependence in this loop thus makes it ineligible for parallelization.

Not all data dependences must inhibit parallelization. The following paragraphs discuss some of the exceptions.

Nested Loops and Matrices

Some nested loops that operate on matrices may have a data dependence in the inner loop only, allowing the outer loop to be parallelized. Consider the following:

for (i=0; i<10; i++)
    for (j=1; j<100; j++)
 a[i][j] = a[i][j-1] + 1;

The data dependence in this nested loop occurs in the inner [j] loop: Each row access of a[j][i] depends upon the preceding row [j-1] having been assigned in the previous iteration. If the iterations of the j loop were to execute in any other order than the one in which they would execute on a single processor, the matrix would be assigned different values. The inner loop, therefore, must not be parallelized.

But no such data dependence appears in the outer loop: Each column access is independent of every other column access. Consequently, the compiler can safely distribute entire columns of the matrix to execute on different processors; the data assignments will be the same regardless of the order in which the columns are executed, so long as each executes in serial order.

Assumed Dependences

When analyzing a loop, the compiler will err on the safe side and assume that what looks like a data dependence really is one and so not parallelize the loop. Consider the following:

for (i=100; i<200; i++)
    a[i] = a[i-k];

The compiler will assume that a data dependence exists in this loop because it appears that data that has been defined in a previous iteration is being used in a later iteration. However, if the value of k is 100, the dependence is assumed rather than real because a[i-k] is defined outside the loop.