|
|
HP C/HP-UX Programmer's Guide: Workstations and Servers > Chapter 4 Optimizing HP C Programs Parallel Execution |
|
This section provides information on parallel execution. The +Oparallel option causes the compiler to transform eligible loops for parallel execution on multiprocessor machines. If you link separately from the compile line and you compiled with the +Oparallel option, you must link with the cc command and specify the +Oparallel option to link in the right startup files and runtime support. When a program is compiled with the +Oparallel option, the compiler looks for opportunities for parallel execution in loops and generates parallel code to execute the loop on the number of processors set by the MP_NUMBER_OF_THREADS environment variable discussed below. By default, this is the number of processors on the executing machine. For a discussion of parallelization, including how to use the +Oparallel option, see "Parallelizing C Programs" below. For more detail on +Oparallel, see the description in "Controlling Specific Optimizer Features" earlier in this chapter. The environment variable MP_NUMBER_OF_THREADS is available for use with parallel programs. The MP_NUMBER_OF_THREADS environment variable enables you to set the number of processors that are to execute your program in parallel. If you do not set this variable, it defaults to the number of processors on the executing machine. On the C shell, the following command sets MP_NUMBER_OF_THREADS to indicate that programs compiled for parallel execution can execute on two processors:
If you use the Korn shell, the command is:
The following sections discuss how to compile C programs for parallel execution and inhibitors to parallelization. The following command lines compile (without linking) three source files: x.c, y.c, and z.c. The files x.c and y.c are compiled for parallel execution. The file z.c is compiled for serial execution, even though its object file will be linked with x.o and y.o.
The following command line links the three object files, producing the executable file para_prog:
As this command line implies, if you link and compile separately, you must use cc, not ld. The command line to link must also include the +Oparallel and +O3 options in order to link in the right startup files and runtime support.
HP-UX 10.20 users: At runtime, compiler-inserted code performs a check to determine if the call is to a system routine or to a user-defined routine with the same name as a system routine. If the call is to a system routine, the code inhibits parallel execution. If your program makes explicit use of threads, do not attempt to parallelize it. Profiling a program that has been compiled for parallel execution is performed in much the same way as it is for non-parallel programs:
The differences are:
The following sections describe different conditions that can inhibit parallelization. Additionally, +Onoloop_transform and +Onoinline may be helpful options if you experience any problem while using +Oparallel. The compiler will not parallelize any loop containing a call to a routine that has side effects. A routine has side effects if it does any of the following:
If the compiler determines that a runtime determination of a loop's iteration count cannot be made before the loop starts to execute, the compiler will not parallelize the loop. The reason for this precaution is that the runtime code must know the iteration count in order to know how many iterations to distribute to the different processors for execution. The following conditions can prevent a runtime count:
When a loop is parallelized, the iterations are executed independently on different processors, and the order of execution will differ from the serial order that occurs on a single processor. This effect of parallelization is not a problem. The iterations could be executed in any order with no effect on the results. Consider the following loop:
In this example, the array a would always end up with the same data regardless of whether the order of execution were 0-1-2-3-4, 4-3-2-1-0, 3-1-4-0-2, or any other order. The independence of each iteration from the others makes the loop eligible candidate for parallelization. Such is not the case in the following:
In this loop, the order of execution does matter. The data used in iteration i is dependent upon the data that was produced in the previous iteration [i-1]. a would end up with very different data if the order of execution were any other than 1-2-3-4. The data dependence in this loop thus makes it ineligible for parallelization. Not all data dependences must inhibit parallelization. The following paragraphs discuss some of the exceptions. Some nested loops that operate on matrices may have a data dependence in the inner loop only, allowing the outer loop to be parallelized. Consider the following:
The data dependence in this nested loop occurs in the inner [j] loop: Each row access of a[j][i] depends upon the preceding row [j-1] having been assigned in the previous iteration. If the iterations of the j loop were to execute in any other order than the one in which they would execute on a single processor, the matrix would be assigned different values. The inner loop, therefore, must not be parallelized. But no such data dependence appears in the outer loop: Each column access is independent of every other column access. Consequently, the compiler can safely distribute entire columns of the matrix to execute on different processors; the data assignments will be the same regardless of the order in which the columns are executed, so long as each executes in serial order. When analyzing a loop, the compiler will err on the safe side and assume that what looks like a data dependence really is one and so not parallelize the loop. Consider the following:
The compiler will assume that a data dependence exists in this loop because it appears that data that has been defined in a previous iteration is being used in a later iteration. However, if the value of k is 100, the dependence is assumed rather than real because a[i-k] is defined outside the loop. |
|