HPlogo HP C/HP-UX Programmer's Guide: HP 9000 Computers > Chapter 4 Optimizing HP C Programs

Level 2 Optimization Modules

» 

Technical documentation

Complete book in PDF

 » Table of Contents

Level 2 performs optimizations within each procedure. At level 2, the optimizer performs all optimizations performed at the prior level, with the following additions:

  • FMAC synthesis.

  • Coloring register allocation.

  • Induction variable elimination and strength reduction.

  • Local and global common subexpression elimination.

  • Advanced constant folding and propagation. (Simple constant folding is done by level 0 optimization.)

  • Loop invariant code motion.

  • Store/copy optimization.

  • Unused definition elimination.

  • Software pipelining.

  • Register reassociation.

  • Loop unrolling.

The examples in this section are shown at the source code level wherever possible. Transformations that cannot be shown at the source level are shown in assembly language.

Coloring Register Allocation

The name of this optimization comes from the similarity to map coloring algorithms in graph theory. This optimization determines when and how long commonly used variables and expressions occupy a register. It minimizes the number of references to memory (loads and stores) a code segment makes. This can improve run-time speed.

You can help the optimizer understand when certain variables are heavily used within a function by declaring these variables with the register qualifier. The first 10 register qualified variables encountered in the source are honored. You should pick the ten most important variables to be most effective.

The coloring register allocator may override your choices and promote to a register a variable not declared register over one that is, based on estimated speed improvements.

The following code shows the type of optimization the coloring register allocation module performs. The code:

     LDI     2,r104
COPY r104,r103
LDO 5(r103),r106
COPY r106,r105
LDO 10(r105),r107

becomes:

     LDI     2,r25
LDO 5(r25),r26
LDO 10(r26),r31

Induction Variables and Strength Reduction

The induction variables and strength reduction module removes expressions that are linear functions of a loop counter and replaces each of them with a variable that contains the value of the function. Variables of the same linear function are computed only once. This module also simplifies the function by replacing multiplication instructions with addition instructions wherever possible.

For example, the code:

     for (i=0; i<25; i++) {
r[i] = i * k;
}

becomes:

     t1 = 0;
for (i=0; i<25; i++) {
r[i] = t1;
t1 += k;
}

Local and Global Common Subexpression Elimination

The common subexpression elimination module identifies expressions that appear more than once and have the same result, computes the result, and substitutes the result for each occurrence of the expression. The types of subexpression include instructions that load values from memory, as well as arithmetic evaluation.

For example, the code:

     a = x + y + z;
b = x + y + w;

becomes:

     t1 = x + y;
a = t1 + z;
b = t1 + w;

Constant Folding and Propagation

Constant folding computes the value of a constant expression at compile time. For example:

A = 10;
B = A + 5;
C = 4 * B;

can be replaced by:

A = 10;
B = 15;
C = 60;

Loop Invariant Code Motion

The loop invariant code motion module recognizes instructions inside a loop whose results do not change and moves them outside the loop. This ensures that the invariant code is only executed once.

For example, the code:

     x = z;
for(i=0; i<10; i++)
{
a[i] = 4 * x + i;
}

becomes:

     x = z;
t1 = 4 * x;
for(i=0; i<10; i++)
{
a[i] = t1 + i;
}

Store/Copy Optimization

Where possible, the store/copy optimization module substitutes registers for memory locations, by replacing store instructions with copy instructions and deleting load instructions.

For example, the following HP C code:

     a = x + 23; 

where a is a local variable.

     return a;

produces the following code for the unoptimized case:

     LDO     23(r26),r1
STW r1,-52(0,sp)
LDW -52(0,sp),ret0

and this code for the optimized case:

     LDO     23(r26),ret0

Unused Definition Elimination

The unused definition elimination module removes unused memory location and register definitions. These definitions are often a result of transformations made by other optimization modules.

For example, the function:

     f(int x)
{
int a,b,c:

a = 1;
b = 2;
c = x * b;
return c;
}

becomes:

     f(int x)
{
int a,b,c;

b = 2;
c = x * b;
return c;
}

Software Pipelining

Software pipelining is a code transformation that optimizes program loops. It rearranges the order in which instructions are executed in a loop. It generates code that overlaps operations from different loop iterations. Software pipelining is useful for loops that contain arithmetic operations on floats and doubles.

The goal of this optimization is to avoid CPU stalls due to memory or hardware pipeline latencies. The software pipelining transformation adds code before and after the loop to achieve a high degree of optimization within the loop.

Example

The following pseudo-code fragment shows a loop before and after the software pipelining optimization. Four significant things happen:

  • A portion of the first iteration of the loop is performed before the loop.

  • A portion of the last iteration of the loop is performed after the loop.

  • The loop is unrolled twice.

  • Operations from different loop iterations are interleaved with each other.

The following is a C for loop:

#define SIZ 10000
float x[SIZ], y[SIZ]; \*Software pipelining works with*\
int i; \*floats and doubles. *\
init();
for (i = 0;i<= SIZ;i++);
{
x[i] =x[i] / y[i] + 4.00
}

When this loop is compiled with software pipelining, the optimization can be expressed in pseudo-code as follows:

R1 = 0;                Initialize
array index.
R2 = 4.0;              Load
constant value.
R3 = Y[0];             Load
first Y value.
R4 = X[0];             Load
first X value.
R5 = R4 / R3;          Perform
division on first element:n = X[0] / Y[0]. 
                      
do {                   Begin
loop.
      R6 = R1;         Save
current array index.
      R1++;            Increment
array index.
      R7 = X[R1];      Load
current X value.
      R8 = Y[R1];      Load
current Y value.
      R9 = R5 + R2;    Perform
addition on prior row:X[i] = n + 4.0.
      R10 = R7 / R8;   Perform
division on current row:m = X[i+1] / Y[i+1].
      X[R6] = R9;      Save
result of operations on prior row.
                      
      R6 = R1;         Save
current array index.
      R1++;            Increment
array index. 
      R4 = X[R1];      Load
next X value. 
      R3 = Y[R1];      Load
next Y value. 
      R11 = R10 + R2;  Perform
addition on current row:X[i+1] = m + 4 
      R5 = R4 / R3;    Perform
division on next row:n =  X[i+2] / Y[i+2] 
      X[R6] = R11      Save
result of operations on current row.
} while (R1 <= 100);   End
loop. 
                      
R9 = R5 + R2;          Perform
addition on last row:X[i+2] = n + 4 
X[R6] = R9;            Save
result of operations on last row.

This transformation stores intermediate results of the division instructions in unique registers (noted as n and m). These registers are not referenced until several instructions after the division operations. This decreases the possibility that the long latency period of the division instructions will stall the instruction pipeline and cause processing delays.

Prerequisites of Pipelining

Software pipelining is attempted on a loop that meets the following criteria:

  • It is the innermost loop.

  • There are no branches or function calls within the loop.

  • The loop is of moderate size.

This optimization produces slightly larger program files and increases compile time. It is most beneficial in programs containing loops that are executed a large number of times. This optimization is not recommended for loops that are executed only a small number of times.

Use the +Onopipeline option with the +O2, +O3, or +O4 option to suppress software pipelining if program size is more important than execution speed. This will perform level two optimization, but disable software pipelining.

Register Reassociation

Array references often require one or more instructions to compute the virtual memory address of the array element specified by the subscript expression. The register reassociation optimization implemented in the PA-RISC compilers tries to reduce the cost of computing the virtual memory address expression for array references found in loops.

Within loops, the virtual memory address expression can be rearranged and separated into a loop varying term and a loop invariant term. Loop varying terms are those items whose values may change from one iteration of the loop to another. Loop invariant terms are those items whose values are constant throughout all iterations of the loop. The loop varying term corresponds to the difference in the virtual memory address associated with a particular array reference from one iteration of the loop to the next.

The register reassociation optimization dedicates a register to track the value of the virtual memory address expression for one or more array references in a loop and updates the register appropriately in each iteration of a loop.

The register is initialized outside the loop to the loop invariant portion of the virtual memory address expression and the register is incremented or decremented within the loop by the loop variant portion of the virtual memory address expression. On PA-RISC, the update of such a dedicated register can often be performed for free using the base-register modification capability of load and store instructions.

The net result is that array references in loops are converted into equivalent but more efficient pointer dereferences.

For example:

int a[10][20][30];

void example (void)
{
int i, j, k;

for (k = 0; k < 10; k++)
for (j = 0; j < 10; j++)
for (i = 0; i < 10; i++)
{
a[i][j][k] = 1;
}
}

after register reassociation is applied to the innermost loop becomes:

int a[10][20][30];

void example (void)
{
int i, j, k;
register int (*p)[20][30];

for (k = 0; k < 10; k++)
for (j = 0; j < 10; j++)
for (p = (int (*)[20][30]) a[0][j][k], i = 0; i < 10; i++)
{
*(p++[0][0]) = 1;
}
}

In the above example, the compiler-generated temporary register variable, p, strides through the array a in the innermost loop. This register pointer variable is initialized outside the innermost loop and auto-incremented within the innermost loop as a side-effect of the pointer dereference.

Register reassociation can often enable another loop optimization. After performing the register reassociation optimization, the loop variable may be needed only to control the iteration count of the loop. If this is case, the original loop variable can be eliminated altogether by using the PA-RISC ADDIB and ADDB machine instructions to control the loop iteration count.

© Hewlett-Packard Development Company, L.P.