Optimization

Optimization is the process by which ALLBASE/SQL finds the most appropriate access path to data for the execution of a particular SQL statement. An access path is one of the following ways of getting at data:

Sequential scan -- reading the entire table.
Index scan -- using a B-tree or PCR to access particular rows or the entire table.
Hash scan -- accessing a hash page directly, then finding the row on that page or on an overflow page.
TID scan -- accessing a single row directly using its page address or TID.

Usually the most appropriate path is the one requiring the fewest I/O operations needed to read requested data.

Optimization also includes decisions about join order, join method (nested loop or sort/merge), and sort operations (indexed or not).

How Optimization is Done

Optimization is primarily done on the basis of elements in the WHERE clause of an SQL statement. The WHERE clause may contain one or many components, known as factors, which limit the amount of data that will be manipulated by the statement. Information about these factors can potentially be used to improve the performance of a query. The following example has two EQUAL factors:

SELECT * FROM T1 WHERE C1 = 'Jones' AND C2 = 123534

Assume that C1 is defined as CHAR(40) NOT NULL and C2 is defined as INTEGER NOT NULL.

To retrieve the data, the query processor must scan the table to retrieve the rows in which C1 is Jones and C2 is 123534. There are several ways of doing this:

Scan the entire table, choosing only the rows that meet both criteria.
Scan the table for rows that satisfy the criterion for C1 and then eliminate the rows that do not satisfy the criterion for C2.
Scan the table for rows that satisfy the criterion for C2 and then eliminate the rows that do not satisfy the criterion for C1.

Assuming there are B-tree indexes on both columns, what would the best approach be? In order to answer this, we need to consider several variables:

Size of the table.
Selectivity of each index.
Index cluster count of each index.
Sizes of the two indexes.

Table Size

If the table occupies only 3 pages in the DBEFile, the most that would be required to access the data without using an index is 3 I/Os (plus some I/Os to access system information; we ignore this hereafter). If the index occupies only 2 pages, the same query might need to access as few as 5 pages, but it could conceivably access many more than this, if the cluster count is high. There is clearly no advantage in using an index in this case.

As the size of the table increases, the use of an index may become more advantageous. Assume that the table has 10,000 pages, that data is stored in sorted order and that key values are found on 100 data pages. In this case, 10,000 I/Os would be needed to do the query with a serial scan. If a B-tree index on the table has 3 levels and if key values are found on two index leaves and 100 data pages, then only 104 I/Os would be needed for the same query. The index scan clearly makes sense now.

Selectivity

How can the optimizer tell whether to use a B-tree index on C1 or on C2, assuming that both exist? Index predicate selectivity and cluster count are the most important factors; the size of the index and the characteristics of the key also determine the choice. As an example of selectivity, suppose the index on C1 resulted in the selection of 200 rows, but the index on C2 resulted in 2000. Other things being equal, the optimizer would select C1 as the index with the higher selectivity, that is, the index that retrieves the smallest number of rows for its predicate.

Index Size

An index on C1 would be considerably larger than one on C2. Since key values of C1 are 40 bytes, and key values in C2 are 4 bytes, the B-tree on C1 would be ten times the size of the B-tree on C2, and the number of I/Os needed to retrieve the same number of key values using C1 would be ten times the number needed for the index on C2. In this case, if selectivity is the same, the optimizer would choose C2.

Cluster Count

The cluster count of an index indicates the number of times ALLBASE/SQL has to access a different data page to retrieve the next row during an index scan. Assume the table is sorted in the order of keys in C1. In this case, the cluster count for a B-tree on C1 would be equal to the number of pages on which rows are found. Let's say this is 100 pages for C1. Assume the same selectivity for C2, but random distribution of key values over the pages of the DBEFile in which data is stored. Each row retrieved with C2 would require a new I/O, and the cluster count could be closer to the number of rows in the table. In the present case, C1 is chosen. Cluster count is described in more detail in the "Guidelines on Logical and Physical Design" chapter. To monitor the cluster count, run SQLMON and go to the Static Cluster screen.

The optimizer makes all its choices in a similar fashion, weighing the relative cost of each scan plan and eventually choosing the most appropriate one.

Using GENPLAN

You can observe the result of the optimizer's work by using the SQL GENPLAN statement followed by a query on the system catalog pseudotable SYSTEM.PLAN. See the "Analyzing Queries with GENPLAN" section in the "Guidelines on Query Design" chapter for more information.

Using SETOPT

To override the access plan chosen by the optimizer, use the SETOPT statement. For more information, See the SETOPT section in the ALLBASE/SQL Reference Manual.