Optimization [ ALLBASE/SQL Performance Guidelines ] MPE/iX 5.0 Documentation
ALLBASE/SQL Performance Guidelines
Optimization
Optimization is the process by which ALLBASE/SQL finds the most
appropriate access path to data for the execution of a particular SQL
statement. An access path is one of the following ways of getting at
data:
* Sequential scan--reading the entire table.
* Index scan--using a B-tree or PCR to access particular rows or the
entire table.
* Hash scan--accessing a hash page directly, then finding the row on
that page or on an overflow page.
* TID scan--accessing a single row directly using its page address
or TID.
Usually the most appropriate path is the one requiring the fewest I/O
operations needed to read requested data.
Optimization also includes decisions about join order, join method
(nested loop or sort/merge), and sort operations (indexed or not).
How Optimization is Done
Optimization is primarily done on the basis of elements in the WHERE
clause of an SQL statement. The WHERE clause may contain one or many
components, known as factors, which limit the amount of data that will be
manipulated by the statement. Information about these factors can
potentially be used to improve the performance of a query. The following
example has two EQUAL factors:
SELECT * FROM T1 WHERE C1 = 'Jones' AND C2 = 123534
Assume that C1 is defined as CHAR(40) NOT NULL and C2 is defined as
INTEGER NOT NULL.
To retrieve the data, the query processor must scan the table to retrieve
the rows in which C1 is Jones and C2 is 123534. There are several ways
of doing this:
* Scan the entire table, choosing only the rows that meet both
criteria.
* Scan the table for rows that satisfy the criterion for C1 and then
eliminate the rows that do not satisfy the criterion for C2.
* Scan the table for rows that satisfy the criterion for C2 and then
eliminate the rows that do not satisfy the criterion for C1.
Assuming there are B-tree indexes on both columns, what would the best
approach be? In order to answer this, we need to consider several
variables:
* Size of the table.
* Selectivity of each index.
* Index cluster count of each index.
* Sizes of the two indexes.
Table Size.
If the table occupies only 3 pages in the DBEFile, the most that would be
required to access the data without using an index is 3 I/Os (plus some
I/Os to access system information; we ignore this hereafter). If the
index occupies only 2 pages, the same query might need to access as few
as 5 pages, but it could conceivably access many more than this, if the
cluster count is high. There is clearly no advantage in using an index
in this case.
As the size of the table increases, the use of an index may become more
advantageous. Assume that the table has 10,000 pages, that data is
stored in sorted order and that key values are found on 100 data pages.
In this case, 10,000 I/Os would be needed to do the query with a serial
scan. If a B-tree index on the table has 3 levels and if key values are
found on two index leaves and 100 data pages, then only 104 I/Os would be
needed for the same query. The index scan clearly makes sense now.
Selectivity.
How can the optimizer tell whether to use a B-tree index on C1 or on C2,
assuming that both exist? Index predicate selectivity and cluster count
are the most important factors; the size of the index and the
characteristics of the key also determine the choice. As an example of
selectivity, suppose the index on C1 resulted in the selection of 200
rows, but the index on C2 resulted in 2000. Other things being equal,
the optimizer would select C1 as the index with the higher selectivity,
that is, the index that retrieves the smallest number of rows for its
predicate.
Index Size.
An index on C1 would be considerably larger than one on C2. Since key
values of C1 are 40 bytes, and key values in C2 are 4 bytes, the B-tree
on C1 would be ten times the size of the B-tree on C2, and the number of
I/Os needed to retrieve the same number of key values using C1 would be
ten times the number needed for the index on C2. In this case, if
selectivity is the same, the optimizer would choose C2.
Cluster Count.
The cluster count of an index indicates the number of times ALLBASE/SQL
has to access a different data page to retrieve the next row during an
index scan. Assume the table is sorted in the order of keys in C1. In
this case, the cluster count for a B-tree on C1 would be equal to the
number of pages on which rows are found. Let's say this is 100 pages for
C1. Assume the same selectivity for C2, but random distribution of key
values over the pages of the DBEFile in which data is stored. Each row
retrieved with C2 would require a new I/O, and the cluster count could be
closer to the number of rows in the table. In the present case, C1 is
chosen. Cluster count is described in more detail in the "Guidelines on
Logical and Physical Design."
The optimizer makes all its choices in a similar fashion, weighing the
relative cost of each scan plan and eventually choosing the most
appropriate one.
Using GENPLAN
You can observe the result of the optimizer's work by using the SQL
GENPLAN statement followed by a query on the system catalog pseudotable
SYSTEM.PLAN.
MPE/iX 5.0 Documentation