Logical Data Design [ ALLBASE/SQL Performance Guidelines ] MPE/iX 5.0 Documentation
ALLBASE/SQL Performance Guidelines
Logical Data Design
Logical data design concerns normalization of data, followed by the
creation of tables, views, indexes, and authorization schemes. When
designing logical objects in ALLBASE/SQL, know your queries beforehand.
The following paragraphs present ways of adjusting your design to arrive
at the best performance for the functionality you need to implement. The
following topics are discussed:
* Normalization Issues.
* Including Calculated Data in Tables.
* B-Tree Index Design.
* Clustering Indexes.
* Choosing Appropriate Index Types.
* Authorization Design.
* Using a Duplicate Database.
Normalization Issues
Normalization results in tables with a relatively small number of
columns, and a minimum amount of redundancy within a given table. In
general, this yields an efficient design, but there are some drawbacks.
The following table shows some of the pros and cons of normalization.
Pros and Cons of Normalization
-------------------------------------------------------------
| | |
| Normalization Pros | Normalization Cons |
| | |
-------------------------------------------------------------
| | |
| Less or no data redundancy | Relatively larger number of |
| | table accesses (for joins) |
| | |
-------------------------------------------------------------
| | |
| Relatively more efficient | Additional indexes (PCRs) |
| use of storage space | required to maintain |
| | referential integrity. |
| | |
-------------------------------------------------------------
| | |
| Reduced data maintenance, | Increased overhead for |
| since you don't need to | processing more join |
| update as many occurrences | queries. |
| of a value | |
| | |
-------------------------------------------------------------
Here are some tips:
* In general, design your tables to be in third normal form so that
updates can be carried out against the smallest number of tuples.
* Avoid overnormalization of tables in logical design. If two
reasonably static tables are joined frequently, consider making
them into one table.
For example, consider the following two tables in third normal form:
__________________________
| employee | department | Table A
|___________|______________|
______________________________
| department | location | Table B
|_____________|________________|
If you have a continual need to find the location of the employee, you
might consider combining the two tables. In this case, you will incur
extra maintenance if the location of the department changes, since it now
appears in every tuple of Table C:
__________________________________________
| employee | department | location | Table C
|___________|______________|_______________|
When should you not combine tables, but stick with third normal form?
Retain the third normal form for tables in which you must maintain
several rows of data which are fully dependent on one particular key:
_____________________________________________
| custno | address | state | zip | phone | Table D
|________|_________|_________|______|_________|
_____________________________________________
| custno | orderno | orderdate | . . . . | Table E
|________|_________|___________|______________|
Since Table E may contain many orders for each customer number, the
amount of redundancy would be unacceptable if you were to include Table
D's columns in each row of Table E.
Combining tables can also increase the concurrency between batch
processes and on-line activities, since there is less contention for I/O
resources within the system.
Denormalizing Tables that are Consistently Joined.
Look for tables that are consistently being joined, and consider
combining tables or duplicating some columns to prevent the need for the
join. Considerations:
* Not all joins are expensive. If join columns are supported with
unique indexes, the cost of doing a join is not really an issue.
* Only consider denormalizing when there is a performance problem
that can be attributed to the join.
* Assess the impact of denormalizing on the performance of other
processes that use the tables. Consider how often each process
occurs, and tune for the most used processes.
* Duplicate data wastes storage space and can cause update problems.
* Tables that are combined in denormalization should have a 1:1
relationship with each other. The combining should not result in
an excessive row length or an excessive number of indexes.
* Tables should have the fewest indexes possible.
Horizontal Partitioning.
Large tables that contain historical data may be good candidates for
division into two tables with the same column definition. To identify
potential candidates for horizontal partitioning, review all your
queries, looking for table scans with a low number of rows returned.
Review other processes which would be impacted by dividing the table in
two. Note especially the impact on processes which need all rows. Tune
for the most critical or most frequently used processes.
As an example, consider a table containing daily stock prices. If the
current year's prices are the only ones that are accessed frequently in
looking up prices, then earlier data can be moved to a different table.
Suppose the table is created as follows:
CREATE PUBLIC TABLE SECURITIES
(NAME CHAR(10),
DATE DATE,
PRICE DECIMAL (6,2))
You can define a different table with the same column definition for the
historical data:
CREATE PUBLIC TABLE PRE1992
(NAME CHAR(10),
DATE DATE,
PRICE DECIMAL (6,2))
Then use a Type 2 INSERT to transfer the rows:
INSERT INTO PRE1992 SELECT *
FROM SECURITIES
WHERE DATE < '1992-01-01'
Finally, delete the 1992 data from the Securities table:
DELETE FROM SECURITIES
WHERE DATE < '1992-01-01'
Vertical Partitioning.
If some columns are infrequently used or if they require special
security, consider moving them to their own table. Be sure to include
the primary key values with every partition.
Vertical partitioning will make the rows smaller and more efficient to
process (since there will be more rows per page). A disadvantage of
having more rows per page is the potential of reducing concurrency
because of page level locking. The use of row level locking may help
offset this disadvantage.
Including Calculated Data in Tables
You can include calculated data in tables instead of performing
calculations when data is retrieved. In this case, the overhead of the
calculation occurs at INSERT or UPDATE time rather than at query time.
Stored procedures are a useful way of recalculating whenever data values
change. Some calculations can be done using ALLBASE/SQL functions such
as SUM, MIN, AVG, etc.
You can create a view to return the results of a calculation, but this
may degrade performance in some cases. Specifically, do not create views
that contain aggregate functions or GROUP BY clauses. For better
performance, define the view without aggregates or GROUP BY clauses and
then apply the aggregation and grouping to the SELECT done on the view.
B-Tree Index Design
The use of indexes affects performance directly, since an index scan is
generally faster than a serial scan under the following conditions:
* When only a small fraction of the rows in a table will be
retrieved.
* When a large number of rows must be sorted.
Indexes are fast for retrievals, but slow for inserts, deletes, and some
updates. You should consider the following questions when designing
indexes:
* How often will an index be used in data retrieval? in joins?
* How often will an index be used in place of sorting? DBCore sorts
data on columns associated with a GROUP BY, ORDER BY, DISTINCT, or
UNION.
* How often will there be inserts, updates, or deletes to indexed
columns?
If you are using unique or primary key constraints, ALLBASE/SQL
automatically creates unique indexes which can be used for data
retrieval. If you are using referential constraints, ALLBASE/SQL creates
a PCR (parent-child relationship), which is an index on the two tables in
the referential relationship. These may be less efficient than an index
you create yourself.
Choosing Keys.
* Create indexes on columns that frequently appear in an ORDER BY or
a GROUP BY clause, or in the select list of a SELECT DISTINCT
statement.
* For frequently performed joins, try to ensure that at least one
side of each predicate in the WHERE clause has an index. For
example, if you are joining the SupplyPrice table with the
OrderItems table in the sample DBEnvironment (PartsDBE), you can
improve performance by creating indexes on the common column,
VendPartNumber, in both tables.
* Create indexes on columns frequently involved in a LIKE predicate.
In such cases, be sure the literal string in the LIKE predicate
does not begin with a wildcard character.
* Do not create indexes on columns containing YES/NO or other
columns that can have only a few values (for example, day of the
week).
* Always identify the primary and foreign keys in each entity before
creating tables. This immediately identifies which indexes to
create and which columns to specify as PRIMARY KEY or FOREIGN KEY
columns.
* INTEGER and CHAR columns are more suitable as index keys than
DECIMAL or DATE. INTEGER keys are best for extremely large tables
(gigarows). Order of efficiency:
* INTEGER
* FLOAT
* CHAR
* DECIMAL
* DATE
* Index keys greater than 20 bytes are not recommended.
* Consider creating an index of concatenated keys for WHERE clauses
that contain AND's along with an EQUAL predicate (=).
* Create separate indexes on the columns on each side of an OR in a
WHERE clause when the columns are from a single table.
* If you use compound indexes, be sure to use at least the first
column of the compound index in your WHERE clauses.
Building Indexes on Large Tables.
Indexes for large tables require special consideration:
* When creating an index for a very large table, be sure there is
enough disk space for the index DBEFiles and for temporary sort
space used while building the index. Make sure the group and
account (MPE/iX) or directory (HP-UX) has plenty of space, and
make sure the index DBEFiles are large enough. See "Using
TempSpaces" in the chapter "Guidelines on System Administration"
for information on creating TempSpaces.
* When defining multiple indexes on very large tables (more than 3
million rows), use the following procedure in the specified order:
1. Create DBEFileSet.
2. Create DBEFiles for data (type = TABLE).
3. Create table.
4. Load data into table.
5. Create one DBEFile (type = INDEX) large enough to hold the
first index plus sufficient additional space for growth.
6. Create the first index.
7. Create one DBEFile to hold the next index.
8. Create the next index.
9. Repeat the previous two steps until all indexes are
created.
Since the algorithm used when creating an index uses the last
DBEFile defined, each index will now reside in its own DBEFile.
(Caution: do not create a TABLE type DBEFile in the middle of the
process of defining indexes, since this will cause the search for
a DBEFile to start at the beginning instead of using the last
DBEFile defined.)
If you use this process, page splits on indexes will point forward
in the same DBEFile, resulting in better performance for inserts
(though not for retrievals) on very large tables.
* For special applications, indexes can be added or dropped at any
time. In these cases, consider how much data is involved, and
estimate the time and resources needed to drop and recreate.
* Creating a B-tree index after loading data into the table causes
minimal logging. In this case, key data is sorted and then loaded
into index pages without page splitting. The result is faster
index creation and a more compact index than when creating the
index before loading data.
* Tables and indexes for which the most I/O is performed should be
placed on the fastest available disk. If a choice must be made,
place the index (not the table) on the fastest disk. Use the
SQLUtil MOVEFILE command to place DBEFiles on different devices.
Maintaining Indexes.
Periodic review of indexes ensures the greatest efficiency. Use the
following recommendations:
* Review all indexes for usage, considering all queries (including
ad hoc queries). Drop all indexes that are not used.
* Review available data about process usage, looking for table scans
that return only minimum amounts of data. Consider creating an
index.
* B-tree indexes grow in depth as rows are inserted; that is, the
number of intermediate levels between root page and leaf page
increases. However, as rows are deleted, the depth of the B-tree
does not decrease. If many additions and deletions have been
done, consider dropping and recreating the indexes.
* If update time is more critical than retrieval time, consider
dropping indexes.
* Be sure to keep the number of indexes small on update-intensive
tables.
Clustering Indexes
Applications which frequently access rows in index order will have better
performance if the rows are physically stored on disk in index order.
Performance is better because I/O is minimized. For example, a query
that needs to read all rows that have a certain value will be faster if
all those rows reside on one page, instead of on many pages.
To automatically maintain table data in index order, ALLBASE/SQL allows
B-tree indexes and referential constraints to be defined as clustering
when they are created. When a clustering index has been defined on a
table, rows added to the table are placed physically near other rows with
similar key values whenever there is space available on the page. If
space cannot be found on the desired page, then the next row is inserted
onto the first page in the DBEFileSet that has space.
A clustering index does a good job of maintaining index order when the
number of insertions is approximately equal to, or less than, the number
of deletions on the table. Under these conditions, space for a newly
inserted row will usually be found on a page having rows with similar
keys. But when the number of insertions is greater than the number of
deletions, space will usually not be found on the desired page for the
row. In this case, time and I/O will be wasted in calculating, locating,
and examining the desired page even though the row must be inserted in
some other page in the DBEFileSet. Thus, you pay the price at INSERT
time for the attempt to optimize the placement of the row using a
clustering index.
If space does not usually exist on table pages, then a clustering index
should not be defined on the table, even if your applications frequently
access the data in index order. You should use a non-clustering index
instead, and manually maintain the table data in index order. You do
this by monitoring the cluster count of the index, and performing UNLOAD
operations followed by sorting and reloading. This process is explained
more full later in this chapter, in the sections "Monitoring the Cluster
Count" and "Reclustering a Table."
Using Clustering Indexes.
Consider using a clustering index when accessing a subset of rows with
the same key value or when retrieving a sequential range (including an
ORDER BY, GROUP BY, DISTINCT, or UNION). Note the following:
* Clustering indexes are more expensive than non-clustering indexes
for inserts and index column updates.
* You can create empty spaces on data pages by adding "dummy" rows
during the initial table load and later deleting them. This
process is known as "sparsing" the table.
* You may have only one clustering index per table.
If a table rarely has rows deleted from it, and if it frequently has rows
added to it, then you should not define a clustering index. Extra time
(and I/O) will be wasted when rows are inserted in looking for free space
for the row.
If you decide to use clustering, sort your data before loading the table.
Create the clustering index after sorting data and loading the table.
Monitoring the Cluster Count.
To measure how well clustered a B-tree index is (whether it is defined as
clustering or non-clustering), ALLBASE/SQL uses the cluster count
value, which is stored in the CCOUNT field of SYSTEM.INDEX or
SYSTEM.CONSTRAINTINDEX when you do an UPDATE STATISTICS. The CCOUNT is
the number of data page accesses that were made during the scan. The
CCOUNT is increased by one each time the next row is not stored on the
same page as the last row seen. The same page can be accessed multiple
times if the data is not physically stored in index order. The best case
is when the CCOUNT value is equal to the number of pages in the table.
If the CCOUNT is higher than this value, then more I/O than is minimally
necessary might be required for an index scan over the entire table. The
worst case is when the CCOUNT value equals the number of rows in the
table.
Use the following query to examine the cluster count for all B-tree
indexes:
SELECT T1.INDEXNAME, T1.OWNER, T1.TABLENAME,
T1.CCOUNT, T2.NROWS, T2.NPAGES
FROM SYSTEM.INDEX T1, SYSTEM.TABLE T2
WHERE T1.OWNER = T2.OWNER
AND T1.TABLENAME = T2.NAME;
Use the following query to examine the cluster count for all unique or
referential constraints:
SELECT T1.CONSTRAINTNAME, T1.OWNER, T1.TABLENAME,
T1.CCOUNT, T2.NROWS, T2.NPAGES
FROM SYSTEM.CONSTRAINTINDEX T1, SYSTEM.TABLE T2
WHERE T1.OWNER = T2.OWNER
AND T1.TABLENAME = T2.NAME;
You should use this command after doing an UPDATE STATISTICS on the
tables of interest to make sure the cluster count is up-to-date.
For indexes used in sorting or for retrieving large numbers of rows, the
cluster count is especially important. A low cluster count may indicate
good performance. Note that a clustering index does not necessarily
result in a good cluster count.
Reclustering a Table.
When CCOUNT in SYSTEM.INDEX or SYSTEM.CONSTRAINTINDEX grows to twice the
number of pages in a table, you may be able to improve performance by
reclustering the table using the following procedure:
1. Unload the data from the table using the INTERNAL option. Use a
SELECT with an ORDER BY on the index key so that data is unloaded
in sorted order.
2. Drop any referential constraints that exist on other tables
referring to the unloaded table.
3. Drop the table. This will also drop all B-tree indexes that have
been defined on the table.
4. Issue a COMMIT WORK statement. This makes the space occupied by
the table and its related indexes and constraints available for
reuse.
5. Issue the CREATE TABLE statement to recreate the table. Do not
specify any constraints.
6. Load the data back into the table using the INTERNAL option.
7. Use the CREATE INDEX statement to recreate the clustering index,
or use the ALTER TABLE statement to ADD a clustering constraint.
Note that only one clustering index or constraint can be defined
on a given table.
8. Use the ALTER TABLE statement to add any other constraints that
should be defined for the table.
9. Recreate any other B-tree indexes or referential constraints that
may have been dropped.
10. Do an UPDATE STATISTICS to update the system catalog information
about the table and its indexes and constraints.
11. Issue a COMMIT WORK statement.
Using Hash Structures.
For some tables, hashing offers better performance than B-tree indexes.
Hashing works best with a fairly uniform distribution of key values,
spreading the corresponding rows evenly over the number of pages
available. A key with a skewed distribution will attempt to place all
rows on a correspondingly skewed set of pages. The worst key results in
hash values that cluster tightly in a narrow range of primary pages,
leaving others empty.
You must define the hash at the same time you create the table. The hash
key must be unique. Considerations:
* Hash retrieval is faster than B-tree retrieval unless there is a
bad overflow situation. The larger the number of overflow pages,
the slower the access to the average tuple.
* You may also define B-tree indexes on a hash table.
* An insert to a hash structure must perform a separate calculation
to arrive at the correct page for each row of data. An ordinary
non-indexed insert must find an available page in the DBEFileSet,
and finding this page may require several I/Os; but once the
location for new inserts is determined, ordinary inserts are less
expensive than hash inserts. B-tree inserts are more expensive
than either hash or non-indexed inserts.
* You cannot update the hash key column directly; instead, you must
delete and re-insert the row.
* You cannot drop a hash index without dropping the table itself.
* Smaller rows mean more rows per page, which means less chance for
overflow when several rows hash to the same page.
* The DBEFile or DBEFiles for the hash must be empty at the time you
create the table, and there must be at least enough DBEFile space
to hold all primary pages. You should add an appropriate number
of pages (say, 30%) for growth and overflow.
* Use a prime number for the number of primary pages. An odd number
is almost as good.
* Single-column integer values provide better hashing than
multi-column or character indexes.
* You can use hashing to spread rows evenly among the pages of a
small table for the greatest page-level concurrency.
* Do not load a hash table in key-sorted order. Doing so may result
in a performance penalty.
* If the number of overflow pages grows too large, you can unload
and reload the table. See "Unloading and Reloading to Remove
Overflow Pages" later in this chapter.
Choosing Appropriate Index Types
It is important to consider whether a B-tree index is an appropriate
choice for a given table. The larger the table and the lower the level
of update activity, the greater the benefit of a B-tree index. A B-tree
never hurts a read-only application. Indeed, B-tree indexes are usually
defined to improve the read access requirements of an application.
B-tree indexes can, however, degrade the performance of an application
that updates the table on which the B-tree is defined because whenever a
row is inserted or deleted, the B-tree index must also be updated. In
addition, if the application updates a key column in any row, then the
B-tree index must also be updated.
The needs of the applications that access a table must carefully be
considered before a B-tree index is defined on the table. If the mix of
transactions involving the table is primarily read-intensive, then a
B-tree index can be defined to improve performance. If the mix is
primarily update-intensive, then consider the following:
* If rows are frequently inserted and/or deleted, then hashing
should probably be used instead of a B-tree index. Hashed tables
do not require a second update, as B-tree indexes do. However,
hashed tables may require more manual maintenance for optimal
performance. Such maintenance may involve occasional unloading
and reloading of data.
* If updates are made to the key columns in the row, then
performance degradation cannot be avoided: a B-tree index will
automatically be updated; and an update on the hash key requires
that the old row be deleted and the new row inserted in two
separate operations. You might consider changing the hash key or
index key to a less volatile set of columns if this is possible.
Updating Statistics.
In order for the optimizer to make the best choice among available
indexes, use the UPDATE STATISTICS statement on tables after creating
indexes and after doing many inserts or deletes. After updating
statistics, use the VALIDATE statement to revalidate any stored sections
that were invalidated by the UPDATE STATISTICS statement. While
revalidation is taking place, performance may be impaired due to
obtaining exclusive locks on the system catalog. To avoid concurrency
problems, use the VALIDATE statement after using the UPDATE STATISTICS
statement. Preferably, you should use both statements during periods of
low activity.
NOTE If you do not update statistics at appropriate times, the optimizer
cannot choose the appropriate access path to data. In order to
avoid performance degradation, be sure your statistics are
up-to-date. You can query SYSTEM.TABLE to determine the time the
last UPDATE STATISTICS was done for a given table. The following
example shows the latest US time for PurchDB.Parts:
SELECT NAME, OWNER, USTIME
FROM SYSTEM.TABLE
WHERE NAME = 'PARTS'
AND OWNER = 'PURCHDB'
Authorization Design
The creation of authorization schemes can also affect performance, since
ALLBASE/SQL has to look up authorizations before carrying out specific
commands. The more complex the authorization scheme, the more system
table pages must be read by the query processor. The only authorization
verified at run time for non-dynamic queries is RUN authority.
Therefore, it helps to keep a simple hierarchy for RUN authorization. In
general, user group hierarchies should be kept simple (one or two levels)
unless it is absolutely necessary to use more. If the user running an
application is the owner of the module or has DBA authority, the
authorization check is faster.
The overhead of authorization checking is higher for dynamic queries.
This is because the first execution of a dynamic query in a transaction
requires the query to be re-authorized. It is therefore all the more
important to keep the authorization hierarchy simple.
In designing authorization schemes, you need to weigh the needs of
security and ease of maintenance against your requirements for
performance.
Using a Duplicate Database
To avoid lock contention, some database designers create a main database
for updating and a duplicate or subset database for reporting. Data can
be transferred periodically from the main database to the duplicate in
batch mode.
MPE/iX 5.0 Documentation