Logical Data Design

Logical data design concerns normalization of data, followed by the creation of tables, views, indexes, and authorization schemes. When designing logical objects in ALLBASE/SQL, know your queries beforehand. The following paragraphs present ways of adjusting your design to arrive at the best performance for the functionality you need to implement. The following topics are discussed:

Normalization Issues.
Including Calculated Data in Tables.
B-Tree Index Design.
Clustering Indexes.
Choosing Appropriate Index Types.
Authorization Design.
Using a Duplicate Database.

Normalization Issues

Normalization results in tables with a relatively small number of columns, and a minimum amount of redundancy within a given table. In general, this yields an efficient design, but there are some drawbacks. The following table shows some of the pros and cons of normalization.

Table 2-1 Pros and Cons of Normalization

Normalization Pros	Normalization Cons
Less or no data redundancy	Relatively larger number of table accesses (for joins)
Relatively more efficient use of storage space	Additional indexes (PCRs) required to maintain referential integrity.
Reduced data maintenance, since you don't need to update as many occurrences of a value	Increased overhead for processing more join queries.

Here are some tips:

In general, design your tables to be in third normal form so that updates can be carried out against the smallest number of tuples.
Avoid overnormalization of tables in logical design. If two reasonably static tables are joined frequently, consider making them into one table.

For example, consider the following two tables in third normal form:

 __________________________

| employee  |  department  |  Table A

|___________|______________|



 ______________________________

| department  |  location      |  Table B

|_____________|________________|

If you have a continual need to find the location of the employee, you might consider combining the two tables. In this case, you will incur extra maintenance if the location of the department changes, since it now appears in every tuple of Table C:

 __________________________________________

| employee  |  department  |  location     |  Table C

|___________|______________|_______________|

When should you not combine tables, but stick with third normal form? Retain the third normal form for tables in which you must maintain several rows of data which are fully dependent on one particular key:

 _____________________________________________

| custno | address |  state  |  zip |  phone  | Table D

|________|_________|_________|______|_________|



 _____________________________________________

| custno | orderno | orderdate |    . . . .   | Table E

|________|_________|___________|______________|

Since Table E may contain many orders for each customer number, the amount of redundancy would be unacceptable if you were to include Table D's columns in each row of Table E.

Combining tables can also increase the concurrency between batch processes and on-line activities, since there is less contention for I/O resources within the system.

Denormalizing Tables that are Consistently Joined

Look for tables that are consistently being joined, and consider combining tables or duplicating some columns to prevent the need for the join. Considerations:

Not all joins are expensive. If join columns are supported with unique indexes, the cost of doing a join is not really an issue.
Only consider denormalizing when there is a performance problem that can be attributed to the join.
Assess the impact of denormalizing on the performance of other processes that use the tables. Consider how often each process occurs, and tune for the most used processes.
Duplicate data wastes storage space and can cause update problems.
Tables that are combined in denormalization should have a 1:1 relationship with each other. The combining should not result in an excessive row length or an excessive number of indexes.
Tables should have the fewest indexes possible.

Horizontal Partitioning

Large tables that contain historical data may be good candidates for division into two tables with the same column definition. To identify potential candidates for horizontal partitioning, review all your queries, looking for table scans with a low number of rows returned. Review other processes which would be impacted by dividing the table in two. Note especially the impact on processes which need all rows. Tune for the most critical or most frequently used processes.

As an example, consider a table containing daily stock prices. If the current year's prices are the only ones that are accessed frequently in looking up prices, then earlier data can be moved to a different table. Suppose the table is created as follows:

CREATE PUBLIC TABLE SECURITIES 

     (NAME CHAR(10), 

      DATE DATE, 

      PRICE DECIMAL (6,2))

You can define a different table with the same column definition for the historical data:

CREATE PUBLIC TABLE PRE1992

     (NAME CHAR(10), 

      DATE DATE, 

      PRICE DECIMAL (6,2))

Then use a Type 2 INSERT to transfer the rows:

INSERT INTO PRE1992 SELECT *

       FROM SECURITIES

       WHERE DATE < '1992-01-01'

Finally, delete the 1992 data from the Securities table:

DELETE FROM SECURITIES 

WHERE DATE < '1992-01-01'

Vertical Partitioning

If some columns are infrequently used or if they require special security, consider moving them to their own table. Be sure to include the primary key values with every partition.

Vertical partitioning will make the rows smaller and more efficient to process (since there will be more rows per page). A disadvantage of having more rows per page is the potential of reducing concurrency because of page level locking. The use of row level locking may help offset this disadvantage.

Including Calculated Data in Tables

You can include calculated data in tables instead of performing calculations when data is retrieved. In this case, the overhead of the calculation occurs at INSERT or UPDATE time rather than at query time. Stored procedures are a useful way of recalculating whenever data values change. Some calculations can be done using ALLBASE/SQL functions such as SUM, MIN, AVG, etc.

You can create a view to return the results of a calculation, but this may degrade performance in some cases. Specifically, do not create views that contain aggregate functions or GROUP BY clauses. For better performance, define the view without aggregates or GROUP BY clauses and then apply the aggregation and grouping to the SELECT done on the view.

B-Tree Index Design

The use of indexes affects performance directly, since an index scan is generally faster than a serial scan under the following conditions:

When only a small fraction of the rows in a table will be retrieved.
When a large number of rows must be sorted.

Indexes are fast for retrievals, but slow for inserts, deletes, and some updates. You should consider the following questions when designing indexes:

How often will an index be used in data retrieval? in joins?
How often will an index be used in place of sorting? DBCore sorts data on columns associated with a GROUP BY, ORDER BY, DISTINCT, or UNION.
How often will there be inserts, updates, or deletes to indexed columns?

If you are using unique or primary key constraints, ALLBASE/SQL automatically creates unique indexes which can be used for data retrieval. If you are using referential constraints, ALLBASE/SQL creates a PCR (parent-child relationship), which is an index on the two tables in the referential relationship. These may be less efficient than an index you create yourself.

Choosing Keys

Create indexes on columns that frequently appear in an ORDER BY or a GROUP BY clause, or in the select list of a SELECT DISTINCT statement.
For frequently performed joins, try to ensure that at least one side of each predicate in the WHERE clause has an index. For example, if you are joining the SupplyPrice table with the OrderItems table in the sample DBEnvironment (PartsDBE), you can improve performance by creating indexes on the common column, VendPartNumber, in both tables.
Create indexes on columns frequently involved in a LIKE predicate. In such cases, be sure the literal string in the LIKE predicate does not begin with a wildcard character.
Do not create indexes on columns containing YES/NO or other columns that can have only a few values (for example, day of the week).
Always identify the primary and foreign keys in each entity before creating tables. This immediately identifies which indexes to create and which columns to specify as PRIMARY KEY or FOREIGN KEY columns.
INTEGER and CHAR columns are more suitable as index keys than DECIMAL or DATE. INTEGER keys are best for extremely large tables (gigarows). Order of efficiency:
- INTEGER
- FLOAT
- CHAR
- DECIMAL
- DATE
Index keys greater than 20 bytes are not recommended.
Consider creating an index of concatenated keys for WHERE clauses that contain AND's along with an EQUAL predicate (=).
Create separate indexes on the columns on each side of an OR in a WHERE clause when the columns are from a single table.
If you use compound indexes, be sure to use at least the first column of the compound index in your WHERE clauses.

Building Indexes on Large Tables

Indexes for large tables require special consideration:

When creating an index for a very large table, be sure there is enough disk space for the index DBEFiles and for temporary sort space used while building the index. Make sure the directory has plenty of space, and make sure the index DBEFiles are large enough. See "Using TempSpaces" in the chapter "Guidelines on System Administration" for information on creating TempSpaces.
When defining multiple indexes on very large tables (more than 3 million rows), use the following procedure in the specified order:
1. Create DBEFileSet.
2. Create DBEFiles for data (type = TABLE).
3. Create table.
4. Load data into table.
5. Create one DBEFile (type = INDEX) large enough to hold he first index plus sufficient additional space for growth.
6. Create the first index.
7. Create one DBEFile to hold the next index.
8. Create the next index.
9. Repeat the previous two steps until all indexes are created.
Since the algorithm used when creating an index uses the last DBEFile defined, each index will now reside in its own DBEFile. (Caution: do not create a TABLE type DBEFile in the middle of the process of defining indexes, since this will cause the search for a DBEFile to start at the beginning instead of using the last DBEFile defined.)
If you use this process, page splits on indexes will point forward in the same DBEFile, resulting in better performance for inserts (though not for retrievals) on very large tables.
For special applications, indexes can be added or dropped at any time. In these cases, consider how much data is involved, and estimate the time and resources needed to drop and recreate.
Creating a B-tree index after loading data into the table causes minimal logging. In this case, key data is sorted and then loaded into index pages without page splitting. The result is faster index creation and a more compact index than when creating the index before loading data.
Tables and indexes for which the most I/O is performed should be placed on the fastest available disk. If a choice must be made, place the index (not the table) on the fastest disk. Use the SQLUtil MOVEFILE command to place DBEFiles on different devices.

Maintaining Indexes

Periodic review of indexes ensures the greatest efficiency. Use the following recommendations:

Review all indexes for usage, considering all queries (including ad hoc queries). Drop all indexes that are not used.
Review available data about process usage, looking for table scans that return only minimum amounts of data. Consider creating an index.
B-tree indexes grow in depth as rows are inserted; that is, the number of intermediate levels between root page and leaf page increases. However, as rows are deleted, the depth of the B-tree does not decrease. If many additions and deletions have been done, consider dropping and recreating the indexes.
If update time is more critical than retrieval time, consider dropping indexes.
Be sure to keep the number of indexes small on update-intensive tables.
Use SQLMON to monitor index usage. The SampleIO Indexes screen indicates which indexes are used the most. The Static Size screen displays the size of each index, and the Static Cluster screen displays an index's cluster count.

Clustering Indexes

Applications which frequently access rows in index order will have better performance if the rows are physically stored on disk in index order. Performance is better because I/O is minimized. For example, a query that needs to read all rows that have a certain value will be faster if all those rows reside on one page, instead of on many pages.

To automatically maintain table data in index order, ALLBASE/SQL allows B-tree indexes and referential constraints to be defined as clustering when they are created. When a clustering index has been defined on a table, rows added to the table are placed physically near other rows with similar key values whenever there is space available on the page. If space cannot be found on the desired page, then the next row is inserted onto the first page in the DBEFileSet that has space.

A clustering index does a good job of maintaining index order when the number of insertions is approximately equal to, or less than, the number of deletions on the table. Under these conditions, space for a newly inserted row will usually be found on a page having rows with similar keys. But when the number of insertions is greater than the number of deletions, space will usually not be found on the desired page for the row. In this case, time and I/O will be wasted in calculating, locating, and examining the desired page even though the row must be inserted in some other page in the DBEFileSet. Thus, you pay the price at INSERT time for the attempt to optimize the placement of the row using a clustering index.

If space does not usually exist on table pages, then a clustering index should not be defined on the table, even if your applications frequently access the data in index order. You should use a non-clustering index instead, and manually maintain the table data in index order. You do this by monitoring the cluster count of the index, and performing UNLOAD operations followed by sorting and reloading. This process is explained more full later in this chapter, in the sections "Monitoring the Cluster Count" and "Reclustering a Table."

Using Clustering Indexes

Consider using a clustering index when accessing a subset of rows with the same key value or when retrieving a sequential range (including an ORDER BY, GROUP BY, DISTINCT, or UNION). Note the following:

Clustering indexes are more expensive than non-clustering indexes for inserts and index column updates.
You can create empty spaces on data pages by adding "dummy" rows during the initial table load and later deleting them. This process is known as "sparsing" the table.
You may have only one clustering index per table.

If a table rarely has rows deleted from it, and if it frequently has rows added to it, then you should not define a clustering index. Extra time (and I/O) will be wasted when rows are inserted in looking for free space for the row.

If you decide to use clustering, sort your data before loading the table. Create the clustering index after sorting data and loading the table.

Monitoring the Cluster Count

To measure how well clustered a B-tree index is (whether it is defined as clustering or non-clustering), ALLBASE/SQL uses the cluster count value, which is stored in the CCOUNT field of SYSTEM.INDEX or SYSTEM.CONSTRAINTINDEX when you do an UPDATE STATISTICS. The CCOUNT is the number of data page accesses that were made during the scan. The CCOUNT is increased by one each time the next row is not stored on the same page as the last row seen. The same page can be accessed multiple times if the data is not physically stored in index order. The best case is when the CCOUNT value is equal to the number of pages in the table. If the CCOUNT is higher than this value, then more I/O than is minimally necessary might be required for an index scan over the entire table. The worst case is when the CCOUNT value equals the number of rows in the table.

To monitor the cluster count with SQLMON, invoke the Static Cluster screen and examine the CCOUNT, TOT ROWS, and UNLOAD/LOAD SUGGESTD fields.

You can also query the system catalog to examine the cluster count for all B-tree indexes:

SELECT T1.INDEXNAME, T1.OWNER, T1.TABLENAME,

T1.CCOUNT, T2.NROWS, T2.NPAGES

FROM SYSTEM.INDEX T1, SYSTEM.TABLE T2

WHERE T1.OWNER = T2.OWNER

AND   T1.TABLENAME = T2.NAME;

Use the following query to examine the cluster count for all unique or referential constraints:

SELECT T1.CONSTRAINTNAME, T1.OWNER, T1.TABLENAME,

T1.CCOUNT, T2.NROWS, T2.NPAGES

FROM SYSTEM.CONSTRAINTINDEX T1, SYSTEM.TABLE T2

WHERE T1.OWNER = T2.OWNER

AND   T1.TABLENAME = T2.NAME;

You should use this command after doing an UPDATE STATISTICS on the tables of interest to make sure the cluster count is up-to-date. For indexes used in sorting or for retrieving large numbers of rows, the cluster count is especially important. A low cluster count may indicate good performance. Note that a clustering index does not necessarily result in a good cluster count.

Reclustering a Table

If the UNLOAD/LOAD SUGGESTED value of the SQLMON Static Cluster screen is high, or if the CCOUNT in SYSTEM.INDEX or SYSTEM.CONSTRAINTINDEX grows to twice the number of pages in a table, you should recluster the table.

You may be able to improve performance by reclustering the table using the following procedure:

Unload the data from the table using the INTERNAL option. Use a SELECT with an ORDER BY on the index key so that data is unloaded in sorted order.
Drop any referential constraints that exist on other tables referring to the unloaded table.
Drop the table. This will also drop all B-tree indexes that have been defined on the table.
Issue a COMMIT WORK statement. This makes the space occupied by the table and its related indexes and constraints available for reuse.
Issue the CREATE TABLE statement to recreate the table. Do not specify any constraints.
Load the data back into the table using the INTERNAL option.
Use the CREATE INDEX statement to recreate the clustering index, or use the ALTER TABLE statement to ADD a clustering constraint. Note that only one clustering index or constraint can be defined on a given table.
Use the ALTER TABLE statement to add any other constraints that should be defined for the table.
Recreate any other B-tree indexes or referential constraints that may have been dropped.
Do an UPDATE STATISTICS to update the system catalog information about the table and its indexes and constraints.
Issue a COMMIT WORK statement.

Using Hash Structures

For some tables, hashing offers better performance than B-tree indexes. Hashing works best with a fairly uniform distribution of key values, spreading the corresponding rows evenly over the number of pages available. A key with a skewed distribution will attempt to place all rows on a correspondingly skewed set of pages. The worst key results in hash values that cluster tightly in a narrow range of primary pages, leaving others empty.

You must define the hash at the same time you create the table. The hash key must be unique. Considerations:

Hash retrieval is faster than B-tree retrieval unless there is a bad overflow situation. The larger the number of overflow pages, the slower the access to the average tuple.
You may also define B-tree indexes on a hash table.
An insert to a hash structure must perform a separate calculation to arrive at the correct page for each row of data. An ordinary non-indexed insert must find an available page in the DBEFileSet, and finding this page may require several I/Os; but once the location for new inserts is determined, ordinary inserts are less expensive than hash inserts. B-tree inserts are more expensive than either hash or non-indexed inserts.
You cannot update the hash key column directly; instead, you must delete and re-insert the row.
You cannot drop a hash index without dropping the table itself.
Smaller rows mean more rows per page, which means less chance for overflow when several rows hash to the same page.
The DBEFile or DBEFiles for the hash must be empty at the time you create the table, and there must be at least enough DBEFile space to hold all primary pages. You should add an appropriate number of pages (say, 30%) for growth and overflow. To determine how much empty space a DBEFile has, run SQLMON and go to the Static DBEFile screen.
Use a prime number for the number of primary pages. An odd number is almost as good.
Single-column integer values provide better hashing than multi-column or character indexes.
You can use hashing to spread rows evenly among the pages of a small table for the greatest page-level concurrency.
Do not load a hash table in key-sorted order. Doing so may result in a performance penalty.
If the number of overflow pages grows too large, you can unload and reload the table. See "Unloading and Reloading to Remove Overflow Pages" later in this chapter.

Choosing Appropriate Index Types

It is important to consider whether a B-tree index is an appropriate choice for a given table. The larger the table and the lower the level of update activity, the greater the benefit of a B-tree index. A B-tree never hurts a read-only application. Indeed, B-tree indexes are usually defined to improve the read access requirements of an application. B-tree indexes can, however, degrade the performance of an application that updates the table on which the B-tree is defined because whenever a row is inserted or deleted, the B-tree index must also be updated. In addition, if the application updates a key column in any row, then the B-tree index must also be updated.

The needs of the applications that access a table must carefully be considered before a B-tree index is defined on the table. If the mix of transactions involving the table is primarily read-intensive, then a B-tree index can be defined to improve performance. If the mix is primarily update-intensive, then consider the following:

If rows are frequently inserted and/or deleted, then hashing should probably be used instead of a B-tree index. Hashed tables do not require a second update, as B-tree indexes do. However, hashed tables may require more manual maintenance for optimal performance. Such maintenance may involve occasional unloading and reloading of data.
If updates are made to the key columns in the row, then performance degradation cannot be avoided: a B-tree index will automatically be updated; and an update on the hash key requires that the old row be deleted and the new row inserted in two separate operations. You might consider changing the hash key or index key to a less volatile set of columns if this is possible.

Updating Statistics

In order for the optimizer to make the best choice among available indexes, use the UPDATE STATISTICS statement on tables after creating indexes and after doing many inserts or deletes. After updating statistics, use the VALIDATE statement to revalidate any stored sections that were invalidated by the UPDATE STATISTICS statement. While revalidation is taking place, performance may be impaired due to obtaining exclusive locks on the system catalog. To avoid concurrency problems, use the VALIDATE statement after using the UPDATE STATISTICS statement. Preferably, you should use both statements during periods of low activity.

NOTE: If you do not update statistics at appropriate times, the optimizer cannot choose the appropriate access path to data. In order to avoid performance degradation, be sure your statistics are up-to-date. You can query SYSTEM.TABLE to determine the time the last UPDATE STATISTICS was done for a given table. The following example shows the latest US time for PurchDB.Parts:

SELECT NAME, OWNER, USTIME 

FROM SYSTEM.TABLE

WHERE NAME = 'PARTS'

AND OWNER = 'PURCHDB'

Authorization Design

The creation of authorization schemes can also affect performance, since ALLBASE/SQL has to look up authorizations before carrying out specific commands. The more complex the authorization scheme, the more system table pages must be read by the query processor. The only authorization verified at run time for non-dynamic queries is RUN authority. Therefore, it helps to keep a simple hierarchy for RUN authorization. In general, user group hierarchies should be kept simple (one or two levels) unless it is absolutely necessary to use more. If the user running an application is the owner of the module or has DBA authority, the authorization check is faster.

The overhead of authorization checking is higher for dynamic queries. This is because the first execution of a dynamic query in a transaction requires the query to be re-authorized. It is therefore all the more important to keep the authorization hierarchy simple.

In designing authorization schemes, you need to weigh the needs of security and ease of maintenance against your requirements for performance.

Using a Duplicate Database

To avoid lock contention, some database designers create a main database for updating and a duplicate or subset database for reporting. Data can be transferred periodically from the main database to the duplicate in batch mode.