HP 3000 Manuals

Logical Data Design [ ALLBASE/SQL Performance Guidelines ] MPE/iX 5.0 Documentation
ALLBASE/SQL Performance Guidelines

Logical Data Design 

Logical data design concerns normalization of data, followed by the
creation of tables, views, indexes, and authorization schemes.  When
designing logical objects in ALLBASE/SQL, know your queries beforehand.
The following paragraphs present ways of adjusting your design to arrive
at the best performance for the functionality you need to implement.  The
following topics are discussed:

   *   Normalization Issues.
   *   Including Calculated Data in Tables.
   *   B-Tree Index Design.
   *   Clustering Indexes.
   *   Choosing Appropriate Index Types.
   *   Authorization Design.
   *   Using a Duplicate Database.

Normalization Issues 

Normalization results in tables with a relatively small number of
columns, and a minimum amount of redundancy within a given table.  In
general, this yields an efficient design, but there are some drawbacks.
The following table shows some of the pros and cons of normalization. 

          Pros and Cons of Normalization 

-------------------------------------------------------------
|                             |                             |
|     Normalization Pros      |     Normalization Cons      |
|                             |                             |
-------------------------------------------------------------
|                             |                             |
| Less or no data redundancy  | Relatively larger number of |
|                             | table accesses (for joins)  |
|                             |                             |
-------------------------------------------------------------
|                             |                             |
| Relatively more efficient   | Additional indexes (PCRs)   |
| use of storage space        | required to maintain        |
|                             | referential integrity.      |
|                             |                             |
-------------------------------------------------------------
|                             |                             |
| Reduced data maintenance,   | Increased overhead for      |
| since you don't need to     | processing more join        |
| update as many occurrences  | queries.                    |
| of a value                  |                             |
|                             |                             |
-------------------------------------------------------------

Here are some tips:

   *   In general, design your tables to be in third normal form so that
       updates can be carried out against the smallest number of tuples.

   *   Avoid overnormalization of tables in logical design.  If two
       reasonably static tables are joined frequently, consider making
       them into one table.

For example, consider the following two tables in third normal form:

      __________________________
     | employee  |  department  |  Table A
     |___________|______________|

      ______________________________
     | department  |  location      |  Table B
     |_____________|________________|

If you have a continual need to find the location of the employee, you
might consider combining the two tables.  In this case, you will incur
extra maintenance if the location of the department changes, since it now
appears in every tuple of Table C:

      __________________________________________
     | employee  |  department  |  location     |  Table C
     |___________|______________|_______________|

When should you not combine tables, but stick with third normal form?
Retain the third normal form for tables in which you must maintain
several rows of data which are fully dependent on one particular key:

      _____________________________________________
     | custno | address |  state  |  zip |  phone  | Table D
     |________|_________|_________|______|_________|

      _____________________________________________
     | custno | orderno | orderdate |    . . . .   | Table E
     |________|_________|___________|______________|

Since Table E may contain many orders for each customer number, the
amount of redundancy would be unacceptable if you were to include Table
D's columns in each row of Table E.

Combining tables can also increase the concurrency between batch
processes and on-line activities, since there is less contention for I/O
resources within the system.

Denormalizing Tables that are Consistently Joined.   

Look for tables that are consistently being joined, and consider
combining tables or duplicating some columns to prevent the need for the
join.  Considerations:

   *   Not all joins are expensive.  If join columns are supported with
       unique indexes, the cost of doing a join is not really an issue.
   *   Only consider denormalizing when there is a performance problem
       that can be attributed to the join.
   *   Assess the impact of denormalizing on the performance of other
       processes that use the tables.  Consider how often each process
       occurs, and tune for the most used processes.
   *   Duplicate data wastes storage space and can cause update problems.
   *   Tables that are combined in denormalization should have a 1:1
       relationship with each other.  The combining should not result in
       an excessive row length or an excessive number of indexes.
   *   Tables should have the fewest indexes possible.

Horizontal Partitioning.   

Large tables that contain historical data may be good candidates for
division into two tables with the same column definition.  To identify
potential candidates for horizontal partitioning, review all your
queries, looking for table scans with a low number of rows returned.
Review other processes which would be impacted by dividing the table in
two.  Note especially the impact on processes which need all rows.  Tune
for the most critical or most frequently used processes.

As an example, consider a table containing daily stock prices.  If the
current year's prices are the only ones that are accessed frequently in
looking up prices, then earlier data can be moved to a different table.
Suppose the table is created as follows:

     CREATE PUBLIC TABLE SECURITIES
          (NAME CHAR(10),
           DATE DATE,
           PRICE DECIMAL (6,2))

You can define a different table with the same column definition for the
historical data:

     CREATE PUBLIC TABLE PRE1992
          (NAME CHAR(10),
           DATE DATE,
           PRICE DECIMAL (6,2))

Then use a Type 2 INSERT to transfer the rows:

     INSERT INTO PRE1992 SELECT *
            FROM SECURITIES
            WHERE DATE < '1992-01-01'

Finally, delete the 1992 data from the Securities table:

     DELETE FROM SECURITIES
     WHERE DATE < '1992-01-01'

Vertical Partitioning.   

If some columns are infrequently used or if they require special
security, consider moving them to their own table.  Be sure to include
the primary key values with every partition.

Vertical partitioning will make the rows smaller and more efficient to
process (since there will be more rows per page).  A disadvantage of
having more rows per page is the potential of reducing concurrency
because of page level locking.  The use of row level locking may help
offset this disadvantage.

Including Calculated Data in Tables 

You can include calculated data in tables instead of performing
calculations when data is retrieved.  In this case, the overhead of the
calculation occurs at INSERT or UPDATE time rather than at query time.
Stored procedures are a useful way of recalculating whenever data values
change.  Some calculations can be done using ALLBASE/SQL functions such
as SUM, MIN, AVG, etc.

You can create a view to return the results of a calculation, but this
may degrade performance in some cases.  Specifically, do not create views
that contain aggregate functions or GROUP BY clauses.  For better
performance, define the view without aggregates or GROUP BY clauses and
then apply the aggregation and grouping to the SELECT done on the view.

B-Tree Index Design 

The use of indexes affects performance directly, since an index scan is
generally faster than a serial scan under the following conditions:

   *   When only a small fraction of the rows in a table will be
       retrieved.

   *   When a large number of rows must be sorted.

Indexes are fast for retrievals, but slow for inserts, deletes, and some
updates.  You should consider the following questions when designing
indexes:

   *   How often will an index be used in data retrieval?  in joins?
   *   How often will an index be used in place of sorting?  DBCore sorts
       data on columns associated with a GROUP BY, ORDER BY, DISTINCT, or
       UNION.
   *   How often will there be inserts, updates, or deletes to indexed
       columns?

If you are using unique or primary key constraints, ALLBASE/SQL
automatically creates unique indexes which can be used for data
retrieval.  If you are using referential constraints, ALLBASE/SQL creates
a PCR (parent-child relationship), which is an index on the two tables in
the referential relationship.  These may be less efficient than an index
you create yourself.

Choosing Keys.   

   *   Create indexes on columns that frequently appear in an ORDER BY or
       a GROUP BY clause, or in the select list of a SELECT DISTINCT
       statement.

   *   For frequently performed joins, try to ensure that at least one
       side of each predicate in the WHERE clause has an index.  For
       example, if you are joining the SupplyPrice table with the
       OrderItems table in the sample DBEnvironment (PartsDBE), you can
       improve performance by creating indexes on the common column,
       VendPartNumber, in both tables.

   *   Create indexes on columns frequently involved in a LIKE predicate.
       In such cases, be sure the literal string in the LIKE predicate
       does not begin with a wildcard character.

   *   Do not create indexes on columns containing YES/NO or other
       columns that can have only a few values (for example, day of the
       week).

   *   Always identify the primary and foreign keys in each entity before
       creating tables.  This immediately identifies which indexes to
       create and which columns to specify as PRIMARY KEY or FOREIGN KEY
       columns.

   *   INTEGER and CHAR columns are more suitable as index keys than
       DECIMAL or DATE. INTEGER keys are best for extremely large tables
       (gigarows).  Order of efficiency:
          *   INTEGER
          *   FLOAT
          *   CHAR
          *   DECIMAL
          *   DATE

   *   Index keys greater than 20 bytes are not recommended.

   *   Consider creating an index of concatenated keys for WHERE clauses
       that contain AND's along with an EQUAL predicate (=).

   *   Create separate indexes on the columns on each side of an OR in a
       WHERE clause when the columns are from a single table.

   *   If you use compound indexes, be sure to use at least the first
       column of the compound index in your WHERE clauses.

Building Indexes on Large Tables.   

Indexes for large tables require special consideration:

   *   When creating an index for a very large table, be sure there is
       enough disk space for the index DBEFiles and for temporary sort
       space used while building the index.  Make sure the group and
       account (MPE/iX) or directory (HP-UX) has plenty of space, and
       make sure the index DBEFiles are large enough.  See "Using
       TempSpaces" in the chapter "Guidelines on System Administration"
       for information on creating TempSpaces.
   *   When defining multiple indexes on very large tables (more than 3
       million rows), use the following procedure in the specified order:
          1.  Create DBEFileSet.
          2.  Create DBEFiles for data (type = TABLE).
          3.  Create table.
          4.  Load data into table.
          5.  Create one DBEFile (type = INDEX) large enough to hold the
              first index plus sufficient additional space for growth.
          6.  Create the first index.
          7.  Create one DBEFile to hold the next index.
          8.  Create the next index.
          9.  Repeat the previous two steps until all indexes are
              created.
       Since the algorithm used when creating an index uses the last
       DBEFile defined, each index will now reside in its own DBEFile.
       (Caution:  do not create a TABLE type DBEFile in the middle of the
       process of defining indexes, since this will cause the search for
       a DBEFile to start at the beginning instead of using the last
       DBEFile defined.)
       If you use this process, page splits on indexes will point forward
       in the same DBEFile, resulting in better performance for inserts
       (though not for retrievals) on very large tables.
   *   For special applications, indexes can be added or dropped at any
       time.  In these cases, consider how much data is involved, and
       estimate the time and resources needed to drop and recreate.
   *   Creating a B-tree index after loading data into the table causes
       minimal logging.  In this case, key data is sorted and then loaded
       into index pages without page splitting.  The result is faster
       index creation and a more compact index than when creating the
       index before loading data.
   *   Tables and indexes for which the most I/O is performed should be
       placed on the fastest available disk.  If a choice must be made,
       place the index (not the table) on the fastest disk.  Use the
       SQLUtil MOVEFILE command to place DBEFiles on different devices.

Maintaining Indexes.   

Periodic review of indexes ensures the greatest efficiency.  Use the
following recommendations:

   *   Review all indexes for usage, considering all queries (including
       ad hoc queries).  Drop all indexes that are not used.
   *   Review available data about process usage, looking for table scans
       that return only minimum amounts of data.  Consider creating an
       index.
   *   B-tree indexes grow in depth as rows are inserted; that is, the
       number of intermediate levels between root page and leaf page
       increases.  However, as rows are deleted, the depth of the B-tree
       does not decrease.  If many additions and deletions have been
       done, consider dropping and recreating the indexes.
   *   If update time is more critical than retrieval time, consider
       dropping indexes.
   *   Be sure to keep the number of indexes small on update-intensive
       tables.

Clustering Indexes 

Applications which frequently access rows in index order will have better
performance if the rows are physically stored on disk in index order.
Performance is better because I/O is minimized.  For example, a query
that needs to read all rows that have a certain value will be faster if
all those rows reside on one page, instead of on many pages.

To automatically maintain table data in index order, ALLBASE/SQL allows
B-tree indexes and referential constraints to be defined as clustering 
when they are created.  When a clustering index has been defined on a
table, rows added to the table are placed physically near other rows with
similar key values whenever there is space available on the page.  If
space cannot be found on the desired page, then the next row is inserted
onto the first page in the DBEFileSet that has space.

A clustering index does a good job of maintaining index order when the
number of insertions is approximately equal to, or less than, the number
of deletions on the table.  Under these conditions, space for a newly
inserted row will usually be found on a page having rows with similar
keys.  But when the number of insertions is greater than the number of
deletions, space will usually not be found on the desired page for the
row.  In this case, time and I/O will be wasted in calculating, locating,
and examining the desired page even though the row must be inserted in
some other page in the DBEFileSet.  Thus, you pay the price at INSERT
time for the attempt to optimize the placement of the row using a
clustering index.

If space does not usually exist on table pages, then a clustering index
should not be defined on the table, even if your applications frequently
access the data in index order.  You should use a non-clustering index
instead, and manually maintain the table data in index order.  You do
this by monitoring the cluster count of the index, and performing UNLOAD
operations followed by sorting and reloading.  This process is explained
more full later in this chapter, in the sections "Monitoring the Cluster
Count" and "Reclustering a Table."

Using Clustering Indexes.   

Consider using a clustering index when accessing a subset of rows with
the same key value or when retrieving a sequential range (including an
ORDER BY, GROUP BY, DISTINCT, or UNION). Note the following:

   *   Clustering indexes are more expensive than non-clustering indexes
       for inserts and index column updates.
   *   You can create empty spaces on data pages by adding "dummy" rows
       during the initial table load and later deleting them.  This
       process is known as "sparsing" the table.
   *   You may have only one clustering index per table.

If a table rarely has rows deleted from it, and if it frequently has rows
added to it, then you should not define a clustering index.  Extra time
(and I/O) will be wasted when rows are inserted in looking for free space
for the row.

If you decide to use clustering, sort your data before loading the table.
Create the clustering index after sorting data and loading the table.

Monitoring the Cluster Count.     

To measure how well clustered a B-tree index is (whether it is defined as
clustering or non-clustering), ALLBASE/SQL uses the cluster count
value, which is stored in the CCOUNT field of SYSTEM.INDEX or
SYSTEM.CONSTRAINTINDEX when you do an UPDATE STATISTICS. The CCOUNT is
the number of data page accesses that were made during the scan.  The
CCOUNT is increased by one each time the next row is not stored on the
same page as the last row seen.  The same page can be accessed multiple
times if the data is not physically stored in index order.  The best case
is when the CCOUNT value is equal to the number of pages in the table.
If the CCOUNT is higher than this value, then more I/O than is minimally
necessary might be required for an index scan over the entire table.  The
worst case is when the CCOUNT value equals the number of rows in the
table.

Use the following query to examine the cluster count for all B-tree
indexes:

     SELECT T1.INDEXNAME, T1.OWNER, T1.TABLENAME,
     T1.CCOUNT, T2.NROWS, T2.NPAGES
     FROM SYSTEM.INDEX T1, SYSTEM.TABLE T2
     WHERE T1.OWNER = T2.OWNER
     AND   T1.TABLENAME = T2.NAME;

Use the following query to examine the cluster count for all unique or
referential constraints:

     SELECT T1.CONSTRAINTNAME, T1.OWNER, T1.TABLENAME,
     T1.CCOUNT, T2.NROWS, T2.NPAGES
     FROM SYSTEM.CONSTRAINTINDEX T1, SYSTEM.TABLE T2
     WHERE T1.OWNER = T2.OWNER
     AND   T1.TABLENAME = T2.NAME;

You should use this command after doing an UPDATE STATISTICS on the
tables of interest to make sure the cluster count is up-to-date. 
For indexes used in sorting or for retrieving large numbers of rows, the
cluster count is especially important.  A low cluster count may indicate
good performance.  Note that a clustering index does not necessarily
result in a good cluster count.

Reclustering a Table.   

When CCOUNT in SYSTEM.INDEX or SYSTEM.CONSTRAINTINDEX grows to twice the
number of pages in a table, you may be able to improve performance by
reclustering the table using the following procedure:

   1.  Unload the data from the table using the INTERNAL option.  Use a
       SELECT with an ORDER BY on the index key so that data is unloaded
       in sorted order.
   2.  Drop any referential constraints that exist on other tables
       referring to the unloaded table.
   3.  Drop the table.  This will also drop all B-tree indexes that have
       been defined on the table.
   4.  Issue a COMMIT WORK statement.  This makes the space occupied by
       the table and its related indexes and constraints available for
       reuse.
   5.  Issue the CREATE TABLE statement to recreate the table.  Do not 
       specify any constraints.
   6.  Load the data back into the table using the INTERNAL option.
   7.  Use the CREATE INDEX statement to recreate the clustering index,
       or use the ALTER TABLE statement to ADD a clustering constraint.
       Note that only one clustering index or constraint can be defined
       on a given table.
   8.  Use the ALTER TABLE statement to add any other constraints that
       should be defined for the table.
   9.  Recreate any other B-tree indexes or referential constraints that
       may have been dropped.
  10.  Do an UPDATE STATISTICS to update the system catalog information
       about the table and its indexes and constraints.
  11.  Issue a COMMIT WORK statement.

Using Hash Structures.   

For some tables, hashing offers better performance than B-tree indexes.
Hashing works best with a fairly uniform distribution of key values,
spreading the corresponding rows evenly over the number of pages
available.  A key with a skewed distribution will attempt to place all
rows on a correspondingly skewed set of pages.  The worst key results in
hash values that cluster tightly in a narrow range of primary pages,
leaving others empty.

You must define the hash at the same time you create the table.  The hash 
key must be unique.  Considerations:

   *   Hash retrieval is faster than B-tree retrieval unless there is a
       bad overflow situation.  The larger the number of overflow pages,
       the slower the access to the average tuple.
   *   You may also define B-tree indexes on a hash table.
   *   An insert to a hash structure must perform a separate calculation
       to arrive at the correct page for each row of data.  An ordinary
       non-indexed insert must find an available page in the DBEFileSet,
       and finding this page may require several I/Os; but once the
       location for new inserts is determined, ordinary inserts are less
       expensive than hash inserts.  B-tree inserts are more expensive
       than either hash or non-indexed inserts.
   *   You cannot update the hash key column directly; instead, you must
       delete and re-insert the row.
   *   You cannot drop a hash index without dropping the table itself.
   *   Smaller rows mean more rows per page, which means less chance for
       overflow when several rows hash to the same page.
   *   The DBEFile or DBEFiles for the hash must be empty at the time you
       create the table, and there must be at least enough DBEFile space
       to hold all primary pages.  You should add an appropriate number
       of pages (say, 30%) for growth and overflow.
   *   Use a prime number for the number of primary pages.  An odd number
       is almost as good.
   *   Single-column integer values provide better hashing than
       multi-column or character indexes.
   *   You can use hashing to spread rows evenly among the pages of a
       small table for the greatest page-level concurrency.
   *   Do not load a hash table in key-sorted order.  Doing so may result
       in a performance penalty.
   *   If the number of overflow pages grows too large, you can unload
       and reload the table.  See "Unloading and Reloading to Remove
       Overflow Pages" later in this chapter.

Choosing Appropriate Index Types 

It is important to consider whether a B-tree index is an appropriate
choice for a given table.  The larger the table and the lower the level
of update activity, the greater the benefit of a B-tree index.  A B-tree
never hurts a read-only application.  Indeed, B-tree indexes are usually
defined to improve the read access requirements of an application.
B-tree indexes can, however, degrade the performance of an application
that updates the table on which the B-tree is defined because whenever a
row is inserted or deleted, the B-tree index must also be updated.  In
addition, if the application updates a key column in any row, then the
B-tree index must also be updated.

The needs of the applications that access a table must carefully be
considered before a B-tree index is defined on the table.  If the mix of
transactions involving the table is primarily read-intensive, then a
B-tree index can be defined to improve performance.  If the mix is
primarily update-intensive, then consider the following:

   *   If rows are frequently inserted and/or deleted, then hashing
       should probably be used instead of a B-tree index.  Hashed tables
       do not require a second update, as B-tree indexes do.  However,
       hashed tables may require more manual maintenance for optimal
       performance.  Such maintenance may involve occasional unloading
       and reloading of data.
   *   If updates are made to the key columns in the row, then
       performance degradation cannot be avoided:  a B-tree index will
       automatically be updated; and an update on the hash key requires
       that the old row be deleted and the new row inserted in two
       separate operations.  You might consider changing the hash key or
       index key to a less volatile set of columns if this is possible.

Updating Statistics.   

In order for the optimizer to make the best choice among available
indexes, use the UPDATE STATISTICS statement on tables after creating
indexes and after doing many inserts or deletes.  After updating
statistics, use the VALIDATE statement to revalidate any stored sections
that were invalidated by the UPDATE STATISTICS statement.  While
revalidation is taking place, performance may be impaired due to
obtaining exclusive locks on the system catalog.  To avoid concurrency
problems, use the VALIDATE statement after using the UPDATE STATISTICS
statement.  Preferably, you should use both statements during periods of
low activity.


NOTE  If you do not update statistics at appropriate times, the optimizer
      cannot choose the appropriate access path to data.  In order to
      avoid performance degradation, be sure your statistics are
      up-to-date.  You can query SYSTEM.TABLE to determine the time the
      last UPDATE STATISTICS was done for a given table.  The following
      example shows the latest US time for PurchDB.Parts:

           SELECT NAME, OWNER, USTIME
           FROM SYSTEM.TABLE
           WHERE NAME = 'PARTS'
           AND OWNER = 'PURCHDB'


Authorization Design 

The creation of authorization schemes can also affect performance, since
ALLBASE/SQL has to look up authorizations before carrying out specific
commands.  The more complex the authorization scheme, the more system
table pages must be read by the query processor.  The only authorization
verified at run time for non-dynamic queries is RUN authority.
Therefore, it helps to keep a simple hierarchy for RUN authorization.  In
general, user group hierarchies should be kept simple (one or two levels)
unless it is absolutely necessary to use more.  If the user running an
application is the owner of the module or has DBA authority, the
authorization check is faster.

The overhead of authorization checking is higher for dynamic queries.
This is because the first execution of a dynamic query in a transaction
requires the query to be re-authorized.  It is therefore all the more
important to keep the authorization hierarchy simple.

In designing authorization schemes, you need to weigh the needs of
security and ease of maintenance against your requirements for
performance.

Using a Duplicate Database 

To avoid lock contention, some database designers create a main database
for updating and a duplicate or subset database for reporting.  Data can
be transferred periodically from the main database to the duplicate in
batch mode.
MPE/iX 5.0 Documentation