Physical Data Design

Physical data design means the arrangement of data inside tables, which reside in DBEFiles, and the arrangement of DBEFiles in DBEFileSets. This section includes the following topics:

Creating DBEFileSets.
Creating DBEFiles.
Creating Tables.
Initial Table Loads.
Unloading Data.
Unloading and Reloading to Avoid Indirect Rows.
Unloading and Reloading to Remove Overflow Pages.
Tips on Deletions from Tables.

Additional information relating to physical design is found in the section "Load Balancing" in the chapter "Guidelines on System Administration." You may also find it useful to review the "Physical Design" chapter in the ALLBASE/SQL Database Administration Guide.

Creating DBEFileSets

DBEFileSets are logical groupings of DBEFiles. DBEFileSets are also the logical home of tables, which are always created in a particular DBEFileSet. The following guidelines offer suggestions on how to distribute tables among DBEFileSets.

Avoiding the SYSTEM DBEFileSet for User Data

Although you can create user-defined tables in the SYSTEM DBEFileSet, it is recommended that you create new DBEFileSets to keep user data separate from system catalog data to facilitate maintenance and performance tuning. Here are the reasons why:

System catalog data is accessed frequently. If user data is kept in the SYSTEM DBEFileSet, then the pages used for system data become intermingled with the pages used for user data. If ALLBASE/SQL needs to access a system page that physically resides near the end of the DBEFileSet, then it must first look through all the page table pages that come before it. The average time to access a particular system page increases because of the extra data stored within the DBEFileSet.
Sometimes ALLBASE/SQL uses pages in the SYSTEM DBEFileSet for temporary space. If this space is not available, processing will terminate in an unsuccessful state. Mixing user and system data makes it more likely that temporary space will not be available when needed. See the section "Temporary Space in SYSTEM" in the chapter "Guidelines on System Administration."
Grouping tables in separate DBEFileSets allows you to identify the space requirements for particular tables. If you store all tables in the SYSTEM DBEFileSet and the DBEFileSet runs out of space, it will be difficult to tell how fast each table is growing. It will also be harder to determine which table's update or insert operation caused the system to run out of space.

In particular, you should avoid creating hash structures in the SYSTEM DBEFileSet.

Placing Large Tables in Separate DBEFileSets

You should place tables larger than 1000 pages in their own DBEFileSets. When more than one table is stored in the DBEFileSet, the pages of DBEFiles in the DBEFileSet become a mixture of pages from the different tables. In doing a sequential scan, if ALLBASE/SQL needs to access a data page that physically resides near the end of the DBEFileSet, it must first look through all the page table pages that come before it. The average time to access a data page thus increases because of the extra data stored within the DBEFileSet. The time increases severely when several large tables are stored in the DBEFileSet. To determine the size of a DBEFileSet, run SQLMON and go to the Static DBEFile screen.

Maintenance functions such as UPDATE STATISTICS also will take longer if many large tables are stored in a single DBEFileSet. Therefore, create a separate DBEFileSet for each table or group of tables that you want to maintain separately.

Gathering Small Tables into DBEFileSets

For smaller tables, there may be a performance advantage in grouping them together into DBEFiles of 1000 pages or less. When each small table occupies its own DBEFileSet, you can expect some of the following performance problems:

I/O might increase, for several reasons:
- Each page table page has little information on it. More pages would have to be read in from disk because the data physically resides on more separate pages.
- Because each page table page has little information on it, it will not stay in the data buffer pool. Pages with more information have more locality of reference and are more likely to remain in the buffer pool (thus reducing real I/O).
If the data buffer pool were large enough, the actual increased I/O might be negligible.
Each DBEFileSet requires at least one DBEFile, which is an actual operating system file. Opening a file to read from it is a slow operation. Putting each table into its own DBEFileSet increases the number of files, and therefore increases the average time needed to access the data in these tables.
Operating systems have limits on the number of files that can be open simultaneously. To avoid exceeding the limit, successive close and open file operations may be required, which will degrade performance. ALLBASE/SQL imposes its own limit of 50 files that may be open simultaneously.
The operating system must search for the correct file descriptor, which may take longer with more files open.
Disk space usage would increase, since many virtually empty page table pages would exist.

When small tables are placed in a single DBEFileSet, disk space is used efficiently and I/O is minimized. For example, if you have fifty tables that average 5 pages, they will all fit comfortably into a single DBEFile and use only one page table page.

Creating DBEFiles

DBEFiles are the physical repositories of ALLBASE/SQL data on the operating system. Since you have great flexibility in defining DBEFile types, sizes, and assignment to DBEFileSets, your choices often affect the overall performance of your system. Here are some suggestions for creating DBEFiles:

Create DBEFiles in sizes that are multiples of 253 pages to minimize the space used by page table pages. Each 253-page segment contains one page table page plus 252 pages of data following the page table page.
Create separate TABLE and INDEX DBEFiles, especially for large tables, so that they can be placed on separate disks to improve I/O performance.
Create a MIXED DBEFile in its own DBEFileSet for a small table (less than 1000 pages) with a single index. (In this case, data and index rows will be on alternating pages, which will result in faster processing and better space utilization).
Use dynamic DBEFile expansion to avoid running out of DBEFile space at run time. In the CREATE DBEFILE statement, specify the initial size, the maximum size, and the increment size.
Use raw files to shorten the I/O path length and eliminate operating system buffering of database pages. This feature is appropriate only for large DBEFiles containing large tables accessed randomly. Standard I/O is more appropriate for small DBEFiles or DBEFiles that are frequently accessed sequentially.

For information on raw files, refer to the appendix "Using HP-UX Raw Files for DBEFiles and Logs" in the ALLBASE/SQL Database Administration Guide.

Avoiding Extra DBEFile Space

When sequential scans are made frequently for queries involving tables in a particular DBEFileSet, extra space should be minimized. When sequential scans are made, every page table page in the DBEFileSet is read to determine if pages might qualify for the query.

When a large amount of extra space exists, then extra I/O is required to fetch the page table pages for empty DBEFile space. One way to minimize the extra space in your DBEFiles is by specifying dynamic space expansion in the CREATE DBEFILE statement.

Creating Tables

Here are some general suggestions on table creation:

Create PUBLIC and PUBLICROW tables for maximum concurrency. PUBLICREAD can improve performance by reducing internal page locking. Use PRIVATE for special purposes. Note that PRIVATE is the default; usually, you want to specify something else.
Choose data types which suit the programming language so as to avoid data conversions.
Native language (NLS) users should be aware that indexes and predicates that use USASCII (n-computer or C) columns perform faster than indexes and predicates that use native language columns. If there is a choice, use USASCII for index columns in your tables, create indexes on USASCII columns, and use USASCII columns in your predicates.
Integer values can be 5 - 15% more efficient than packed decimal values.
When adding a column with ALTER TABLE and adding a default value, every row in the table is immediately updated to add the new value. This causes all rows to become indirect rows. It may be better to unload data, drop and recreate the table with the new column and default, then reload the data.
If a column is to be used consistently in the WHERE clause, it should have an index.
Wherever possible, load data for non-hash tables in sorted order. This is essential for tables with clustering indexes. If a multi-column index exists, the order of its columns should be the same as the order of the columns in the load file. Hash tables should not be loaded in sorted order.
Consider combining tables that are constantly joined together. It is a good idea to revisit normalization issues when actually creating tables.
For large tables, place data and indexes on separate disks using the SQLUtil MOVEFILE command. In general, spread the load of database and log files over your disks, and keep data and log files on separate devices. Refer to the section "Load Balancing" in the "Guidelines on System Administration" chapter for more information.

Avoiding NULL and Variable Length Data

Avoid NULLs and variable length data types (VARCHAR and VARBINARY) for the following reasons:
- Nulls always require additional CPU to check for null values. Using NOT NULL can save as much as 5% in CPU because of the overhead of checking for nulls.
- Use of NULL and VARCHAR or VARBINARY may cause wasted space due to the inability to use a shared tuple header. (A separate header must be stored for each tuple that has a different size.) However, the use of nulls or variable length columns may actually save space if there are only a few such rows per page. If the difference between the maximum size of the row and average size of data stored in the row is greater than the size of the header, then use of NULL or variable size columns may be more efficient.
- Updating a NULL, VARCHAR, or VARBINARY column may cause the data to move to a different page, leaving an indirect tuple behind and increasing the number of I/Os required to access the data tuple.
- You can create a table with default values instead of nulls so that when columns are updated they will not have to move to another page because of tuple size increase.
Use an appropriate (not an unreasonably long) length for variable length columns. The maximum length can affect the performance of BULK SELECT and cursor operations.
Indexes whose keys will be updated should never include NULL columns as keys, and rarely should they include VARCHAR or VARBINARY columns. Variable length keys can cause long index chains and rearranging when index values are updated.

Using INTEGER Rather than SMALLINT Data

Since SQLCore only performs 4-byte arithmetic, all SMALLINT values are automatically converted to INTEGERs before processing by SQLCore. These conversions do not affect the optimizer's choice of a scan plan, but they may be costly in CPU for very large databases. To avoid the conversion, consider defining your SMALLINT columns as INTEGER columns. The drawback, of course, is that INTEGERs require twice as much storage space as SMALLINTs.

Initial Table Loads

For optimal performance, use the following ISQL commands before starting the load operation:

SET LOAD_BUFFER
Use this statement to enlarge the load buffer beyond the default size of 16,384 bytes.
SET AUTOLOCK ON Avoid lock contention by locking the table in exclusive mode when the load is performed. For PUBLIC and PUBLICROW tables, locking the table in exclusive mode avoids the overhead of obtaining locks as pages of data are added to the table. If you are loading a table that has an index defined on it, locking the table in exclusive mode also provides optimal logging.
SET AUTOCOMMIT ON and SET AUTOSAVE Together these commands cause the number of rows specified with AUTOSAVE to be automatically committed when the load buffer is full. Should the load operation subsequently fail, you can insert the remaining rows with the LOAD PARTIAL command.
SET SESSION DML ATOMICITY AT ROW LEVEL Setting the DML atomicity to row level guarantees that savepoints will not be generated during a load, reducing logging overhead when running in non-archive mode.
SET SESSION UNIQUE, REFERENTIAL, CHECK CONSTRAINTS DEFERRED
By deferring constraints, you avoid problems caused by the order in which dependent values are inserted into a table when foreign or primary key constraints exist. Constraint checking is deferred until the end of the load operation.
You may wish to load a table first, then add constraints to it using the ALTER TABLE statement. When the constraint is added, the data in the table is checked for consistency, and if inconsistent data if found, an error is generated. In this case, it is your responsibility to remove the inconsistent tuples before attempting to add the constraint.
Deferring constraints and setting atomicity to the row level together have the effect of reducing logging overhead greatly in non-archive mode because of the use of no-log pages. However, logging is still performed for the allocation of new pages, and the non-archive log file must be large enough to include a record for each page.

Additional tips for improving load performance:

For subsequent unload and load operations, use the INTERNAL option of the ISQL UNLOAD and LOAD commands. LOAD INTERNAL is 3-5 times faster than LOAD EXTERNAL.
Using archive logging during loading requires additional log file space, but it actually requires fewer I/Os than nonarchive logging. However, the saving of I/Os must be weighed against the time required to back up log files so they can be reused.
Set the number of log buffers to at least 120 while loading.
Set data buffer pages to 200 to reduce the size of the linked list and the expense of maintaining it.
Add indexes after data is loaded. For clustering indexes, sort the data before the load.
If you wish to use rules tied to INSERT operations, create the rules after loading the table. Otherwise, the operation of the rule will degrade the performance of the load. Note, however, that the operation of the rule is not retroactive, so you must ensure the consistency of your data at the time the rule is created.

Unloading Data

UNLOAD operations are subject to the following performance considerations:

To recluster data to improve the cluster count, unload the table with an ORDER BY clause, delete all rows from the table, then reload the data in the new sorted order. This process is described more thoroughly earlier in this chapter under "Reclustering a Table." Also, see the next item.
For large tables, you can UNLOAD INTERNAL to tape. The tape must be labeled. It is important to use a blocking factor that is as close to 16K bytes as possible, since ISQL uses 16K byte chunks. If your block size is less than 16K, the unload and load will take longer and use more tape.
For UNLOAD operations on large amounts of data that you intend to sort, consider unloading the data first and then sorting with an external utility rather than using the ORDER BY as part of the UNLOAD. In this case, you need to use UNLOAD EXTERNAL.

Unloading and Reloading to Remove Indirect Rows

The existence of indirect rows increases the amount of I/O that must be performed to obtain data. Indirect rows also consume more disk space, since they frequently are not able to use the shared tuple header. Therefore, indirect rows should be avoided. Use the following procedure to remove indirect rows from a table:

Unload the data from the table using the INTERNAL option.
Drop any referential constraints on other tables that refer to the unloaded table.
Drop the table. This will also drop any B-tree indexes that have been defined on the table.
Issue the CREATE TABLE statement to recreate the table. Do not include constraints.
Issue the COMMIT WORK statement to reclaim space.
Load the data back into the table using the INTERNAL option.
Use the CREATE INDEX statement to recreate any B-tree indexes that may have been dropped, and use the ALTER TABLE statement to add back constraints (including referential constraints) that may have been dropped.
Do an UPDATE STATISTICS for the table to update system catalog information about the table and its indexes and constraints.
Issue the COMMIT WORK statement.

Unloading and Reloading to Remove Overflow Pages

Since many key values can hash to the same page address, it is possible for a page in a hash structure to become full. When this happens, a new row must be inserted on an overflow page. Overflow pages increase the amount of I/O that must be performed to obtain table data. The larger the number of overflow pages, the slower the average access to any tuple.

Observe the AVGLEN column in SYSTEM.HASH or the AVGOVERFLOW field of the SQLMON Static Hash screen to see the average number of page accesses required to retrieve a particular row. As the number of overflow pages increases, so will this number. Increasing the capacity of the table should reduce AVGLEN, and thus improve performance. Use the following procedure:

Unload the data from the table using the INTERNAL option.
Drop any referential constraints on other tables that refer to the unloaded table.
Drop the table. This will also drop any B-tree indexes that have been defined on the table.
Issue a COMMIT WORK statement. This makes the space occupied by the table and its related indexes and constraints available for reuse.
If necessary, create additional TABLE or MIXED DBEFiles and add them to the DBEFileSet that will contain the redefined table.
Issue the CREATE TABLE statement, specifying a larger number of primary pages than when you previously created the table. The CREATE TABLE statement should not include any constraints.
Load the data back into the table using the INTERNAL option.
Use the CREATE INDEX statement to recreate any B-tree indexes that may have been dropped, and use the ALTER TABLE statement to add back constraints (including referential constraints) that may have been dropped.
Issue an UPDATE STATISTICS statement for the table to update system catalog information about the table and its indexes and constraints.
Issue the COMMIT WORK statement.

Tips on Deletions from Tables

When deleting all rows from a table, it is more efficient to use the TRUNCATE TABLE statement instead of the DELETE statement. On a large table, the DELETE statement may cause the log to run out of space, whereas the TRUNCATE TABLE statement incurs minimal logging.
The performance of the TRUNCATE TABLE statement degrades when the table is hashed. Dropping the table and recreating it may be faster than using the TRUNCATE TABLE statement.
If the table specified by the TRUNCATE TABLE statement is included in a referential constraint, it may be more efficient to drop the constraint, issue the TRUNCATE TABLE statement, and readd the constraint.