Physical Data Design [ ALLBASE/SQL Performance Guidelines ] MPE/iX 5.0 Documentation
ALLBASE/SQL Performance Guidelines
Physical Data Design
Physical data design means the arrangement of data inside tables, which
reside in DBEFiles, and the arrangement of DBEFiles in DBEFileSets. This
section includes the following topics:
* Creating DBEFileSets.
* Creating DBEFiles.
* Creating Tables.
* Initial Table Loads.
* Unloading Data.
* Unloading and Reloading to Avoid Indirect Rows.
* Unloading and Reloading to Remove Overflow Pages.
* Tips on Deletions from Tables.
Additional information relating to physical design is found in the
section "Load Balancing" in the chapter "Guidelines on System
Administration." You may also find it useful to review the "Physical
Design" chapter in the ALLBASE/SQL Database Administration Guide.
Creating DBEFileSets
DBEFileSets are logical groupings of DBEFiles. DBEFileSets are also the
logical home of tables, which are always created in a particular
DBEFileSet. The following guidelines offer suggestions on how to
distribute tables among DBEFileSets.
Avoiding the SYSTEM DBEFileSet for User Data.
Although you can create user-defined tables in the SYSTEM DBEFileSet, it
is recommended that you create new DBEFileSets to keep user data separate
from system catalog data to facilitate maintenance and performance
tuning. Here are the reasons why:
* System catalog data is accessed frequently. If user data is kept
in the SYSTEM DBEFileSet, then the pages used for system data
become intermingled with the pages used for user data. If
ALLBASE/SQL needs to access a system page that physically resides
near the end of the DBEFileSet, then it must first look through
all the page table pages that come before it. The average time to
access a particular system page increases because of the extra
data stored within the DBEFileSet.
* Sometimes ALLBASE/SQL uses pages in the SYSTEM DBEFileSet for
temporary space. If this space is not available, processing will
terminate in an unsuccessful state. Mixing user and system data
makes it more likely that temporary space will not be available
when needed. See the section "Temporary Space in SYSTEM" in the
chapter "Guidelines on System Administration."
* Grouping tables in separate DBEFileSets allows you to identify the
space requirements for particular tables. If you store all tables
in the SYSTEM DBEFileSet and the DBEFileSet runs out of space, it
will be difficult to tell how fast each table is growing. It will
also be harder to determine which table's update or insert
operation caused the system to run out of space.
In particular, you should avoid creating hash structures in the SYSTEM
DBEFileSet.
Placing Large Tables in Separate DBEFileSets.
You should place tables larger than 1000 pages in their own DBEFileSets.
When more than one table is stored in the DBEFileSet, the pages of
DBEFiles in the DBEFileSet become a mixture of pages from the different
tables. In doing a sequential scan, if ALLBASE/SQL needs to access a
data page that physically resides near the end of the DBEFileSet, it must
first look through all the page table pages that come before it. The
average time to access a data page thus increases because of the extra
data stored within the DBEFileSet. The time increases severely when
several large tables are stored in the DBEFileSet.
Maintenance functions such as UPDATE STATISTICS also will take longer if
many large tables are stored in a single DBEFileSet. Therefore, create a
separate DBEFileSet for each table or group of tables that you want to
maintain separately.
Gathering Small Tables into DBEFileSets.
For smaller tables, there may be a performance advantage in grouping them
together into DBEFiles of 1000 pages or less. When each small table
occupies its own DBEFileSet, you can expect some of the following
performance problems:
* I/O might increase, for several reasons:
* Each page table page has little information on it. More
pages would have to be read in from disk because the data
physically resides on more separate pages.
* Because each page table page has little information on it,
it will not stay in the data buffer pool. Pages with more
information have more locality of reference and are more
likely to remain in the buffer pool (thus reducing real
I/O).
If the data buffer pool were large enough, the actual increased
I/O might be negligible.
* Each DBEFileSet requires at least one DBEFile, which is an actual
operating system file. Opening a file to read from it is a slow
operation. Putting each table into its own DBEFileSet increases
the number of files, and therefore increases the average time
needed to access the data in these tables.
* Operating systems have limits on the number of files that can be
open simultaneously. To avoid exceeding the limit, successive
close and open file operations may be required, which will degrade
performance. ALLBASE/SQL imposes its own limit of 50 files that
may be open simultaneously.
* The operating system must search for the correct file descriptor,
which may take longer with more files open.
* Disk space usage would increase, since many virtually empty page
table pages would exist.
When small tables are placed in a single DBEFileSet, disk space is used
efficiently and I/O is minimized. For example, if you have fifty tables
that average 5 pages, they will all fit comfortably into a single DBEFile
and use only one page table page.
Creating DBEFiles
DBEFiles are the physical repositories of ALLBASE/SQL data on the
operating system. Since you have great flexibility in defining DBEFile
types, sizes, and assignment to DBEFileSets, your choices often affect
the overall performance of your system. Here are some suggestions for
creating DBEFiles:
* Create DBEFiles in sizes that are multiples of 253 pages to
minimize the space used by page table pages. Each 253-page
segment contains one page table page plus 252 pages of data
following the page table page.
* Create separate TABLE and INDEX DBEFiles, especially for large
tables, so that they can be placed on separate disks to improve
I/O performance.
* Create a MIXED DBEFile in its own DBEFileSet for a small table
(less than 1000 pages) with a single index. (In this case, data
and index rows will be on alternating pages, which will result in
faster processing and better space utilization).
* Use dynamic DBEFile expansion to avoid running out of DBEFile
space at run time. In the CREATE DBEFILE statement, specify the
initial size, the maximum size, and the increment size.
* Use pseudo-mapped I/O on MPE/iX or raw files on HP-UX to shorten
the I/O path length and eliminate operating system buffering of
database pages. This feature is appropriate only for large
DBEFiles containing large tables accessed randomly. Standard I/O
is more appropriate for small DBEFiles or DBEFiles that are
frequently accessed sequentially. When appropriate on MPE/iX, use
pseudo-mapped I/O with memory-resident data buffers. File system
prefetching is not available on MPE/iX when using pseudo-mapped
I/O.
Restrictions on pseudo-mapped files in MPE/iX:
* They cannot be dynamically expanded.
* You cannot use concurrent backup (that is, SQLUtil
STOREONLINE does not work with pseudo-mapped files).
For additional information on pseudo-mapped files in MPE/iX, refer to the
appendix "Using Pseudo-Mapped DBEFiles" in the MPE/iX version of the
ALLBASE/SQL Database Administration Guide. For information on raw files
in HP-UX, refer to the appendix "Using HP-UX Raw Files for DBEFiles and
Logs" in the HP-UX version of the ALLBASE/SQL Database Administration
Guide.
Avoiding Extra DBEFile Space.
When sequential scans are made frequently for queries involving tables in
a particular DBEFileSet, extra space should be minimized. When
sequential scans are made, every page table page in the DBEFileSet is
read to determine if pages might qualify for the query. When a large
amount of extra space exists, then extra I/O is required to fetch the
page table pages for empty DBEFile space. One way to minimize the extra
space in your DBEFiles is by specifying dynamic space expansion in the
CREATE DBEFILE statement.
Creating Tables
Here are some general suggestions on table creation:
* Create PUBLIC and PUBLICROW tables for maximum concurrency.
PUBLICREAD can improve performance by reducing internal page
locking. Use PRIVATE for special purposes. Note that PRIVATE is
the default; usually, you want to specify something else.
* Choose data types which suit the programming language so as to
avoid data conversions.
* Native language (NLS) users should be aware that indexes and
predicates that use USASCII (n-computer or C on HP-UX; NATIVE-3000
on MPE/iX) columns perform faster than indexes and predicates that
use native language columns. If there is a choice, use USASCII
for index columns in your tables, create indexes on USASCII
columns, and use USASCII columns in your predicates.
* Integer values can be 5 - 15% more efficient than packed decimal
values.
* When adding a column with ALTER TABLE and adding a default value,
every row in the table is immediately updated to add the new
value. This causes all rows to become indirect rows. It may be
better to unload data, drop and recreate the table with the new
column and default, then reload the data.
* If a column is to be used consistently in the WHERE clause, it
should have an index.
* Wherever possible, load data for non-hash tables in sorted order.
This is essential for tables with clustering indexes. If a
multi-column index exists, the order of its columns should be the
same as the order of the columns in the load file. Hash tables
should not be loaded in sorted order.
* Consider combining tables that are constantly joined together. It
is a good idea to revisit normalization issues when actually
creating tables.
* For large tables, place data and indexes on separate disks using
the SQLUtil MOVEFILE command. In general, spread the load of
database and log files over your disks, and keep data and log
files on separate devices. Refer to the section "Load Balancing"
in the "Guidelines on System Administration" chapter for more
information.
Avoiding NULL and Variable Length Data.
* Avoid NULLs and variable length data types (VARCHAR and VARBINARY)
for the following reasons:
* Nulls always require additional CPU to check for null
values. Using NOT NULL can save as much as 5% in CPU
because of the overhead of checking for nulls.
* Use of NULL and VARCHAR or VARBINARY may cause wasted space
due to the inability to use a shared tuple header. (A
separate header must be stored for each tuple that has a
different size.) However, the use of nulls or variable
length columns may actually save space if there are only a
few such rows per page. If the difference between the
maximum size of the row and average size of data stored in
the row is greater than the size of the header, then use of
NULL or variable size columns may be more efficient.
* Updating a NULL, VARCHAR, or VARBINARY column may cause the
data to move to a different page, leaving an indirect tuple
behind and increasing the number of I/Os required to access
the data tuple.
* You can create a table with default values instead of nulls
so that when columns are updated they will not have to move
to another page because of tuple size increase.
* If VARCHARs or NULL columns are necessary to save space, create a
table with VARCHARs or NULLs as the trailing columns in the table
schema. This is because when row lengths change, the amount of
shifting required at the storage level is minimized.
* Use an appropriate (not an unreasonably long) length for variable
length columns. The maximum length can affect the performance of
BULK SELECT and cursor operations.
* Indexes whose keys will be updated should never include NULL
columns as keys, and rarely should they include VARCHAR or
VARBINARY columns. Variable length keys can cause long index
chains and rearranging when index values are updated.
Using INTEGER Rather than SMALLINT Data.
Since SQLCore only performs 4-byte arithmetic, all SMALLINT values are
automatically converted to INTEGERs before processing by SQLCore. These
conversions do not affect the optimizer's choice of a scan plan, but they
may be costly in CPU for very large databases. To avoid the conversion,
consider defining your SMALLINT columns as INTEGER columns. The
drawback, of course, is that INTEGERs require twice as much storage space
as SMALLINTs.
Initial Table Loads
In order to achieve the highest possible performance in loading data into
your tables for the first time, create a utility program that uses the
BULK INSERT command to add rows of data to the table. For very large
tables, such a program can improve performance over ISQL's LOAD command
by 300% to 500%. C would not be the best language choice for a load
program if you are using BULK INSERT with character data, because of
differences in character handling between C and ALLBASE/SQL.
For subsequent UNLOAD and LOAD operations, use the INTERNAL option of the
ISQL UNLOAD and LOAD commands. Wherever it is appropriate, create
indexes after loading.
Be sure to turn off archive logging if you need to conserve log file
space, and use BULK INSERT statements. Additional tips:
* Using archive logging during loading requires additional log file
space, but it actually requires fewer I/Os than nonarchive
logging. However, the saving of I/Os must be weighed against the
time required to back up log files so they can be reused.
* By setting constraints DEFERRED, you avoid concern over the order
in which dependent values are inserted into a table when foreign
or primary key constraints exist. At COMMIT WORK time,
constraints are applied to the loaded tables.
* Setting the DML atomicity to row level guarantees that savepoints
will not be generated during a load, which can improve performance
while non-archive logging is in effect. Be sure to turn
AUTOCOMMIT OFF in this case, since the duration of row level
atomicity and deferral of constraints is the current transaction
only; at automatic COMMIT WORK, levels return to STATEMENT LEVEL
atomicity and IMMEDIATE constraint application.
* For optimal performance, use the following statements before
starting the LOAD operation:
isql=> SET AUTOCOMMIT OFF;
isql=> SET UNIQUE, REFERENTIAL, CHECK CONSTRAINTS DEFERRED;
isql=> SET DML ATOMICITY AT ROW LEVEL;
isql=> LOCK TABLE TableName IN EXCLUSIVE MODE;
Note the following:
* If you do not turn AUTOCOMMIT off, the exclusive lock will
be released at the time of the first automatic COMMIT WORK,
and the atomicity will return to the statement level.
* Turning AUTOCOMMIT off, deferring constraints, and setting
atomicity to the row level together have the effect of
reducing logging overhead greatly in non-archive mode
because of the use of no-log pages. However, logging is
still performed for the allocation of new pages, and the
non-archive log file must be large enough to include a
record for each page.
* For PUBLIC and PUBLICROW tables, locking the table in
exclusive mode avoids the overhead of obtaining locks as
pages of data are added to the table. If you are loading a
table that has an index defined on it, locking the table in
exclusive mode also provides optimal logging.
* You may wish to load a table first, then add constraints to it
using the ALTER TABLE statement. When the constraint is added,
the data in the table is checked for consistency, and if
inconsistent data if found, an error is generated. In this case,
it is your responsibility to remove the inconsistent tuples before
attempting to add the constraint.
* If you wish to use rules tied to INSERT operations, create the
rules after loading the table. Otherwise, the operation of the
rule will degrade the performance of the LOAD. Note, however, that
the operation of the rule is not retroactive, so you must ensure
the consistency of your data at the time the rule is created.
* Set the number of log buffers to at least 120 while loading.
* Set data buffer pages to 200 to reduce the size of the linked list
and the expense of maintaining it.
* Load in single user mode to avoid lock contention.
* Add indexes after data is loaded. For clustering indexes, sort
the data before the load.
* LOAD INTERNAL is 3-5 times faster than LOAD EXTERNAL.
Unloading Data
UNLOAD operations are subject to the following performance
considerations:
* To recluster data to improve the cluster count, unload the table
with an ORDER BY clause, delete all rows from the table, then
reload the data in the new sorted order. This process is
described more thoroughly earlier in this chapter under
"Reclustering a Table." Also, see the next item.
* For large tables, you can UNLOAD INTERNAL to tape. The tape must
be labeled. It is important to use a blocking factor that is as
close to 16K bytes as possible, since ISQL uses 16K byte chunks.
If your block size is less than 16K, the unload and load will take
longer and use more tape.
* For UNLOAD operations on large amounts of data that you intend to
sort, consider unloading the data first and then sorting with an
external utility rather than using the ORDER BY as part of the
UNLOAD. In this case, you need to use UNLOAD EXTERNAL.
Unloading and Reloading to Remove Indirect Rows
The existence of indirect rows increases the amount of I/O that must be
performed to obtain data. Indirect rows also consume more disk space,
since they frequently are not able to use the shared tuple header.
Therefore, indirect rows should be avoided. Use the following procedure
to remove indirect rows from a table:
1. Unload the data from the table using the INTERNAL option.
2. Drop any referential constraints on other tables that refer to the
unloaded table.
3. Drop the table. This will also drop any B-tree indexes that have
been defined on the table.
4. Issue the CREATE TABLE statement to recreate the table. Do not
include constraints.
5. Issue the COMMIT WORK statement to reclaim space.
6. Load the data back into the table using the INTERNAL option.
7. Use the CREATE INDEX statement to recreate any B-tree indexes that
may have been dropped, and use the ALTER TABLE statement to add
back constraints (including referential constraints) that may have
been dropped.
8. Do an UPDATE STATISTICS for the table to update system catalog
information about the table and its indexes and constraints.
9. Issue the COMMIT WORK statement.
Unloading and Reloading to Remove Overflow Pages
Since many key values can hash to the same page address, it is possible
for a page in a hash structure to become full. When this happens, a new
row must be inserted on an overflow page. Overflow pages increase the
amount of I/O that must be performed to obtain table data. The larger
the number of overflow pages, the slower the average access to any tuple.
Observe the AVGLEN column in SYSTEM.HASH to see the average number of
page accesses required to retrieve a particular row. As the number of
overflow pages increases, so will this number. Increasing the capacity
of the table should reduce AVGLEN, and thus improve performance. Use the
following procedure:
1. Unload the data from the table using the INTERNAL option.
2. Drop any referential constraints on other tables that refer to the
unloaded table.
3. Drop the table. This will also drop any B-tree indexes that have
been defined on the table.
4. Issue a COMMIT WORK statement. This makes the space occupied by
the table and its related indexes and constraints available for
reuse.
5. If necessary, create additional TABLE or MIXED DBEFiles and add
them to the DBEFileSet that will contain the redefined table.
6. Issue the CREATE TABLE statement, specifying a larger number of
primary pages than when you previously created the table. The
CREATE TABLE statement should not include any constraints.
7. Load the data back into the table using the INTERNAL option.
8. Use the CREATE INDEX statement to recreate any B-tree indexes that
may have been dropped, and use the ALTER TABLE statement to add
back constraints (including referential constraints) that may have
been dropped.
9. Issue an UPDATE STATISTICS statement for the table to update
system catalog information about the table and its indexes and
constraints.
10. Issue the COMMIT WORK statement.
Tips on Deletions from Tables
* When deleting all rows from a table, it is more efficient to drop
and later recreate the table than to use the DELETE statement
without a predicate. The DELETE statement on a large table may
also cause the log to run out of space.
* When deleting the majority of rows from a large table, it may be
more efficient to unload the rows to be retained, drop the table,
recreate it, and reload the data, rather than selecting and
deleting the larger number of rows.
MPE/iX 5.0 Documentation