Designing Hash Structures

Like an index, a hash structure specifies up to 16 key columns on a table to facilitate rapid access to data. But unlike an index, a hash structure is not a separate object; instead, it is a way of arranging the actual table data in DBEFile pages. To create a hash structure, you must define a hash key and specify a number of primary pages in a DBEFileSet containing one or more empty TABLE or MIXED DBEFiles.

Because of the nature of hashing, you can define only one hash structure for a single table. You can, however, define multiple B-tree indexes on the same table. You cannot define a clustering index, because clustering affects the physical placement of data, and this would be incompatible with the physical placement chosen by the hash algorithm.

Another consequence of hashing is the requirement that you specify a number of primary pages. When the table grows to include more data than will fit on the primary pages, new tuples are placed on overflow pages, which can be located in the same or in other DBEFiles within the same DBEFileSet. The larger the number of overflow pages, the slower the access to the average tuple.

Understanding the Hash Function

Each time a row is inserted into the table, a hash function maps the specified key value to a logical page in the DBEFileSet associated with the hash structure. If two keys hash to the same page (in what is called a collision), then space is allocated on the same page if possible for both rows. If space cannot be allocated on the same page, a new page is allocated (an overflow page) and linked to the old page.

To access data through a hash structure, ALLBASE/SQL calculates the correct data page location in the DBEFileSet from the key value by means of the hash function.

The hash function used by ALLBASE/SQL derives the primary page number in three steps:

Creates a concatenated form of the key.
Folds the concatenated key into a 4-byte integer.

Computes the primary page number, as follows:

   Primary Page Number = (Folded Key) mod (Number of Primary Pages + 1)

The important thing to note is that hashing works best with a fairly even distribution of key values, spreading the corresponding rows evenly over the number of pages available. A key with a skewed distribution will attempt to place all rows on a correspondingly skewed set of pages. For good results, use a prime number of primary pages when you are hashing on a non-integer key.

Choosing Hash Keys

In choosing a hash key, one important consideration is your query design. There must be an EQUAL factor for each key column in the predicates of queries that will use the hash structure.

Another important issue is the distribution of key values. The best key results in a set of hash values that are evenly distributed among the primary pages available. The worst key results in hash values that cluster tightly in a narrow range of primary pages, leaving others empty.

Another consideration is that the key must be unique. It can either be a unique single column value or a unique combination. In addition to being unique, a hash key should be non-volatile, that is, not subject to frequent update. Since you cannot use the UPDATE statement with a hash key column, you must do a DELETE followed by an INSERT when a key modification is necessary.

An integer key such as a social security number is ideal. You can also use multiple columns to create a unique key, as in the following example:


   CREATE PUBLIC TABLE PurchDB.OrderItems

     (OrderNumber       INTEGER           NOT NULL,

      ItemNumber        INTEGER           NOT NULL,

      VendPartNumber    CHAR(16),

      PurchasePrice     DECIMAL(10,2)     NOT NULL,

      OrderQty          SMALLINT,

      ItemDueDate       CHAR(8),

      ReceivedQty       SMALLINT )

   UNIQUE HASH ON (OrderNumber, VendPartNumber) PAGES=101

   IN OrderFS

Note that use of a multicolumn key does not, in itself, ensure uniqueness.

Choosing the Number of Primary Pages

The number of primary pages in a hash structure is the number of data pages allocated as hash buckets. The optimal number depends partially on how much data you need to store, and partially on how sparse you wish the DBEFile pages to be. The larger the number of primary pages, the more sparse the pages.

In general, choose a number that is large enough for the data you need to store plus space for a reasonable amount of growth. In practical terms, the smallest number of primary pages would be equal to the size of the data divided by the size of a page (about 3900 bytes, allowing for overhead). The largest number of primary pages you might choose would be equal to the total number of rows; for a unique hash structure with a completely even distribution of key values, this would mean one row per page.

For small tables subject to frequent access, create a hash structure with the number of primary pages equal to the number of rows and with an even distribution of key values. An example is a currency exchange table containing entries for 50 currencies using a currency code key with values ranging from 1001 to 1050 and 50 primary pages. This structure will enable you to lock only a single row at a time while accessing data, thereby improving concurrency.

Designing Hash Structures

Technical documentation

» Table of Contents

» Index

Understanding the Hash Function

Choosing Hash Keys

Choosing the Number of Primary Pages