KSAM FILE STRUCTURE

A KSAM/3000 file is two physical files: a data file and a key file. The data file portion of a KSAM file contains all the data in the file and contains nothing but the data. Data records are written to the data file in the order in which they are received from a program. (The last record added is always written to the end of the file.) This chronological order is not necessarily in sequence by key value. At the time the file is opened, you can specify that records must be written in primary key sequence, but the default mode is to write records in any order.

The key file portion of a KSAM file contains the key entries that maintain the sequence of the data records. As a data record is written to the data file, a key entry is added for each key defined for the file, and the sequential connections between key entries maintained. This means that if there is a primary key and two alternate keys, three key entries are added with each new data record, and three sets of pointers are updated to reflect the new key sequence of each key.

The structure of the data file is like that of any MPE file. Data records may be fixed or variable in length. If fixed, each record is the size specified when the file is created (default size is equivalent to one 128-word disc sector). If variable, the actual size of each record is included in the record itself, and the maximum size of any record is used to determine the blocking. By default, data records are blocked one record per block.

The structure of the key file is more complex. The key file is organized so that locating a particular key requires the least number of accesses. For this purpose, the key files are organized in a particular structure known as a "B-Tree".^[3]B-tree structure has two main advantages:

^[3]

The number of file accesses is limited to the number of levels in the tree. If there are two levels, no more than two reads of the key file are needed to locate a particular key.
The key file is balanced. This means that each level pointer associated with a particular key value points to approximately as many higher key values as lower key values at the next level of the tree.

B-tree structure in general is discussed below, followed by a discussion of how KSAM key files use this structure.

B-TREE STRUCTURE

In a B-tree, there is always one root level block that points to blocks at a lower level. At the lowest level, the blocks are called leaves and they do not reference another level. In a two-level structure (see Figure B-1 “Two-Level B-Tree Structure”), the blocks at the second level all "leaves". If the tree has more than two levels, intermediate blocks (nodes or branches) are referenced by a higher level and themselves reference a lower level. Unless this lower level is a leaf, it also references a lower level. This continues until the lowest (leaf) level is reached.

The notion of higher and lower level does not refer to the key values, The root block key values are always central and point to blocks with lower values and blocks with higher values. Thus if there are two entries at the root or a branch level, there will be three pointers to the next level: one for key values less than the first key value, one for key values less than the second key value but greater than the first, and one for key values greater than the second.

Within each block, values are stored in ascending order. Although not all blocks are filled with values, each block in a tree is the same size. Figure B-1 “Two-Level B-Tree Structure” illustrates a simple 2-level tree with one root block and three leaf blocks. The root is a single block and each leaf is a block of the same size. (This example uses the KSAM minimum key block size consisting of four key entries per block.)

Figure B-1 Two-Level B-Tree Structure

ADDING OR DELETING KEYS.

When a key block is full and new keys are added, the root level block may need to be split, causing a new root block to be introduced and adding a new level to the tree. This process is illustrated in Figure B-2 “Split Causes New Level in Tree” where the addition of one new key to a partially filled block does not affect the tree structure, but the addition of a second key to the full block causes the block to split.

Again, this example assumes the minimum key block size for the sake of simplicity. Note that all key file maintenance is performed in the KSAM extra data segment where space is allocated for one more key than the key block size. This allows the addition of a key to a "full" block. Before the block is read back into the key file, it is split so that the key block size is maintained.

Figure B-2 Split Causes New Level in Tree

When the root block and all the leaves are full, another split becomes necessary. Figure B-3 “Tree Growth from Two to Three Levels” illustrates a split caused when a new key is added to a full two-level tree structure, forcing it to a three-level structure.

Figure B-3 Tree Growth from Two to Three Levels

Note that key blocks must always be defined with an even number of keys. As a result, when a key is added to a full block, there will be a middle value to form a block at a new level. This maintains the balance essential to B-tree structure.

As records are deleted from the data file, two blocks at the same level (brothers) may be merged into one block. If sufficient records are deleted, the root block may be merged into a higher level, thereby contracting the number of levels in the key structure.

KSAM KEY FILE STRUCTURE

A KSAM key file consists of three types of information:

Control — contains general control information such as the KSAM filename, and the number of keys defined for the file,
Key descriptor — contains general key information for each key such as the starting location in the data record of the key field, and the location in the key file of the root key entry.
Key entries — Each key entry contains information about a key associated with a data record. This information consists of:
- the key value
- a pointer into a data record in the data file with the same key value.
- pointers into other records within the key file.

The control and key descriptor information is contained in two blocks (physical records) at the beginning of each file. Regardless of the number of keys in the file, each block is 128 words (l sector) long. Thus, every key file is preceded by two sectors of control and key descriptor information.

The key entries are also organized into blocks of a fixed size. However, the exact number of blocks and the size of each block is based on a variety of factors, such as the key size, the number of keys in the file, the number of key values for each key, the key blocking factor, and so forth. (Calculation of key block size is discussed later in this section.) These key entry blocks are organized into the Btree structure discussed above. A separate key structure is maintained for each key defined for the file, Thus there may be up to 16 separate tree structures in a single KSAM file.

Refer to Figure B-4 “KSAM Key File Structure With Two Keys” for a simplilied diagram of a KSAM key file with two keys each organized into a two level tree structure. For a detailed description of the three types of block, refer to Figure B-5 “Control Block and Key Descriptor Block” and Figure B-6 “Key Entry Block Structure”.

CONTROL BLOCK.

This 128-word block contains information on the data file associated with the key file, and keeps track of the number and type of access to the key file. It also specifies the number of keys (primary and alternate) defined for the KSAM file. The name of the data file and the number of keys are essential for associating the key file with the data file. The number of keys determines how many entries are in the Key Descriptor Block. (Refer to Figure B-5 “Control Block and Key Descriptor Block”.)

KEY DESCRIPTOR BLOCK.

This 128-word block contains one 8-word entry for each key defined for the KSAM file. The first entry describes the primary key, the next entry describes the first alternate key (if there is one), and each subsequent entry describes any additional alternate keys. The first word of each entry points to the root block for that key; another important item is the location of the key in the data file record. (Refer to Figure B-5 “Control Block and Key Descriptor Block”.)

Figure B-4 KSAM Key File Structure With Two Keys

Figure B-5 Control Block and Key Descriptor Block

KEY ENTRY BLOCKS.

Each block in the key file contains, in addition to the key values, pointers that link the key blocks to each other and pointers that link each key value to an associated data record. Preceding these entries, the first item in every key block is the address of the block on disc; the next item is the number of keys in the key block.

All key block access for search and modification is performed in the KSAM extra data segment. The disc address in each key block insures that the block is returned to its correct location on disc from the extra data segment.

Figure B-6 “Key Entry Block Structure” illustrates the general layout of all key entry blocks. Each key value is followed by a pointer to a data record and a pointer to the block at the next level with higher key values. The first, pointer in each block points to a block at the next level with lower key values. These pointers are set to zero for key blocks that have no next level (the leaves on a tree structure).

Figure B-6 Key Entry Block Structure

RELATION OF KEY TO DATA FILE

The purpose of the KSAM key file is to maintain the order of data records in the data file. In order to maintain sequential order for each key, the keys blocks are connected through pointers. In addition to these pointers, each key entry must also contain a pointer linking the key value to the data record containing the corresponding key value.

When the KSAM file is created, each key is defined by its starting location in the data record, its length, and its type. The location is specified as the character position where the key value starts; the length is the maximum number of characters used by the key value; its type is the type of the value such as, an integer, a character string, or a double-word integer.

Thus, if the primary key is defined as a character string that starts in character position 3 and is 20 characters long, then KSAM expects that each data record will contain such a value in that location. Whatever is placed in the defined location is treated as the primary key and determines the order in which data records are sequenced.

The order in which records are physically written to the date file is called chronological sequence. This sequence may or may not also be a key sequence. If the records were written to the file in primary key sequence, then this sequence and the chronological sequence are the same. If there is an alternate key for the file, however, it is very unlikely that alternate key sequence is the same as the chronological sequence.




	NOTE: Key sequence in KSAM files is always in ascending order by key value.

Refer to Figure B-7 “Data File/Key File Relation” for a simplified diagram of the relation between the primary keys in the key file and the associated data records in the data file. (A similar diagram could be set up for the alternate key.) The diagram shows the pointer in each key entry pointing directly to a record in the data file.

When a data record is to be located by key value, the root block for the appropriate key is searched first, using a binary search method. If the key is in the root block, the search is over. If it is not, the key value is between two root block values or it is less than the lowest value or greater than the highest. Using the pointer in the appropriate location, a block at the next level is located. This block is then searched for the selected key. Again, if the key is found, the search is over. If the key is not found at this level, the appropriate pointer to the next level is used and the search continues.

When the selected key value is found, the pointer to the data file associated with that key value is used to locate the record in the data file.

Figure B-7 Data File/Key File Relation

^[3]Described in "Organization and Maintenance of Large Ordered Indexes" Bayer and McCreight, Acta Informatica, Springer Verlag,1972, pp 173-189.