Managing the HP3000 Part 33

⇓ ⇐

The HP e3000--For Complete Novices
Part 33: IMAGE/SQL Application Logic—Performance

by George Stachnik

Welcome to part 33 of this series of articles on the HP e3000. We began looking at IMAGE/SQL in part 18 of this series—which makes this the 15th article that has focussed on the HP e3000's database management system. One might wonder if a single part of MPE/iX deserves so much attention. But in a very real sense, IMAGE/SQL is the heart of the e3000, so it's appropriate that we should spend much of our time studying it.

We've seen how to design, create, and access IMAGE/SQL databases, and we've learned some best practices that relate to the prevention of database corruption and locking strategies. This month, we're going to wrap up our discussion of IMAGE fundamentals by examining why HP chose to add SQL interfaces to IMAGE, and along the way, learn a little bit about database performance.

This is a topic that could fill a whole book all by itself, so we will only be able to present a cursory overview of the subject here. There are basically two important things to understand about IMAGE performance:

IMAGE was designed to do one thing very quickly—online transaction processing (or OLTP for short).
HP has done a remarkable job of evolving IMAGE to keep up with the changing demands that customers make on their databases. Still, it's important to keep in mind that IMAGE was designed for OLTP. While it can do other things, the farther away you get from a simple OLTP environment, the less effective IMAGE/SQL is going to be, particularly from a performance perspective.

IMAGE/SQL and OLTP

Like MPE and the HP 3000 hardware, IMAGE (the SQL was added much later) was designed in the 1970s, at a time when databases were found only on the largest and most expensive mainframe-style computers. The HP 3000 (the "e" did not become part of the name until the year 2000) was the first minicomputer system that included a robust database management system (or DBMS). This made it the first minicomputer that was designed from the ground up for commercial use.

In part 18 of this series, we discussed the business problems faced by HP customers at the time, and saw how HP was able to use its new DBMS to solve many of those problems. If you still have that issue of Interact magazine in your library, now would be a good time to review that article. In a nutshell, the designers of what was then called IMAGE/3000 were trying to create a product that could meet the following three requirements:

The DBMS would need to provide read/write access to large amounts of data using a variety of access methods (including keyed access, direct access, and sequential access).
The DBMS would need to provide simultaneous access to that data for a large community of users, using different application programs, each of which needed to be able to have its own view of the data.
The DBMS would need to provide that access with subsecond response time.

I suspect that if you had told the original designers of IMAGE/3000 that their creation would still be in use in the 21st century, they'd have laughed at you. If you had told them how IMAGE databases would be used in the 21st century, they'd have said, "That's simply impossible."

It's true that many of the design criteria that were used to create IMAGE/3000 have changed significantly since the 1970s.

The meaning of the term "large amounts of data" has changed. Table 1 shows a few examples of how the constraints on the sizes of IMAGE databases and datasets have changed over the years.
The meaning of the term "large community of users" has also changed. Early models of the HP 3000 supported up to about a dozen users before response times suffered. Today's N-class servers can support thousands of sessions.

The only requirement that does not seem to have changed since the 1970s is the need for subsecond response time. Users are just as impatient today as they were when Richard Nixon was president. But even there, things are not what they used to be—because the things that application programmers expect HP's DBMS to do for them in less than a thousand milliseconds have changed dramatically.

Table 1:
Missing: part33tab1.jsp

What People Want from a DBMS

In the 1970s, the ability to read, add, delete, and update records in a database, online, in real time, was a revolutionary concept. Consequently, the designers of IMAGE/3000 optimized their design around those four functions (DBGET, DBPUT, DBDELETE, and DBUPDATE). IMAGE's ability to do these four things quickly (in most situations) is at the heart of what made the HP 3000 an early success.

By the late 1970s and early 1980s, the day-to-day operation of many companies was being recorded faithfully in OLTP databases that were purring away in glass-walled data centers at each company's headquarters. Every order, every part number, every transaction that the company entered into was duly recorded in an OLTP database. Not surprisingly, it didn't take long for people to become completely dependent upon the databases. Why take the time to maintain paper files when everything you could ever want to know was "in the computer"?

As this mountain of corporate information piled up in these OLTP databases, managers soon discovered that the only way of getting to the data was to use the same application programs that had put it there in the first place. These programs were fast, but they weren't designed to analyze the data. They were designed just to maintain it.

Customers began clamoring for ways of pulling the information out of the database so they could use it to make better decisions. They demanded tools to analyze it so they could run the company more effectively. The concept of the ad hoc query soon entered the lexicon of IT managers everywhere. Customers began demanding that databases provide whole new levels of functionality. Instead of retrieving data using predefined search items, as they had been doing in the OLTP environment, they wanted to be able to retrieve and analyze any data, from any database, and select it using the contents of any fields they saw fit.

Naturally enough, all the major computer vendors saw this as an opportunity to sell hardware (and software). And so, they began working on new DBMSs that could handle these new ad hoc queries. The most durable result of this work may well have been IBM's SQL system. Today, most people associate the word SQL with a language—SQL is actually an acronym for Structured Query Language. But in its original implementation, SQL was not just a language, but a whole new kind of DBMS, called a relational database management system.

The term "relational database management system" (or RDBMS) had originated in academia. There was a good deal of academic argument over the precise meaning of the word "relational" in this context. But before long, the marketing departments of the major computer hardware vendors weighed in on the subject, and the meaning of the term quickly became blurred beyond any recognition, at least by the academics.

By the mid 1980s, all the major vendors had "relational" database products (whether or not they were "truly" relational is something that the academics are probably still arguing over). But to their surprise, the customers did not rush to purchase them. There were two major reasons for the delay.

The data they wanted to analyze resided in "legacy" databases, managed by OLTP DBMS products that were optimized around online transaction processing. HP customers were using IMAGE, just as IBM's customers were using a mainframe DBMS called IMS (Information Management System).
To handle ad hoc queries effectively, the major hardware vendors had created new relational database management systems. IBM created SQL, and HP came up with ALLBASE.

You can see the problem easily enough. Customers had mountains of data sitting in OLTP databases (such as IMAGE databases on HP 3000s). What they wanted to be able to do was process ad hoc queries against that data. What they got instead were whole new database management systems that enabled them to do ad hoc queries—but not on the data that they had in the OLTP databases.

Before they could do the ad hoc queries that they wanted so badly, customers needed to copy their data from the OLTP databases to new relational databases. At first blush, this doesn't sound like such a big deal. But as usual, the devil's in the details.

Copying the data was time consuming and expensive. Remember, by this time the OLTP databases had grown quite large.
The OLTP databases changed constantly. Maintaining a copy of the data in a relational database (at least an up-to-date copy) was a daunting project at the very least.
Copying your data into a relational database would have been a more attractive proposition if it meant you could get rid of the OLTP database. But in most cases, you couldn't. At least not if your application programs were designed to work with the OLTP database. In most cases, rewriting the OLTP applications to operate on the relational database was an expensive project, and in any case, few (if any) of the early relational DBMSs could provide subsecond response time.

Consequently, most IMAGE users stayed with IMAGE through the 1980s. In spite of the capabilities of relational DBMSs, they remained little more than a curiosity until the 1990s. With the arrival of the networked PC and client-server data analysis tools, many customers became convinced that the extra effort required to get to a relational DBMS might just be worth it.

In the early 1990s, many of HP's customers chose to implement their relational DBMSs using Oracle, running on HP 9000 computers running HP-UX. Oracle was marketing its UNIX-based DBMS as a product that could be used in two ways:

as an OLTP database (like IMAGE) or
to process ad hoc queries (like ALLBASE).

This was a very seductive marketing message, which many HP customers implemented (with varying levels of success). When HP's loyal MPE customer base began moving to HP-UX, the company realized that it had to do something.

That "something" was IMAGE/SQL—HP's attempt to "graft" the SQL language portion of ALLBASE onto the data engine of IMAGE (by then called TurboIMAGE), its OLTP database. We'll begin looking at IMAGE/SQL (and by extension, ALLBASE) in upcoming articles—but for the remainder of this part of the series, we're going to take our last look at IMAGE-without-SQL with an eye toward answering the following question. "What do you need SQL for, anyway? Why can't you just use IMAGE to process ad hoc queries?"

IMAGE Without SQL

In the OLTP environment that IMAGE had been designed for, the user would typically call up information and update it using predefined search items. For example, an order processing clerk might use a customer's name or account number as a key value.

Part 18 of this series of articles discussed how one goes about designing an IMAGE database. In particular, it discussed the fact that certain fields are designated as search items. In our order processing example, the fields containing customer names or account numbers might well be designated as search items.

The definition of search items is important because of the way IMAGE locates records. In the 1970s, there was an explosion of DBMS software in the industry. Many of these products shared a common weakness. If you asked them to locate a particular record in the database, they had to search the database to find it. These searches were enormously resource-intensive, and they could slow you system to a crawl. An example of this can be found on the HP 3000 in the form of KSAM. When you retrieve a record from a KSAM value using a key value, KSAM searches the KSAM file in order to find the record that has the key value you're looking for. By contrast, when you retrieve a record from an IMAGE database using a key value, IMAGE can usually locate the record without a search, as long as you're using a search item as your key field.

Figure 1 shows a very simple example of how this might work. We see a detail dataset (ORDERS) with two search items—CUST-NAME and ACCT-NO—followed by miscellaneous other fields. Suppose an application program wants to add or retrieve records from ORDERS corresponding to a particular customer named GEORGE. Let's investigate what would happen from a performance perspective.

Figure 1:

Let's suppose you've built an IMAGE database like the one shown in Figure 1, but you haven't added any data to it yet. When you're ready to add the record with CUST-NAME=GEORGE to the CUSTOMER master dataset, you'll call DBPUT.

DBPUT needs to figure out where to put the new record in the (thus far empty) dataset. You might think that the common sense thing to do would be to put it in the first slot. But IMAGE does something that may seem odd at first, but which turns out to be very ingenious. It calculates a record number. It takes the key value (GEORGE), passes it through a mathematical formula, and uses the result to determine exactly where to place it in the CUSTOMER dataset.

This formula converts the key value into an integer between 1 and whatever the capacity of the dataset is. For example, suppose that the CUSTOMER dataset was defined to have a capacity of 100 records. The logic in DBPUT will convert the string GEORGE into an integer between 1 and 100, and place the GEORGE record in the slot in CUSTOMER with that number. It's as simple as that.

We aren't going to go into the details of how this algorithm works in this article. But we are going to talk about why it works so well. The most important reason is that IMAGE's hashing algorithm returns unique results for just about any key value you throw at it.

In other words, if you pass the string GEORGE through the hashing algorithm, you will get a result than would get if you used TOM, LARRY, MOE, or SHIRLEY (or Curly). This, at least, is how it's supposed to work—and it does work that way as long as you're careful. We'll get to what we mean by "careful" in just a moment. But first, let's see why that's so important.

Imagine a very large database being accessed online. A request comes in to access the GEORGE record. IMAGE's search algorithm enables it to go directly to the GEORGE record without having to search for it. It does this by repeating the logic used by DBPUT. It passes the string GEORGE through the same mathematical equation that DBPUT uses, and uses the result to figure out where DBPUT placed it in the CUSTOMER dataset.

Note that IMAGE does not have to search for the GEORGE record (as KSAM, for example, would have to do). In most cases IMAGE will know exactly where to find it in the master dataset. This is an enormously valuable feature. Most DBMSes spend an inordinate amount of time searching for records. IMAGE is able to go directly to whatever record you request, as long as you follow the rules. (And again, we'll discuss "the rules" in just a moment.)

Before we do, note also that the hashing algorithm is only used to locate records in master datasets. How, then, does IMAGE find records in detail datasets? Don't forget that every record in every master contains pointers to the corresponding records in the detail dataset. In Figure 1, CUSTOMER is a master dataset. The GEORGE record contains pointers to any GEORGE records that might exist in the ORDERS detail dataset. No search is necessary to locate the detail records: the chain pointers in CUSTOMER tell IMAGE exactly where to look.

Don't Sabotage the Hashing Algorithm

This hashing algorithm is the reason why IMAGE is so fast when used in an OLTP environment. Unfortunately, there are two things that customers sometimes do to sabotage the hashing algorithm. So you want to be careful not to do them. In order for the hashing algorithm to be effective, you want to make sure that you don't violate these rules:

The capacity of your master datasets should be a prime number. A prime number is a number that cannot be calculated by multiplying two other numbers together (except for the number itself, and 1). For example, 25,013 is a prime number, because there are only two numbers that can be multiplied by one another and give 25,013 as a result-and they are 25,013 and 1. There's a table of large prime numbers in the TurboIMAGE Database Management System Reference Manual (p/n 30391-90011).
If your application is adding records to a master dataset, don't allow the dataset to fill up. As the amount of free space shrinks, it becomes more and more likely that the search algorithm will begin to generate synonyms.

Synonyms are two or more key values that map to the same slot number in a master file. I know that earlier we said that this doesn't happen—that the hashing algorithm will return a unique result for each key value. And that's true as long as you follow the rules. One of the rules is that you maintain enough free space in the dataset (there are guidelines in the IMAGE manual for how much free space is enough to avoid synonyms). Since the capacity of the dataset is one of the factors used by the hashing algorithm, the smaller the capacity, the greater the likelihood that it will start generating synonyms.

IMAGE handles synonyms without generating errors. Basically, when a synonym occurs, one of the records is placed at the calculated address (called the primary), and the other one will be placed in a nearby slot (called a secondary). A pointer connects the two addresses. If DBFIND goes looking for the record that's located in the secondary slot, the hashing algorithm will direct it to the primary address. There it will find a record, but not the one it's looking for. It will locate the record in the secondary slot by following the pointer, but in so doing, it generates overhead that will begin to slow down performance. Essentially, if you allow your IMAGE database to develop a lot of synonyms, you're forcing IMAGE to conduct searches for records that it would otherwise be able to locate very quickly.

The problem of secondaries can become even more severe in the event that a slot occupied by a secondary happens to correspond to the primary address of another key. In that case, IMAGE has to move the secondary to yet another address, making it even harder to find in the event that somebody else tries to locate it later on. In this case it's called a "migrating secondary." Databases that contain a lot of migrating secondaries are prime causes of performance problems.

There are many database tools available from HP and from third parties that will evaluate your IMAGE databases to see if they are being 'clogged up' with secondaries. One of the "classics" is a program called "howmessy" which is available from Robelle. For more information on third party IMAGE database tools, refer to Sally Blackwell's article on the subject. It can be found on the Web at
//www.3kworld.com/content.asp?contid=86.

With these minor exceptions, IMAGE performance is very fast as long as you're using it as it was designed to be used—locating, reading, writing, and updating records using search items. Next we'll see what happens if you ask IMAGE to locate a record without using any search item at all.

The database shown in Figure 1 has two search items—CUST-NAME and ACCT-NO. What if you want to search for a record by some other criteria? What if you want to see all the orders that were placed for a particular item-number? Of course, we could add an ITEM- MASTER to the database. (Theoretically, you could make every item a search item, although that would present its own set of problems.) But if you're not going to be accessing the data by item number frequently, there's a more practical approach to be considered.

Some application programs (like HP's QUERY) handle such requests by reading parts of the database serially. The program QUERY.PUB.SYS was one of HP's earlier attempts at providing a tool for handling ad hoc queries, and it's bundled with every HP 3000. As long as you use it to retrieve records using search items, it's a fairly useful (if somewhat unfriendly) tool.

But as soon as you ask it to retrieve a record from either a master or a detail using criteria other than a search item value, QUERY begins to show its "dark side." It will locate the record you want by reading every record in the dataset, one at a time, from the beginning to the end, until it finds what it's looking for. This works, but it's a performance disaster.

These are exactly the kinds of demands that are made by ad hoc queries. Next month, we're going to start looking at IMAGE/SQL's SQL interface, which handles them better than QUERY does. Using logic borrowed from ALLBASE, it is a powerful tool, and it adds a great deal to IMAGE. But it's important to remember that IMAGE/SQL is first and foremost an OLTP database. It's not a relational database, and although it can handle ad hoc queries, it will never handle them as efficiently as an RDBMS.

George Stachnik works in Hewlett-Packard's Commercial Systems Division in Cupertino, California. He is responsible for the development and delivery of e3000 training.

←Part 32 Part 34→

The HP e3000--For Complete Novices Part 33: IMAGE/SQL Application Logic—Performance

IMAGE/SQL and OLTP

What People Want from a DBMS

IMAGE Without SQL

Don't Sabotage the Hashing Algorithm

The HP e3000--For Complete Novices
Part 33: IMAGE/SQL Application Logic—Performance