|
The HP e3000--For Complete Novices
Part 33: IMAGE/SQL Application Logic—Performance
by George Stachnik
Welcome to part 33 of this series of articles on the HP e3000. We
began looking at IMAGE/SQL in part 18 of this series—which makes this
the 15th article that has focussed on the HP e3000's database management
system. One might wonder if a single part of MPE/iX deserves so much
attention. But in a very real sense, IMAGE/SQL is the heart of the
e3000, so it's appropriate that we should spend much of our time
studying it.
We've seen how to design, create, and access IMAGE/SQL databases, and
we've learned some best practices that relate to the prevention of
database corruption and locking strategies. This month, we're going to
wrap up our discussion of IMAGE fundamentals by examining why HP chose
to add SQL interfaces to IMAGE, and along the way, learn a little bit
about database performance.
This is a topic that could fill a whole book all by itself, so we
will only be able to present a cursory overview of the subject here.
There are basically two important things to understand about IMAGE
performance:
- IMAGE was designed to do one thing very quickly—online transaction
processing (or OLTP for short).
- HP has done a remarkable job of evolving IMAGE to keep up with the
changing demands that customers make on their databases. Still, it's
important to keep in mind that IMAGE was designed for OLTP. While it can
do other things, the farther away you get from a simple OLTP
environment, the less effective IMAGE/SQL is going to be, particularly
from a performance perspective.
IMAGE/SQL and OLTP
Like MPE and the HP 3000 hardware, IMAGE (the SQL was added much
later) was designed in the 1970s, at a time when databases were found
only on the largest and most expensive mainframe-style computers. The HP
3000 (the "e" did not become part of the name until the year 2000) was
the first minicomputer system that included a robust database management
system (or DBMS). This made it the first minicomputer that was
designed from the ground up for commercial use.
In part 18 of this series, we discussed the business problems faced
by HP customers at the time, and saw how HP was able to use its new DBMS
to solve many of those problems. If you still have that issue of Interact
magazine in your library, now would be a good time to review that
article. In a nutshell, the designers of what was then called
IMAGE/3000 were trying to create a product that could meet the following
three requirements:
- The DBMS would need to provide read/write access to large amounts of
data using a variety of access methods (including keyed access, direct
access, and sequential access).
- The DBMS would need to provide simultaneous access to that data for a
large community of users, using different application programs, each of
which needed to be able to have its own view of the data.
- The DBMS would need to provide that access with subsecond response time.
I suspect that if you had told the original designers of IMAGE/3000
that their creation would still be in use in the 21st century, they'd
have laughed at you. If you had told them how IMAGE databases would be
used in the 21st century, they'd have said, "That's simply impossible."
It's true that many of the design criteria that were used to create IMAGE/3000
have changed significantly since the 1970s.
- The meaning of the term "large amounts of data" has changed. Table 1
shows a few examples of how the constraints on the sizes of IMAGE databases
and datasets have changed over the years.
- The meaning of the term "large community of users" has also changed.
Early models of the HP 3000 supported up to about a dozen users before
response times suffered. Today's N-class servers can support thousands
of sessions.
The only requirement that does not seem to have changed since the
1970s is the need for subsecond response time. Users are just as
impatient today as they were when Richard Nixon was president. But even
there, things are not what they used to be—because the things that
application programmers expect HP's DBMS to do for them in less than
a thousand milliseconds have changed dramatically.
Table 1:
Missing: part33tab1.jsp |
What People Want from a DBMS
In the 1970s, the ability to read, add, delete, and update records in
a database, online, in real time, was a revolutionary concept.
Consequently, the designers of IMAGE/3000 optimized their design around
those four functions (DBGET, DBPUT, DBDELETE, and DBUPDATE). IMAGE's
ability to do these four things quickly (in most situations) is at the
heart of what made the HP 3000 an early success.
By the late 1970s and early 1980s, the day-to-day operation of many
companies was being recorded faithfully in OLTP databases that were
purring away in glass-walled data centers at each company's
headquarters. Every order, every part number, every transaction that
the company entered into was duly recorded in an OLTP database. Not
surprisingly, it didn't take long for people to become completely
dependent upon the databases. Why take the time to maintain paper files
when everything you could ever want to know was "in the computer"?
As this mountain of corporate information piled up in these OLTP
databases, managers soon discovered that the only way of getting to the
data was to use the same application programs that had put it there in
the first place. These programs were fast, but they weren't designed to
analyze the data. They were designed just to maintain it.
Customers began clamoring for ways of pulling the information out of
the database so they could use it to make better decisions. They
demanded tools to analyze it so they could run the company more
effectively. The concept of the ad hoc query soon entered the
lexicon of IT managers everywhere. Customers began demanding that
databases provide whole new levels of functionality. Instead of
retrieving data using predefined search items, as they had been doing in
the OLTP environment, they wanted to be able to retrieve and analyze
any data, from any database, and select it using the contents
of any fields they saw fit.
Naturally enough, all the major computer vendors saw this as an
opportunity to sell hardware (and software). And so, they began working
on new DBMSs that could handle these new ad hoc queries. The most
durable result of this work may well have been IBM's SQL system. Today,
most people associate the word SQL with a language—SQL is actually an
acronym for Structured Query Language. But in its
original implementation, SQL was not just a language, but a whole new kind
of DBMS, called a relational database management system.
The term "relational database management system" (or RDBMS) had
originated in academia. There was a good deal of academic argument over
the precise meaning of the word "relational" in this context. But
before long, the marketing departments of the major computer hardware
vendors weighed in on the subject, and the meaning of the term quickly
became blurred beyond any recognition, at least by the academics.
By the mid 1980s, all the major vendors had "relational" database
products (whether or not they were "truly" relational is something that
the academics are probably still arguing over). But to their surprise,
the customers did not rush to purchase them. There were two major
reasons for the delay.
- The data they wanted to analyze resided in "legacy" databases,
managed by OLTP DBMS products that were optimized around online
transaction processing. HP customers were using IMAGE, just as IBM's
customers were using a mainframe DBMS called IMS (Information Management
System).
- To handle ad hoc queries effectively, the major hardware vendors
had created new relational database management systems.
IBM created SQL, and HP came up with ALLBASE.
You can see the problem easily enough. Customers had mountains of
data sitting in OLTP databases (such as IMAGE databases on HP 3000s).
What they wanted to be able to do was process ad hoc queries against
that data. What they got instead were whole new database management
systems that enabled them to do ad hoc queries—but not on the data that
they had in the OLTP databases.
Before they could do the ad hoc queries that they wanted so badly,
customers needed to copy their data from the OLTP databases to new
relational databases. At first blush, this doesn't sound like such a
big deal. But as usual, the devil's in the details.
- Copying the data was time consuming and expensive. Remember, by this
time the OLTP databases had grown quite large.
- The OLTP databases changed constantly. Maintaining a copy of the
data in a relational database (at least an up-to-date copy) was a
daunting project at the very least.
- Copying your data into a relational database would have been a more
attractive proposition if it meant you could get rid of the OLTP
database. But in most cases, you couldn't. At least not if your
application programs were designed to work with the OLTP database. In
most cases, rewriting the OLTP applications to operate on the relational
database was an expensive project, and in any case, few (if any) of the
early relational DBMSs could provide subsecond response time.
Consequently, most IMAGE users stayed with IMAGE through the 1980s.
In spite of the capabilities of relational DBMSs, they remained little
more than a curiosity until the 1990s. With the arrival of the
networked PC and client-server data analysis tools, many customers
became convinced that the extra effort required to get to a relational
DBMS might just be worth it.
In the early 1990s, many of HP's customers chose to implement their
relational DBMSs using Oracle, running on HP 9000 computers running
HP-UX. Oracle was marketing its UNIX-based DBMS as a product that could
be used in two ways:
- as an OLTP database (like IMAGE) or
- to process ad hoc queries (like ALLBASE).
This was a very seductive marketing message, which many HP customers
implemented (with varying levels of success). When HP's loyal MPE
customer base began moving to HP-UX, the company realized that it had to
do something.
That "something" was IMAGE/SQL—HP's attempt to "graft" the SQL
language portion of ALLBASE onto the data engine of IMAGE (by then
called TurboIMAGE), its OLTP database. We'll begin looking at IMAGE/SQL
(and by extension, ALLBASE) in upcoming articles—but for the remainder
of this part of the series, we're going to take our last look at
IMAGE-without-SQL with an eye toward answering the following question.
"What do you need SQL for, anyway? Why can't you just use IMAGE to
process ad hoc queries?"
IMAGE Without SQL
In the OLTP environment that IMAGE had been designed for, the user
would typically call up information and update it using predefined
search items. For example, an order processing clerk might use a
customer's name or account number as a key value.
Part 18 of this series of articles discussed how one goes about
designing an IMAGE database. In particular, it discussed the fact that
certain fields are designated as search items. In our order
processing example, the fields containing customer names or account
numbers might well be designated as search items.
The definition of search items is important because of the way IMAGE
locates records. In the 1970s, there was an explosion of DBMS software
in the industry. Many of these products shared a common weakness. If
you asked them to locate a particular record in the database, they had
to search the database to find it. These searches were enormously
resource-intensive, and they could slow you system to a crawl. An
example of this can be found on the HP 3000 in the form of KSAM. When
you retrieve a record from a KSAM value using a key value, KSAM searches
the KSAM file in order to find the record that has the key value you're
looking for. By contrast, when you retrieve a record from an IMAGE
database using a key value, IMAGE can usually locate the record without a
search, as long as you're using a search item as your key field.
Figure 1 shows a very simple example of how this might work. We see a detail
dataset (ORDERS) with two search items—CUST-NAME and ACCT-NO—followed by
miscellaneous other fields. Suppose an application program wants to
add or retrieve records from ORDERS corresponding to a particular
customer named GEORGE. Let's investigate what would happen from a
performance perspective.
Let's suppose you've built an IMAGE database like the one shown in Figure 1,
but you haven't added any data to it yet. When you're ready to add the
record with CUST-NAME=GEORGE to the CUSTOMER master dataset, you'll call DBPUT.
DBPUT needs to figure out where to put the new record in the (thus
far empty) dataset. You might think that the common sense thing to do
would be to put it in the first slot. But IMAGE does something that may
seem odd at first, but which turns out to be very ingenious. It calculates
a record number. It takes the key value (GEORGE), passes it through a
mathematical formula, and uses the result to determine exactly where to
place it in the CUSTOMER dataset.
This formula converts the key value into an integer between 1 and
whatever the capacity of the dataset is. For example, suppose that the
CUSTOMER dataset was defined to have a capacity of 100 records. The
logic in DBPUT will convert the string GEORGE into an integer between 1
and 100, and place the GEORGE record in the slot in CUSTOMER with that
number. It's as simple as that.
We aren't going to go into the details of how this algorithm works in
this article. But we are going to talk about why it works so well.
The most important reason is that IMAGE's hashing algorithm returns
unique results for just about any key value you throw at it.
In other words, if you pass the string GEORGE through the hashing
algorithm, you will get a result than would get if you used TOM, LARRY,
MOE, or SHIRLEY (or Curly). This, at least, is how it's supposed to
work—and it does work that way as long as you're careful. We'll get to
what we mean by "careful" in just a moment. But first, let's see why
that's so important.
Imagine a very large database being accessed online. A request comes
in to access the GEORGE record. IMAGE's search algorithm enables it to
go directly to the GEORGE record without having to search for it. It
does this by repeating the logic used by DBPUT. It passes the string
GEORGE through the same mathematical equation that DBPUT uses, and uses
the result to figure out where DBPUT placed it in the CUSTOMER dataset.
Note that IMAGE does not have to search for the GEORGE record (as
KSAM, for example, would have to do). In most cases IMAGE will know
exactly where to find it in the master dataset. This is an enormously
valuable feature. Most DBMSes spend an inordinate amount of time
searching for records. IMAGE is able to go directly to whatever record
you request, as long as you follow the rules. (And again, we'll discuss
"the rules" in just a moment.)
Before we do, note also that the hashing algorithm is only used to
locate records in master datasets. How, then, does IMAGE find records
in detail datasets? Don't forget that every record in every master
contains pointers to the corresponding records in the detail dataset.
In Figure 1, CUSTOMER is a master dataset. The GEORGE record
contains pointers to any GEORGE records that might exist in the ORDERS
detail dataset. No search is necessary to locate the detail records:
the chain pointers in CUSTOMER tell IMAGE exactly where to look.
Don't Sabotage the Hashing Algorithm
This hashing algorithm is the reason why IMAGE is so fast when used
in an OLTP environment. Unfortunately, there are two things that
customers sometimes do to sabotage the hashing algorithm. So you want
to be careful not to do them. In order for the hashing algorithm to be
effective, you want to make sure that you don't violate these rules:
- The capacity of your master datasets should be a prime number. A
prime number is a number that cannot be calculated by multiplying two
other numbers together (except for the number itself, and 1). For
example, 25,013 is a prime number, because there are only two numbers
that can be multiplied by one another and give 25,013 as a result-and
they are 25,013 and 1. There's a table of large prime numbers in the
TurboIMAGE Database Management System Reference Manual (p/n 30391-90011).
- If your application is adding records to a master dataset, don't
allow the dataset to fill up. As the amount of free space shrinks, it
becomes more and more likely that the search algorithm will begin to
generate synonyms.
Synonyms are two or more key values that map to the same slot number
in a master file. I know that earlier we said that this doesn't
happen—that the hashing algorithm will return a unique result for each
key value. And that's true as long as you follow the rules. One
of the rules is that you maintain enough free space in the dataset
(there are guidelines in the IMAGE manual for how much free space is
enough to avoid synonyms). Since the capacity of the dataset is one of
the factors used by the hashing algorithm, the smaller the capacity, the
greater the likelihood that it will start generating synonyms.
IMAGE handles synonyms without generating errors. Basically, when a
synonym occurs, one of the records is placed at the calculated address
(called the primary), and the other one will be placed in a nearby slot
(called a secondary). A pointer connects the two addresses. If DBFIND
goes looking for the record that's located in the secondary slot, the
hashing algorithm will direct it to the primary address. There it will
find a record, but not the one it's looking for. It will locate the
record in the secondary slot by following the pointer, but in so doing,
it generates overhead that will begin to slow down performance.
Essentially, if you allow your IMAGE database to develop a lot of
synonyms, you're forcing IMAGE to conduct searches for records that it
would otherwise be able to locate very quickly.
The problem of secondaries can become even more severe in the event
that a slot occupied by a secondary happens to correspond to the primary
address of another key. In that case, IMAGE has to move the secondary
to yet another address, making it even harder to find in the event that
somebody else tries to locate it later on. In this case it's called a
"migrating secondary." Databases that contain a lot of migrating
secondaries are prime causes of performance problems.
There are many database tools available from HP and from third
parties that will evaluate your IMAGE databases to see if they are being
'clogged up' with secondaries. One of the "classics" is a program
called "howmessy" which is available from Robelle. For more information
on third party IMAGE database tools, refer to Sally Blackwell's article
on the subject. It can be found on the Web at
//www.3kworld.com/content.asp?contid=86.
With these minor exceptions, IMAGE performance is very fast as long
as you're using it as it was designed to be used—locating, reading,
writing, and updating records using search items. Next we'll see what
happens if you ask IMAGE to locate a record without using any search
item at all.
The database shown in Figure 1 has two search items—CUST-NAME
and ACCT-NO. What if you want to search for a record by some other
criteria? What if you want to see all the orders that were placed for a
particular item-number? Of course, we could add an ITEM- MASTER to the
database. (Theoretically, you could make every item a search
item, although that would present its own set of problems.) But if
you're not going to be accessing the data by item number frequently,
there's a more practical approach to be considered.
Some application programs (like HP's QUERY) handle such requests by
reading parts of the database serially. The program QUERY.PUB.SYS was
one of HP's earlier attempts at providing a tool for handling ad hoc
queries, and it's bundled with every HP 3000. As long as you use it to
retrieve records using search items, it's a fairly useful (if somewhat
unfriendly) tool.
But as soon as you ask it to retrieve a record from either a master
or a detail using criteria other than a search item value, QUERY begins
to show its "dark side." It will locate the record you want by reading
every record in the dataset, one at a time, from the beginning to the
end, until it finds what it's looking for. This works, but it's a
performance disaster.
These are exactly the kinds of demands that are made by ad hoc
queries. Next month, we're going to start looking at IMAGE/SQL's SQL
interface, which handles them better than QUERY does. Using logic
borrowed from ALLBASE, it is a powerful tool, and it adds a great deal
to IMAGE. But it's important to remember that IMAGE/SQL is first and
foremost an OLTP database. It's not a relational database, and although
it can handle ad hoc queries, it will never handle them as efficiently
as an RDBMS.
George Stachnik works in Hewlett-Packard's Commercial Systems Division in
Cupertino, California. He is responsible for the development and delivery
of e3000 training.
|