Building A Data Warehouse on the
HP3000:
A Practical Example
Cailean Sherman - Taurus Software
Cheryl Grandy - DISC
Jennie Hou - HP
Taurus Software
1032 Elwell Court
Suite 245
Palo Alto, CA 94303
650-961-1323
cailean@taurus.com
Dynamic Information Systems Corporation
5733 Central Avenue
Boulder, CO 80301
303-444-4000
cag@disc.com
Hewlett-Packard
19447 Pruneridge Avenue
Cupertino, CA 95014
408-447-5971
Jennie_Hou@hp.com
408-447-5649
Alvina_Nishimoto@hp.com
Building A Data Warehouse on the HP3000:
A Practical Example
Table of Contents
WHY A DATA WAREHOUSE OR DATA MART?
WHAT IS A DATA WAREHOUSE OR DATA
MART?
Technical goals of the data warehouse
Technical structure of the data warehouse
The data
The data model: Re-architected database
- star schema
The environment
BUILDING A DATA WAREHOUSE STEP BY
STEP
Analyze the business needs and determine
value
Determine the scope of your project
Building a target model
Map the data
Extract operational data and load the target
Optimize queries
End user interface and deployment
Determine the value of the mart
CONCLUSION
This
paper explains the steps involved in building a data warehouse on the HP3000
and how that data can be used for in-depth analysis of a variety of business
parameters. It will discuss the nature of data warehouses and data marts,
the technology used to build them, and their value to an organization. Because
data warehouses can adapt to the needs of any organization, this paper will
detail the process one company went through to decide what they wanted in
their data mart and their step-by-step implementation.
The
literature of data warehousing is studded with sound bites that suggest how
companies use data warehouses and data marts for competitive advantage. No
company in its right mind will give away real secrets, so the examples below
are more illustrative than real.
"
A supermarket found a correlation in its data warehouse
that linked sales of beer and diapers on Friday evenings between 5 and 7 PM.
By sending observers into stores at that time, they found that dads with infants
were doing the shopping. The retailer placed the highest-margin diapers alongside
the beer. Sales and profit on diapers soared.
"
A web retailer
of women. s clothes sold an expensive mail list to a Weight Loss Clinic
for all customers whose dress size increased from the previous order.
"
A Managed
Care Organization found that 31% of its medical costs came from 8% of its
members, and that the MCO was losing money on that group. By focusing on this
segment, the MCO put in place programs to return the group to profit.
HOW TO FIND THE
VALUE IN A DATA MART
The goal of data warehousing is to increase profitability.
Any investment proposal that increases net Earnings Before Income Tax
(EBIT) will attract off-budget and immediate funding if the return is large
and the risk small. Driving up EBIT can be done by increasing revenues,
cutting costs or improving margins, separately or in conjunction.
"
Cutting
Costs
Most companies have cut their costs to the bone by
reducing headcount and automating processes. There is usually very little
fat left that a data warehouse or mart can help, with the exception of Inventory
Management, if that function is unsophisticated.
"
Increasing
revenue and improving margins can be achieved in a number of ways:
1)
Discount analysis: In a direct sales operation, you can carefully monitor which rep gives
what discount percentage to which customers and when. Cutting the overall
rate of discounting to win business by 1% increases sales in a $100 million
company by $1 million. The same amount drops straight to EBIT. If your sales
are larger, or your gain is more than 1%, the ROI for a Billings or Bookings
data warehouse or mart can be staggering.
2)
Customer profitability: You can find out who your best customers are, not
only in sales volume, but also in EBIT, if you can factor in informal estimates
of the costs of servicing the customer. Maybe you are consciously buying market
share by selling to some customers at a loss or at minimal profit. If not, a Sales Analysis warehouse or mart
will help identify your most profitable customers. Your sales force can then
find more customers like them, and you can drop or re-price the less profitable
customers.
Customer lifetime value can reveal
that you should spend most of your effort on your base rather than looking
for new accounts, with profound implications for your sales organization. s
productivity.
3) Product profitability: The
same analysis can be applied to products.
Most companies sell a wide range of products to one
customer with a wide range of individual gross margins. If you could satisfy
the customer with a product mix with adequate function, with an average 1%
increase in margin, and do it without
reducing revenues, the 1% and $1 million or more would drop straight to EBIT.
If your customer set and your product set both can
be improved in this fashion, then you can adopt both cross-selling and up-selling.
The product becomes a weapon for targeting your best customers with new product
and higher margin variants of the products they already buy.
ROI is of little relevance unless you can find a senior line
executive who is directly affected by the ROI, and who is willing to go
to bat for the data warehouse or data mart. . Directly affected. means that
the ROI favorably affects their personal Key Performance Indicators (KPIs).
VPs of Sales are keenly interested in anything that promises more sales. VPs
of Manufacturing like a firm schedule that makes high margin products. CFOs
drool over anything that offers a sniff at improving EBIT.
What is a data warehouse or data mart?
Let. s go to the source for this definition. In the
words of Ralph Kimbal, one of the gurus of
data warehousing, "A data warehouse is a powerful database model that
significantly enhances managers' ability to quickly analyze large, multidimensional
data sets." A data warehouse contains cleansed and organized data that
allows data warehouse managers to make business decisions based on facts,
not on intuition.
The . data mart. is collected data in a single subject
area, usually extracted from the data warehouse or pulled directly from an
operational system.
Experience has been that companies build many marts,
and all of these may need similar source data. A company may start with a
Billings Mart containing information on sales derived from invoices. Later
they may decide to have a Finance Mart containing information on profit derived
from the same information.
Why should both marts pull this data directly from
the source system? A data warehouse can feed both sets of marts and queries,
thereby reducing redundant work.
To
provide the information users need to increase profitability you must meet
both of the following goals of the data warehouse or data mart.:
·
To
provide easy access to corporate data:
1)
The end user access tools must
be easy to use. The end user access tools should live on the PC in an environment familiar
to the end user. They must have a short learning curve and be highly intuitive.
If you have high level managers asking questions of the data warehouse, they
must have a simple way to interact with it.
2)
The access should be graphic,
so it is easy to spot trends. People have been using Microsoft Excel® for years.
Spotting trends in a columnar report listing 500 customers. sales for the
quarter is a very tedious process. However, if you take the same list and
chart it graphically, it is easy for the human eye to spot the winners and
losers.
3)
The access should be manageable
by end users. MIS is no longer the end user of the tools that access the data. End
users are analysts, mid-level managers, even in some cases, high level managers.
They must be able to easily get answers to their questions, and ask new questions,
all without getting MIS involved.
4)
The process must be fast.
Questions leapfrog, so you have
to get answers fast. The very nature of data analysis is that not all requests
are known beforehand, so there. s lot of unpredictable, ad hoc inquiry.
In an analytical environment, the end user is directly interacting
with the data, rather than looking at a printed report. The answers have to
be delivered fast, before users lose their train of thought. If you lose the
interest of a user, they won't come back!
·
To provide data for analysis
that is clean and reliable
1)
For consistent analysis, the
environment must be stable. Now that users can create their
own reports, they must have consistent data. One department doing an analysis must get the same results as any other.
You can. t have two managers showing up at a meeting with different
sales figures for last month!
2)
Source conflicts must be resolved. Transactional (production) systems often have
data stored in several different applications. Customer addresses, for example, may be in the invoicing system,
as well as the sales order system. Chase
Manhattan Bank may be stored as "Chase" in one system and "Chase
Manhattan Bank" in another. A straight load of data for this customer into the data warehouse
would result in two customers being represented . and with invoices being
listed with no supporting sales orders and visa versa. These conflicts in
the source data need to be resolved and merged as they move into the analytical
environment.
3)
Historical analysis must be
possible. There are two aspects to historical data that
must be considered. The amount of
historical data in the warehouse has to be sufficient to satisfy the need
to spot trends over time. To see the
growth in sales of a product over 3 years, perhaps to help predict sales of
future products, you need three years of data.
The second historical consideration is the ability to see the state
of affairs at a historical point in time.
Take, for example, a situation where you want to see what your inventory
amounts have been over the last year to help determine if you can carry less
inventory and still satisfy demand. You
may need to be able to see what your inventory amount was for product X on
the 3rd, 4th, and 5th of July. As data is added to the data warehouse, it
needs to be added and date stamped - not updated like it would be in the transactional
inventory system.
4)
Performance of operational
systems must not be impaired. It. s a given
that the Data Warehouse will run on a separate and dedicated systems to avoid
conflicts with the performance needs of the operational systems. Extracting
production data to populate the data warehouse will have some impact on the
operational system. Consideration
must be given to whether full loads of data are practical, or whether incremental
loads are needed, using only changes since the last load.
Structure of the data warehouse
To
create a data warehouse, there are several technical requirements involved.
Creating data to be analyzed requires
that the data be subject-oriented, integrated, time referenced and non-volatile.
Making sure that the data can be accessed quickly and can meet the ad hoc
queries that users need requires that the data be organized in a new database
design, the star or multidimensional data model. And there are a whole new set of systems and tools required
to build and maintain this new data environment.
Data warehouses group data by subject rather than by
activity. Transactional systems are organized around activities - claim processing,
ordering product, shipping product. Data organized by transaction cannot answer
questions asked by subject, such as "how many shoes did we sell in Europe
this year?" This request would require heavy searching of sales orders
and customer address records. If the transactional system does not have a
key or index on the country in the address record, the query would most likely
be very slow.
Data used for analysis needs to be organized around subjects, customers, sales,
and products. Because of the design that will be used for the data warehouse
or data mart, if you ask for the sales in Europe, you are able to search through
very few records to get the answer to your question.
Integrated refers to de-duplicating data and merging
it from many sources into one consistent location. If you are going to find
your top 20 customers, you must know that HP and Hewlett Packard are the same
company. There must be one customer number for any form of Hewlett-Packard
in your data warehouse or mart. You can't have one from the sales system that
is alphanumeric, and one from the shipping system that is a concatenation
of their company name and phone number .
Much of the transformation and loading work that goes
into the data warehouse or mart occurs here, integrating data and standardizing
it. There are often problems found in the source data that need to be corrected
as well, such as absence of data.
The data warehouse user typically asks
questions about prior states of being of the data. For example: "What was
my backlog this month last year vs. this year?. or . How many refrigerator
units were in my service department this time two years ago?" To answer
these questions, you need to know what the order status was at a point in
time last year, or what a service inventory count was at a historic point
in time.
Data warehouses handle this
technically by time stamping data as it is added to the warehouse and by continuing
to append data to the warehouse without replacing or updating existing information.
Many snapshots of data are moved into the warehouse, one on top of another.
In this way, you can accurately see the inventory count at a certain time, or the backorders outstanding on a specific day. This allows us to answer questions that compare past performance to the present, providing the ability to spot trends in the data and to see if these trends are consistent across different product lines. That makes it possible to better forecast future performance with facts rather than intuition.
The data warehouse contents are typically not modified
or updated. There are several reasons for this:
First, there really is no reason to update a data warehouse.
As mentioned, it is a snapshot in time. You would not want to go back and
change history. If a data element is bad in the warehouse, you don't as a
rule want to make changes to your company's data in the warehouse environment.
You typically want to change it in the source (transactional) system
or in the load process. In that way, when the next batch of data gets loaded,
the change continues to be implemented, and the history of the change is recorded.
Second, the data warehouse model is designed for queries
and bulk loads; updates can be very expensive. Data warehouse models often
replicate data in many places (de-normalize the data). An example of this
would be a Health Insurance Claim Data Warehouse. Each insured member belongs
to an insurance plan and each plan belongs to a group. You may set up a table
listing each Plan and the Group it belongs to. A Group is now repeated several times in the database, once for each
Plan that belongs to it. This design is used to make it easy to query via
plan or group. If, however, you need to change a group name, you have to update
many locations, and that can be very costly.
Third, data warehouse environments are large by nature.
They contain historical data, summarized data and very detailed data.
Typically these environments take up gigabytes of space. Updating gigabytes of data can be very costly.
Fourth, the data warehouse
environment is heavily indexed. To
allow for the slicing and dicing of data that an analyst requires, many indexes
are placed on the data. To update these environments is very CPU intensive.
In short, data should simply
be loaded, then accessed, never updated.
Transactional systems are designed
to push one transaction at a time through the system at high speed.
The database organization is designed for processing efficiency, with
minimum redundancy. This . normalized. database organization is
complex and very inefficient for data analysis purposes.
The data warehouse adopts a
data model called a star or snowflake schema that maximizes the performance of queries. The star consists of one large central
fact table (data set) in the center of the star. It is the only table (data
set) in the schema with multiple joins (paths), each connecting it to other
satellite tables (data sets) called . dimension. tables.
The central fact table of a
star schema is where the "numerical measurements of the business are
stored. Each of these measurements is taken at the intersection of all the
dimensions." (Ralph Kimball)
The fact table is linked to
several dimension tables that contain data describing more detail about the
fact they are linked to. These dimensions are used to query quickly into the
fact table via keys . the fields in these dimensions are your query data points.
The data in this organization
is deliberately designed for redundancy so that users will find it easy to
create queries and they can run very fast, with minimal linking between tables.
If a database is going to be
used for analysis, it should be designed for analysis - hence the star schema.
See the following graphic for an example of a star schema.
The fact table is INVOICE_LINES. The
rest of the tables are dimensions.
You can see from the diagram
above that it is easy to find sales using various combinations of criteria
such as a particular product for a calendar month. You can select all of the
invoices for that calendar month then easily find the parts you are inquiring
about. What could have been a large serial pass to read through millions of
records in a transactional system turns into a finely tuned pass through a
fraction of the records using the star schema data model.
Relational databases may be
used to house these star schemas, and until recently, these databases were
the only environments accessible by the popular end-user tools that did the
analysis needed for decision support. With DISC's introduction of specialized
OMNIDEX indexes for Data Warehousing, Image/SQL is now an advantageous option
as a data warehouse repository. In
fact the capabilities of an IMAGE database enhanced with OMNIDEX go far beyond
standard relational database technology.
To satisfy the goals of the
data warehouse and to accommodate the new data structures, several types of
tools are used today to populate the data warehouse, house the data warehouse,
index the data for query optimization, and allow fast user access.
The diagram above shows the
different technologies used. Reading from right to left there are Extraction processes to move data out
of the transactional systems and into the warehouse. There is the database management system that houses
the warehouse. There are indexes added to the database to optimize the queries. And there are the users access tools.
There are two ways to populate a data warehouse. You
can use custom programs written in a language like Cobol. Or you can use tools
built specifically to move data.
Custom programs have some advantages: You may have
a staff of programmers who are familiar with a language. Or you may already
have many processes written that you can leverage to extract data from your
transactional systems.
The disadvantage of custom programs is the maintenance
cost. Data warehouses are constantly changing, and if you are using a third
generation language to populate the warehouse, it will need to be re-compiled
each time you make a change. Each change made to the mapping and transformation
similarly requires more time to implement in a third GL than a tool written
specifically to transform data.
ETL tools used to populate a data warehouse must be
able to perform the following functions:
1)
Accept
data from many sources
2)
Move
data to many targets
3)
Handle
any needed data transformations
4)
Allow
inserts, deletions or updates of any data
5)
Move
data in batch
6)
Perform
fast loads
7)
Handle
any network in the mix
8)
Keep
track of mapping . metadata
9)
You
may also want to be able to detect changes and incrementally update the warehouse.
Sources and targets must not be limited to databases,
they must also include external input that may be coming via files. The sources
and targets must all be accessible for inserts, updates and deletes if an
ETL tool is going to handle all of the data manipulation you will require.
Moving data can be done all at once or incrementally.
Take, for example, a data warehouse that is refreshed each night with new
information. You can drop data from yesterday and replace it with all of your
history (drop and load). The advantage to this method is that the logic is
simple. The disadvantage is that if you have much data it can be too time
consuming. It can take more time than the batch window available to populate
the target.
Change
oriented extraction grabs only the data elements that
have been added, updated or deleted since the last refresh of the data warehouse.
The advantage to this method is that the volume of data is significantly less
than a full refresh. The disadvantage is that you have to find a way to
make sure you track all changes. Using a date stamp on a record only traps the
inserts and updates to that table or dataset. Deletes are lost. You must either
use . triggers. in a relational
database or find a way to detect change in a non-relational database to ensure
that you retrieve all changes to the data, including any deletes.
Data transformations required in populating a data
warehouse need to be as robust and encompassing as third generation languages.
Data needs to be cleansed, integrated, follow rules, be standardized, manipulated
logically and physically and flagged when incomplete or in error. Problems
in the data are typically flagged and corrected early on in the process. Not
only is the data transformation process a means to reorganize the data, it
is a QA process.
The ETL must be able to take advantages of specific
DBMS features to facilitate fast loads of large amounts of data. It should
be able to move data across the network without requiring an additional software
layer that can slow performance.
There are two types of databases typically used to
house data warehouses, Relational databases and Multidimensional databases.
Relational databases are familiar to most of us, Oracle, DB2, and SQL Server.
When using a Star schema model, these databases can provide accessibility
to the data warehouse required by the users. Over the last few years, these
environments have been enhanced by their vendors to improve multi-table queries.
Multidimensional databases were developed to accommodate
more ad-hoc queries. They handle unpredictable access and analysis of the
data they contain, allowing the data to be examined in many ways with optimal performance. The disadvantage of multidimensional
databases is that they cannot handle more than 5 or 6 dimensions before the
build and refresh times become unmanageable. Multidimensional databases do not support updates,
take a long time to load, do not handle as much data as standard relational
databases (typically only 50 Gb), and data summaries at a level that is not
pre-defined are not allowed.
Many users choose to use both relational and multidimensional
databases in their data warehouse architectures. Often the foundation or data
warehouse will use a relational database, while the data mart or single subject
area will be populated in a multidimensional (Cube) model.
There are also tools available which can enhance databases
(including IMAGE/SQL) with speed and capabilities that go beyond relational
and multidimensional databases, while eliminating their limitations.
As an example, OMNIDEX for Data Warehousing provides unlimited multidimensional
analysis across any number of dimensions.
OMNIDEX for Data Warehousing also eliminates the need
for summary tables in relational databases, replacing them with high-speed
dynamic data summaries. Plus it provides
fast access to the underlying detail data.
There are many kinds of questions
you can ask of your corporate data warehouse. These seem to fall into three
categories of access methods and tools:
Reporting tools
provide pre-defined formats and content of data either to the desk top or
printed reports. Some of these Report Writers come from a production reporting
background, others are newer, web based offerings. Typically, queries are pre-written and executed by users with parameters
or templates. Products that fall into this category are Cognos, Actuate, SQR
and Microsoft Access.
Online Analytical
Processing OLAP
tools allow you to perform multidimensional analysis of the data warehouse.
Typically, OLAP lets you answer not only "what happened"
questions that reporting tools provide, but also "why?".
To find out why your sales for May were down 50% from last year, you
need to learn more about what comprises that total sales amount you see on
your screen. If you can drill down into the products sold -- where they were
sold, what customers bought them, what sales reps sold them -- you may be able to find out why your sales are down and hopefully
change this. These features, drill down, drill up, pivoting are found in OLAP
products.
Arbor, Business Objects, Brio, Microstrategy, Information Advantage, Microsoft and Cognos all offer OLAP tools.
Data mining tools
use artificial
intelligence to predict what may happen in the future. It uses complex statistical
analysis to look at the data in your data warehouse, spot trends and make
an educated guess at what will happen next. Examples of Data Mining Tool companies
are Datamind, NHC Software, IBM
Building a data mart/data warehouse step by step.
Briefly stated, the steps involved in building a data warehouse
or data mart are these:
1)
Analyze
the business need and determine value.
2)
Pick
the scope of initial deployment.
3)
Build
the target model.
4)
Map
data from production to the mart.
5)
Extract
operational and external data and load the mart.
6)
Optimize
queries into the mart.
7)
Deploy
end user interface.
8)
Evaluate
value of the mart to the end user.
Our Case Study
This section will use the case
study of a company that distributes a range of imported electronic components
throughout North America. The Sales channel has both direct reps, and distributors.
We initially started working
with this company because they thought that they wanted to create a consolidated
reporting environment which merged information from several systems into one
central database. They initially were
interested in allowing users access to their sales data for ad hoc queries.
Much of their in-house expertise
is on the HP3000 , so they elected to put a data mart on an HP3000.
DecisionVault, a data warehousing technology bundle specifically designed
for HP3000 customers was selected to build and populate the data mart.
DecisionVault is comprised
of BridgeWare from Taurus and Quest, OMNIDEX from DISC, BrioQuery from Brio
Technology, and the HP3000 with IMAGE/SQL.
It allowed the customer to create a high-performance, very cost competitive
data warehouse environment on an HP3000.
We first met with our client's CIO and MIS staff.
They initially were interested in a consolidated reporting environment
to satisfy many of the reporting requests coming from users.
They estimated that they had two people working full time writing reports
and were hoping to free up these resources.
The goal of the needs analysis for a data warehouse
or mart is to get executive sponsorship, and a clear definition of measurable
business issues which can be addressed.
It is critically important for MIS to work with their
users and find the pains they are facing. This accomplishes three things:
1) MIS knows what the users want.
2) MIS has user buy-in on the project.
3) MIS will be able to get user funding without
having to use the MIS budget.
Finding Pain:
A fast way to failure in a
Data Warehousing project is ask the users for a list of the data items they
want to see on reports. Business issues
or Pains that VPs face are what is needed.
If we can. t find a Business issue, let. s spend the time and money on
something else. While IT or users
may point to areas of Pain, the only way to specify and value Pain it to uncover
it personally with the top executive team.
Find an executive
Sponsor. Without Pain you won. t find
an Executive Sponsor. Even with Pain,
you need a firm commitment from one of the three or four top executives in
the company (CEO, CFO, VP Sales, VP Operations, VP Manufacturing).
The main reason
for failure in Data Warehousing is lack of one or both of these key requirements!
Conducting interviews
with users to uncover pains. Only
after you have formal support from an executive should you go to the users.
Concentrate on the direct reports of your Executive
Sponsor and their staffs. Ask these
interviewees whom they interact with in other functions on the Pain, and go
to interview those people with a view to uncovering new Pain that you can
raise with another VP.
The Pain and Interview process is a skilled task.
Do not attempt to do this without training or at least without having read
deeply into Kimball. s material on the subject.
Where you can, use an experienced business consultant to help you.
In our example company, the VP of Sales was directly
paid under a management by objective (MBO) scheme. His targets included not only the traditional Sales measures, but
also Gross Margin on Product. He couldn. t measure
his progress accurately from any data on hand, so he couldn. t be sure
he was claiming all of his due compensation.
Personal pain like this can be a strong motivator despite
the VPs reluctance to articulate the dollar impact. Underlying his personal
issue was the knowledge that he was largely in the dark about who was selling
what, particularly sales to large customers who could buy, either from his
salespeople or from his distributors, with widely varying margin contributions
In addition the CFO had suspected that gross product
margins varied widely. From a prototype built on billing data, he saw spreads
on functionally similar products of up to 20%. He agreed immediately that
the average margin could be improved by 5% with appropriate knowledge and
training. The 5% translated to $5 million in EBIT that would give an ROI of twenty times the cost of the initial Data Mart
No matter how much ROI you uncover, be sure to start with one data mart, with one functional area; say Billings, or Bookings. Start with the simplest, get the mart operational and prove your ROI before you even start the next mart. Small bites of a big meal prevent indigestion.
Start with the end in mind, so lay down your first data mart on a foundation that will later support other data marts. Don. t extract directly from production to the Mart, but by way of a staging area or central data warehouse that other data marts can later draw on.
Begin by listing the potential marts. The process of uncovering a ROI will also uncover the functional areas that merit a data mart, as well as the priority. In the example company, a number of areas promised some ROI: Billings, Booking, Backlog, Inventory and Finances. All of these are candidate marts that will make up a company wide data warehouse in time.
For each mart, you need to discover what elements are
most important for analysis. This is done by interviewing the users. You will
probably find that the CFO is interested in sales and gross margins by product
and by all the descriptors that can apply to a product, including product
family -- line, group, subcategory, color, size, supplier and so on. Most
products have upwards of 100 descriptors that can reveal areas for margin
improvement.
In dimensional modeling, you call the areas of analysis,
such as customer and product, the DIMENSIONs. The descriptors are called attributes,
and they become fields in the dimension tables (data sets). Even though you
select one data mart to start, you define the dimensions across all the marts
at the outset because you want CONFORMED dimensions that all the marts can
share. For example, you will need the same product dimension for a billings
mart as you will later want for a bookings mart and for a backlog mart.
Listing the marts will also uncover the facts or numeric
measures that underpin the ROI you are seeking. Almost invariably, facts are
numeric and not textual. Text indicators almost certainly belong in a dimension
as an attribute.
In the case of the CFO, gross margin down to the part
number will be required. You need to discover how to compute that margin.
Is it list price minus cost, or invoiced price less cost? Is the cost acquisition
cost, or that cost plus carrying costs, or are you adding customer service
costs by product, or should you include freight and insurance?
For our project, numeric indicators were chosen for
Billings, the items for that we now call FACTS and represent them in a FACT
table. Cost of Goods was also defined as acquisition cost.
Billings was the chosen area because it offered the
most ROI in a relatively simple area of the business. You can also be sure
that the data is reasonably clean since customers generally complain if their
invoices are inaccurate.
Building
a target model can be broken into four steps which we will cover in more
detail.
·
Selecting the grain for the fact table
·
Selecting the dimensions
·
Selecting the individual facts for the fact table
·
Completing the design and implementing it
The .
grain. or . granularity. is the lowest level of detail at which facts are
held.
Our VP of Sales was very interested in sales by customer by fiscal quarter since he gets paid that way. But, if your grain were higher, such as by customer by quarter, you would never be able to see analyses by sales person or by product or by week or by invoice number.
You
generally want to hold facts at the deepest level of detail that is possible
within the limitations of the source systems providing the data. From that base
level, you can easily provide aggregates, perhaps by week, month, quarter and
year.
In our
project's Billings Mart, the details found on the individual lines of each
invoice supplemented from the invoice header, were chosen to provide the highest
(most) granularity and therefore the greatest flexibility in analysis.
When you
determine the scope of the project in the section above, you list all of the
attributes by which you think you will want to perform analysis, such as by
customer, geographic regions, time, product descriptions. These are the
dimensions or points on your star schema.
Now you
select the subset of dimensions that is meaningful in describing and amplifying
the meaning of your selected Fact Table.
For our
Invoice Lines Fact Table, our client required dimension tables representing
Customers, Products, Sales People, Addresses and Time.
You saw
above some examples of the attributes for the Product dimension. Time in the
form of a Periods table allows every individual day that the Billings Mart may
cover to be described, so you can analyze billings by those attributes, down to
the day. Each
day clearly falls into a given calendar year, quarter, month, and week. If
fiscal year does not match calendar year, add fiscal year, quarter, month, and
week. You can also have attributes for a day that describes season, say Back to
School and Easter. You could have an attribute for an annual shut down, to
explain why nothing happened in a given week.
Select the individual facts for the Fact
Table
The Fact
Table will contain pointers to the Dimensions that were selected. In this case,
an Invoice Header tells you what customer was billed, at what billing address,
and who took the order. These pointers (foreign keys in database talk) point to
the rows (records) in the Customer, Address and Sales People Dimension Tables.
These foreign keys were put in the detailed row (record) in the Fact Table. A
detailed fact row (record) for each invoice number and invoice line number was
the chosen grain for this data mart.
You now
need to add the numeric indicators, the individual facts for each detailed fact
row. You know that quantity of product billed and the individual monetary
amounts on an invoice line will always sum up correctly to the invoice level or
the month level, so those facts are fully additive, and you can insert them in
your row design. Our billings mart allowed for the quantity, the list price, and
the billed price.
Note
that some
numeric indicators, such as levels or balances, are not fully additive across
all dimensions. For example, adding a percentage discount as a fact in our
invoice line would be meaningless when we summarize. . Averages and percentages
are not additive at all and shouldn. t be facts. You can re-compute ratios
whenever you need them, so we didn't include any of these measures.
You can
take numeric data as you find it in the invoicing system, such as quantity and
billed amount, and you can add computed facts if you want. In this case, you
might find it useful to compute the difference between list price and billed
price as the Discount Given amount, and the difference between the billed price
and Cost of Goods Sold as the Gross Profit amount. Storing these
amounts in the Fact record saves the analysis programs effort in computing them
on the fly.
Our
design now looked like this. Erwin can be used to help visualize the
model
From
here, we implemented surrogate keys instead of the actual customer and product
numbers to link the tables. As you can imagine, there are many reasons not to
use the . real. numbers that identify customers and products or any of the
foreign keys that point to the dimension tables. You may re-use those key values
in the transactional system while the number still points to old data in the
data mart.
You may
add a division to the data mart that has a customer under a different . real.
number. Or, say you have acquired a company for their customers, and you want to
count the sales by both companies under one customer. s name.
You may
want to have two rows of dimension data that describe different attributes for
different periods. Say your rep sells a new customer. In the data mart, you
assign that sale to that rep, and his team, group and division. Two years later,
you re-assign the customer to another rep and division. If you just change the
one existing row, that customer will disappear from the original rep. s sales
numbers and appear in the new rep. s numbers, even though they might not have
worked for the company at that time.
For disk
space and speed of join processing, it. s more efficient, to use small integers
for your computer keys rather than the long text strings that the . real.
numbers may have.
To
neatly handle these situations, use surrogate keys that are artificial and never
seen by the user. This adds some complexity each time you want to add a new
dimension row, say to handle the sales rep problem above. It is well worth the
effort. You will want to implement surrogate keys for all the dimension tables,
including the Time dimension.
CUST_ID,
and CUST_ADDR_ID listed above are examples of surrogate keys.
The last
step is to physically build the empty files for your Fact and Dimension Tables,
so that you can load and test them with data. In contrast to designing a data
warehouse using a relational database, we did not need to design and build
summary tables to aggregate the data at this point, since our client used
OMNIDEX as part of DecisionVault. OMNIDEX will provide the data summaries at
high speed using its own summary indexes.
Now that
we had built the target mart schema it was time to populate it.
The
first step of every data movement project is to develop a mapping document. The
mapping document details where the data is coming from; transformations, editing
rules, lookups needed; and what needs to happen if an error occurs. This
document becomes the specification for the extraction routines.
The
mapping document is done after the data model for the data mart or warehouse is
complete. For the purposes of illustration, let. s look at a simplified version
of our data model. In this version, we have only a single dimension table and
the fact table.
Loading
of the data mart begins first with the dimension, so we started there.
Working
with the data analyst most familiar with the source data, we found a source for
each of the target fields. To develop the transformation rules, it is sometimes
helpful to print the data for this dimension, e.g. a list of all customers.
Normally seeing sample data will help identify any problem areas, e.g. customers
without addresses, etc.
Together
with the data analyst, we developed a minimum criterion for the acceptance of a
dimension record. Our client wanted to accept all customers. Don. t worry;
moving all the data over to the data mart is a self-correcting problem. As
executive management begins reviewing the data and finds junk, they will either
impose . their. rules or demand that the source data be corrected. Either way,
the data will get . cleaned. up.
Our
project's mapping document was developed in Excel. You could use any tool that
works for you. Excel is simple to use as it is easy to . paste. sample data onto a
worksheet, and it is easy to email the documents between the parties involved in
developing the data warehouse. Below is our mapping document for the customer
dimension.
|
Target Table: CUSTOMERS |
|
|
|
|
cust.cust_name |
|
cust_name |
parent.parent_id |
Check to see if parent already exists, if not then
add |
parent_id |
parent.user_cust_no |
|
parent_nbr |
parent.cust_name |
|
parent_name |
no source |
constant: SRCAPPL |
source_system |
no source |
constant: current time/date |
last_updated |
cust_seg.active_flag |
Retrieve from cust_seg using cust_no and seg_no =
317 |
active_flag |
cust_commcn.area_cd |
Retrieve from cust_commcn using cust_no |
area_nbr |
cust_commcn.phone_no |
Retrieve from cust_commcn using cust_no |
phone_no |
cust.cust_type_cd |
|
cust_type |
cust.duns_no |
|
duns_no |
cust.duns_rtng |
|
duns_rating |
cust.fed_tax_id |
|
tax_id |
cust.sic_cd |
|
sic_code |
You can see in this document, it took four source tables to
provide enough data to populate the dimension table. For those fields that
required data from other tables, we made sure to include information on how to
make those linkages from our primary source table (cust) to the other source tables (parent, cust_seg, and cust_commcn).
If there is decoding involved, either create a lookup table
or provide another sheet that details the translation, e.g. this source code
gets this description. The mapping document should be complete enough that an
outside programmer could write code from the document.
You should write the mapping documents for all of the
dimensions first and then move to the Fact Table(s). The Fact Table mapping will
assume that all of the dimensions exist in the data model before the fact is
added. In other words, you shouldn. t have to check if the customer exists,
because there shouldn. t be any invoices added who do not have a customer. If
your source system doesn. t check this referential integrity, either as part of
the application or as part of the database structure, you will need to include
the rules needed to insure that the dimensional data is in place prior to
writing a fact row. The mapping document is a . working. document that will
change as you discover different aspects of the data, both before and during the
load document.
Once the
mapping document for a particular dimension is complete, it is time to begin the
development of the extraction routines. All data marts begin with an initial
load that loads the entire applicable contents of the transactional source
application into the data mart. This is a bulk load process that starts with the
dimensions and then loads the fact tables.
As with
most initial loads for data marts, we started with the dimensions first. The customer data is
key to the model and is the cornerstone of the all of the information within the
mart. When
talking to the user about their customer data, they indicated that the data was
clean for a number of reasons: Product had been shipped to the address in the
database. The application has an automatic de-duping feature to ensure that
duplicate catalogs are not sent to the same address. And address fields are
automatically updated by the application.
Deciding on a refresh frequency:
Once the
initial load is complete, deciding on a methodology to keep the data mart
up-to-date is essential. The choices are reload the data every month, week or
night or develop an incremental load scheme. The advantage of the reload process
is that the same procedures that did the initial load can be used without any
changes. The disadvantage is that the time window available to reload the data
may not be long enough. In addition to this problem, the individual incremental
changes, e.g. inventory changing from 10 to 20 with a net change of 10, are
lost. You will just be left with the end-of-day inventory levels. Depending on
what kinds of questions your data is trying to answer, this may not be
acceptable.
Our
client started with a full load process which was executed weekly. As the users started
really using the system they asked for a daily refresh. They wanted to query
with yesterday's results in the billings mart, they were using it for reporting
as well as analyisis.
In their case they did not have enough time overnight available to add
another process which moved millions of records across the network and into the
mart. They had
to have an incremental update process.
This was
one reason they decided to use DecisionVault, the technology bundle allows for
both types of updating strategies. Taurus. BridgeWare product captures the
transactions that have modified in the selected databases or files. The
transactions can then be applied to the data mart. This update can be done
either in real-time or at some interval that you set up.
Initial data load:
Let. s
look at the first dimension, CUSTOMERS. The
source system for this application was an Oracle database. This meant that the
data was already edited for type, meaning you could depend on numeric data
having numeric values, dates having valid date values and character data was
already upshifted and stripped of trailing spaces. The data cleansing issues
were then reduced to finding duplicate customers. During your review of the
customer data, you might find duplicated customers. The customer elected to load
all duplicate customers and force data correction on the source side.
BridgeWare data mapping and movement could have been done
either using the Forklift GUI development tool or developing a script in the
editor. When we
developed scripts, we tried to break them into small pieces (modules) and then
put all the modules together at the end. The scripting broke into three major
components: accessing files, selecting data, and the data movement actions. Let. s examine each
of these components separately.
The
scripting language has all the power of a third generation programming language.
The scripting language is broken into three major components: accessing files,
selecting data, and the action to be taken on the data. To access a file, you
either create it or open it. Open is used for those files that already exist. In
this situation, you are accessing an Oracle database (your source) and an Image
database (your target). The syntax of the open statement is:
open TAG [remote MACHINE LOGININFO] databasetype DATABASEPARMS.
In our
case the client wasaccessing their source locally and writing to the target
remotely:
open PROD ORACLE PROD/pass &
home=/oracle/product/723 sid=prod
open DW REMOTE hp3k user=mgr.dw image dw &
pass=secret mode=1
Once the
files are open for access, we tell BridgeWare what data we want from those
files. In this situation, we wanted all of the customer information. There was
no selection or filtering done. Selecting or reading is accomplished with the
read statement. The syntax of the read statement is:
read TAG = DBTAG.DATASET [for SELECTION].
In this
case, we read the CUST tables
from the source database called PROD in the
open
statement. The read
statement was:
read SRC_CUST = PROD.CUST
As you
remember from the mapping document, we needed to read four source records to
populate the single data mart table, CUSTOMERS.
BridgeWare allows the creation of local record variables to hold the various
pieces until you have a completed record. To create a local record variable, use
the define
statement. The define
statement can be told to use the record layout of a file that is already open
using the using clause. To save us from having to define each item, we used this
clause:
define CUST using DW.CUSTOMERS
As the
various data become available, we set the various fields to their appropriate
values using the setvar
statement. For example:
setvar CUST.PARENT_ID = null_id
setvar CUST.PARENT_NBR = null_nbr
setvar CUST.PARENT_NAME = SRC_CUST.CUST_NAME
When we
had the various pieces, we could copy the record from the local record variable
to the target dataset. This is done using the copy statement. The syntax for the
copy
statement is: copy tag to dbtag.dataset. So for
our example it was:
copy cust to dw.customers
An incomplete script to
read data from our source and copy it to our target looks like this:
open PROD ORACLE PROD/pass &
home=/oracle/product/723 sid=prod
open DW REMOTE hp3k user=mgr.dw image dw &
pass=secret mode=1
define CUST using DW.CUSTOMERS
read SRC_CUST = PROD.CUST
setvar CUST.PARENT_ID = null_id
setvar CUST.PARENT_NBR = null_nbr
setvar CUST.PARENT_NAME = SRC_CUST.CUST_NAME
copy cust to dw.customers
endread
Let. s
expand to the script to match our project's real world.
open PROD ORACLE PROD/pass &
home=/oracle/product/723 sid=prod
open DW REMOTE hp3k user=mgr.dw image &
dw pass=secret mode=1
define CUST_KEY using DW.CUSTOMER_KEY
define CUST using DW.CUSTOMERS
define PARENT_KEY: using DW.PARENT_KEY
define PARENTS using DW.PARENTS
define add_entry : boolean
define error_flag : boolean
define no_id_flag : boolean
define null_nbr : oracle varchar2(12) value 'NONE'
define null_id : oracle number(11)
**
*Add any new Customer Numbers to the
* Customer key table
*(Generate the Surrogate Customer Id)
**
read SRC_CUST = PROD.CUST
setvar add_entry = $true
read C_KEY = DW.CUSTOMER_KEY &
for C_KEY.CUST_NBR =
SRC_CUST.USER_CUST_NO &
and C_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar add_entry = $false
endread
if add_entry = $true then
setvar CUST_KEY.CUST_NBR =
SRC_CUST.USER_CUST_NO
setvar CUST_KEY.SOURCE_SYSTEM =
'SRCAPPL'
setvar CUST_KEY.LAST_UPDATED =
$now
copy CUST_KEY to DW.CUSTOMER_KEY
endif
setvar CUST.CUST_ID = 0
read C_KEY = DW.CUSTOMER_KEY &
for C_KEY.CUST_NBR =
SRC_CUST.USER_CUST_NO
setvar CUST.CUST_ID =
C_KEY.CUST_ID
endread
setvar no_id_flag = $true
if CUST.CUST_ID > 0 then
setvar no_id_flag = $false
setvar CUST.CUST_NBR =
SRC_CUST.USER_CUST_NO
read SRC_CUST_SEG = PROD.CUST_SEG
&
for SRC_CUST_SEG.CUST_NO =
SRC_CUST.CUST_NO &
and SRC_CUST_SEG.SEG_NO = 317
setvar CUST.ACTIVE =
SRC_CUST_SEG.ACTIVE_FLAG
endread
read SRC_CUST_COMMCN =
PROD.CUST_COMMCN &
for SRC_CUST_COMMCN.CUST_NO
= SRC_CUST.CUST_NO
setvar CUST.AREA_NBR =
SRC_CUST_COMMCN.AREA_CD
setvar CUST.PHONE_NBR
= SRC_CUST_COMMCN.PHONE_NO
endread
setvar CUST.PARENT_ID =
null_id
setvar CUST.PARENT_NBR =
null_nbr
setvar CUST.PARENT_NAME =
SRC_CUST.CUST_NAME
read SRC_PARENT = PROD.CUST
&
for SRC_PARENT.CUST_NO =
SRC_CUST.WITHIN_CUST_NO
setvar add_entry =
$true
read P_KEY =
DW.PARENT_KEY &
for
P_KEY.CUST_NBR = SRC_PARENT.USER_CUST_NO &
and
P_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar
add_entry = $false
endread
if add_entry = $true then
setvar
PARENT_KEY.CUST_NBR = &
SRC_PARENT.USER_CUST_NO
setvar
PARENT_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar
PARENT_KEY.LAST_UPDATED = $now
copy PARENT_KEY to
DW.PARENT_KEY
endif
setvar CUST.PARENT_ID = 0
read P_KEY = DW.PARENT_KEY &
for P_KEY.CUST_NBR =
SRC_PARENT.USER_CUST_NO &
and P_KEY.SOURCE_SYSTEM =
'SRCAPPL'
setvar CUST.PARENT_ID
= P_KEY.PARENT_ID
endread
setvar add_entry = $true
read PRNTS = DW.PARENTS &
for PRNTS.PARENT_ID =
CUST.PARENT_ID
setvar add_entry =
$false
endread
if add_entry = $true then
setvar
PARENTS.PARENT_ID = CUST.PARENT_ID
setvar
PARENTS.PARENT_NBR = &
SRC_PARENT.USER_CUST_NO
setvar
PARENTS.PARENT_NAME = &
SRC_PARENT.CUST_NAME
setvar
PARENTS.LAST_UPDATED = $now
copy PARENTS to
DW.PARENTS
endif
setvar CUST.PARENT_NBR
= SRC_PARENT.USER_CUST_NO
setvar
CUST.PARENT_NAME = SRC_PARENT.CUST_NAME
endread
setvar CUST.CUST_NAME =
SRC_CUST.CUST_NAME
setvar CUST.CUST_TYPE =
SRC_CUST.CUST_TYPE_CD
setvar CUST.DUNS_NBR =
SRC_CUST.DUNS_NO
setvar CUST.DUNS_RATING =
SRC_CUST.DUNS_RTNG
setvar CUST.TAX_ID =
SRC_CUST.FED_TAX_ID
setvar CUST.SIC_CODE =
SRC_CUST.SIC_CD
setvar CUST.LAST_UPDATED =
$now
copy CUST to DW.CUSTOMERS
endif
if no_id_flag = $true then
print 'Problem during load for
customer", &
src_cust.cust_no
endif
endread
This
script from our example first opens the source and target database. It creates
local record variables for those tables for which you are going to add records.
There are four tables: cust_key (this
is the customer. s surrogate key, customers
(contains the data as described in the mapping document in the previous
section), parent (the
customer. s parent, e.g. HP is the parent for all the various division locations
of HP), and the parent-key (this
is the parent. s surrogate key). Notice surrogate keys have been used. Surrogate
keys are used when data can come from more than one system using the same .
customer number. but mean two different customers. The surrogate creates a
unique identifier for the customer within the data mart.
Next, we
defined local variables that are used as switches and intermediate storage.
The
script then reads our source. It checks to see if this customer already exists
in the data mart by checking the surrogate key for customer, customer_key. If the
customer doesn. t exist, create the surrogate key and write it out. Next go and
get the status information from the cust_seg table
in your source database and put it in your record variable for customer. Do the
same thing for the customer. s phone number that you get from the cust_commcn table.
The next
logic checks for a parent company. If there is a parent, it uses the parent. s
key. If there is no parent, it creates the parent record
and puts an entry in the parent surrogate key dataset.
Finally,
assign the remaining fields and copy your record to the customers dataset
in the data mart.
Incremental load:
Once the
data mart was loaded and began being used, it was clear that the customer wanted
a daily refresh of data. They did not want to perform a full refresh
each night because they did not have a batch window opening large enough for a
full refresh.
DecisionVault's BridgeWare combines DataBridge with a
real-time change detect component for the HP3000. On relational databases it is easy to
capture changes as they are posted to the database using triggers. On the
HP3000, Image does not have such a feature. Although Image has logging, the log
file doesn. t capture enough information to apply changes to the datamart.
BridgeWare. s technology is best explained using a
diagram.
BridgeWare, a joint development with Quest Software combines
their detect change technology with Taurus' ETL. Their component, called
SharePlex, detects changes in data files on the HP3000 and writes them to a
message file. SharePlex works at the operating system level, detecting changes
before they reach the data file. DataBridge is then used to define how the
captured changes should be transformed and moved.
The message file holds the records of the changes, and
contains a header record with the name of the file changes, the type of
transaction, a sequence number, and the date and time the transaction occurred.
Each type of record (insert, update, delete) always writes a header plus the
record images which are appropriate. An insert has the record image for the
record inserted. A delete has the record image of the record before it was
deleted. An update has the before and after record image.
As our client moved to an incremental load process they had
to consider how to handle each of their transaction types, adds, changes and
deletes. The
rest of the load process remained the same.
Once data was loaded into the IMAGE/SQL data warehouse using
BridgeWare, it was almost ready to be queried, analyzed, and retrieved by the
end-users using BrioQuery. At this point, performance became an issue because
users can be quite demanding about query retrieval times (think of user demands
for response time on the web).
Query
retrieval times are very important because the #1 complaint of data warehousing
managers, according to Alan Paller of TDWI (The DataWarehousing Institute)
is:
. I should have paid
more attention to performance..
We
evaluated our options to improve query performance. According to data
warehousing literature, standard options to satisfy user query requests are to:
·
Read serially through large . fact. tables
·
Build summary tables (typically with relational databases)
·
Load the data into a MDD (multidimensional database) and
query against it
Serial
reads through large tables or data sets are slow. They are not viable for any
significant amount of data. Even with extremely large expenditures in CPUs
or parallel processors, they are typically not fast enough for online reporting
with multiple concurrent users competing for resources.
At first
glance, using a relational database and building summary tables is a
possibility. They store pre-calculated and sorted values that can answer
specific queries. But we were worried that the number of tables needed to do the
queries which came out of our user interviews would be prohibitive because
a different
table needs to be built for almost every query.
As
mentioned above, multidimensional databases are limited in the number of
dimensions they can handle, and they cannot be updated. Typically only 5 or 6
dimensions can be used before the build and refresh times become unmanageable
and another data mart must be built. This we not going to meet our user's
requirements, since they wanted to expand the data mart into a data warehouse
over time, and the refresh times would outgrow their nightly batch window.
This is
where DISC. s OMNIDEX indexing software came in. OMNIDEX is included in
DecisionVault, and our client had used it before as a high speed keyword search
engine for the HP3000. In the past few years, OMNIDEX had been specifically
enhanced for fast and flexible access into data warehouses and data marts. It
provides additional access that is not possible with IMAGE/SQL (or Oracle or SQL
Server databases) alone.
OMNIDEX
accomplishes high speed access to ad hoc queries by supplementing the existing
database with Multidimensional and Aggregation Indexes that are specifically
designed for decision support queries. OMNIDEX provides:
·
High-speed data retrieval for query optimization
·
ODBC/Web middleware residing between the database and
popular end-user tools such as Brio
·
Multidimensional analysis, with an unlimited number of
dimensions
·
Dynamic data aggregations, without the need for summary
tables
·
Instant drill-down to detail
·
Optimized for star or snowflake schema
·
Fast index build rate and compressed disk space
To show how OMNIDEX was used to deliver the query
performance we needed on our data mart let. s look at the database. s star
schema design and the retrieval requirements.
The current model has a basic sales fact table and five (5)
dimensions (refer again to the star schema diagram). The dimensions that can be
sliced and diced in any combination are:
1) Customers
2) Addresses
3) Products
4) Periods
5) Sales
Peoples
For our
client, the primary purpose of the Billing/Sales Analysis data mart was to
analyze customer characteristics and buying patterns, such as:
1) Customer
demographics
2) Product
performance
3) Sales
rep performance
4) Regional
preferences
Let. s
see how OMNIDEX handled multidimensional queries, data aggregations, and
drill-down to detail for this client.
As with
most data warehouse projects, we determined that most of the data warehousing
queries which needed to be made were complex and multidimensional in nature,
using multiple criteria against multiple fields or across multiple dimensions.
For example, our end-users rarely wanted to access data by only one field or
dimension, such as finding the number of customers in the state of California.
They wanted to ask complex questions such as how many customers in the state of
CA have purchased product B or D in the last year.
Our
client's sales manager wanted to see the sum of sales for his sales reps for
their new products during a certain time period such as last month. The
inventory managers might want to see the quantity sold of a product for the last
month and compare it to last year. s. A marketing manager might want to see the
sum of sales for some particular states and customer types.
We knew
that the performance of these multidimensional queries in a data warehouse
environment using a star or snowflake database design would be greatly enhanced
through the use of OMNIDEX multidimensional indexes.
OMNIDEX Multidimensional Indexes are specially designed for
unlimited multidimensional access, allowing complex queries using any number of
criteria or dimensions which we needed to satisfy our requirements.
We
placed OMNIDEX Multidimensional indexes on each column in the dimension tables
so the end-users could query by any combination. When an end-user issues a
query, a qualifying count based on index access only is returned without
touching the actual data. This speeds the query because it is much more
efficient to service a query request by simply looking in an index or indexes
instead of going to the primary source of data (the database itself). This is
true for both pre-defined and ad hoc queries.
Because the end-users quickly obtain the results each time
they perform an ad hoc search, they can decide whether to add criteria to
further qualify the selection, aggregate it, drill down into the detail data, or
start over. With the multidimensional indexes applied, our client's end-users
were able to query repeatedly in seconds or minutes instead of hours or days,
even with large amounts of data. This allowed them to interact with and explore
the data in unpredictable ways, very quickly. Only when the end-users want to
view the detail data was I/O performed against the database.
In
addition to multidimensional queries, our client's end-users wanted to see data
summaries (e.g., COUNT, SUM, AVERAGE, MINIMUM, MAXIMUM) such as total sales
dollars, number of customers, average quantity, etc., and group them by a field
such as sum of sales by region, or average quantity of product line B sold by
sales rep.
OMNIDEX performs data summaries by using its Aggregation
indexes only, summarizing quantities or amounts dynamically at the time of the
query. Up to 1 million values per second can be queried.
We placed an OMNIDEX aggregation index on the fields to be
summarized and were able to query at these very high speeds while still
satisfying our client's requirement for heavy ad hoc inquiries.
The multidimensional and aggregation indexes which we set up with the client contain pointers back to the actual data in the fact table, so the underlying detail data can be instantly accessed and displayed to the user at any time. This means data can be kept at any level of granularity, quickly aggregated, and then the detail displayed - a capability that is lost when using multidimensional databases or summary tables with relational databases .
Once the
star schema database is designed, the general query needs are determined, and
data is loaded, OMNIDEX is ready to be installed. To enhance this client's star
schema with OMNIDEX indexes, an . environment catalog. was set up that defines
the database layout (like a data dictionary).
Once the
environment catalog was created and compiled, an OMNIDEX consultant worked with
the client's staff to further analyze the query needs. They first
determined what snowflake tables would optimize the query performance for .
Group By. fields from large data sets to be used in data aggregations.
Depending on the cardinality of the data (number of possible
values per field), and the large size of the dimension tables, the snowflakes
that were created to benefit this Sales Analysis data mart were as follows: (OMNIDEX has a
program called SNOWGEN that generates the snowflake tables automatically.)
From
From
From
From
Customers
Addresses
Products
SalesReps
Cust name
City name
Prod Group District code
Cust
type
Country name
Prod Type Office
code
Parent
name
Region name
Prod Line
Region code
Parent
nbr
State/Province code Part Desc Last
name
Parent
ID
Status
SIC
code
Vendor
Tax
ID
ABC code
From
Periods
Calendar
month name
Day
name
After
the snowflakes were created and added to the environment catalog, the OMNIDEX
installation job for the data mart was defined based on the overall query needs.
First of all, all of the fields in the dimensions tables were defined as
multidimensional keys (MDK).
Once the
indexes were created and loaded, it was time to install the OMNIDEX Client
Server software on each user. s PC, create some data source definitions, and
invoke a . listener. process on the HP3000 server. High-performance queries
then became available to the users through ODBC-compliant analysis and reporting
tools.
When a
pre-defined or ad hoc query or report issues a request for information, ODBC
passes the query through the OMNIDEX search engine and SQL Optimizer, which
accesses the indexes and passes the qualifying count or data aggregation back to
the PC query or reporting tool. The OMNIDEX Multidimensional and Aggregation
indexes allowed us a broad range of queries to be optimized with one set of
indexes that took much less build time and disk space than other "standard" data
warehousing options..
If the
user requests the actual detail data, the pointers for the selected data records
are used to quickly retrieve only those records of interest from the sales fact
data set. The OMNIDEX SQL optimizer insures that all queries against a star
schema or snowflake are completed in the fastest manner, and never cause slow,
I/O-intensive serial reads of the data.
Our Case Study - Step 7
Once we
had clean, certified data in our data mart with indexes installed on it, the
Billings data mart was ready to give to the VPs and their staff. We needed to
choose a business intelligence tool that users like. We knew that any
ODBC-compliant tool can access an IMAGE database through the OMNIDEX interface.
Our client chose BrioQuery because of its fast learning curve and ease of
use.
Brio
Technology. s range of products allows users to access the data mart from their
desktops or from their web browsers. BrioQuery generates SQL requests automatically
that are sent to OMNIDEX so that OMNIDEX can quickly retrieve the data from the
Data Mart. The
data comes back to the user initially in row and column format, ready to be
graphed and analyzed.
This
sample shows the data requested from the Data Mart on the . Request. line, and
below that the 6,999 rows of data that were returned.
Our
client chose to develop standard analyses using charts, pivot spreadsheets
and serial reports that cover most of the detail that the VPs want to focus on
to meet their ROI goals. Fortunately, Brio. s products are easy enough to use
that an Excel user can create new analyses very quickly.
Here is
an example bar chart graph: Analysis shows how the sales people have been
discounting each quarter. We know that the VP of Sales and the CFO believe they
can reduce the discounting rate from an average of 14% to at least 12%, if they
can see where the problems are. That 2% is $2 million at the end of the year in
Sales and Margin . with no effort. We can . drill into. this chart to find out
that reps are the worst offenders and whether any customer or product is
particularly responsible. Managers can then act to correct the worst
offenders.
Here. s
an example of . Green Bar Paper. reporting brought up to date. Note that the output
contains interactive graphic analyses that change as the control breaks on the
listing change.
Here. s an
example of a Pivot table for product margin analysis.
The
display is color coded to reflect the user. s thresholds for . good. and . Bad.
performance (Blue and Red). The Purple horizontal and vertical axes show
the attributes of the Dimensions that this analysis uses. The gray tab at the
end of each purple axis allows any horizontal dimension to be pivoted to the
vertical and vice versa. The OLAP structure that underlies this
analysis is then re-computed .
It is
important as an MIS organization to investigate the perceived and realized
benefits provided by the data warehouse or mart. Once the users begin using the warehouse or
mart they will ALWAYS ask for changes, and it is important to document the value
of the new environment to help prioritize and justify new enhancements and new
marts.
Determining the value of the mart can be done through
interviews with end users or even through written surveys. This list of sample
questions was pulled from a Data Warehousing Institute survey being conducted
currently by the Survey Research Center at the University of Georgia.
Strategic Benefits:
How did
the mart increase our competitive position?
How did
it improve relationships with our customers?
How did
it improve relationships with suppliers?
How did
it increase our ability to conduct one to one marketing?
How did
it improve our business process?
How did
it improve cost efficiency?
How did
it decrease the cost of supplying information?
How did
it decrease the effort in supplying information?
How did
it decrease the time to build new applications?
Information Benefits:
How did
the mart improve our consistency?
How did
the mart improve integration?
How did
the mart improve comprehensiveness?
How did
the mart improve availability?
How did
the mart improve timeliness?
How did
the mart improve accuracy?
How did
these improvements affect our business? And what value did this have?
You have
seen how Pain is vital to the success of a Data Mart Project, and how we built
our client. s Data Mart .
You have
seen that, for the first time, there is a viable, comprehensive solution for
creating data warehouses and data marts on an HP3000 from data that had been
previously locked up in transactional databases on the HP3000 or HP9000. Better
yet, the data for this solution can be easily assembled by lower level MIS
personnel and used by . good grief . marketing and executive types to increase
the productivity and profitability of your organization.
We
recommend that anyone entering the Data Warehousing arena start by reading the
works of Ralph Kimball at minimum and attend his workshops. Ralph has codified
the principles of data warehousing in a simple and down to earth fashion that
addresses the business needs rather than the technology. No serious data
warehousing
project should start before the designers have fully absorbed his
principles, particularly in data modeling. Ralph. s two main works are:
The Data Warehousing ToolKit. ISBN 0-471-25547-5
The Data Warehousing LifeCycleToolKit.