DBMS - December 1996
DBMS Online: Data Warehouse Architect By Ralph Kimball

Letting the Users Sleep, Part 1

Nine Decisions in the Design of a Data Warehouse.

The job of a data warehouse designer is a daunting one. Often the newly appointed data warehouse designer is drawn to the job because of the high visibility and importance of the data warehouse function. In effect, management says to the designer: "Take all of the enterprise data and make it available to management so that they can answer all of their questions and sleep at night. And please do it very quickly, and we're sorry but we can't add any more staff until the proof of concept is successful."

This responsibility is exciting and very visible, but most designers feel overwhelmed by the sheer enormity of the task. Something real needs to be accomplished, and fast. Where do you start? Which data should be brought up first? Which management needs are most important? Does the design depend on the details of the most recent interview, or are there some underlying and more constant design guidelines that you can depend on? How do you scope the project down to something manageable, yet at the same time build an extensible architecture that will gradually let you build a comprehensive data warehouse environment?

These questions are close to causing a crisis in the data warehouse industry. Much of the recent surge in the industry toward "data marts" is a reaction to these very issues. Designers want to do something simple and achievable. No one is willing to sign up for a galactic design that must somehow get everything right on the first try. Everyone hopes that in the rush to simplification, the long-term coherence and extendibility of the design will not be compromised.

Fortunately, a pathway though this design challenge achieves an implementable immediate result, and at the same time it continuously augments the design so that eventually a true enterprise-scale data warehouse is built. The secret is to keep in mind a design methodology, which I call the "Nine-Step Method." (See Table 1.) I will develop the first three of the nine steps in this month's article and the remaining six steps in next month's article.

As a result of interviewing marketing users, finance users, sales force users, operational users, first- and second-level management, and senior management, a picture emerges of what is keeping these people awake at night. You can list and prioritize the primary business issues facing the enterprise. At the same time, you should conduct a set of interviews with the legacy systems' DBAs, who will reveal which data sources are clean, which contain valid and consistent data, and which will remain supported over the next few years.

Preparing for the design with a proper set of interviews is crucial. Interviewing is also one of the hardest things to teach people. I find it helpful to reduce the interviewing process to a tactic and an objective. Crudely put, the tactic is to make the end users talk about what they do, and the objective is to gain insights that feed the nine design decisions. The tricky part is that the interviewer can't pose the design questions directly to the end users. End users don't have opinions about data warehouse system design issues; they have opinions about what is important in their business lives. End users are intimidated by system design questions, and they are quite right when they insist that system design is IS's province, not theirs. Thus, the challenge of the data mart designer is to meet the users far more than halfway.

In any event, armed with the top-down view (what keeps management awake) and the bottom-up view (which data sources are available), the data warehouse designer is ready for:

Step 1: Choosing the Process. By "process" I mean the subject matter of a particular data mart. The first data mart you build should be the one with the most bang for the buck. It should simultaneously answer the most important business questions and be the most accessible from a data extraction point of view. A great place to start in most enterprises is to build a data mart that consists of customer invoices or monthly statements. (See Figure 1.) This data source is probably fairly accessible and of fairly high quality. One of Kimball's laws is that the best data source in any enterprise is the record of How Much Money They Owe Us. Unless costs and profitability are easily available before the data mart is even designed, it's best to avoid adding these items to this first data mart. Nothing drags down a data mart implementation faster than a heroic or impossible mission to provide activity-based costing as part of the first deliverable.

Step 2: Choosing the Grain. This second step seems like a technical detail at this early point, but it is actually the secret to making progress on the design. Choosing the grain means deciding exactly what a fact table record represents. Recall that the fact table is the large central table in the dimensional design that has a multipart key. Each of the components of the multipart key is a foreign key to an individual dimension table. In the customer invoices example in Figure 1, the grain of the fact table is the individual line item on the customer invoice. In other words, a line item on an invoice is a single fact table record, and vice versa.

Only when you have chosen the grain can you have a coherent discussion of what the dimensions of the data mart's fact table are. This is the subject of Step 3.

Step 3: Identifying and Conforming the Dimensions. The dimensions are the drivers of the data mart. The dimensions are the platforms for browsing the allowable constraint values and launching these constraints. The dimensions are the source of row headers in the user's final reports; they carry the enterprise's vocabulary to the users. A well-architected set of dimensions makes the data mart understandable and easy to use. A poorly presented or incomplete set of dimensions robs the data mart of its usefulness.

Dimensions should be chosen with the long-range data warehouse in mind. This choice presents the primary moment at which the data mart architect must lift up his or her gaze from the data mart details and consider the longer-range plans. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. Only in this way can two data marts share one or more dimensions in the same application. When a dimension is used in two data marts, this dimension is said to be conformed. Good examples of dimensions that absolutely must be conformed between data marts are the customer and product dimensions in an enterprise. If these dimensions are allowed to drift out of synchronization between data marts, the overall data warehouse will fail, because the two data marts will not be able to be used together.

The requirement to conform dimensions across data marts is very strong. Careful thought must be given to this requirement before the first data mart is implemented. The data mart team must figure out what an enterprise customer ID is and what an enterprise product ID is. If this task is done correctly, successive data marts can be built at different times, on different machines, and by different development teams, and these data marts will merge coherently into an overall data warehouse. In particular, if the dimensions of two data marts are conformed, it is easy to implement drill across by sending separate queries to the two data marts, and then sort-merging the two answer sets on a set of common row headers. The row headers can be made to be common only if they are drawn from a conformed dimension common to the two data marts. (See Figure 2.) I first described Drilling Across in my March 1996 column. (See "Drilling Down, Up, and Across," DBMS, March 1996, page 14.)

With these first three steps correctly implemented, you can attack the last six steps. Each step gets easier if the preceding steps have been performed correctly. I'll describe the six remaining steps in next month's article.


TABLE 1.
Nine-Step Method in the Design of a Data Warehouse

1. Choosing the Process
2. Choosing the Grain
3. Identifying and Conforming the Dimensions
4. Choosing the Facts
5. Storing Precalculations in the Fact Table
6. Rounding Out the Dimension Tables
7. Choosing the Duration of the Database
8. The Need to Track Slowly Changing Dimensions
9. Deciding the Query Priorities and the Query Modes


Figure 1.


--A Customer Invoices Data Mart. Each record in this fact table represents an invoice line item. The Invoice Number is a "degenerate" dimension that allows line items on a particular invoice to be grouped.


Figure 2.


--Two data marts with conformed dimensions (Time, Customer, Product, and Promotion). These two data marts can work together to produce a combined report with Product and Promotion as common row headers.


Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.

This is a copy of an article published @
http://www.dbmsmag.com/