DBMS - November 1996 - Data Warehouse Architect

DBMS - November 1996 Data Warehouse Architect By Ralph Kimball

Causal (Not Casual) Dimensions

Consider causal dimensions when you want your data warehouse to be more informative.

One of the most interesting and valuable dimensions in a data warehouse is one that explains why a fact table record exists. In most data warehouses, you build a fact table record when something happens. For example:

When the cash register rings in a retail store, a fact table record is created for each line item on the sales ticket. The obvious dimensions of this fact table record are product, store, customer, sales ticket, and time, as shown in Figure 1.
At a bank ATM, a fact table record is created for every customer transaction. The dimensions of this fact table record are financial service, ATM location, customer, transaction type, and time.
When the telephone rings, the phone company creates a fact table record for each "hook event." A complete call-tracking data warehouse in a telephone company records each completed call, busy signal, wrong number, and partially dialed call.

In all three of these cases, a physical event takes place, and the data warehouse responds by storing a fact table record. However, the physical events and the corresponding fact table records are more interesting than simply storing a small piece of rev enue. Each event represents a conscious decision by the customer to use the product or the service. A good marketing person is fascinated by these events. Why did the customer choose to buy the product or use the service at that exact moment? If we only had a dimension called "Why Did The Customer Buy My Product Just Now?" our data warehouses could answer almost any marketing question. We call a dimension like this a "causal" dimension, because it explains what caused the event.

Surprisingly, in many cases the available data can build a good approximation of a causal dimension. This data masquerades under headings such as Promotion, Store Condition, Deal, Contract, Rate Card, or Reason. For instance, in a retail environment, a n umber of management decisions are in effect at any time for a product, including Temporary Price Reduction, On Ad, or On Display. Each of these management decisions arguably affects the volume of sales. Most of these decisions are viewed as retail promot ions.

At a bank ATM, there may be a New Account Drive, a Promotional Mailing, or a Branch Teller Surcharge. Again, each of these management decisions affects the volume and the patterns of ATM usage. There may also be exogenous effects on ATM usage, such as a national holiday or bad weather, that are not the result of a human management decision.

The telephone company hook events are similarly "explained" by causal dimensions such as Reduced Rate Dialing Specials, Lifeline Rates, and Off-Peak Usage Incentives. Some of these descriptors can be found in legacy data in the form of Contracts, Deals, or Rate Cards.

One of the best things a data warehouse designer can do is search for and build causal dimensions. Data for causal conditions, such as promotions, store conditions, or contracts, is often available somewhere in the corporate environment but is rarely lin ked in a clean way to the primary transaction data feed. Retail transaction systems are the most likely to have a link to causal data, largely because retail transaction systems must keep track of price reductions and markdowns. Less commonly, the retail transaction system also keeps track of whether an item is on display or is being advertised. In some cases, the data warehouse team can ask the production point-of-sale programming staff to add an advisory data field to the legacy data. The store manage r can fill in the field on a regular basis to record whether an item is being promoted through displays or advertisements. This kind of business re-engineering greatly simplifies data extraction for the data warehouse team and improves the power of the d ata warehouse.

ATM transaction and telephone switch usage data almost never contain links to causal data. In these cases, causal data needs to be merged into the transaction data from an entirely separate source, such as a marketing promotion system.

A useful causal dimension need not describe every minor variation in a promotion or the store condition. It may be most useful to build a causal dimension at a reasonably high level, gradually building up a few hundred types of promotion descriptions or store conditions. Figure 2 shows a useful causal dimension for a retail store point-of-sale fact table. In this case, the relevant causal conditions being measured include price treatment, ad description, and display description. A ny given promotion for a product in a store on a given day will consist of some combination of these factors. For example, orange juice may be discounted today in all of the stores, but only some of the stores may accompany the discount with a special in -store display. Notice that one of the most important records in this promotion dimension is the record describing "no promotion." Most of the products in a store on a given day are probably sold under the "no promotion" causal condition.

A causal dimension is a kind of advisory dimension that should not change the fundamental grain of a fact table. Recall that the grain of a fact table identifies the meaning of a single fact table record. In Figure 1, the grain of the fact table is the i ndividual line item on a particular customer's sales ticket. I stated earlier in this article that the natural dimensions of this fact table are product, store, customer, sales ticket, and time. If you decide that you can describe each sale more specific ally by a set of promotion conditions, store conditions, and exogenous conditions, then you can add a special key into the fact table that points to the relevant combined causal description for each sales record. The addition of such a key does not chang e the number of fact table records. All of the old applications continue to work, continue to produce exactly the same results, and do not require recoding. This setup is an example of the robustness of the star join database organization described in my August 1996 DBMS article Dangerous Preconceptions". In this case, the dangerous preconception is that you cannot add additional information such as a causal dimension to the design after the data warehouse becomes opera tional. As this example illustrates, you can add new dimensions at any time, as long as you are careful to preserve the original grain. As I pointed out in August, it is easy to preserve the original grain if you start with the lowest-level transactions in the business, because in a very fundamental sense it is not possible to create a more granular view of the business. A sales transaction is a sales transaction, whether or not you accompany it with fancy causal descriptors.

Some of you may be bothered by the assumption that the causal dimension "explains" why the customer bought the product. Obviously, you never know for sure why anyone buys anything. In some cases, you can't even be sure whether the presumed stimulus (the ad or the display) was even noticed by the customer. For these reasons, causal factors are usually classified as "absolute" or "suspected." An absolute causal factor, such as a price reduction, is a factor you know affected some aspect of the sale, such as the price. A suspected causal factor, such as a newspaper ad or bad weather, is simply a causal factor that existed at the same time as the sale but may not have been visible to or even noticed by the customer at the time of purchase. In the long run, it is up to advanced techniques such as data mining to determine if a correlation exists between these suspected causal factors and any change in sales.

The link between causal factors and business performance leads to the most important business question surrounding a causal dimension - namely, "Was my promotion profitable?" Equivalently, we ask, "Did the promotion (or other causal factor) make any diff erence?" There are at least three increasingly sophisticated ways to ask this question. The most basic form of the question is: Was I profitable while the promotion or other causal factor was happening? The intermediate form of the question is: What was the lift of the promotion compared to the baseline sales? And the most advanced form of the question is: What were the patterns of cannibalization and time shifting as a result of the promotion? Which other products were affected, and which other product s showed no effect? In a forthcoming feature article on data mining, I will explain these promotion measures and describe how a data warehouse is used to answer such questions.

The existence of a causal dimension often provokes the "what didn't happen?" question. For example, what was on promotion that did not sell? Even with a causal dimension, you cannot answer these questions with a fact table that records what did happen. A companion fact table, called a coverage table, is needed in this case. The set difference between the coverage fact table and the primary fact table provides the answer. In my September DBMS article "Factless Fact Tables", I d escribed the structure of coverage tables that help us show where causal factors did not produce the results we had hoped for.

Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.

Figure 1

--A typical star join schema for a set of retail sales transactions. Each fact table record is a line item on a particular customer ticket. In this article, I show how to add a sixth "causal" dimension that explains why the sale took place.

Figure 2

--A useful causal dimension in a retail environment. The new causal key is simply inserted into the existing fact table without violating the grain of the fact table or changing any existing applications.

This is a copy of an article published @ http://www.dbmsmag.com/