DBMS - August 1997
DBMS Online: Data Warehouse Architect By Ralph Kimball

Data Warehouse Role Models

What to do when a single dimension appears several times in the same fact table.

Given the title, you may expect this month's column to be about stalwart IS citizens who play inspirational roles in making their data warehouses successful. Sorry, but yours truly is a modeler to the end. So what this column is really about is how "roles" arise in a data warehouse design and how we model them.

A role in a data warehouse is a situation in which a single dimension appears several times in the same fact table. This can happen in a number of ways. Last month I discussed the Time dimension. In certain kinds of fact tables, Time can appear repeatedly. For instance, we may build a fact table to record the status and final disposition of a customer order. This kind of fact table is called an accumulating snapshot fact table. The dimensions of this table could be Order Date, Packaging Date, Shipping Date, Delivery Date, Payment Date, Return Date, Refer to Collection Date, Order Status, Customer, Product, Warehouse, and Promotion.

Each of these dimensions is a foreign key in the fact table record pointing to a specific dimension table. Dimension tables contain the descriptive textual attributes of the data warehouse and are the source of most of our interesting constraints.

The first seven dimensions in the design are all times! However, we cannot join these seven foreign keys to the same table. SQL would interpret such a seven-way simultaneous join as requiring that all of the dates be the same. That doesn't seem very likely.

Instead of a seven-way join, we need to fool SQL into believing that there are seven independent Time dimension tables. We even need to go to the length of labeling all of the columns in each of the tables uniquely. If we don't label the columns uniquely, we get into the embarrassing position of not being able to tell the columns apart if several of them have been dragged into a report.

Even though we cannot literally use a single time table, we still want to build and administer a single time table behind the scenes. For the user, we can create the illusion of seven independent time tables in a couple of ways. We can either make seven identical physical copies of the time table, or we can create seven virtual copies of the time table with the SQL SYNONYM command. Regardless of the approach, once we have made these seven clones, we still have to define a SQL view on each copy in order to make the field names uniquely different.

Now that we have seven differently described Time dimensions, they can be used as if they were independent. They can have completely unrelated constraints, and they can play different roles in a report.

The scenario described in the previous few paragraphs is the classic role model exercise in a data warehouse. Although the other examples I am about to describe have nothing to do with time, they are handled in exactly the same way.

The second example was developed in my May column on voyages and networks. (See "Traveling through Databases," DBMS, May 1997, page 16.) We saw that those voyage data marts that were trying to represent journeys all needed to have at least four "port" dimensions to properly describe the context of a journey segment. The dimensions of a frequent flyer flight segment fact table need to include Flight Date, Segment Origin Airport, Segment Destination Airport, Trip Origin Airport, Trip Destination Airport, Flight, Fare Class, and Customer.

The four Airport dimensions are four different roles played by a single underlying Airport table. We build and administer these exactly the way we did the seven Time tables in the previous example.

The telecommunications industry has many situations requiring the use of role models. With the advent of deregulation, a number of competing entities may all extract revenue from a single phone call. On a single call, these entities might include Source System Provider, Local Switch Provider, Long Distance Provider, and Added Value Service Provider.

These four entities need to be dimensions on every call. In the complex and evolving telecommunications industry, it may be very difficult and confusing to maintain four different partially overlapping tables of business entities. Some business entities will play several of these roles. It will be far easier to keep a single Business Entity table and use it repeatedly within a data warehouse role model framework.

Actually, in building a full-blown call revenue analysis fact table, we would also recognize that there are at least two more Business Entity roles that should be added to the design. The fifth and sixth Business Entity dimensions could be Calling Party and Called Party.

It is worth pointing out that the master billing production system for a large telecommunications company presents both an enormous opportunity and a dilemma. On the one hand, the billing system may well identify the revenue attributed to the four competing Providers mentioned above. But on the other hand, competitive regulations may limit whether the billing system will allow any one party to perform a detailed competitive analysis.

The telecommunications industry also has a very well-developed notion of Location. Many telco dimensions have a precise geographic location as part of their description. This location may be resolved to a physical address, or even to a highly precise latitude and longitude. Using our role-modeling skills, we can imagine building a master Location table and then using it repeatedly. The Location table could be part of Working Telephone Number, Billing Telephone Number, Equipment Inventory, Network Inventory (including poles and switch boxes), Real Estate Inventory, Service Location, Dispatch Location, Right of Way, and even Business Entity. Each record in the location table is a point in space. Points in space are great because they roll up to every conceivable geography. Points in space roll up just as easily to area codes as they roll up to counties, census tracts, and sales territories. A single Location record should probably include all of these rollups simultaneously.

However, you will have noticed a little different emphasis in this discussion of Location. Location is more naturally thought of as a component of a dimension, not as a standalone dimension all by itself. The Working Telephone Number dimension is the most basic description of where all the phone numbers are actually located. But this dimension also needs to describe the customer who owns this working telephone number, and perhaps something about the type of service this number represents. For instance, is it part of a Centrex system? This larger description seems to confuse the issue of creating a role for the Location table.

The use of an embedded role, such as Location, in a variety of unrelated larger dimensions, is one of the few places where I would encourage and support "snowflaking." In other words, what I recommend here is to create a join from each of the tables needing to describe Location to a clone of the Location table. The issues in creating Location clones are exactly the same as creating the Time, Airport, and Business Entity clones earlier in this article. We need separate physical copies, or synonyms, for each use of the Location table, together with overlying SQL views to create distinguishable column names. These views are then used to join into their respective larger dimensions, such as Working Telephone Number.

My concern with snowflaking comes from the abuse that the entity-relation approach encourages when creating a plethora of little tables by squeezing out all of the redundancy from the dimension table. The blizzard of snowflaked tables can seriously freeze a data warehouse. All of the little tables intimidate end users, who don't like visual complexity, and all of the little tables destroy the performance of cross browsing amongst dimensional attributes that aren't hierarchically related. But that's a topic for my Manifesto, which appears on page 61 in this issue.

To finish off, although it might be a stretch to implement such a table, let's imagine a single fact table that combines three of the examples in this article into one design. Imagine that we are capturing switch traffic at an internal node of a large telecommunications company. We want a very detailed view of this switch traffic so that we can make investments in new alternative routes for calls, new capacity for the current switch, and new features for the current switch. To make these decisions, we want to go all the way down to a microanalysis of who is using the switch for what purpose. We further imagine that after capturing the raw call traffic though the switch, we come back later in the billing cycle and correctly rate the revenue of each call. So perhaps the dimensions of our Rated Switch Traffic data mart fact table look like: Call Date, Rated Date, Billed Date, Paid Date, Source System Provider, Local Switch Provider, Long Distance Provider, Added Value Service Provider, Calling Party (with embedded Location), Called Party (with embedded Location), Prior Switch (with embedded Location), and Subsequent Switch (with embedded Location).

The numeric measured facts in this record could be: Call Time of Day, Call Duration in Seconds, and Rated Revenue.

Here we have Time playing four roles in the first four dimensions, Provider (or Business Entity) playing four roles in the next four dimensions, and Location playing four roles by being embedded in the last four dimensions. Although there is tremendous expressive power in these 12 dimensions, most of the work in creating the dimensions is focused on just three underlying tables. Any data warehouse designer who pulled off a design like this would be considered a good role model indeed.


Ralph Kimball was coinventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can reach Ralph through his Web page at www.rkimball.com.


This is a copy of an article published @ http://www.dbmsmag.com/