DBMS - May 1997
DBMS Online: Data Warehouse Architect By Ralph Kimball

Traveling through Databases

Ralph guides us through voyages and networks.

An interesting design theme that has come up in many data warehouses I have designed is the business application where a person or a thing goes on a voyage. Although ship voyages come to mind most readily, I am really talking about any situation where the person or thing travels from a beginning to an end with some stops in the middle. There are many examples of voyage situations: ocean container cargo shipping, ocean cruises, airline passenger travel, airline cargo shipping, rail transport, truck transport, package delivery services, United States mail (any postal system, really), distribution systems, material movement systems, business-travel credit-card tracking (including car rentals and hotel stays), and, surprisingly, telephone company network-inventory tracking (all of the switches and lines in the network).

All of these voyage situations can be thought of as networks. The network is the map of every possible voyage between ports. (See Figure 1.) Although it took a while to dawn on me, the design issues for network-oriented business data warehouses are often the same as the design issues for classic voyage data warehouses. Think about the telephone company network inventory example as a map of the possible voyages a telephone call could make between an origin "port" telephone number and a destination port telephone number.

Walking Through the Design

Voyage and network databases raise a number of unique design issues not found in other data warehouses. But voyage databases contain an inherent trap. The most straightforward and obvious star schema design easily contains all possible information about voyages taken, but it may fail miserably to answer the most basic business questions about the voyages. These questions include "Why did the person or thing take the voyage?" and "Where are they going?"

To understand this dilemma and see how to fix it, suppose that you want to build a frequent-flyer data mart for the marketing department at a major airline. Using the methodology described in my DBMS column of December 1996, first choose the business process for this data mart. The business process is the system for generating passenger tickets. You will extract directly from the production ticketing system into the frequent-flyer data mart. The second step is to decide on the "grain" of the main fact table in the star schema. The grain of a fact table identifies the exact context of a fact table record.

In voyage databases, it is inevitable that the grain of the fact table must be the individual voyage segment taken. In the frequent-flyer case, this means that there is a record in the main fact table for each segment flown by a frequent flyer. You must use the segment as the grain, because all of the interesting information about what happened to the frequent flyer is uniquely available at the segment level. This information includes the origin airport for the segment, the destination airport for the segment, the fare class flown, whether an upgrade was purchased or claimed, the type of aircraft, the location of the passenger's seat, whether the flight was late, whether the service was satisfactory, the actual miles flown, and the miles awarded.

The third and fourth steps in the design are deciding on the specific dimensions and facts. It is fairly easy to arrive at the design shown in Figure 2. At first glance, this design seems to be very satisfactory. The dimensionality is very clean and obvious, and it seems that there would be many useful "entry points" into this fact table through the dimension tables that would let you construct all sorts of interesting business queries.

Strangely enough, this schema fails miserably. It is nearly impossible for marketing department end users to ask why the frequent flyers bought a ticket in the first place! The grain of the individual segment masks the nature of the voyage. On a real business flight shown in Figure 3, I may leave from San Jose and fly to Detroit with an intermediate hub transfer stop in Dallas. After staying in Detroit for two days, I go on to Atlanta. Then I return to San Jose, again though Dallas. My five segments look like:

I would have to study the flight times closely to realize that the real marketing answer is:

Adding Dimensions

No query tool -- or SQL that I am aware of -- will deliver this answer from the basic design in Figure 2. There is no reasonable way to determine what a meaningful "business destination" is in this sequence of segments, and any query that tries to constrain across disparate records in the database is the kiss of death for SQL.

Fortunately, there is a simple solution to his dilemma. If you add two more airport dimensions to Figure 2, suddenly the design becomes very responsive to the end users. These added dimensions are shown with the arrow in Figure 2. These new TRIP_ORIGIN and TRIP_DESTINATION dimensions can be thought of as providing the context for the voyage segment. Now the marketing end users can perform a simple SQL query such as

Select count(distinct ticket_number) ...
Where TRIP_ORIGIN = 'San Jose' and TRIP_DESTINATION = 'Detroit'.

The TRIP_ORIGIN and TRIP_DESTINATION dimensions are identified during the production data extract process. An airline would normally identify a stopover of more than four hours as a meaningful stop at a destination. While the ticket is being processed during the data extract, the TRIP_ORIGIN and TRIP_DESTINATION are easily identified and entered as fields on the individual segment records.

The theme of providing two additional "context" dimensions for the voyage segment is the secret of the voyage and network schema designs. All of these schemas need these two additional dimensions. Armed with these two additional dimensions, end users can ask many interesting questions, including:

An analogous question for a telephone network that needs the two context dimensions might be: What are the ultimate origins and destinations of the high volume of traffic we are seeing on this particular line between switching points?

The four airport dimensions must be logically independent in this design so that they can be constrained in random and uncorrelated ways. However, you only need to build a single physical airport table to support all four logical dimensions. Although the SQL SYNONYM construct will work to create the illusion of four independent logical tables, the SYNONYM approach by itself is awkward for the end users. The four independent logical tables will end up having identical field names. This makes report building and ad hoc querying confusing. The field called AIRPORT_NAME would be found in all four logical tables, and it becomes cosmetically difficult to tell the four roles apart. A better approach might be to also define four SQL views on these four SYNONYMS, where each view relabels all the fields to names such as TRIP_ORIGIN_AIRPORT_NAME and TRIP_DESTINATION_AIRPORT_NAME.

Voyage schemas often have a large number of dimensions. A schema for a container shipping business could easily have 12 or more dimensions, such as original shipper, ultimate consignor, foreign consolidator, domestic consolidator, carrier, voyage origin, voyage destination, segment origin, segment destination, commodity shipped, container vessel, and bill of lading number (degenerate).

The first five of these dimensions are commercial business entities that play roles in shipping the commodity from the original shipper to the ultimate consignor. Like the airport dimension in the previous example, these five dimensions could be implemented in a single commercial entity physical table, but five SYNONYMS and five SQL views should be used to separate the five business roles.

Images and Maps

It is very compelling to combine graphic images and maps in a voyage database. Voyages and networks are inherently physical, and businesses often have rich sources of graphics and maps that fit nicely with the grain of the voyage or network segment. For instance, in your frequent-flyer database, you could have images of airplanes, cabin layouts, cities, and airports to accompany each segment record. Maps of flight segments, trips, and overall passenger tickets would be very useful. Maps could be available for gate layouts in airports showing the distance between gates.

Even an ordinary relational database can support extensions to maps and graphic images. For graphic images, the name of a JPEG file in the dimension record for the aircraft or the airport is enough for an application to open the image while using the database. I have frequently added graphic images to my queries using Visual Basic just by passing the JPEG file name to the ImageMan Visual Basic plug-in routine. ImageMan handles many different graphic image formats seamlessly; the images really liven up the analysis. ImageMan is available from Data Techniques Inc. of Burnsville, N.C. (www.datatech.com).

Similarly, latitudes and longitudes can be passed to mapping software, which can display a map in a window alongside the query results. In Visual Basic I would take a look at ESRI's MapObjects routines for map generation. ESRI's well-designed Web site is found at www.esri.com.


Ralph Kimball was coinventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can reach Ralph through his Web page at www.rkimball.com.


Figure 1.


--A network is a map of every possible voyage between ports. This could just as well be a set of shipping routes or a set of telephone cables.

Figure 2.


--The Frequent Flyer Data Mart Schema. This design is almost useless until the extra two dimensions are added. The design then becomes very powerful and can answer most marketing questions.

Figure 3.


--A typical business trip starting and ending in San Jose. Dallas is not a true destination. The interesting city pairs are San Jose to Detroit, Detroit to Atlanta, and Atlanta to San Jose.


This is a copy of an article published @
http://www.dbmsmag.com/