DBMS - April 1996 - Data Warehouse Architect

DBMS, April 1996 Data Warehouse Architect By Ralph Kimball

Slowly Changing Dimensions

Unlike OLTP Systems, Data Warehouses Can Track Historical Data.

One major difference between an OLTP system and a data warehouse is the ability to accurately describe the past. OLTP systems are usually very poor at correctly representing a business as of a month or a year ago. A good OLTP system is always evolving. O rders are being filled and, thus, the order backlog is constantly changing. Descriptions of products, suppliers, and customers are constantly being updated, usually by overwriting. The large volume of data in an OLTP system is typically purged every 90 t o 180 days. For these reasons, it is difficult for an OLTP system to correctly represent the past. In an OLTP system, do you really want to keep old order statuses, product descriptions, supplier descriptions, and customer descriptions over a multiyear p eriod?

The data warehouse must accept the responsibility of accurately describing the past. By doing so, the data warehouse simplifies the responsibilities of the OLTP system. Not only does the data warehouse relieve the OLTP system of almost all forms of repor ting, but the data warehouse contains special structures that have several ways of tracking historical data. (OLTP systems produce "flash reports" for management, and the people who run OLTP systems are proud of that capability. But beyond these simple d aily and weekly summaries and counts, the OLTP environment is a very costly environment in which to do any kind of complex reporting. Whether an OLTP shop likes it or not, the economics of reporting favor the data warehouse.)

A dimensional data warehouse database consists of a large central fact table with a multipart key. This fact table is surrounded by a single layer of smaller dimension tables, each containing a single primary key. In a dimensional database, these issues of describing the past mostly involve slowly changing dimensions. A typical slowly changing dimension is a product dimension in which the detailed description of a given product is occasionally adjusted. For example, a minor ingredient change or a minor packaging change may be so small that production does not assign the product a new SKU number (which the data warehouse has been using as the primary key in the product dimension), but nevertheless gives the data warehouse team a revised description of t he product. The data warehouse team faces a dilemma when this happens. If they want the data warehouse to track both the old and new descriptions of the product, what do they use for the key? And where do they put the two values of the changed ingredient attribute?

Other common slowly changing dimensions are the district and region names for a sales force. Every company that has a sales force reassigns these names every year or two. This is such a common problem that this example is something of a joke in data ware housing classes. When the teacher asks, "How many of your companies have changed the organization of your sales force recently?" everyone raises their hands.

There are three main techniques for handling slowly changing dimensions in a data warehouse: overwriting, creating another dimension record, and creating a current value field. Each technique handles the problem differently. The designer chooses among th ese techniques depending on the users' needs.

Overwriting

The first technique is the simplest and fastest. But it doesn't maintain past history! Nevertheless, overwriting is frequently used when the data warehouse team legitimately decides that the old value of the changed dimension attribute is not interesting . For example, if you find incorrect values in the city and state attributes in a customer record, then overwriting would almost certainly be used. After the overwrite, certain old reports that depended on the city or state values would not return exactl y the same values. Most of us would argue that this is the correct outcome.

Creating Another Dimension Record

The second technique is the most common and has a number of powerful advantages. Suppose you work in a manufacturing company and one of your main data warehouse schemas is the company's shipments. The product dimension is one of the most important dimens ions in this dimensional schema. (See Figure 1.) A typical product dimension in a shipments schema would have several thousand detailed records, each representing a distinguishable product capable of being shipped. A good product d imension table would have at least 50 attributes describing the products, including hierarchical attributes such as brand and category, as well as nonhierarchical attributes such as flavor and package type. An important attribute provided by manufacturin g operations is the SKU number assigned to the product. You should start by using the SKU number as the key to the product dimension table.

Suppose that manufacturing operations makes a slight change in packaging of SKU #38, and the packaging description changes from "glued box" to "pasted box." Along with this change, manufacturing operations decides not to change the SKU number of the prod uct, or the bar code (UPC) that is printed on the box. If the data warehouse team decides to track this change, the best way to do this is to issue another product record, as if the pasted box version were a brand new product. The only difference between the two product records is the packaging description. Even the SKU numbers are the same. The only way you can issue another record is if you generalize the key to the product dimension table to be something more than the SKU number. A simple technique i s to use the SKU number plus two or three version digits. Thus the first instance of the product key for a given SKU might be SKU# + 01. When, and if, another version is needed, it becomes SKU# + 02, and so on. Notice that you should probably also park t he SKU number in a separate dimension attribute (field) because you never want an application to be parsing the key to extract the underlying SKU number. Note the separate SKU attribute in the Product dimension in Figure 1.

This technique for tracking slowly changing dimensions is very powerful because new dimension records automatically partition history in the fact table. The old version of the dimension record points to all history in the fact table prior to the change. The new version of the dimension record points to all history after the change. There is no need for a timestamp in the product table to record the change. In fact, a timestamp in the dimension record may be meaningless because the event of interest is t he actual use of the new product type in a shipment. This is best recorded by a fact table record with the correct new product key.

Another advantage of this technique is that you can gracefully track as many changes to a dimensional item as you wish. Each change generates a new dimension record, and each record partitions history perfectly. The main drawbacks of the technique are the requirement to generalize the dimension key, and the growth of the dimension table itself.

Creating a Current Value Field

You use the third technique when you want to track a change in a dimension value, but it is legitimate to use the old value both before and after the change. This situation occurs most often in the infamous sales force realignments, where although you ha ve changed the names of your sales regions, you still have a need to state today's sales in terms of yesterday's region names, just to "see how they would have done" using the old organization. You can attack this requirement, not by creating a new dimen sion record as in the second technique, but by creating a new "current value" field.

Suppose in a sales team dimension table, where the records represent sales teams, you have a field called "region." When you decide to rearrange the sales force and assign each team to newly named regions, you create a new field in the sales dimension ta ble called "current_region." You should probably rename the old field "previous_region." (See Figure 2.) No alterations are made to the sales dimension record keys or to the number of sales team records. These two fields now allow an application to group all sales fact records by either the old sales assignments (previous region) or the new sales assignments (current region). This schema allows only the most recent sales force change to be tracked, but it offers the immense flexib ility of being able to state all of the history by either of the two sales force assignment schemas. It is conceivable, although somewhat awkward, to generalize this approach to the two most recent changes. If many of these sales force realignments take place and it is desired to track them all, then the second technique should probably be used.

Choosing a Technique

The second and third techniques described here will handle the great majority of applications with slowly changing dimensions. The second technique, creating another dimension record, works very well for dimension tables with up to several hundred thousa nd records. Even the addition of many new records to these moderately large dimensions will not compromise performance in a DBMS with good indexing techniques, such as bit vector indexing. However, eventually a point may be reached in very large dimensio ns, such as multimillion record customer lists, where the second technique cannot be used. In this case, you are forced to resort to a cruder technique, appropriate for Monster Dimensions. This will be the subject of my column next month.

Figure 1.

--A typical manufacturing shipments schema with five dimensions, showing the Product dimension expanded. In this article I show how to track a meaningful change of the package type (pkg_type) attribute over time when the OLTP system refuses to change th e master product key (SKU #).

Figure 2.

--A typical Sales Team dimension for almost any company that sells products. In this article I show how to track a change in the region attribute when you need to see both the old and new versions of the attribute over all historical data.

Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.

This is a copy of an article published @ http://www.dbmsmag.com/