In last month�s column I showed how to handle slowly changing dimensions, including descriptions of products that occasionally changed and assignments of sales territories that occasionally changed. We saw that the two major techniques available to the data modeler were creating a new dimension record (often used for product changes) and creating a new �current� field (often used for sales territory changes). I finished the column with the warning that neither of these techniques work very well for "monster" dimensions, such as a multimillion-row customer list.
In today's data warehouses, a medium-size dimension, such as a master product list for a retailer, may have 500,000 records. Serious DBMSs are capable of browsing amongst the textual attributes such as flavor and package type in a 500,000-row table and giving the user good interactive performance. Last month's technique of adding new records to this medium-size table poses no threat to performance in these medium-size dimension tables, as long as the table doesn't swell to much more than a million rows. Vendors such as Red Brick Systems, Oracle Corp., and Sybase Inc. are offering bit-vector indexes on these kinds of dimension tables that give dramatically better browsing performance amongst the various attributes compared to the old fashioned B-tree indexing schemes.
The biggest customer dimensions, however, are at least ten times bigger than the biggest product dimensions. All large companies that deal with individual human being customers have customer lists ranging from a few million to as many as 100 million records. This dimension cannot be compressed or summarized. The customer list dimension drives just about every interesting fact table in these businesses and needs to be available at the lowest level of granularity in order to subset the business based on detailed customer characteristics.
Unfortunately, big customer lists are even more likely to "slowly change" than the medium-size product lists. Retailers are anxious to periodically update their customer information. Insurance companies must update their information about their customers, their insured automobiles, and their insured homes, because it is critical to have an accurate description of these items at the time a policy is approved and when a claim is made. Figure 1 (see page 16) shows a typical customer list, with "hot" demographic fields that are especially interesting to track as they change. However, it would seem like you are between a rock and a hard place. You must track the slowly changing nature of the customer list, but you don't dare use either of the techniques appropriate for medium-size dimensions because of the table size.
The solution to this dilemma is to break off the hot customer attributes into their own separate "demographics" dimension table as shown in Figure 2 (see page 16). You leave behind "constant" information in the original customer table. You gain the advantage of being able to track the changing nature of the customer descriptions, but you make some compromises as well.
First, you need to make a subtle change in order to successfully create this new demographics dimension. All of the continuously variable demographic measures, such as income and total purchases, must be converted to banded values. In other words, you force the attributes in this new dimension to have a relatively small number of discrete values. You then build the dimension with all possible discrete attribute combinations. For example, in Figure 2, if each of the five attributes has ten possible values then the demographics dimension will have exactly 105 = 100,000 records. You also need to construct an artificial demographics key for this dimension. It may as well consist of the consecutive numbers from one to 100,000.
Now, every time something interesting happens with one of your customers, such as a sales event or an insurance claim event, you put a record into the appropriate fact table describing this event. Two of the keys will always be customer and demographics, as shown in Figure 2. Because the decision to associate the demographics key with the customer key occurs whenever we place a record into the fact table, you can change the demographics description of the customer as frequently as you wish. Amazingly, there is no added overhead in the dimension tables with this approach, because you built every possible demographics combination at table creation time. All you have to do is plug in the current correct demographics profile for that customer whenever an event occurs and a fact table record is created.
The big advantage of this approach is that you can support very frequent snapshotting of customer profiles with no increase in data storage or data complexity as you increase the number of snapshots. But there are some trade-offs. First, you are forced to clump the demographic attributes into banded ranges of discrete values. This places a limit on the specificity of the data (such as income), and makes it impractical to change to a different set of value bands at a later time. Once you decide on your bands, you are stuck with them. Second, the demographics dimension itself cannot be allowed to grow too large. The 100,000 predetermined values in the example are a reasonable upper limit on the total number of possible values. There are certainly some cases in which you need more than five demographic attributes with ten values each. Surprisingly, a workable solution to this problem is to build a second demographics dimension. At a large retailer recently, I faced the problem of two sets of demographics measures. One set related to traditional income, family, and education measures of the customer and the other related to volatile purchase and credit behavior measured while buying the retailer's products. All of these attributes combined together resulted in several million possible combinations. I was worried that the demographic dimension approach was not going to work. Somewhat reluctantly, I proposed a two demographics dimensions approach. To my surprise, the resulting data warehouse was very successful. The retailer was very pleased with the flexibility of the data model and the ability to track the changes both in family demographics and customer behavior very responsively.
A third potential drawback to this approach is that we have separated the hot demographics data from the more constant descriptors of the customer, such as geographic location and birth date. Now it is harder to browse the data as if it were a single dimension implemented as a flat file. The demographics data can now only be browsed along with the more constant customer data by linking through the fact table. This will usually be a slower and more expensive browse than if all the data were in a single dimension table. I don't think this is a killer criticism, however. Relational systems will give the correct answer to the browse; they simply may be a little slow. This seems like a reasonable price to pay for the freedom to track the customers so responsively.
Finally, sharp-eyed readers may point out that when using this approach you can associate demographics with the customer only when you actually generate a fact table record, such as a sales event or an insurance claim event. Theoretically, if there are no sales, the fact table is empty. If the fact table is empty, then you will never be able to link the demographics to the customer. While this is technically true, there are easy ways around this dilemma. If sales are so sparse that you actually are in danger of missing a demographics measurement, then all you need to do is define a demographics transaction event. This event has no sales dollars associated with it, but it serves as a good place to register a new demographics profile of the customer. Fact table records are cheap. There are billions of them in fact tables anyway.
In my last few columns, I have focused on end-user analysis issues and dimension modeling issues. This is where the application analyst MBAs hang out. Next month I will move decisively into the back room, where the DBAs hang out. The toughest problem in data warehousing needs to be faced: the dreaded data extraction problem.