DBMS - June 1996
Data Warehouse Architect By Ralph Kimball

Mastering Data Extraction

Steps you must complete when migrating legacy data into a data warehouse.

This month I tackle the longest and riskiest part of building a data warehouse: building the production data extract system. At the beginning of a data warehouse project, most of you are pretty good at estimating how big the databases will be and planning for all the hardware and software components. But when you finally have received your funding and bought all the pieces, you often stumble trying to deliver the first results. Most often this happens because you underestimate the complexity and resources required to build the data extract system.

To understand why this happens, I'll decompose the data extract system. Figure 1 (page 16) shows a typical breakdown of the extraction steps for a dimensional data warehouse that is being fed by a series of mainframe legacy applications. You should assume a generic orders-and-shipments kind of business in this discussion, although these extract steps apply across all forms of data warehouses. In this example, the first five steps take place on the mainframe and the last seven steps take place on the target data warehouse system, which might be a large Unix server. Step 6 is the migration step between the two machines, but this step could occur as early as step 2 and as late as step 8, depending on where the data warehouse team can most conveniently arrange for the computing resources. Let's look briefly at each step.

1. Read the legacy data. The first challenge is reading the data! If you are lucky enough to be dealing with an open and well-documented legacy application, this step is easy. You can read all of the legacy application system files and you know the meaning of each data field. More typically, you may be able to read the data physically, but there are many fields you don't understand. Furthermore, you may not be sure when the various fields are updated by the legacy application. In the worst case, you can't read the data at all! Perhaps you are dealing with a third-party proprietary application for which the vendor has not revealed the formats of the underlying files. In this case you are reduced to running reports or "extract jobs" that contain only the content and formats the application vendor allows.

2. Decide what changed. It is critical to isolate the changed data from the legacy system as early in the extract process as possible. This step drastically reduces the amount of data to be migrated down to the data warehouse. But, changed data is more than just the new orders that have been entered. Changed data also includes new customer profiles, new product descriptions, and new relationships between customers, products, and suppliers. Ideally the legacy system timestamps all of its underlying records, and is capable of isolating and providing these changed records as a simple extraction step. More likely, the changed records are not timestamped, and the data warehouse team must compare the master customer file and the master product file from the legacy system against "gold copies" to see which records have changed. The gold copies are, in a sense, huge tables of metadata.

The worst legacy systems are not transaction systems at all. There may be no easy way to isolate the changed data without a lot of application programming.

3. Generalize keys for slowly changing dimensions. If the data warehouse team decides to track changing customer descriptions or changing product descriptions, then the respective keys must be generalized if you add new dimension records. I discussed this technique of adding new dimension records in my April column. (See "Slowly Changing Dimensions," DBMS, April 1996, page 18.) In that article I showed that generalizing the key could be as simple as adding two or three version digits at the end of the production key supplied by the legacy system, or it could be as complicated as creating a new anonymous key. In either case, you need a "key administration application" in your data extract pipeline to keep track of these keys as you generate them.

4. Combine separate sources under each record key. In most cases the legacy applications are implemented as many separate files. The fact table records will often be fairly obvious one-to-one transformations of legacy records, but for the dimension tables, you often must go through a significant process of denormalizing and conforming separate inputs into single records.

5. Create load record images. In this step you prepare record-by-record images of the final base data to be loaded. You have not yet generated any aggregate (summary) records. All the source data fields such as dates, numbers, and currencies are unpacked and expressed in standard formats. If the final target database record has 18 fields in it, then you make sure that all 18 fields are filled in at this time.

6. Migrate the data from the mainframe to the data warehouse server. You can move the data from the mainframe system to the data warehouse system at any point between step 2 and step 8, but it is good to delay the migration at least to this point. Often the access methods and data transformation routines needed to unpack, compare, combine, and transform the data are more naturally available on the original machine that created the data rather than on a machine downstream with a different operating system.

7. Sort and re-sort the load record images to create aggregate records. You should create aggregate (summary) records outside of the DBMS whenever possible for a couple of reasons. First, creating aggregates is nothing more than sorting the data on one or more of its attributes and then producing break rows while adding up the data in a linear pass. Each type of aggregate requires a separate sorting pass. You can't easily produce both product aggregates and customer geography aggregates in the same linear pass through the data. It would be normal to re-sort the load record images at least a dozen times during the aggregate creation step. I prefer to do this sort outside of the DBMS because dedicated sort utilities such as Syncsort Corp.'s SyncSort are usually vastly better at sorting than are relational DBMSs. SyncSort, for example, runs on a variety of platforms, including MVS and Unix. You can also perform this step on the mainframe, before the migration step.

The second reason for creating aggregate records prior to the load is that these aggregate records can then be included in the high-speed bulk load (step 9). Not only is this faster than inserting the aggregate records one at a time from a DBMS application, but you can save these aggregate records along with the base-level records as part of the daily incremental snapshot of data. In the event of a mass recovery of data, you don't have to recalculate the aggregates; they can simply be reloaded.

8. Generalize keys for aggregate records. The aggregate records need keys. This step is different from step 3. The keys for aggregate records need to be completely artificial and must not conflict with keys for base-level records. The data warehouse team needs to build an application for generating and keeping track of these aggregate keys.

9. Bulk load all the records with referential integrity turned on. When all of the input records have finally been assembled, they should be loaded with the DBMS's bulk data loader, and hopefully at the same time exploit parallel processing at both the hardware and software levels. It is desirable to inhibit index updating during this step so that the bulk loading can be done at the fastest possible speed. Several DBMSs, including DB2, Informix, and Red Brick, allow table indexes to be segmented so that only a small portion of the index is dropped during the load process (it is to be rebuilt afterward).

It is very important during this step to enforce referential integrity. In a star-join schema, referential integrity means always making sure that the components of the fact table key are always true foreign keys with respect to each of their dimension tables. In a billion-row fact table, if you ever load a record that violates referential integrity, you probably will never find and fix it. It is simply too expensive to go looking for random fact records whose key components are "not in" various dimension tables.

10. Process load exceptions. Invariably you will have some records that fail the load process. Usually they fail the referential integrity check. You must collect and process these records by fixing the bad components of the fact table keys. This load exception processing is a separate application.

11. Index the newly loaded records. Once all of the data is loaded, you must rebuild all of the affected indexes. You may need an application to keep track of the status of these indexes.

12. Quality assure the data load. Just before the data warehouse goes "live," you must quality assure the newly loaded data to ensure that it is fit for consumption. Remember that you have the same responsibilities as a traditional publisher. You are a professional, and your readers (users) are counting on you to make sure that the data is complete and of high quality. The quality assurance step is a series of DBMS reports that count and graph the main dimensional buckets in the data to check whether they are within reasonable bounds and whether they tie to "flash totals" from the legacy systems.

13. Publish. The publishing step is a manual or perhaps even an automated email message to all users, each morning, summarizing the status of the previous day's load. This is a very valuable step and is much appreciated by the users. It is part of your publishing responsibility.

Stepping Out

This list of steps varies from data warehouse to data warehouse, and may be arranged in a different order, but it is difficult to leave out any of these steps. The key observation is that there are many separate applications and administrative steps in this pipeline. It is very difficult to plan for the development of this suite of extract steps without a very detailed analysis. Only after the extract pipeline has been decomposed into all of its individual programming steps, and the underlying issues of the data have been teased out, can the data warehouse team accurately plan this development.

There are a number of powerful (and expensive) data extraction tools available to help automate this process. They can be quite helpful when their capabilities exactly match your data extraction needs. None of the available tools automates all 13 steps. In next month's special product review issue of DBMS, you will see how some of the more capable extraction tools match up to the entire data warehouse extraction task.


Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.

This is a copy of an article published @ http://www.dbmsmag.com/