DBMS - July 1996
Data Warehouse Architect By Ralph Kimball

Automating Data Extraction

Ralph Reviews Carleton Passport, ETI Extract, and Prism Warehouse Manager.

In last month's column I described the architecture of data extraction from a legacy system into a data warehouse. I described 13 steps that encompass your complete responsibility from accessing the legacy data all the way to publishing the resulting changes in the final data warehouse. These 13 steps take place every time you update your data warehouse, potentially every day. The 13 steps are:

1. Read the legacy data.
2. Decide what changed.
3. Generalize keys for slowly changing dimensions.
4. Combine separate sources under each record key.
5. Create load record images.
6. Migrate the data from the mainframe to the data warehouse server.
7. Sort and re-sort the load record images to create aggregate records.
8. Generalize keys for aggregate records.
9. Bulk load all the records with referential integrity turned on.
10. Process load exceptions.
11. Index the newly loaded records.
12. Quality assure the data load.
13. Publish.

During the last month, I have studied the product literature and talked to the marketing executives from three leading extraction tool providers: Carleton Corp., Evolutionary Technologies International (ETI), and Prism Solutions. These data extraction tool providers all aim to automate the process of accessing, transforming, and migrating data. I have tried to take an IS user's perspective in judging what they do and how well their tools map to my overall tasks and worries. As you will see, these tools automate a number of the most difficult tasks in my list of 13 steps, but they also leave me with a lot of messy work to do on my own.

All three tools offer a similar architecture for controlling the data extraction process, shown in Figure 1 (page 18). A PC or a workstation client acts as the controller for the extraction process. In the most typical configuration, the client PC generates Cobol programs and associated JCL for an IBM host that reads and manipulates source data, and prepares the data for migration and loading into a target DBMS on another machine. All three systems rely extensively on metadata to describe the source and target databases, and control the running of the extract process.

The flagship products from these three vendors include:

Detecting the Differences

One of the challenges an IS consumer faces is that it is difficult to tell the differences among these three products from their outward descriptions. They all address the central problem of extracting data to a data warehouse. Following are some of the differences that stand out on closer examination.

Client Platform. Carleton and Prism run on Windows 3.1 and OS/2 PCs, while ETI runs only on Unix workstations running X-Windows and Motif.

Host Platforms. All three tools sport a long list of IBM mainframe systems, Unix-flavor DBMSs, and diverse data file formats that they can read from and write to. The lists include sequential file formats and the major IBM databases such as IMS, IDMS, and DB/2, as well as data warehouse DBMSs such as Red Brick, Oracle, Informix, Sybase, and others. All of the tools also navigate the mainframe sources with the help of data repositories and mainframe data dictionaries such as the Platinum Repository from Platinum Technology Inc. and the Predict Repository from Software AG of North America Inc. Although each of the extraction tool vendor's lists certainly differs in detail, the message from all three vendors is similar: Each supports many diverse source file formats.

Metadata Storage. Carleton and Prism store their metadata in whatever DBMS the user wishes. ETI stores its data only in an internal format on the Unix workstation.

Merging Multiple Data Sets. All three tools stress how well their tools read from multiple inputs, and write to multiple outputs, in a single execution pass.

Support for Data Loading. All three vendors support the generation and execution of scripts that load data into the designated data warehouse on the target machine.

Determining What Changed in the Legacy Data and Generating New Keys. Prism is the only vendor that tackles this problem head on. The Prism Change Manager automatically handles timestamped mainframe database records and mainframe transaction log files. However, if your legacy system provides no direct clues as to what has changed (say, in a master product list), then all three extraction tools leave you with a manual programming job to compare today's legacy data with a scratch copy from yesterday, and then decide what to do with newly inserted, deleted, or changed records. Additionally, if you are confronted with a "slowly changing dimension" (see my April 1996 column, page 18) where the legacy system has overwritten one or more of the product attributes (for example), then you are on your own if you decide to generalize the production key and keep both the old and new descriptions of the product. All of these nuances add up to a significant amount of hand-programming by the data warehouse extraction team. It appears that once you have written the custom modules for handling these comparison steps and key generation steps, all three of the extract tools will incorporate these custom modules into the overall extraction job logic.

Building and Managing Aggregates. Although all of the tools could easily produce summary records (aggregates), none of the tools was tightly integrated with the data warehouse aggregate navigators. In my October 1995 column on aggregate navigation I remarked that navigators such as HP's Intelligent Warehouse generated some very interesting metadata describing which aggregates should be built to improve data warehouse performance. In that column I stated that there was no immediate link to the extraction tools for automatically building these suggested aggregates. Nine months later, this still appears to be the case.

Interfacing to Query Tool Metadata. Prism and Carleton provide useful interfaces to end users at query time through the Prism Directory Manager and Carleton's End User Browsing facility in Passport.

Handling Load Exceptions. None of the three tools directly controls the target DBMS data load process and watches over the exceptions that result from referential-integrity violations. However, in fairness to the vendors, they all said that they would rather check for referential integrity much earlier in the extraction process than the final DBMS load. All three tools can validate key values. It was unclear how automated this capability is.

Quality Assuring (QA) the Final Load. None of the three tools directly launches quality assurance applications on the final loaded data in the target data warehouse. Such a quality assurance application would count various buckets in the newly loaded data and check for reasonableness. For instance, if the data load was supposed to represent a large retail chain with 300 stores, then a simple QA check would count the number of stores in yesterday's data. If the first check passed, then the overall number of products could be checked. If this second check passed, then a matrix of the number of products by each store could be checked. Another check would be the total sales volume in both units and dollars, first for the chain as a whole, and then by each store. If all these numbers fell within specified QA upper and lower bounds, then the "publisher" DBA would be fairly confident that the load was of high quality, and the users could trust the content.

Stepping back from the details, I came away feeling that the extraction tools were very good at automating "linear" data extraction and transformation, where a very complicated switchboard-like application may be needed to shuffle data out of well-understood legacy sources and rearrange it into target files for loading into a DBMS. However, the tools still have a way to go before they become the data warehouse DBA's best friend. The nightly data load consists of all 13 steps, and these steps have to be performed in as automated a way as possible. A perfect data load should run all the way to completion without any human intervention.

The extraction tools all do a good job of automating steps 1, 4, 5, and 6. With a fair amount of explicit programming, the tools can also automate steps 2, 3, 7, and 8. If step 9 (referential integrity) is moved up much earlier in the process, most of it can be subsumed by the extraction tools as well. This still leaves the DBA on his or her own with steps 10, 11, 12, and 13. To summarize, about one third of the process (4 steps out of 13) is exactly what the extraction doctor ordered. Another third of the steps (4 or perhaps 5 out of 13) can be automated with some difficulty and maybe a fair amount of programming. The last third is not well addressed by today's extraction tools, and the DBA is left with constructing an ad hoc process.

Prism Gets the Nod

Of the three vendors, Prism gets the nod for being the most data warehouse aware, both in its literature and in its product features, especially with Prism's new Directory Manager software. Prism also benefits from the contributions of one of its founders, Bill Inmon, the father of data warehousing. Carleton is close behind and is clearly competing head to head with Prism. ETI has a reputation for having an extremely powerful product, but according to users I have talked to, it requires a significant learning curve to use effectively. Additionally, ETI imposes the burden of requiring a Unix X-Windows and Motif workstation environment for developing the control programs, rather than the ostensibly more familiar Intel platforms.


Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.
* Carleton Corp., 3 New England Executive Park, Burlington, MA 01803; 617-272-4310 or fax 617-272-2910; http://www.carleton.com.
* Evolutionary Technologies International, 4301 Westbank Dr., Austin, TX 78746; 512-327-6994 or fax 512-327-6117; http://www.evtech.com; [email protected].
* Prism Solutions Inc., 1000 Hamlin Ct., Sunnyvale, CA 94089; 408-752-1888 or fax 408-752-1875; http://www.prismsolutions.com.


This is a copy of an article published @ http://www.dbmsmag.com/