Data warehousing is beginning to develop a rich set of facilities and tools that distinguish it from its older cousin, transaction processing. As we in the data warehouse segment of the market understand more deeply what we are trying to coax out of our databases, a set of themes has begun to emerge in our query and analysis tools that has quite a different texture from the old transaction processing report writers of five years ago.
In this column I discuss a selected set of new high-end query tool capabilities appearing in various tools that I think are important to getting the data warehousing job done. This list of capabilities is by no means complete! Consider it Ralph Kimball's idiosyncratic view of the data warehouse market. All of these features are deep and powerful. It would be wonderful if members of the query tool vendor community would raise the bar and begin competing with each other on these types of features.
I am certain that many readers will be interested to know exactly which query tool vendors support these capabilities. Because it is difficult for me to conduct a fair, balanced industry-wide survey of query tool providers, I will use this column to ask the query tool providers: 1) to comment on whether they have these features, or equivalently whether they can provide the equivalent functionality in some other way, and 2) to suggest additional features that I should have included in my list. I will report on all responses I get to this request in a future column.
If you are wondering exactly who all the query tool providers are, check out Larry Greenfield's list. Larry maintains a wonderful industry resource for data warehousing on his Web site, "The Data Warehousing Information Center," at pwp.starnetinc.com/larryg/index.html. This site is a significant tool for all of us in the data warehouse industry, and I recommend it highly. I crawl through it myself several times per month.
Here are my serious query tool features:
Cross-Browsing of Dimension Attributes. Almost every use of a query tool against a data warehouse involves a characteristic two-step dance: First, you visit some or all of the dimension tables in your star join schema in order to set the constraints, and second, after setting the constraints, you launch the multi-table join involving several of the dimension tables together with the large central fact table. Cross-browsing is needed to perform the first step. A real dimension table, such as a list of all of your products or customers, takes the form of a large dimension table with many, many attributes (fields). It is absolutely mandatory for a query tool to present, in real time, a list of the valid values in a dimension attribute (for example, product brand) and let the user choose one or more of the values to set a constraint. This basic browsing capability is now fairly standard in sophisticated query tools. Cross-browsing, on the other hand, refers to the capability of a query tool to present the valid values of the product brand, subject to a constraint elsewhere on that dimension table. In other words, you only want to browse the brand names of products in the salad dressing category. This ability to cross-browse distinguishes showroom demos from query tools used for serious querying in large data warehouse environments. If you can't cross-browse, you may ask for the descriptions of all of the salad dressings and get 16,000 of them (as I did a few years ago).
Open Aggregate Navigation. I discussed aggregate navigation extensively in DBMS in the November 1995 and July 1996 issues. Aggregate navigation is the ability to automatically choose pre-stored summaries, or aggregates, in the course of processing a user's SQL requests. Aggregate navigation must be performed silently and anonymously, without the end user or the application developer being aware that the aggregations even exist. Open aggregate navigation occurs when the aggregate navigation facility is a separate module that is available for all query tool clients simultaneously. In my opinion, nothing is worse, or more shortsighted, than an aggregate navigation facility embedded in a proprietary query tool and unavailable to other end-user clients. Unless the current proprietary aggregate navigators embedded in query tools are made into openly accessible modules, big DBMS vendors may take this business away from query tool providers.
Multipass SQL. To calculate comparisons or to correctly calculate non-additive measures in report break rows, the query tool must break the report down into a number of simple queries that are processed separately by the DBMS. The query tool then automatically combines the results of the separate queries in an intelligent way. Breaking a single complex request into several small requests is called multipass SQL. Multipass SQL also allows drilling across several conformed data marts in different databases, in which the processing of a single galactic SQL statement would otherwise be impossible. Finally, multipass SQL gives the aggregate navigator a chance to speed up the report, because each atomic SQL request is simple and easily analyzed by the aggregate navigator.
Semi-Additive Summations. There is an important class of numeric measures in common business fact tables that are not completely additive. Anything that is a measure of intensity is generally not additive, especially across the time dimension. For example, inventory levels and account balances are not additive across time. These facts are called semi-additive. Everyone is familiar with the idea of taking one of these semi-additive facts, such as a bank balance, and creating a useful summary at the end of the month by averaging across time. Unfortunately, you cannot use the basic SQL avg function to calculate this kind of average across time. avg averages across all of the dimensions, not just time. If you fetch five accounts and four time periods from the DBMS, avg will divide the total account balance by 20 (five times four) rather than doing what you want, which is to divide by four. It isn't difficult to divide by four, but it is a distraction for the end user or the application developer, who must stop and store the number four in the application explicitly. What is needed is a generalization of the sum operator to become avgtimesum. This function automatically performs a sum, but also automatically divides by the cardinality of the time constraint in the surrounding query. This feature makes all applications involving inventory levels, account balances, and other measures of intensity significantly simpler.
Show Me What Is Important. The growth in the power and capacities of data warehouses is a good news/bad news story in some respects. The good news is that you can now store mind-boggling amounts of low-level data in your databases: Fact tables with a billion records are fairly commonplace. The bad news is that you are in much more danger of getting back too much data to comprehend usefully. Increasingly, your query tools must help you automatically sift through the data to show you only what is important. At the low end, you simply need to show data rows in your reports that meet certain threshold criteria. This process involves more than just adding a HAVING clause to the SQL. In an aggregate navigated, multipass SQL environment, the criteria for including or excluding a record from the user's view may not be known until after the DBMS has long since passed back all the results. Thus this filtering function is rightfully the responsibility of the query tool. The high end of showing what is important is the exciting new area of data mining, which merits a whole separate article. Increasingly, query tools need to embed data mining capabilities into their user interfaces and underlying architectures.
Behavioral Studies. An interesting class of applications involves taking the results of a previous report or set of reports and then using these results over and over again at a later time. A manufacturer might run a series of reports analyzing customer ordering behavior. From the original group of 50,000 customers, a subset of 2,000 "problem customers" might be derived. At this point, the user wants to run a whole series of follow-up reports on these 2,000 customers. The problem is that it is very difficult to constrain on the set of 2,000 customers. They almost certainly are not defined by any reasonable constraint on the customer dimension table. They are really only defined by the complex outcome of the original analysis. What is needed is a two-step capability in the query tool. First, when the original defining report is run that shows the 2,000 problem customers, it must be possible with a single command to capture directly the underlying customer keys in a special separate table (call it a behavioral study). This table then gets a permanent name and is independent from the original defining report. Second, the user must be able to attach this special table of customer keys to any fact table as a direct constraint on the customer key component of that fact table's composite key. This process then automatically constrains all subsequent analysis to the 2,000 problem customers. In this way, a full range of follow-up reports can be run on the behaviorally defined group.
Next month, I'll venture further afield and talk about the impact of the Internet, which is posing such a significant new architectural issue for front-end tools that many of the front-end tool providers are recasting themselves as server companies.
TABLE 1. Features All Query Tools Should Have |
| Cross-Browsing of Dimension Attributes Open Aggregate Navigation Multipass SQL Semi-Additive Summations Show Me What Is Important Behavioral Studies |