DBMS - September 1997 - Data Warehouse Architect

DBMS - September 1997 DBMS Online: Data Warehouse Architect By Ralph Kimball

Turbocharge Your Query Tools

How To Answer Really Difficult Business Questions.

SQL is very seductive. Simple SQL statements seem to read like English-language requests for information from a database. After all, almost anyone can figure out the intent of a simple SQL request such as this one that asks for the September sales of each of our products:

Select Product_Description, Sum(dollars) -- all the columns in the final output
>From Product, Time, Sales -- the tables needed in the query
Where Sales.Product_key = Product.Product_key -- joining the Sales table to the Product table
And Sales.Time_key = Time.Time_key -- joining the Sales table to the Time table
And Time.Month = 'September, 1997' -- the "application" constraint
Group by Product_Description -- the row header in the final output

Unfortunately, in most cases a more ambitious business request begins to make the SQL complex to write and read. For too long, query tool vendors did not venture far beyond the safe havens of simple SQL requests, like our example. In the early days, most query tools automated the construction of simple SQL requests, sometimes even showing the SQL clauses as they were being built. Only in the past two or three years have query tool vendors tackled how to issue the complex SQL needed for serious business questions. Some vendors, such as Speedware Corp. Inc., Business Objects Inc., and Brio Technology Inc. have deepened their tools by allowing the user to construct embedded subqueries. Some of these vendors as well as vendors such as MicroStrategy Inc. and Sagent Technology Inc. have also implemented multipass SQL where complex requests are broken into many separate queries whose results are combined after the database has completed processing all of them.

Are these approaches sufficient? Are we able to pose all the business questions we want? Are there some business results that beg to be recognized but are trapped in the database because we just can't "speak" clearly enough? Should the SQL language committee give us more power to ask difficult business questions, and could the database vendors implement these language extensions, all before the next millenium? (Only 935 days from the creation of this article!)

To get some perspective on these issues, let me propose seven categories of business questions. These are ordered from the most simple to the most complex, in terms of the logical complexity of isolating the exact records in the database needed to answer the question. This taxonomy isn't the only one possible for business questions or SQL queries, but it is useful as a scale to judge SQL and SQL-producing tools. As you read the following seven categories of queries, try to imagine whether SQL could pose such a query and whether your query tool could produce such SQL. The seven categories of queries are:

Simple Constraints: Constraints against literal constants, such as "Show the sales of candy products in September 1997."
Simple Subqueries: Constraints against a global value found in the data, such as "Show the sales of candy products in September 1997 in those stores that had above average sales of candy products."
Correlated Subqueries: Constraints against a value defined by each output row, for example, "Show the sales of candy products for each month of 1997 in those stores that had above average sales of candy in that month."
Simple Behavioral Queries: Constraints against values resulting from an exception report or a complex series of queries that isolate desired behavior, such as "Show the sales of those candy products in September 1997 whose household penetration for our grocery chain in the 12 months prior to September were more than two standard deviations less than the household penetration of the same products across our 10 biggest retail competitors." This query is a variation of the classic opportunity gap analysis.
Derived Behavioral Queries: Constraints against values found in set operations (union, intersection, and set difference) on more than one complex exception report or series of queries, such as "Show the sales of those candy products identified in example number 4, and which also experienced a merchandise return rate more than two standard deviations greater than our 10 biggest retail competitors." This request is a set intersection of two behavioral queries.
Progressive Subsetting Queries: Constraints against values, as in number 4, but temporally ordered so that membership in an exception report is dependent on membership in a previous exception report: "Show the sales of those candy products in example number 4 that were similarly selected in August 1997 but were not similarly selected in either June or July 1997." A health care example of a progressive subsetting query would be "Show the oldest 100 patients who initially complained of chest pain, then had either treatment A or treatment B, then did not have surgery, and are still alive today."
Classification Queries: Constraints on values that are the results of classifying records against a set of defined clusters using nearest neighbor and fuzzy matching logic: "Show the percentage of low-calorie candy sales contained in the 1,000 market baskets whose content most closely matches a young, health-conscious family profile."

The business questions in these seven categories grow progressively more interesting as we move down the list. The questions in categories 4 through 7 lead almost directly to decisions made and actions taken. Decisions made and actions taken are the true outputs of a data warehouse, and in that sense we should devote a large part of our creative energy to making these kinds of difficult queries possible.

How did you do in comparing these business questions to SQL and to industry-standard tools? Not too well, I assume. Most query tools can really only do category 1 (simple constraints) easily. Nearly all the seasoned tools that I regularly write about in this column can also do category 2 (simple subqueries), although the user interface commands for doing these subqueries may be cumbersome. A few, such as Speedware, are aggressively selling their ability to perform category 3 queries (correlated subqueries). As far as I know, none of the standard query or reporting products has direct user interfaces for categories 4 through 7. If you are trying to support queries in these categories, you are faced with an architectural dilemma: These business questions are getting too complex to express in a single request. Not only should the user partition these queries into sequential processing steps in order to think more clearly about the problem, but the underlying algorithms may be more stable and controllable if they are doing less in each step. So how do you attack these difficult problems in categories 4 though 7, and can you use your current tools to get you part way or all the way to useful answers?

In my February 1997 DBMS column, "Features for Query Tools" (see page 12), I briefly described a technique for handling behavioral queries. Since then, I have been discussing this technique with many groups, and I am becoming convinced that this approach could be an important step forward for query tools that would extend their reach across categories 4, 5, and 6 (simple and derived behavioral queries and progressive subsetting queries). The technique partitions the problem into two steps:

Run an exception report or series of queries that defines the complex behavior you wish to label. For instance, "Define the candy products in September 1997 whose household penetration for our grocery chain in the 12 months prior to September were more than two standard deviations less than the household penetration of the same products across our 10 biggest retail competitors." Although this is a complex request, most of the good report and analysis systems on the market should be able to handle it. After running the exception report (in this case yielding a list of products), capture the product keys of the products identified in the exception report as an actual physical table, consisting of a single product key column.
Now, use the special behavior dimension table (see Figure 1) of product keys whenever you wish to constrain any analysis on any table to that set of specially defined products. The only requirement is that the target fact table must contain a product key as part of its overall composite key.

The special behavior dimension is attached only with an equijoin to the product key component of the fact table's composite key. This can even be done in a view that hides the explicit join to the special behavior dimension. In this way, the star schema looks and behaves like a regular, uncomplicated star. If the special dimension table is hidden under a view and we call this view "Special Product Sales" instead of "Sales," then virtually every query tool, report writer, and ROLAP tool should be able to analyze this specially restricted fact star schema smoothly, without paying a syntax or user interface penalty for the complex processing that defined the original exception report.

Like any design decision, this one involves certain compromises. First, this approach requires a user interface for capturing, creating, and administering real physical tables in the data warehouse. We can imagine building a simple applet in Visual Basic (or another capable application development tool) that sits outside of your favorite query tool or report writer. Whenever a complex exception report has been defined on the screen, you make sure that the appropriate set of keys are displayed and then capture them into the applet with cut and paste. This applet then creates the special behavior dimension table.

A second reality is that these tables must live in the same database space as the primary fact table because they are going to be joined to the fact table directly, thus affecting the DBA's responsibilities.

Third, the use of a random set of product keys, as in our example, will affect aggregate navigation on the product dimension. A sophisticated approach could build custom aggregations on the special behavior set for all those queries that summed over the whole set rather than enumerating the individual members. Aggregate navigation on all the other dimensions should be unaffected.

By storing the keys in the special behavior dimensions in sorted order, set operations of union, intersection, and set difference can be handled in a single pass, thus allowing us to construct derived behavior dimensions that are the combinations of two or more complex exception reports or series of queries.

Finally, by generalizing the behavior dimension to include a time key in each record along with the product key (in the example), the time key can be used to constrain records in the master fact table that occurred after the behaviorally defined event. In this way, we can search for behaviors in a sequential order.

This approach offers the hope that we can use our existing tools to extend our queries beyond simple constraints and subqueries. The Star Tracker freeware query tool, which I have distributed to more than 100,000 end users, contains the interfaces for creating and manipulating special behavior dimensions (called "studies" in Star Tracker). You could even use these tables with other query tools. On my home page (www.rkimball.com) you will see that the company "if..." has acquired the commercial rights for Star Tracker, and they have an upgraded version of Star Tracker freeware as well various commercial versions.

Next month I will tackle the seventh level of complex queries, which, as you may expect, require full-fledged data mining. Data mining is a big topic and the subject of October's column.

Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can reach Ralph through his Web page at www.rkimball.com.

Figure 1.

--A special behavior dimension is added to a normal star schema as a way of restricting the schema to the context of a behaviorally defined group of products.

This is a copy of an article published @ http://www.dbmsmag.com/