DBMS - January 1996
DBMS Online: Data Warehouse Architect By Ralph Kimball

The Problem with Comparisons

A Freshman in Business Needs a Ph.D. in SQL, Part 1.

I have implemented several of the ideas from this column in a free Visual Basic report writer called Star Tracker, which is available to anyone with access to the Internet (http://www.rkimball.com).

The most difficult area of data warehousing is the translation of simple business analyses into SQL. SQL was not designed with business reports in mind. SQL was really an interim language designed to allow relational table semantics to be expressed in a convenient and accessible form, and to enable researchers and early developers to proceed with building the first relational systems in the mid-1970s.

How else can you explain the fact that there is no direct way in SQL to compare this year to last year? Or to compare one group of customers to another? Or to number the answer set? Very simple business questions require complex and daunting SQL. In this and next month's columns, I explore the two most glaring SQL issues: the inability to perform comparisons and the lack of sequential computations. I attack the comparison issue this month.

The early days of relational systems were marked by arguments at conferences between the academics at Berkeley and members of the more traditional Cobol camp. When the Cobol camp withdrew, some of the standard techniques that had been learned from data processing were lost in the shuffle. This situation would not have been too bad if SQL had been more dynamic and alive as a language over the intervening 20 years, but the efforts of the language committees for both SQL-89 and SQL-92 were concerned with extending the relational semantics of SQL, rather than with adding simple business data analysis capabilities. Therefore, the industry is struggling today with a 1970s interim relational processing language that was invented by kindly academic professors, rather than by businesspeople.

Let me present some examples from common business reports to show what is so hard in SQL. The first, and most glaring, problem is the difficulty of performing simple comparisons. Suppose you want to show this year's sales versus last year's sales:

4Q95 4Q94
Sales Sales
Doodads 57 66
Toggles 29 24
Widgets 115 89

This little report is the bedrock of business analysis. A number in the business world rarely makes much sense unless it is compared to some other number. In this report, the comparison numbers are side by side, but you could just as well have used them together in a ratio to show the growth in sales from last year to this year. This-year-versus-last-year is the standard example. You can see that sales of Widgets and Toggles are up substantially from last year, and the sales of Doodads are down. This immediately suggests further analysis and perhaps even management action.

Amazingly, it is very difficult in SQL to get an answer set that looks like this. The usual SQL template fails you. You start by writing a fragment of the SQL needed for the fourth quarter 1995 sales (4Q95):

SELECT Product.Product_Name, SUM(Sales.Dollars)
FROM Sales, Product, Time
WHERE . . . Time.Quarter= '4Q95'
AND Product.Product_Name in ('Doodads', 'Toggles', 'Widgets')
GROUP BY Product.Product_Name

If you have been reading this column for the past few months, you would expect that the underlying dimensional database schema consists of a large central fact table (Sales) joined to a few dimension tables (including Product and Time). To keep this column focused, I have omitted some of the bookkeeping required in the SQL, such as listing all the join constraints in the WHERE clause required to hook the tables together.

Now, how do you get the sales numbers for 4Q94? SQL leaves you with no good alternatives. If you expand the Time constraint to include both quarters, as follows:

WHERE . . . Time.Quarter IN ('4Q95', '4Q94')

then the sum expression adds up the sales from both quarters, which you do not want.

SQL-92 provided a case expression that would seem to offer a way out. Put both quarters' data into the query, and then sort out the two quarters in the select list. This is a kind of if-then logic:

SELECT Product.Product_Name,
SUM(CASE(Time.Quarter = '4Q95', Sales.Dollars, 0)),
SUM(CASE(Time.Quarter = '4Q94', Sales.Dollars, 0))
FROM Sales, Product, Time
WHERE . . . Time.Quarter IN ('4Q95', '4Q94')
AND Product.Product_Name in ('Doodads', 'Toggles', 'Widgets')
GROUP BY Product.Product_Name
The first CASE expression tests each record in the candidate answer set. If the Time.Quarter is 4Q95, then Sales.Dollars is added to the accumulating SUM. If Time.Quarter is something else, zero is added. The same logic applies to the second case statement, except that the time period is 4Q94. This logic allows the two Sales.Dollar columns to be computed.

Unfortunately, several things are wrong with this scenario. First, I don't know of a query tool that supports building this kind of logic. The case construct would have to be hand-coded at the SQL level into applications. Second, the approach doesn't scale to more realistic examples. Suppose you wanted to do a more interesting set of comparisons, as follows:

4Q95
Sales
Fraction
4Q95 4Q94 1995 1994 to All
Sales Sales Sales Sales Products
Doodads 5766 210213 16%
Toggles 29 24 110 93 8%
Widgets 115 89 409 295 32%

This example turns into a mind-boggling set of case statements. Notice the little trap in the last column, where "All Products" represents a larger scope than just Doodads, Toggles, and Widgets.

The third problem with the CASE approach is that the SQL gets so complicated that the optimizer can't really tell what is going on. The proliferation of CASE statements obscures the original simple goal. As you add more comparisons, there is a real possibility that the optimizer will "lose it" and do something zany (like a full table scan of Sales), resulting in disastrous effects on performance. A language designer would look at the SQL and say that it "lacks intentionality." In other words, it doesn't say clearly what it is trying to do.

A fourth problem with the CASE approach in this last example is that the WHERE clause must be emasculated to such a point that it lets almost the entire database into the query. Because the CASE approach defers the issue of constraining the records until they arrive in the answer set, the WHERE clause is forced to let through all the Sales records for all products sold both this year and the previous year. If you had hoped to use a speedy aggregate record (see November's column, "The Aggregate Navigator," page 18), you will be disappointed, because the use of the All Products aggregate is hopelessly embedded in the query using the CASE approach.

Stepping back from the details, how do you program comparisons using a relational database? Historically, you have had only four choices: CASE statements, SQL self-joins, SQL correlated subselects, and separate queries combined in the client application.

The second and third approaches (self-joins and correlated subselects) are older techniques that are even uglier than the CASE statement. Neither approach is viable for data warehouse comparisons. Note that the SQL UNION operator does not address the issues of comparisons within a row, because UNION appends rows, not columns, from multiple queries.

Thus, you are left with the fourth choice: separate queries combined in the client application. In other words, in the last example, you would send the following five separate queries to the database:

4Q95 sales for Doodads, Toggles, and Widgets,
4Q94 sales for Doodads, Toggles, and Widgets,
1995 sales for Doodads, Toggles, and Widgets,
1994 sales for Doodads, Toggles, and Widgets, and
4Q95 sales for All Products.

Your client application must then combine these results outside of the database. Although this approach obviously increases the complexity of the client application, there are some valuable benefits. First, each of the five SQL statements is extremely simple SQL (which all query tools are good at producing). Second, the optimizer can easily analyze these simple SQL statements in order to choose appropriate evaluation plans. Third, the aggregate navigator will process each of these separate queries smoothly, jumping to the yearly aggregates for queries 3 and 4, and jumping to the All Products aggregate for query 5.

The only real glitch is that the client application must perform an outer join on the five answer sets. Keep in mind that there is no guarantee that Doodads, Widgets, and Toggles are all sold in the various time periods requested. If Doodads had not been sold in 1994, you would be forced to enter null data elements into the appropriate cells of your report. Fortunately, outer join is just a fancy name for sort-merge from the old Cobol data processing days. Building the answer set from the five queries requires only that the five answer sets be sort-merged on the set of sorted row headers ("Doodads," "Toggles," and "Widgets").

The separate query approach is extremely simple and general. You can build very complicated reports from separate queries, and the database performance is controllable and scales linearly in the number of queries. I believe that the separate query approach is the only viable approach for standard business reports.

In thinking about how the database industry should attack the problem of building these standard reports, DBMS vendors should address part of the problem and the tool vendors should address the rest. DBMS vendors should extend the concept of the SQL answer set so that multiple queries with the same row headers can be "accreted" column by column into an answer set with symmetric outer join (sort-merge on the row headers). Once this master answer set has been assembled, it can be fetched back to the requesting application for final analysis and presentation.

Tool vendors should make it possible to create comparisons like those in this column. Initially, tool vendors will have to do the sort-merge in the client; later, DBMS vendors can evolve this capability on the server side.

As an example (and as a goad to the industry), I have implemented several of these ideas in a free Visual Basic report writer called Star Tracker, which is available to anyone with access to the Internet (http://www.rkimball.com). This free software is also available on the DBMS Buyer's Guide CD-ROM, shipped with this issue of the magazine. Star Tracker works on Access, Red Brick, Oracle, Informix, and Sybase databases.

Next month I will conclude my analysis of "Business versus SQL" by looking at sequential processing, which is not just difficult, it's impossible.


Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.

This is a copy of an article published @ http://www.dbmsmag.com/