DBMS - February 1996 - Data Warehouse Architect

DBMS, February 1996 DBMS Online: Data Warehouse Architect By Ralph Kimball

SQL Roadblocks and Pitfalls

Exploring some of the more glaring limitations of SQL.

In last month's column, I discussed the first of two glaring problems in SQL that make even simple business analyses frustratingly difficult. Last month I looked at the difficulty of performing comparisons within SQL. This month I take up the other big problem: the lack of sequential operations. Unlike last month's problem, for which you can attempt to program comparisons using existing SQL syntax, you are out of luck in the case of sequential operations. Despite my fellow contributing editor's (the inimitable Joe Celko) heroic programming efforts, there is no way to perform simple sequential operations using standard SQL.

Let's review the really useful sequential operations. All sequential operations process the answer set in order, from the first record to the last, accumulating some kind of calculation along the way. In many of these cases, the answer set represents a time series, and the records are presented in time series order.

The following simple report shows a number of useful sequential operations applied to a time series. Only the first column can be produced in SQL:

Dollar Day Cumulative Three-Day Moving
Sales Number Total Average
February 1 20 1 20 20
February 2 24 2 44 22
February 3 16 3 60 20
February 4 32 4 92 24
February 5 21 5 113 23
and so on

The Day Number column is a numbering of the answer set starting with the first record. The Cumulative Total and the Three-Day Moving Average are equally simple. Amazingly, SQL does not provide any of these functions. It is left up to the client application to provide the ability to perform these calculations. However, it is extremely important not to defer these calculations to the client application. Consider a variation of the previous report:

Dollar Average Cumulative Three-Day Moving
Sales Unit Price Unit Price Unit Price
February 1 20 2.0 2.0 2.0
February 2 24 2.4 2.2 2.2
February 3 16 2.0 2.2 .20
February 4 32 1.6 1.9 1.8
February 5 21 1.7 1.7 1.6
and so on

SQL can calculate the Average Unit Price easily by dividing the Dollar Sales by a hidden Unit Sales value in the data. But if the two sequential computations in columns 3 and 4 are deferred until the data arrives at the client application, it is very difficult for the tool to calculate the correct answer. The Unit Sales data has been lost. Most non-additive business calculations, such as Average Price, exhibit this problem. The calculations must be performed in the SQL engine where all the components of the calculations are available.

Another very useful sequential computation is the rank. Many companies base their business analysis around ranking reports such as the following:

Doodad Toggle Widget Total
Sales Sales Sales Sales
Rank Rank Rank Rank
East 1 2 3 1
Atlantic 3 5 2 4
Southeast 4 3 7 5
Midwest 5 4 1 3
Southwest 7 7 5 7
Pacific 2 2 4 2
Northwest 6 6 6 6

Creating this report using standard SQL and spreadsheet manipulation is a mind-boggling exercise. Only the actual sales numbers are returned from SQL. The rest of the application consists of a complicated series of sorts and macros in the spreadsheet. A simple ranking function in SQL would make this application a breeze. Note that "serious" ranking functions must assign ties and skip the next ranking numbers where the ranks are the same.

A variant of ranking is N-tiling. Tertiles divide the ranks into three categories: high, medium, and low. Quartiles divide the ranks into four categories, and so on. Advanced tertiling companies, such as Clorox, implement smart tertiling where the boundaries between high and medium, and between medium and low, are adjusted automatically to accomodate clumping of the data values.

Another sequential computation related to ranking is Top N, and, of course, Bottom N. Top N can be implemented in terms of rank if the ranking function is allowed in the SQL where clause:

WHERE . . . rank(sales) <= 10

This is the same as "Top 10." You want several flavors of Top N, including Top N by Value, Top N Percent of a list, and Top N Percent of Contribution in a column with an additive measure.

A quite different kind of sequential operation is the break row. In this case, you accumulate the computation in every column until the row header changes value. Then you insert a break row (underlined in the following table) with the correct values:

Average
Unit
Sales Price
Doodads Atlantic 72 2.00
Doodads East 46 2.20
Doodads Southeast 28 2.16
Total Doodads 146 2.09

Toggles Atlantic 66 3.95
Toggles East 56 3.85
Toggles Southeast 32 4.02
Total Toggles 144 3.92

Once again, you dare not compute the break row in the client application because non-additive computations such as Average Unit Price, in general, cannot be calculated correctly outside of the database. This means that you need BREAK BY syntax to go along with GROUP BY and ORDER BY, as follows:

SELECT prod_description, region, SUM(sales), 
       SUM(sales)/SUM(units) 
  FROM sales_fact, product, market, time
  WHERE  . . . <join constraints> AND <application constraints> 
     GROUP BY prod_description, region
     ORDER BY prod_description, region
     BREAK BY prod_description 
          SUMMING 3, 4 DISTRIBUTED 
          RESET BY prod_description

The SUMMING 3, 4 DISTRIBUTED phrase means sum column 3 directly, and sum column 4's components before doing the division.

A further refinement of BREAK BY adds a RESET BY clause that restarts the sequential computation at specified breakpoints. This is required in this example because it is meaningless to accumulate the Average Unit Price computation across the product break. However, in time series computations, the sequential computation often continues across the break even though the report includes a monthly break row.

Sequential computations can be added gracefully to existing SQL because the syntax extensions are compatible with the existing language. Several DBMS vendors have added bits and pieces of sequential processing to their SQL implementations; however, most are incomplete and do not address the important issues of distributed computations in sequential functions and break rows or sequential computations in WHERE clauses. Red Brick Systems (Los Gatos, Calif.) provides the most complete implementation of sequential operations. In fairness to other vendors, I promise to provide in a future column a scorecard of sequential processing extensions to various vendors' SQL (if they will send me their current language manuals).

Fixing the SQL Problem

Application developers and end users of the world are wasting enormous amounts of valuable time implementing simple comparisons and simple sequential computations in their client applications. If they are not wasting their valuable time, they are probably not even trying to provide the comparisons or the sequential computations.

As I pointed out last month, complex comparisons are best performed by "accreting" the answer set's columns. You can do this by performing an entire series of queries with identical row headers and then building the final answer set with a full outer join. I urge DBMS vendors to provide column accretion via SQL. It is not that hard to do, and it would sharply reduce the complexity of client applications.

Sequential operations are all examples of post-processing the answer set; application designers should think that way. Numbering, ranking, tiling, and break-row processing are all performed after the answer set is ready to be shipped to the user. Comparisons and sequential operations can coexist harmoniously. First you accrete the comparisons; then you perform the sequential processing on the generalized answer set.

To investigate these issues further, you can access my home page on the World Wide Web (http://www.rkimball.com). From there you can download my proposed SQL syntax for comparisons and sequential computations. Also, you can download my free Star Tracker software, which implements a full range of comparisons and sequential computations against Microsoft Access, Red Brick, Oracle, Sybase, and Informix databases. Because the major DBMSs do not provide answer set accretion, Star Tracker does its best to implement outer joins on the client. It also knows something about the DBMS's sequential processing SQL extensions, and passes them through when it can; otherwise, Star Tracker does its best to process sequential computations itself. Although this works, it makes applications more complex. The experience of implementing Star Tracker has convinced me that the world would be much better if the DBMS vendors came to the comparison and sequential processing party.

Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.

This is a copy of an article published @ http://www.dbmsmag.com/

	Dollar	Day	Cumulative	Three-Day Moving
	Sales	Number	Total	Average
February 1	20	1	20	20
February 2	24	2	44	22
February 3	16	3	60	20
February 4	32	4	92	24
February 5	21	5	113	23
and so on

	Doodad	Toggle	Widget	Total
	Sales	Sales	Sales	Sales
	Rank	Rank	Rank	Rank
East	1	2	3	1
Atlantic	3	5	2	4
Southeast	4	3	7	5
Midwest	5	4	1	3
Southwest	7	7	5	7
Pacific	2	2	4	2
Northwest	6	6	6	6

		Average Unit
	Sales	Price
Doodads Atlantic	72	2.00
Doodads East	46	2.20
Doodads Southeast	28	2.16
Total Doodads	146	2.09

Toggles Atlantic	66	3.95
Toggles East	56	3.85
Toggles Southeast	32	4.02
Total Toggles	144	3.92