DBMS - October 1997
DBMS Online: Data Warehouse Architect By Ralph Kimball

Digging Into Data Mining

Your Data Warehouse is Your Data Mining Platform.


Data mining is one of the hottest topics in data warehousing. Virtually every IS organization believes that data mining is part of its future and that it is somehow linked to the investment the organization has already made in the data warehouse. But behind all of the excitement is a lot of confusion. Just exactly what is data mining? Is data mining just a generic name for analyzing my data, or do I need special tools and special knowledge in order to do it? Is data mining a coherent body of knowledge or is it an eclectic set of incompatible techniques? Once I have a basic understanding of data mining, can I automatically use my data warehouse to mine my data or do I have to perform yet another extraction to a special "platform"?

In this month's column, I will define the main categories of data mining, and in my next column, I will show what transformations need to be done on your data warehouse data to make it ready for data mining.

Before descending into the details, let's paint the big picture:

The Roots of Data Mining

Although the marketplace for data mining currently features a host of new products and companies, the underlying subject matter has a rich tradition of research and practice that goes back at least 30 years. The first name for data mining, beginning in the 1960s, was statistical analysis. The pioneers of statistical analysis, in my opinion, were SAS, SPSS, and IBM. All three of these companies are very active in the data mining field today and have very credible product offerings based on their years of experience. Originally, statistical analysis consisted of classical statistical routines such as correlation, regression, chi-square, and cross tabulation. SAS and SPSS in particular still offer these classical approaches, but they -- and data mining in general -- have moved beyond these statistical measures to more insightful approaches that try to explain or predict what is going on in the data.

In the late 1980s, classical statistical analysis was augmented with a more eclectic set of techniques with names such as fuzzy logic, heuristic reasoning, and neural networks. This was the heyday of artificial intelligence (AI). Although perhaps a harsh indictment, we should admit that AI was a failure as packaged and sold in the 1980s. Far too much was promised. The successes of AI turned out to be limited to special problem domains and often required a complicated investment to encode a human expert's knowledge into the system. And perhaps most seriously, AI forever remained a black box to which most of us normal IS people couldn't relate. Try selling the CEO on an expensive package that performs "fuzzy logic."

Now in the late 1990s, we have learned how to take the best approaches from classical statistical analysis, neural networks, decision trees, market basket analysis, and other powerful techniques, and package and talk about them in a much more compelling and effective way. Additionally, I believe that the arrival of serious data warehouse systems is the necessary ingredient that has made data mining real and actionable.

The Categories of Data Mining

The best way to talk about data mining is to talk about what it does. A useful breakdown of data mining activities includes: clustering, classifying, estimating and predicting, and affinity grouping. For the discussion of this taxonomy I am indebted to Michael Berry and Gordon Linoff for their wonderful new book, Data Mining Techniques for Marketing, Sales, and Customer Support (John Wiley & Sons, 1997).

An example of clustering is looking through a large number of initially undifferentiated customers and trying to see if they fall into natural groupings. This is a pure example of "undirected data mining" where the user has no preordained agenda and is hoping that the data mining tool will reveal some meaningful structure. The input records to this clustering exercise ideally should be high-quality verbose descriptions of each customer with both demographic and behavioral indicators attached to each record. Clustering algorithms work well with all kinds of data, including categorical, numerical, and textual data. It is not even necessary to identify inputs and outputs at the start of the job run. Usually the only decision the user must make is to ask for a specific number of candidate clusters. The clustering algorithm will find the best partitioning of all the customer records (in our example) and will provide descriptions of the "centroid" of each cluster in terms of the user's original data. In many cases, these clusters have an obvious interpretation that provides insight into the customer base.

Specific techniques that can be used for clustering include standard statistics, memory-based reasoning, neural networks, and decision trees. See Berry and Linoff's book for a very readable introduction to each of these types of tools.

An example of classifying is to examine a candidate customer and assign that customer to a predetermined cluster or classification. Another example of classifying is medical diagnosis. In both cases, a verbose description of the customer or patient is fed into the classification algorithm. The classifier determines to which cluster centroid the candidate customer or patient is nearest or most similar. Viewed in this way, we see that the previous activity of clustering may well be a natural first step that is followed by the activity of classifying. Classifying in the most general sense is immensely useful in many data warehouse environments. A classification is a decision. We may be classifying customers as credit worthy or credit unworthy, or we may be classifying patients as either needing or not needing treatment.

Techniques that can be used for classifying include standard statistics, memory-based reasoning, genetic algorithms, link analysis, decision trees, and neural networks.

Estimating and predicting are two similar activities that normally yield a numerical measure as the result. For example, we may find a set of existing customers that have the same profile as a candidate customer. From the set of existing customers we may estimate the overall indebtedness of the candidate customer. Prediction is the same as estimation except that we are trying to determine a result that will occur in a future time. Estimation and prediction can also drive classification. For instance, we may decide that all customers with more than $100,000 of indebtedness are to be classified as poor credit risks. Numerical estimates have the additional advantage that the candidates can be rank-ordered. We may have enough money in an advertising budget to send promotion offers to the top 10,000 customers ranked by an estimate of their future value to the company. In this case, an estimate is more useful than a simple binary classification.

Specific techniques that can be used for estimating and predicting include standard statistics and neural networks for numerical variables, as well as all the techniques described for classifying when only predicting a discrete outcome.

Affinity grouping is a special kind of clustering that identifies events or transactions that occur simultaneously. A well-known example of affinity grouping is market basket analysis. Market basket analysis attempts to understand what items are sold together at the same time. This is a hard problem from a data processing point of view because in a typical retail environment there are thousands of different products. It is pointless to enumerate all the combinations of items sold together because the list quickly reaches astronomical proportions. The art of market basket analysis is to find the meaningful combinations of different levels in the item hierarchy that are sold together. For instance, it may be most meaningful to discover that the individual item "Coca Cola 12 oz." is very frequently sold with the category of "Frozen Pasta Dinners." In my next column, I will investigate some clever techniques for teasing this kind of insight out of raw retail transaction records.

Specific techniques that can be used for affinity grouping include standard statistics, memory-based reasoning, link analysis, and special purpose market basket analysis tools.

After reading Berry and Linoff's book to learn about the specific tools and techniques, you should carefully work through Larry Greenfield's comprehensive Web resource devoted to vendors and products for data warehousing. Larry has a very complete section on data mining. The site can be accessed at pwp.starnetinc.com/larryg/index.html.

In my next column, we will be ready to descend to the next level: preparing your data warehouse most effectively for data mining.


Ralph Kimball works as an independent consultant designing large data warehouses. His book "The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can email Ralph at [email protected]or reach his Web page at www.rkimball.com.

This is a copy of an article published @ http://www.dbmsmag.com/