Within the past five years the PC industry has gone through a one-time revolution in which everyone's disk storage has been effectively doubled by means of low-level disk compression technology. Although it took a couple of generations of software and a few high-profile lawsuits to be settled between Microsoft and Stac Inc., we are now all the beneficiaries of very reliable compression technology that is virtually seam-less and automatic. This software sits deep within the file system and requires no user intervention. It is also so fast that most of us aren't aware when compression and decompression take place.
The database industry and data warehousing in particular need to go through this same one-time revolution. Data warehousing is ideally suited to harvest the data compression benefit. Our data sets are monstrous, and in our data warehouses, we aren't changing the data in realtime. Most data warehouses consist of gigabytes of static data that is queried but not updated or changed. Even with disk prices dropping to less than $1000 per gigabyte for commercial-grade disk drives, you can still spend $1 million for a terabyte of storage when you count all of the controllers and busses for such large disk farms. In this article, I argue that data compression of our data warehouses should be able to harvest 25 percent or more of this space, which could translate into at least a $250,000 benefit for a terabyte data warehouse.
Data warehousing presents some interesting tradeoffs for compression that are different from the tradeoffs appropriate for the general PC market. A database is quite different from a heterogeneous system of files. Assume that the disk storage for a mature multi-gigabyte data warehouse is split roughly into three sections: 40 percent for data, 40 percent for indexes, and 20 percent for scratch space needed for sorting and copying of files. In this scenario, the main candidate for compression is the first 40 percent, namely the data and aggregates in record form. This data should be highly compressible.
On a PC, the current compression technology averages well over a factor of 2.0 for compressing all forms of files, including text files and large binary files. Text files by themselves often compress by a factor of 5.0 or more. My informal experience with large ASCII files of numeric results from databases shows that these files also compress very dramatically. I would therefore expect that the record portion of a data warehouse (the first 40 percent mentioned above) would shrink by roughly a factor of 3.0. This reduction of the first 40 percent to 13 percent yields a harvest of 27 percent of the overall disk space.
Even if you assume that the indexed portion and the scratch space portion of the data warehouse are not effectively compressed, I believe I have justified the claim of a 25 percent reduction in disk space for a large data warehouse. In the event that I can claim a portion of these spaces as well, I may push the compression benefit to one-third or more of the total disk space.
Data warehouse compression could be implemented either at the disk level or at the DBMS index level. Implementing at the disk level, as in the PC compression technologies, means applying compression to designated file objects without using any special knowledge of what the file object is. Data, metadata, and indexes would potentially all be compressed. The advantage of this approach is that it could be implemented below the DBMS and might not require changing the DBMS software itself. The disadvantage is that the compression is less efficient because it is performed by sequentially scanning each file object without a detailed knowledge of what the object is.
Implementing compression at the index level uses the DBMS's own indexing schemes to generate the compression. It is an axiom of all forms of indexing that the same information that allows an index to be built can be used to compress what you are indexing. An index, by its nature, organizes the data for rapid lookup in a certain order. This capability is equivalent to knowing where all of the repeated instances of the data are.
I have argued in this column that we have been stuck with a particular set of compromises in the relational database world for a long time because of the need to perform transaction processing against the data as well as querying it. As a result, we have always needed to have the data "available" for quick updating. This has inhibited us from compressing the data or recognizing that some of it could be deleted because the equivalent information already resided in the indexes we have built. In some sense, if data is sufficiently indexed, we don't need to store it in its original form because the index contains all of the information necessary to access and reconstruct the data. In a data warehouse, we are more willing to build elaborate indexes for data retrieval that are not easy to update than to build in a transaction processing environment.
A more daring compression opportunity is to remove the keys from the underlying storage of the large fact table in the data warehouse. Recall that all dimensional data warehouses are dominated by extremely large central fact tables possessing multi-part keys, in which each component of the multi-part key is joined to a single, much smaller dimension table. The data storage portion of the data warehouse is overwhelmingly dominated by the fact table, which is often 95 percent of the data storage. You could argue that if all accesses to the fact table come through the index on the large composite key of the fact table, then storing the value of the keys along with the data is redundant. You already know what the key values are because you have just traversed the index to get to the data! In a typical retail-oriented fact table, the fact table record would normally consist of four keys (Time, Product, Store, and Promotion), and each record might contain as little as four facts (such as Unit Sales, Dollar Sales, Dollar Cost, and Customer Count). A reasonable design could easily represent all of these keys and data elements as four-byte integers or scaled integers. Suppressing the keys (four of the eight fields in each record) could result in a whopping 50 percent reduction in data storage for the fact table.
There is no end to the growth of data. Fourteen years ago, in 1982, while working at Metaphor Computer Systems, my colleagues and I searched banks, manufacturers, and retail organizations for large end-user databases. We had trouble finding many larger than 50MB. The idea of a database as large as 1GB was mind-boggling. In the intervening years, the picture has not changed much, except that now the databases are almost exactly 1000 times as large: The average end-user database in large environments today is perhaps 50GB, and production examples of end-user databases larger than 1TB have appeared. Telephone call tracking and retail transaction databases will certainly grow well into the multi-terabyte range, and this is just for databases storing text and numbers! Even without attaching multimedia data to our databases, we will constantly be pursuing ever more finely grained transaction data. I believe that we will ultimately record hundreds of data points from every individual every day across all of our business databases. Fourteen years from now, we will be laughing at the minuscule terabyte databases of the 1990s.
I also believe that we are finally ready to attach multimedia data to our conventional text and number data warehouse. Before the end of the decade, a set of juggernauts will collide: different technologies and industries that have been "warming up" on the sidelines. One such juggernaut is geographic mapping: Every geographic data element implies a series of surrounding maps, and these maps will become part of our data warehouses. Another juggernaut is the availability of high-quality scanned images: Every product, person, and location will be accompanied by a scanned image. A third juggernaut, the CD-ROM, is about to leap into its second generation with the digital versatile disk (DVD) technology capable of storing as much as 17GB of data on the same size media we use today. And, of course, the biggest juggernaut is the Internet, which connects all of us to each other.
In some ways, data compression seems like a finger in the dike when viewed against the growth of data storage that we will demand in the next few years. But at least it will give us a one-time breather as we wait for the next generation of more potent storage technologies.