| |
Big Warehouse, Big Decisions
Mark Madsen
The large or very large data warehouse has special storage requirements. For 24x7 availability and
adequate performance, you have to look beyond the usual answers
The storage purchase is a decision that will have a direct impact on the performance, availability, and cost
of the data warehouse. The correct choice often goes unnoticed, but the incorrect choice can lead to many
difficulties. For a data warehouse of large or VLDB proportions, this purchase decision is no longer a simple
matter of buying disks from the server vendor or buying a storage array from the approved third-party
storage vendor. In a nutshell, the criteria for selection should include availability, reliability,
manageability, performance, serviceability, and cost--with cost definitely being last in the list.
When evaluating your options, you must consider a number of technologies and components. In particular, taking a total system view--rather than simply looking at RAID and assuming that performance and availability will be a natural outcome--is crucial.
Multiple sources for storage products give you the same rich set of choices in product features and configuration that have been available in the open-systems server market for years. But this choice comes with the curse of open systems: the difficulty of deciding which of the many products to evaluate, determining what factors are important, and deciding how to prioritize the various options. However, you can narrow the field of choices by finding out which products support your operating system, work with your server hardware, and can support the capacity you need. When you've narrowed the set of products, you can start to look at more detailed criteria to make a selection.
Most of the larger data warehouses in production are running on symmetric multiprocessor (SMP) platforms, so the discussion here will not address the specific needs of clustered systems or shared-nothing (massively parallel processor, or MPP) systems. These types of systems require either many individual storage units or single storage units that support concurrent connections from multiple hosts.
STORAGE SUBSYSTEM BASICS
There are three basic hardware models for disk storage. The simplest option is internal or external storage in peripheral bays. In this configuration, the disks are attached to an I/O channel, usually a SCSI bus, and the server operating system addresses and manages each disk. Storage vendors refer to this model as "just a bunch of disks" (JBOD).
Platform vendors have added features to their operating systems to help manage the increasing amount of storage attached to their systems. One common feature is a volume manager that allows the system administrator to treat the disks as a pool of storage and to carve up or combine units of storage across multiple disks. Often, the volume manager includes software support for RAID that is usually limited to levels 0 and 1.
The second storage model is external storage arrays, which manage the disks internally and connect to the server via an I/O channel as a single unit. The server typically does not address individual disks or manage the placement of data on those disks. The arrays can be as simple as a set of bonded disks, but the usual practice is to provide multiple levels of RAID managed via a hardware controller in the array.
The third storage model is the centralization of storage for multiple servers onto a special network called a storage area network (SAN) or network-attached storage (NAS). In this model, modular units of storage servers are connected to a network to form a centralized storage complex that is also accessible by servers and managed from a central point, solving the attachment and management problems of large amounts of storage. The architecture of this storage complex looks like a network, with devices connected via switches and loops into complex topologies. Storage arrays and serial interfaces--serial storage architecture (SSA), fibre channel, and fibre channel arbitrated loop (FC-AL) in particular--are the enabling technologies for this storage complex. The result is a high-performance and high-availability storage solution that can be easily reconfigured to meet changing requirements.
DATA WAREHOUSE REQUIREMENTS
There are a few key items specific to data warehousing that limit the storage alternatives available--including the computing platform, database size, and warehouse architecture.
The computing platform is a factor when evaluating whether the database's size makes standard peripheral storage inappropriate. Smaller Unix or NT servers usually have a limited number of I/O channels--thus limiting the amount of addressable storage. For these servers, you must turn to storage arrays to provide the required volume of storage.
The size of the database is, of course, an important factor when considering storage alternatives. There will be a cutoff point in size at which it is no longer feasible to use standard vendor disk storage and you must consider storage array products to meet your needs. On typical high-end Unix systems such as those from Sun and Sequent, the point is usually in the 400GB range. The size cutoff varies depending on the type of computing platform and hardware vendor, but a good rule of thumb is to look to storage arrays when the raw storage requirements exceed 200 or 300GB. Beyond 500GB, you should almost certainly be using storage arrays.
The warehouse architecture can also be a major factor in your decision. If the warehouse follows a single, centralized model, you can base your decision on computing platform and database size as described. But if you plan to have a warehouse and multiple data marts, you could consider using network storage arrays that can be organized as a central storage complex. Many small, individual storage-management tasks can become a burden to support. With a central storage complex, you can take advantage of a single point of management, the higher availability of high-end arrays, and the purchasing economies of scale.
A central storage requirement immediately limits the product set to those that use serial connection schemes (SSA or fibre channel) and support multiple platforms and operating systems. These products are still new to the market and require careful consideration, but they offer advantages in manageability and cost in a distributed environment.
RELIABILITY AND AVAILABILITY
Availability is a key consideration for data warehouses. Just because a data warehouse is not executing transactions doesn't mean it's not subject to the same mission-critical considerations as OLTP systems. In fact, many companies are running customer service departments that require 2437 access to the warehouse.
Storage selection is a key consideration in data warehouse availability because of the volume of data stored. The amount of raw storage can easily add up to hundreds of gigabytes for many businesses. The implication of having this much storage is that you will have lots of disks, and therefore a higher mean time to failure (MTTF) of your storage subsystem.
The consequence is that if should any single drive fail, you may lose warehouse data or at least suffer downtime. The only way to avoid downtime because of a bad drive is to use RAID. RAID, however, is not the only criterion for higher reliability. While RAID systems are designed to withstand the loss of single drives without losing data, they still expose the warehouse to unavailability when bad drives are replaced and data restored.
For a large warehouse, the array must support the ability to replace drives without affecting uptime. One feature that can increase availability when drives fail is hot sparing. In this approach, the array contains one or more spare drives that automatically take the place of failed drives. There is also a second area where redundancy is important: that of power supplies and cooling fans. Most good storage subsystems provide redundant power supplies and fans over the minimum required to support the configuration. With this redundancy, the failure of any single supply or fan will not cause the array to shut down.
Another area that is often a single point of failure is the array controller. The array should provide at least two internal I/O controllers so that if one fails, the other remains available to serve I/O requests from the server. Automatic controller failover is desirable but is less useful if the failover requires a restart of the storage array, thus making the array unavailable for a period of time.
As long as the I/O channel is unaffected by an array controller failover, the server won't know that a failure occurred. However, the failure of the I/O channel is another story. There can be failures where the path to data is broken but both the warehouse server and storage subsystem are working normally. In such a failure, the hardware should automatically switch from the failed I/O channel to an alternate channel (called alternate pathing).
This feature is not always available because it requires support from both the storage subsystem and the server. Many storage arrays do not provide the capability to inform the server automatically that a channel has failed and provide access via a redundant path. Many servers do not have mechanisms in the operating system or hardware to manage failover to a redundant I/O controller. Even so, this feature should be a consideration for very large (multiterabyte) warehouses where the likelihood of a controller failure is high.
Another issue with controller failure in an array is the use of write cache. Typical arrays will use cache to boost write performance and work around the well-known RAID5 write penalty. If an array uses write cache and the controller fails, any data not flushed from that cache is lost, which could lead to a corrupted database. Therefore, it is critical to use cache mirroring to guarantee data integrity. Nonvolatile cache memory alone will not be enough because the data stored in that cache must be retrieved, and data stored on a dead controller may be very difficult to restore. But mirroring write cache is pointless if the mirror cache is on the same controller or even the same power supply as the primary cache. The loss of any of these components will cause an outage and potential data loss. Cache mirroring does not apply to read cache because that data can be easily repopulated and its loss will not affect the database.
Some vendors have chosen to reduce costs in the memory used for read and write cache. The cache should have the ability to correct single-bit errors so data will not be corrupted in the event of an error. Vendors that provide this capability will usually have some reference to error correction code (ECC) in their product information.
PERFORMANCE AND SCALABILITY
Data warehouse storage projects always stress performance and scalability because the data volumes involved introduce unique issues. Scalability has two aspects: the ability to grow in size and the ability to maintain consistent performance as the system grows.
The storage subsystem must be balanced so that it grows not just in raw capacity but in all other areas as well. The key is to look for storage products that are modular in nature--not a single large unit to which disks may be added. A modular design should allow for the addition of cache, bus, controllers, and channels as disks are added. This capability reduces the chance that any single component will become a bottleneck as the warehouse grows.
Some storage arrays rely on multiple controllers sharing a single internal bus, and growth in storage capacity means adding more disks and possibly more controllers or cache. If channel and cache were not designed to support the optimum configuration, the internal bus or the channel may become a bottleneck.
You can calculate the maximum amount of storage on a server by multiplying the number of I/O channels, the number of devices supported per channel, and the maximum capacity of a single device. Without storage arrays, the maximum size of the device is that of a single disk. A storage array can typically present itself as one large chunk of storage many times the size of an individual drive, thus allowing much more storage to be attached to the server.
One area of growth the storage subsystem can't affect is the server I/O channels. If you can't add more channels to the server as storage needs increase, adding storage will begin to degrade overall performance by saturating the channels. The only solution at this point is to purchase a larger server with greater I/O capacity.
There are two distinct usage models that a data warehouse imposes on the storage subsystem: heavy write activity during loading and heavy read activity during daily query usage. Each model has a different set of characteristics and will affect performance, so both must be considered when evaluating products.
Write performance. During the typical load several things occur: Data is moved to the warehouse server, usually as large files. The files are then loaded into the warehouse database, summaries or aggregates are computed, indexes are recreated, and old data may be archived or purged.
All these activities have one thing in common: excessive writing of data. If indexes are re-created or aggregates updated, there will be high number of reads as well. Some of these activities create additional "hidden" write activity, such as sorting to disk while creating indexes. As a result, the storage subsystem must be able to handle sustained write activity without creating a bottleneck.
Parity RAID is known for introducing a write penalty, so expect your performance to degrade by some percentage. The way most vendors have worked around this problem is to use write cache on the RAID controller. The problem is that under such heavy write loads the cache can become saturated, after which point performance is gated to the speed of the disk drives.
Two often-overlooked areas where excessive database writes can bog down the storage subsystem are the database transaction logs and temporary space. During database update activity, the logs will have constant sequential write access. With the former, the most common performance problems occur during index creation and during queries that perform large joins or sorts. The problem with the latter is that high volumes of writes can occur during the hours when the database is simply being queried and the two activities will compete. Some storage products offer separate read and write caches to avoid this performance problem.
But there are many storage products that force all operations to go through a cache. When the workload becomes high enough, these products typically suffer a sudden drop where throughput slows to a crawl. The only solution is to add more cache and hope it solves the problem.
Read performance. Cache memory to increase read performance relies on repeated reads of the same data. The first read takes the time of a disk access, and all subsequent accesses to the same data can be read directly from cache memory. This approach to increasing performance tends to break down under two different scenarios: when access is not to common data and when large volumes of data are accessed.
With random access to data not used by others, those data blocks sit in the cache but are not used again. Given that much warehouse usage doesn't involve looking at the same data, it is likely that a small percentage of the data actually needed is in the cache.
Very large databases limit the utility of cache memory. The percentage of the database that fits into the cache is very small, decreasing the likelihood of a query finding the data it needs in the cache. For example, a 4GB cache--considered very large--is only one percent of a 400GB database. The chance of two queries finding the same data in the cache is small.
When data is accessed in large sequential operations, as would be typical of queries driving full table scans or index range scans, all the data is usually not in the cache. Worse, all the data may not fit in the cache.
Using the earlier example of a 4GB read cache, a single full table scan of a 10GB table will fill the cache two and a half times. This means the query fills the cache with data that it will very soon replace. A second query running simultaneously, even on the same table, has almost no chance of finding the data it needs in the cache because of the actions of the first query.
You can also use RAID to improve read performance. RAID striping increases performance by presenting a set of disks to the host environment as a virtual single disk. By using the full transfer rate of each disk on the channel, the theoretical throughput is the single disk transfer speed multiplied by the number of disks rather than the throughput of a single disk.
For example, an array using five disks with a transfer rate of 10MB/sec will provide 50MB/sec of bandwidth on a single channel. To avoid exceeding the capacity of a single I/O channel it is important to look at its throughput and expected usage when connecting RAID storage.
The RAID level used will have a direct impact on performance, as will the vendor's implementation of that RAID level. RAID5 is designed to handle many small I/O transactions typical in OLTP databases. RAID3 is optimized for large synchronous data transfers such as those found in film editing. RAID0+1, a combination of striping and mirroring, is more expensive in raw storage but provides an optimal mix of performance and reliability. The better RAID solutions will offer multiple user-configurable RAID levels.
Server connection. The bandwidth of the channel is also a key limitation in both performance and size. In general, the faster the channel, the more devices may be attached to it--resulting in a direct increase in the volume of storage that may be attached to the server.
The gating factor in I/O throughput is the speed of the slowest component, usually the individual disk drive. With storage arrays and RAID, multiple drives bonded together can easily exceed the bandwidth of a single channel. SCSI is still the most common connection medium, but fibre channel exceeds SCSI performance by at least a factor of two and is the market direction for the high end. If you're looking for storage alternatives for a VLDB, the product you choose should support FC-AL--or at least offer an upgrade path to FC-AL that doesn't require you to unload and reload all your data.
The bottom line is that measuring performance is more difficult than simply looking at vendor throughput specs. The only way to know if you'll face performance problems due to size is to perform tests with the expected data volumes and concurrent activity on the actual hardware configuration suggested by the vendor. If you test with a smaller volume of data, then it is crucial that the storage configuration be scaled down equally--reducing the raw storage capacity, number of channels, number of controllers, and amount of read and write cache. If you do not reduce these factors to scale, the throughput measurements will be very misleading.
MANAGEABILITY AND SERVICEABILITY
Manageability and serviceability of the subsystem should be considered in product evaluation for a VLDB. If they're not, the effort and cost required to maintain the warehouse platform can grow dramatically.
Manageability should be examined in a number of areas: system monitoring, exception detection and reporting, and the ease of doing repetitive storage management activities. The product should integrate with existing systems management products in the data center. This integration is usually enhanced through compliance to standards such as SNMP or through an enterprise management framework such as Tivoli.
The administrator should also be able to administer the storage product in multiple environments; for example, via a directly attached console, over a network, or via the server. It is important to have the capability to manage the product over the network so the administrator won't be required to go to a console for every administrative task. In the event of a network or server failure, it is equally important that there be some way to access the product using a directly attached console or a dialup line.
The typical data warehouse also goes through database reorganizations and is usually growing in size--all of which means the administrator will be working with the storage subsystem frequently and on many repetitive tasks. The ease of administration is therefore an important factor. This criterion covers many items, such as ease of initial setup and configuration, expansion, reconfiguration of both hardware and software, performance monitoring, and tuning. A requirement for storage products in this environment is that they allow dynamic tuning, reconfiguration, and expansion. You don't want to shut down the warehouse just to add disks or change tuning parameters.
Having a single point to monitor the entire storage subsystem is a big advantage in large configurations. Some products allow the individual storage arrays to be managed collectively as a single unit. Collective management can be an enormous help when a simple reconfiguration is required. The alternative is to issue the same set of commands to every individual storage unit, increasing the likelihood of errors and the time it takes to complete the work.
A product should also be as user-serviceable as possible. Favor designs that allow an operator to change or add hardware (such as disks) easily without a vendor service person present, and that allow for simple reconfiguration of RAID parameters and levels.
Another key serviceability requirement is hot plugability: a feature that allows an operator to replace bad drives without powering down the storage subsystem. If the storage product does not support this capability, the data warehouse will be unavailable for the time it takes to shut down, replace the bad disk, and power up again.
Something to keep in mind is that hot-plugable drives can cause problems with SCSI-based storage products. Replacing drives on a live SCSI bus can create noise caused by changes in bus capacitance, which can lead to SCSI bus errors. Some vendors are using special drive connectors to minimize these effects, but these connectors are more expensive. One well-known Unix systems vendor opted for lower product cost in its arrays with the result that on average, one out of every five drive changes on a live array causes the entire array to hang, requiring a full system reboot. This is one example of why cost should be relatively low on the list of criteria: You get what you pay for.
Other elements of the storage product should be hot plugable as well. Power supplies and fans often go bad before the disk drives, and the loss of these peripheral elements should not force the shutdown of the storage subsystem. This capability is very important for 2437 data centers where availability is often an important service level metric.
COST AND INVESTMENT PROTECTION
The cost of storage can be more than 50 percent of the total system purchase price, so the last set of considerations involves the cost of the product and ensuring that the investment in storage is protected to some degree.
The cost of storage derives from two sources. First, there is the initial price of storage for the warehouse, which involves a large up-front investment. If your data warehouse will contain three years' data but the initial load will be only one year of data, it doesn't make sense to buy capacity for the fully sized warehouse. It would be far wiser to budget for incremental storage additions over the next two years and minimize your initial cash outlay. Given the rate of technology change, there will likely be a desire to upgrade the older storage within two years. This approach minimizes upgrade costs as well as the initial outlay.
The other aspect of cost is that of incremental growth. The cost of expansion can be measured in small increments or large increments. Typically, the small cost increments come from adding disks into the array or peripheral bay; the larger cost increments come from adding more arrays or enclosures to expand storage capacity. It is wise to plan for expansion by having some unused space in the storage enclosures for additional drives so that the next storage addition will not require a large incremental investment.
Another issue that you will likely face is the price differential between fibre channel and SCSI storage. A complete fibre channel product--from the channel to the disks--is generally more expensive, which is one reason some vendors offer a fibre channel host adapter for standard, internal SCSI drives. These mixed subsystems often cost as little as pure SCSI subsystems.
In terms of relative cost, one important point about pure fibre channel products is that the incremental cost of storage goes down as more disks are added because of the technology's ability to manage as many as 125 devices on a single adapter (although most vendors recommend about half this number for reasonable performance). Even a limit of 60 drives per adapter represents four times the number of drives that can be attached to a single SCSI adapter, saving the cost of three additional SCSI adapters. That means fibre channel saves the cost of three I/O adapters over a SCSI implementation.
Another thing to avoid is the standard metric of cost per megabyte. This metric applies to the cost of individual disks but is difficult to apply to storage subsystems. Cost per megabyte leaves out a key factor required in the warehouse: that of price/
performance. (For data on price/performance comparisons between ultra-SCSI and fibre channel, go to the Intel site at iolab.co.intel.com/comparison.htm.)
Protecting your storage investment requires looking for a few features in the products. The first requirement is that the product support more than one server vendor. With open systems, it is not uncommon for a server to be used for two or three years and replaced with a larger, faster server. The ability to change out only the server and retain the storage can save a considerable amount of money during an upgrade cycle.
A feature that is typically not required is support for simultaneous access by multiple heterogeneous platforms. Vendors will often bill this feature as a way to support easy migration of data from source systems to the warehouse during the extract cycle. Unfortunately, this feature is usually expensive and it rarely performs as reliably or as well as advertised.
An additional requirement to consider is backward compatibility with technology changes. This criterion is sometimes difficult to evaluate because it comes down to design decisions that are not always apparent. Features to look for include support for different drive speeds, drive form factors, and drive capacities in a single array so that as new drives come out they can be housed next to older drives.
The same consideration applies to types of channel attachment, storage controllers, and internal components like cache memory. If cache memory is available that is faster or increases the supportable cache size, it would be nice to be able to upgrade without changing out your entire storage subsystem.
For channel attachment an even more important consideration is the ability of the product to support multiple channel attachment types and controllers. If the product supports this capability, you have an upgrade path to higher channel speeds by simply changing the controllers. For example, support for both Fast-20 and Fast-40 SCSI means that it is possible to connect the storage to both older servers and newer servers that support the faster SCSI standard. Support for multiple attachment types is doubly important because most high-end products in the storage industry will be moving from SCSI channel to fibre channel over the next few years, and it can be very costly to perform a "forklift upgrade"--that is, to replace the entire subsystem with new technology.
This industry direction toward faster, more reliable, and more
flexible storage products has been a key enabler for the data warehouse industry. As the industry
evolves, the storage solutions will improve to the point where such detailed product analysis is
no longer a requirement. However, for the next few years purchase decisions in support of a VLDB
will continue to require careful evaluation.
Mark Madsen is a specialist in the data warehouse business with
a focus on the analysis and design of large-scale systems. He is currently employed by Millennia
Vision Corp. in Mountain View, Calif. You can reach Mark at
[email protected].
|
|