Bonnie K. O'Neil, Northern Lights Software, Ltd. Data warehousing is a topic that's crossed most people's minds today. Before moving forward, however, make sure to review your transformation rules. They could have a serious impact on how your company does business.
Discovering, deciphering and managing transformation rules is usually the most difficult part of a data warehouse project. This is because legacy systems contain conflicting data: Which is the right value? How can you determine if elements mean the same thing? This article offers some practical tips on how to cope with transformation rules.
Data Quality Issues
What is the use of a data warehouse if the information it provides is WRONG? All the data mining tools in the world, all the OLAP technologies, ad hoc query tools, etc. will not compensate for BAD DATA.
If the quality of the data is poor, it is very easy to make incorrect business decisions, simply because they are based on incorrect data. For instance, suppose a company is trying to decide where to open a new branch office and wants to find the geographic region with the highest distribution of both prospects and clients. Analysis on the underlying data might indicate a move to region A is the right choice because there is a huge redundancy factor in the data for this specific region. This may allow the company to overlook region B. The company then spends a lot of money opening an office and hiring staff to support region A. The truth might be that region B is the better choice. If the data had been scrubbed and cleansed properly, perhaps the result of the analysis might have correctly identified region B.
Bad data can extract a high cost in marketing dollars, especially for mailings. Customer mailing lists are classic targets for data cleansing. Everyone has been a victim of this-everyone alive has received more than one piece of the same mailer, sometimes three or more. Everyone receives subscription requests for a magazine they already subscribe to. If the mailing list were appropriately pruned, the company can save lots of money per mailing just by eliminating these duplicate names. In addition, companies also spend a lot mailing to addresses that the post office considers invalid and are returned as "undeliverable.," They never reach their destination, but the company pays postage regardless.
Types of Cleansing Problems
Sometimes it helps when you are researching data quality issues to understand the types of problems that can cause bad data.
Integration of Information: Cross Referencing
How are customers reconciled if the customer ID is different in each system? How do you know that customer 123 in System A is the same customer as customer 45AV198 in system B?
The cause of these types of problems is the "Islands of Information" issue that is so prevalent in companies today. Each system is developed when a need is determined, independent of all other systems in the enterprise. Somebody says, "We need an inventory management system," and Poof: one is developed. Then, the need for a sales system is found, and order entry system, then finance, etc., until you have a whole series of independent systems that don't talk to one another, and don't share information easily.
Cross referencing information is crucial, even life-dependent in health care systems. It is very important to track the same patient throughout different treatments, hospital visits, etc. If John Smith is being treated in the emergency room and he also has been treated for a drug related allergy, it will be vital to cross reference the two visits.
Even though it is not a matter of life and death, cross referencing affects sales, which is the life blood of all companies. As every marketer knows, it's easier to sell to someone you have already sold to than to someone new. But how can you take advantage of this when you can't cross reference the customers and prospects in one division's system with those in another?
Cross selling was one of the most important goals for one large banking enterprise that we assisted with a data warehouse. They wanted to know how many current customers were buying products from more than one division. For instance, they sold many different financial products and also managed portfolios, including investment vehicles. They wanted to expand into "Relationship Banking," establishing a relationship with a client and selling different financial products offered by the different divisions in the company. But they couldn't cross reference the same customer across the different systems in the different divisions! It was a major problem. In the bank we were dealing with, they knew this was a problem, and it was a good motivating factor to spur the data warehouse project.
"Relationship Banking" also depends upon likely trends. For example, a customer with a certain demographic profile has a high likelihood of buying Financial Product A. If you know that John Doe matches that demographic profile and has not already bought Financial Product A, you know that there might be a good chance he might buy it if you offer it to him. However, the bank would lose lots of credibility if the representative called him and tried to sell it to him-and he has already purchased it! So the motivation for data cleansing is very high. Even the discovery of the trends is highly dependent upon clean data. How do you know it's an accurate trend, if the data you are using to discern the trend is wrong?
Another example of cross selling occurs when you can identify members of the same household, and offer an effective consolidation of their accounts and be able to identify new needs, based on the household (as opposed to each family member being an individual customer). An insurance company might benefit greatly; John Doe has a car insurance policy with your company. Mary Smith marries John Doe. She has a car insurance policy with another company but a life insurance policy with your company. If you as the agent knew that they got married, you might be able to consolidate their insurance requirements and capture business from your competition.
Mailing List Merging
Your company is sponsoring a big marketing event, and you purchase a mailing list from a magazine. You are also mailing to your "home grown" mailing list, which includes both prospects and customers. How do you merge them, and not embarrass yourself in front of both current and prospective customers?
This problem actually crops up with merging of internal data sources as well. Each internal data source has its own data cleanliness issues, and merging them together only compounds the problem. You hope that the purchased mailing list is fairly clean, so it will make your job a little easier, but the merging issue still remains.
Application Merging
Data matching and synthesis issues aren't unique to customer lists (although the customer problem is so obvious that it can't be ignored). Other areas include when the data warehouse has to report on a data element that's cross referenced in different ways across different systems.
We worked on a data warehouse which was supposed to report on different aspects of product management. It was tied into financial reporting of product performance, an object-oriented workflow system, and several different kinds of product information systems. The problem with merging them became very complex due to the object oriented system: Although the database was the same as the one chosen for the data warehouse (relational database), we couldn't do a straight data export, because the meaning of the data was embedded in the OO code. The data values themselves weren't meaningful; the code interpreted the values. We had to go through the application programming interface (API) and have the application feed to us the data elements we needed. Even so, we had a very hard time matching up the corresponding products. The workflow system provided vital information which was not recorded anywhere else, so we couldn't bypass it. Watch out for these issues!
Acquisition of Another Company
Another common integration and cleansing problem involves when one firm buys another. They both have their own information systems: Sales, Finance, Customer Management, Product, Inventory Management, etc. Each information system has its own "primary key" (unique identifier) scheme, each system tracks data important to the problem it's supposed to solve.
The Data Cleansing Process: Sleuthing
Understanding, researching and documenting data cleansing problems can be the most time consuming part of a data warehouse project: one estimate says that it represents 60 percent of the overall data warehouse project. And it is almost impossible to estimate how long it will take. It is a function of how clean the data is-and there's no knowing that until you actually begin researching.
I call the researching process "sleuthing." You really feel like a Sherlock Holmes, ferreting out data anomalies from often very obscure clues. Sleuthing involves uncovering hidden meanings in legacy and operational data elements; it entails lots and lots of research and experimentation. Armed with a spreadsheet with all the fields listed from each system in the leftmost column, you must find out what the meaning of each field is, what the "source of record" is for each, which other systems have that same field and what are the discrepancies (if any), what allowable values are for each, and what the meaning is of each value. Sometimes it feels like a "wild goose chase". Even worse is when you encounter the "nobody knows" routine-the last guy doing maintenance on the system left long ago, and now you are told "nobody knows anything about that system anymore."
The good news is, tools can help speed up this process considerably. Data cleansing tools will be discussed later.
Process-Oriented Data Quality Problems
Sometimes the sleuthing process will uncover issues that cannot be resolved with a simple transform solution. For example, a retail chain wants to analyze impulse buying trends, association purchases, etc. If a customer buys item A, what's the likelihood that they will also purchase item B?
In the sleuthing process you discover that they have successfully recorded the amounts of each item purchased, but have no record of what type of item it is. You then find out that the clerks find it easier to punch in the amount than the item type. Why? Is it a function of the type of input terminal that the enterprise is using? Or have the clerks not been properly trained? Is there a better way to do it, like using bar codes? These are not database-oriented or scrub/transform oriented problems at all. If there is no item type recorded, no amount of scrubbing will make any difference at all! But this is a very important discovery, and can aid the business greatly in producing quality data in the future. This company cannot produce an accurate data warehouse until they improve their business process involving the recording of purchased items. Be on the lookout for these types of quality problems too.
Merging Data Sources
There are many types of data quality problems you will have to deal with, simply because you are merging several diverse data sources into one unified whole. The problem is to reconcile each data element so it can be unified with others to form the warehouse.
Each legacy system was developed usually to solve a specific problem, usually with no reference to other systems that may have been in place at the time. (The only exception to this is when the system had to be fed from extracts from pre-existing systems). The obvious problem which this creates is myopia of each application, looking at the world from their own point of view. The data elements therefore reflect this point of view, even if the rest of the organization has a different idea of these data elements. The system is to serve the specific application's problem domain only.
It is easy to see that each application's view of the data elements may sharply conflict with other application's view of the same data elements. This is one of the key reasons why they need a data warehouse to begin with, and why the CEO can't answer embarrassing questions like, "how many widgets did we sell last year at what profit?". He has no idea how many widgets the entire company sold, because he can't get a report company-wide! Several departments keep track of widgets in different ways, such that what one department calls a separate widget is actually three different widgets in a different department's reporting!
One problem why many data warehouse projects encounter difficulty is rooted in this issue of data cleansing. Many people just don't have the stamina or patience to ferret out and resolve the myriad of data cleansing problems which surface during the analysis phase. Indeed, many books on data warehousing don't even discuss this issue. Yet we feel it is absolutely critical in the success of any data warehouse project, and it could "make or break" the project.
The Key is Semantics and "Splitting Hairs"
Semantics, or the study of "meanings", is at the root of the data cleansing problem. This is precisely why many people don't want to deal with it-they don't like "splitting hairs". And yet if you are unwilling to "split hairs" and get into fine nuances of meanings, your project will not be successful. This is what analysis is really about: searching for nuances of meanings.
You must totally understand the meaning of all data elements, and these meanings and definitions must be documented, either in a standalone data dictionary, or a metadata repository.
For instance, you may think that everybody in the company knows what a "customer" is. However, this author has seen many situations where the notion of customer is obscured by different departments within the firm; that some departments define a customer by their own requirements, such as the credit department defining a customer as "someone who has had their credit approved" and sales people combining the notions of "customer" and "prospect."
Two Different Kinds of Problems: Field vs. Value
You need to understand both what each element type (column) means as well as what individual values mean.
For example: there is a field called "address". What does address mean? Home address? Work address? Who's address? Customer? What if the company has several addresses-which address would be referred to? What about ship to and bill to addresses? How are these indicated? What if there are more than one ship to address for a given customer?
This is a distinctly different problem from the cleansing of an individual address. For example, you find three addresses associated with "Howard Smith": "3 Milktoast Way," "3 Milktoast Dr." and "56 Northstar Circle." This could mean several things. The first two records could be the same address, and you have to reconcile the "Way" and "Drive." The last address could be a historical one, or it could be the new one, and the Milktoast could be old; or it could represent Mr. Smith's summer home; or you might have two different Howard Smiths in your database. This problem represents cleansing of values. But the fundamental problem involves the address-what is meant by address? You have to resolve this first before you can resolve individual values.
Scrubbing is further complicated by historical values, which are seldom updated in a timely manner in many databases. Often, an older system's fields can mean something different than those in a newer system, because these elements have evolved over time. But even if two data elements have the exact same meaning, often one system is used more than another as the "source of record," so it gets updated in a timely fashion, and the lesser-used system gets neglected. So this means that values can get out of synch very fast. There is a company I know of that has customer addresses stored in more than 200 different systems! The possibilities of data getting out of synch in this setup is very likely, in fact it's guaranteed!
What Does the Value Mean?
If one system is known as having more current information than others, you can create a "transformation rule" which defers to this system's data when it encounters conflicts with other systems' data. Often, however, it is not as simple as this.
The term "Source of Record" refers to the system containing the most accurate, up-to-date value for a specific data item. The data warehouse must contain the value for each data item that is contained in the source of record.
One Column, Different Sources
A common finding when sleuthing is the source of record may vary for the same field, depending upon its value. This is where the fun begins! You have to determine what the business processes are and under which conditions each source maps to each value.
For example, in a non-profit organization, we found that one system was updated more frequently for a specific kind of donor type. For example, System A was used to update donor type 2 more frequently. But if any other donor type was referred to, System B was most likely the central source of record. So in our data transformation rule, if the Donor type was 2, we took the value from system A; all other donor types we took the value from System B.
It is very important to document these findings, and the reason behind the conclusion. You may be challenged on these decisions-after all, we are only human, and we may jump to the wrong conclusion.
Embedded Values: One Column, Different Values
Yes, Virginia, there was a time when memory and disk were VERY expensive, and systems packed as many meanings into one byte as possible. Bit maps had different meanings: For example, the combination 01100110 meant one thing, and 001001110 meant something totally different (and unrelated). Or worse: If the second character is a 3, get the value of field A; if the second character is a 2 and the third character is a 5, then get the value of field B. These types of situations occur in older systems, especially in the telephony world. Older switches, in order to maximize performance, were filled with bitmaps of all types. Decoding them can be a real challenge!
These sorts of problems occur even in more recent systems however (I know, I know, it's a violation of normalization! I certainly don't condone this practice!). I ran into one of these in a system that was only a few years old, and in a relational database. Sometimes they occur to cludge subtype/supertype relationships, and sometimes they show up when the database designer is trying hard to avoid NULLable values.
When you encounter one of these, you must be careful:
- Find out if the field(s) in question should be included in the warehouse at all (sometimes these types of fields fall into disuse, because they're old);
- If they must be included, find out if the application interprets the values (which is usually the case);
- Use the application as much as possible to decipher the meanings;
- Verify from a business person (or two!) what the appropriate meanings are, then DOCUMENT IT CAREFULLY.
Subtypes and Supertypes
Since the relational model doesn't handle subtype/supertype relationships very well, ever since the beginning of RDBMSes (relational database management systems) we have had to cludge subtypes and supertypes. Different strategies are deployed based on attributes "inherited" from the supertype or differentiated by the subtypes. Sometimes flags are used, sometimes separate tables indicate special attributes for a specific subtype, and sometimes nullable columns are used in the supertype table. Sometimes a totally different strategy is used which may resemble the bit map technique discussed above.
Many times different systems treat a group of items as one single item, and another system will not recognize the group but instead deal only with the components. The tricky part is if the group is called by the same name as one or more of the components! Deciphering these things can be a challenge. You have to be alert; this is really a variation of the subtype/supertype problem. Parts explosion problems, which are common in manufacturing environments and sometimes in retail, are examples of this.
Types of Transforms
There are several common types of data transforms that occur in systems integration:
- Source of Record Transforms (described above)
- Synonyms
- Homonyms
Synonyms and Homonyms can occur both on the field level and on the value level, so WATCH OUT.
Synonyms
A Synonym refers to two distinctly different words which share the same meaning. For instance, "car" and "automobile" mean the same thing. This can get tricky, however; shades of meaning may be lurking underneath. Make sure that the two words truly are synonyms, and are not hiding extra meaning. For example, "racer" and "car" both refer to an automobile, but the "racer" is a special type of car. It could therefore be argued that the value "racer" in one system is NOT the same thing as the value "car" in another system.
The systems integration effort requires resolving synonym field names in different systems. But just as the "address" problem eluded to earlier, there may be shades of meaning that are attached to the data in one system that is not there in another system. Be on the lookout for subtle differences.
Synonyms can also crop up in field values. Customer Type field may be the same in two systems, but the domain of allowable values may be different. This situation requires further research.
Homonyms
A homonym refers to the same word with two (or more) distinct meanings. In a data integration effort, you may discover homonyms all over the place. When one group refers to a "Policy", they might mean "Auto Policy" which has all sorts of characteristics and behaviors separate and distinct from other types of "Policies" sold by different divisions within the company. In most situations, homonyms can be resolved by using subtype and supertype relationships, or by carefully defining business terms in unambiguous language. Business Rules are very helpful for this; see my paper "Business Rules to the Rescue!"
Homonyms, like synonyms, can be very sneaky and difficult to find. You can use data discovery tools to help in this effort.
Data Cleansing Tools
There are several different kinds of tools that help with cleansing: Sleuthing/discovery, transform/code, and specific problems. Here are some examples of these:
Sleuthing/Discovery tools
My favorite of the Sleuth/Discovery tools is DB Star's Migration Architect. It helps in the determination of synonyms and homonyms. How can you tell if certain data elements are the same, different or related? DB Star looks at these things. It reports on the percentage of data affinity. For instance, if customer name is related to customer address, there should be a high affinity. The lower affinity, the more data quality problems may exist.
DB Star also helps in disparate system integration. It can report on the percentage of customer names that are the same across systems.
DB Star analyzes data files (flat files). Other discovery tools work on code/programs.
Transform/Code
After you have discovered and tracked down what the transform rules are, you can choose a tool which tracks these transforms and generates the scrub code for you. Most of these tools generate COBOL, because it runs on most all platforms, even the most persnickety-like mainframes (which most have no C compiler). Examples of products that help in this arena are Carleton's Passport and Prism. In my opinion, you want to look for products which support an open repository, which stores its information in Oracle. Then you can extend the repository and store other objects important to your specific environment (see my paper "Extending Metadata Repositories" at this IOUW). Some products store their information in their own internal representation, and extending it or getting information out of it is very difficult.
Specific Problems
Data cleansing problems are certainly not limited to addresses, but often organizations concentrate on them because addresses are tightly related to customer service efficiency. There are quite a few data cleansing products which help organizations with this specific problem. Some products include very large lookup tables which they can use to correct common discrepancies such as "Street" and "Road". The primary address and customer list cleansing tool vendors are Vality, Harte-Hanks, and ISI.
On-Going Data Quality: Scrub Error Detection
In addition to integration issues including merging fields and creating scrub code, there is the issue of on-going data quality. What happens during a load when a data value is found to be incorrect? The scrub code must have appropriate ways of processing errors.
Some errors can be tolerated, and some for obvious reasons, cannot be allowed into the warehouse. Levels of error tolerance should be documented and stored in the metadata so it can be clearly understood. Often, data doesn't have to be perfect, but it must meet certain cleanliness standards.
Error handling should also consider procedures for dealing with missing values. If NULL values are not allowed, techniques to deal with missing information must be put in place.
Imhoff and Geiger also talk about four types of errors which should be detected and handled in scrub code:
- Inaccurate data: Most likely, the scrub code should reject inaccurate values, kick out the record and store it in an errors table, to be dealt with later. Inaccurate data may take some sleuthing to resolve, and the system does not have enough information to resolve it. A bank certainly does not want phony account numbers loaded into the data warehouse. If one is encountered, the record should be rejected.
- Errors within an acceptable tolerance or threshold: These values can probably be loaded into the warehouse successfully, with perhaps a flag set if appropriate, indicating a problem is present. In a non-profit warehouse, a donor type might be incorrect. The organization may choose to allow the erroneous donor type code, but flag it as incorrect.
- Simple transform errors: This category of errors can be handled by transform code. Imhoff uses the example of upper case conversion if the value is discovered to be in lower case.
- Default substitution: A default value can be substituted when either no value or a wrong value is provided. If a sales rep is not provided on a customer record, indicate the default value "new customer" in the sales rep field.
Documentation and Metadata
Transform and integration rules must be documented properly; this cannot be overemphasized. The integration of systems is one of the largest benefits of the data warehouse effort. This kind of stuff was not documented properly in the first place!! ALL transforms must be explained in plain English.
I usually start out with a spreadsheet, because it's easy. The information in the spreadsheets gets transferred later into a cleansing tool which will generate code, and will tie some sort of metadata structure to it.
In the process of researching transform rules, the business rules behind the transform rules are often discovered. For instance, you find out that a Project Code of "NF" may have an explanation attached. In the process, you find out that "F" stands for Funded, and if a project is denied funding ("NF"), the explanation contains the rationale. You will want to track the relationship between the explanation field, and that it holds a description of why funding was not obtained.
Data quality definitely gets affected by the passage of time, changing business conditions, and changing business rules. Packaging changes may affect the way that a retail company summarizes data in the warehouse. A sales rep gets transferred to a different division, and sales from the old territory come in while he/she is working the new territory. Which group reports this revenue? A rental car company wants to get a report showing fleet utilization by month. A business rule changed in the middle of the year, which dictated the percentage of cars which should not allowed to be reserved in advance. The data takes on different meanings when the business rules are understood.
All of these things should be stored in the metadata repository, because they affect how the data is interpreted. Data quality thresholds and their rationales should also be documented carefully in the warehouse's metadata. (See my paper "Extending Metadata Repositories" in this year's IOUW).
Summary/Conclusion
Scrubbing is a major part of a data warehouse project, and should be properly pursued. It is very important to ensure that the business receives quality data for the warehouse, to facilitate sound decision making. Pay scrubbing the appropriate attention it deserves, and your business will get the most benefit out of the data warehouse. Skimp in this area, and you will be paying the price for it with bad business decisions.
References
Brackett, Michael. "The Data Warehouse Challenge: Taming Data Chaos." 1996, John Wiley & Sons.
Imhoff, Claudia and Geiger, Jonathan. "Data Quality in the Data Warehouse". Data Management Review, April 1996.
Kimball, Ralph. "Dealing with Dirty Data". DBMS, September 1996.