It's Time to Think About Archiving

By Huw Price

Editor's Note: This article addresses the archiving of historical data and not the Oracle archiving of redo logs.

Archiving data from relational databases is often the most complex task in a database project. In most projects, it either gets put off or shelved. If it does get done it is usually an after thought and can cause more problems than it fixes. While relational databases bring incredible flexibility when building databases, archiving data from those databases is a complex task.

If you do not archive, the perils become obvious over time. A system can be brought to its knees with: unused data; purchasing new disks to cope with the increased size is expensive; and routine maintenance takes longer and can affect batch and maintenance windows.

So its time to think about archiving. What's the best strategy? As ever, it depends on your site, but the following are a few well-used strategies that we know from experience have proven successful.

The first thing to remember is that archiving is the most complex procedure in a database. Simply hacking out rows without taking account of the database rules is a recipe for disaster. The most complex business rules built into an application must be understood before archiving can begin.

The initial problem is that all too often, the team (or person) setting up the archiving did not work on the application and the group who built the business rules has now moved on. In a lot of cases, good database and process documentation exists, but it is not up to date.

Throughout this article I will reference various "smart" utilities that we have developed to make life much easier when managing archiving. Whether you have these (or similar) utilities available, the concepts remain the same.

Build a Plan

Before you start spend some time defining your project plan. An example plan would be as follows:

Clean your data Run integrity checks on the data.
Explore your Data Iteratively build your extracts.
Extract then Delete Split the extract from delete portion of the archive.
Managing constraints You may not remove data if it violates a business or referential rule.
Test the archives Techniques for testing archives.
Add archive controls Build controls into your archive.
Recursive sets Managing more complex relationships.
Updating your audit Retain as much audit information as possible.
Storing your archived data Where to store the archived data.
Restoring data How do you restore data.
Maintaining archives How do you maintain archives in a dynamic environment.

Clean Your Data

Before you begin archiving it is worth spending some time considering the integrity of your data. If you try and remove data from a database where "business" or integrity links are missing you risk not removing data or removing incorrect data. For example, if you have rows in the Ledger table not linked to their corresponding Items, the rows in the Ledger table will never be removed if your archive relies on the Ledger, Item link.

Before you begin the process of exploring your data, it is worth validating and repairing the data. We scan the data (using SmartCheck) using defined business and integrity rules and identify data errors. This process can throw up a lot of errors and time should be allocated to dealing with these errors.

Data Exploration

In some cases it may be that the errors are not errors with the data but errors in the definition of the "rule." As your archive will rely on this rule when extracting the data, it is worth spending time redefining the rule. A common example is where a value in a column is being used to indicate that this row is being treated differently than other rows. For example a Customer ID of 999999 in an order table has been set up to mean this order is not associated with a Customer but is an internal direct Order. Now as Customer 999999 does not exist any checking runs will initially throw these rows out as errors. The rule would have to be further defined as follows:

if CUSTID on the ORDER table not = 999999 then
ORDER >---| CUSTOMER

As you run the scans on your data, you will gradually fill in the gaps in the data model or confirm any documentation. This is an important process in building your archives.

Once you have got a fair idea on how the data hangs together, you then start defining what you want archived and what conditions you want on the archive. These can be broken into six components.

1. Decide on Your Driving Table

This is the table you will use as the primary selection in the archive. Take some time deciding on this table. For example, you may archive off of Customers. The natural starting point would seem to be the Customer Table. Upon reflection, it may be better to start at the Order table and identify any Orders prior to a certain date. Don�t be afraid to turn your archive on it�s head, especially when you are experimenting at the beginning.

2. Define Your Extract Set

Once you've picked your driving table, select any tables related to the driving table that you wish to include. For each of these tables you must in turn select any related tables to that table. After a while you will build up a set of tables you wish to extract or archive. The following would be a simple example set:

	CUSTOMER
	  | ---- <  ORDER
		       | ---- < ITEM
		      		 | ---- < PRODUCT
	  | ---- <  ADDRESS

3. Include Conditions

You will have to decide on what you want included in your extract. You have picked a driving table, however, you may wish to select rows within that table based on some table joins or table sub-selects. For example, the selection on the Customer table could be:

SELECT * FROM CUSTOMER
WHERE CUST_LAST_UPD < '01-01-92'

However you may want to further clarify this with a join to the Address table so that only Customers in a particular State are included:

SELECT C.*
FROM CUSTOMER C , ADDRESS A
WHERE C.CUST_LAST_UPD < :W-1 AND
C.CUSTID = A.CUSTID AND
A.PRIMARY_ADDRESS = 'Y' AND
A.STATE = :W-2

In this example two variables W-1 and W-2 have been included which would be substituted at run time.

4. Exclude Conditions

Quite often, when thinking of an archive, you tend to think in terms of "I want all this data, but I don't want any of that data." So you need to add exclusion conditions to the extract. For example, you may wish to exclude any customers who have orders that are active.

(An Order_Status of A).

These exclusion conditions may occur on your extract set several "relationships" away from the driving table, but must be built into your extract.

5. Fine Tune the Relationships.

When you first start running your extracts one of two things usually happens, you extract the entire database or you extract no data. This is all part of the data exploration process. It is going to take several iterations to get the extract correct. You may have to redesign the extract completely or add further SQL conditions to links further down the tree.

6. Link Archives Together

When building archives, it is sometimes more logical to split large or cumbersome archives into several others. There are two reasons to do this: archives may become large or difficult to manage, or different analysts may be more familiar with different portions of an application. If this happens, build links between the archives so that one archive will feed the next archive. For example, the accounts payable team may define and archive and pass a file of Ledger items that need to be removed to the Audit and control team who will define their own archive to remove audit information.

Extract then Delete

In the definitions above, I referred to the extract and not the archive. This is an important point. When building an archive, you should separate the extract portion from the delete portion. Deleting data as you extract is very dangerous; there are numerous horror stories of output files being piped to nowhere, or output datasets being deleted.

You may wish to include a sign off between the extract and delete step as part of your audit and control procedures.

Splitting the extract from the delete has the added advantage of allowing you to test without having any serious performance impact on production. The extracts are not performing any data locks and usually run quickly.

Managing Constraints

Managing constraints in a relational database is very complex and managing them within an archive is no exception. Whether you have a fully constrained database or a database with no constraints you should still build the logic to handle constraints into your archive.

When you use your extract set to remove the data from the database, you must remove the data in the correct sequence. For example, remove orders before customers as the former has a logical business rule link to the latter and may contain a physical foreign key link.

Now, with many interrelated tables the order in which they are to be deleted may be difficult to determine and a utility may help (we use Smart Archive).

You must consider the entire set of data when archiving, not just the tables you want. A common occurrence is to run the delete step of an archive and find that no rows are removed because a relationship was forgotten. For example, if you forgot the item table in the archive, you could not remove any orders as there would still be items attached to it, as the orders are not removed the customers could not be removed and so on.

Testing Archives

Get other people involved in testing. When you remove data from the database it is usually another department who screams first, as they may be relying on some of the data. Print out reports of what is being extracted and circulate them to as many people as possible, with a covering note stating that unless you hear otherwise these rows are going to be removed from the database.

Once you have defined your extract a useful technique is to start running the archives on a regular basis but without running the delete step. This has several advantages.

Two useful safety checks are:

Controlling the Archives

When building your archives, where possible, add as many control parameters as you can. The control parameters can include:

When designing your archives, you may have many links to small reference tables or code tables. It may make more sense to create a separate extract of small tables which is run at the same time as the main archive. This has the benefit of reducing the number of repetitive joins the extract requires and reduces the complexity of the archive. These small tables will not be removed from the database, however, storing these small tables separately has advantages when restoring data.

When building your archive it may make sense to include a large set of data in the extract and then only remove a portion of the data at delete time. For example, you may wish to include the Product information in the extract dataset but not remove Products when running the delete step. This has the advantage of creating a full set if you need to move the data to another instance or schema and has advantages when restoring data.

Recursive Sets

In some more complex systems it may not be easy to define a set of data. This occurs when interrelated tables refer back to themselves. For example, a Purchase Order may refer to an Invoice which refers back to another Purchase Order and so on. The problem here is that when archiving old Purchase Orders they may refer back up to active Purchase Orders.

In the case of recursion, it is best to run a separate step which identifies groups of related rows in the recursive set. When doing this it is best to set limit to the level of recursion, so that the number of sets extracted for each recursion depth can be examined. The resulting output file can then be used as an input to the extract set and archive.

Audit Information

When running the archive you should maintain as much audit information as possible. Information you should definitely store is the table definitions and the index definitions.

Store these in tables local to your database, or take copies of the definitions and add them to the extract datasets.

You may wish to store additional information including:

Storing your Archived data

You can store archived data in a variety of places:

When extracting data you may end up with quite large datasets. If you choose to compress these ensure the compression routine you use is backwardly compatible for when you choose to uncompress the data.

A handy tip when storing your archived data is to store a load program or loader scripts to restore the data specifically associated with the extract dataset. This can save time when you wish to restore the data.

Restoring Data

There are several different techniques to restore data from an archive. Which one you use will depend on how critical the data is and how many safety checks you want.

The first and easiest is to restore all or a section of the data file into your original database. If the table definitions have not changed, then this will work fine if you do not overwrite any existing rows. Once you have completed this it is worth running some integrity checks to ensure you have not introduced any errors.

A safer method is to restore the data into a separate staging schema, use the DDL statements you stored with the archived data to build the tables and load the data using the archived matching program or script. Once you have all the data restored you will see the advantages of extracting more data than was removed during your archive.

If you run a compare between the production data and the staging area and ignore any deleted and inserted rows, you will be able to see any changed rows. This information can be very useful. For example, you may have extracted a small code table which has been changed. The value "Z" in the Status Code column now means Inactive whereas when it was archived the value "Z" meant pending. Pending is now set to the value "P."

Before you restore the data you would have to set any values of "Z" to the value of "P" otherwise the restored data will in effect be incorrect. As the data is now back in the database, applying this change can be done with a simple Update statement.

When you are ready to move the data to production use a load utility (we use Smart Load) which takes account of any table changes between schemas to automatically reformat the data.

Maintaining Archives

One of the problems with archive procedures is that once you add new tables to the data model, these must immediately be included in the archive. If you do not include them, the next time the archive runs you will be left with hanging or unrelated data. It is important that archiving does not get forgotten in your application planning.

The Author

Huw Price is the Managing Director of BitbyBit International Ltd., a UK based software house specialising in relational database tools. He is the principal architect of the CheckMate product, a powerful data manipulation toolset and is a specialist in archiving solutions. He may be contacted via Email [email protected]. or by calling +011.44.0.181.780.5300.






This is a copy of an article published @ http://www.ioug.org/