How to Create a Good Test Database

by Wayne Linton

Just as a seasoned sportsperson knows that the finals of the Canadian Open fishing tournament at Lake Erie aren�t the place to try a new lure or different bait, experienced DBAS know that live data warehouses and large production databases are not good places to develop and test reports and programs. Such work is best done in a test database containing a representative sample of the warehouse data. The test database is much smaller than the warehouse but should be identical in structure in terms of its tables, indexes, constraints, and other elements.

Test runs of new reports complete quickly on a small database, and when the data is representative of the warehouse, you can be assured that the report will run the same way against the live data. Similarly, having good data that mimics the warehouse in structure and content is vital to developing and testing new programs.

The challenge is to ensure that the data in the test environment remains fresh and useful for development work. Program testing changes the data content, and as structural and data changes are made, the test database tends to lose its effectiveness as a useful testing environment.

You can typically select a small number of key data elements in a database to serve as representatives of the whole. These elements, when extracted along with all related data, provide a functionally complete subset of the live database. You can test reports or programs against this subset and then run them against the live warehouse with confidence that your testing environment was appropriate.

Foreign keys enforce referential integrity at the database level and form a network of table relationships in a database. You can exploit these relationships to identify subsets of data for populating a test schema or database, archiving data, or removing obsolete data. This article discusses a technique that uses dynamic SQL to do just that.

A Typical Scenario
Figure 1 shows a small portion of the model for a sales data warehouse that contains thousands of customers with billions of sales records. For the purposes of this example, you need to create a report on customer buying patterns in a test environment. You have a handful of customers that provide a representative sample. You can use these customers to seed the data-extraction process to provide the data for testing.

Making Preparations
Create a few objects to prepare for the data extraction. You can create these in the same schema as the data being extracted or in a separate one. If you create a separate schema, you must have direct (via your userid, not through a role) SELECT access on all tables in the foreign-key hierarchy that will be touched. It's easiest if you have the SELECT ANY TABLE system privilege.

First, you need to create a table to hold the initial seed rows from CUSTOMERS and all subsequent data rows that are collected. Create a unique index to prevent duplicates, using the following code:


CREATE TABLE PARENT_CHILD_ROWIDS (
pcr_sequence            number,
p_level         number,
p_owner         varchar2(30),
parent_table_name       varchar2(30),
p_rowid         rowid);

ALTER TABLE PARENT_CHILD_ROWIDS ADD CONSTRAINT PCR_P_ROWID_UK UNIQUE(P_ROWID);

CREATE SEQUENCE PCR_SEQUENCE INCREMENT BY 1 START WITH 1;

You initially seed this table with rowids from the CUSTOMERS table by running a script with appropriate selection criteria, such as the following:

INSERT INTO PARENT_CHILD_ROWIDS
SELECT PCR_SEQUENCE.NEXTVAL,0,owner,CUSTOMERS,ROWID
 FROM owner.CUSTOMERS
 WHERE CUSTOMER_ID IN (12345, 23456, 34567, 45678, 56789);

Due to the thousands or millions of times the system could search the data dictionary views in this process, you should create a denormalized constraint-columns table (see Listing 1). This table is used to traverse the foreign-key hierarchy and construct the selection SQL to be executed by the dynamic SQL routines. You can limit the foreign-key hierarchy search to a specific owner by replacing the DBA views with USER views or coding an owner into the SQL code in Listing 1.

Last, compile the code contained in this zip file. This package has four procedures that use the seed tables named in PARENT_CHILD_ROWIDS and the denormalized constraint information in PARENT_CHILD_BASE to identify all related data in the foreign-key hierarchy. All columns from primary or unique keys are matched to those of corresponding foreign keys in a dynamically constructed SELECT statement and then executed via dynamic SQL. The collected rows are inserted into the PARENT_CHILD_ROWIDS table and become seeds for the next iteration. The process continues until no more rows are inserted.

Performance
The EXTRACT process will scan the PARENT_CHILD_BASE table thousands or millions of times. If this table is small enough, you can improve performance with in-memory table scans by altering the table to CACHE it in the SGA. Or, you can eliminate the table scans by creating this index:

CREATE INDEX Parent_Child_Base_IX
ON Parent_Child_Base  
  (P_owner,P_table,PK_col,C_owner,
  C_table,FK_col);

When you are executing the Extract package, the dynamic SQL execution procedure will perform full table scans where the foreign key being searched is not indexed. Make sure the required indexes are in place, as tables in a data warehouse can be extremely large.

Only tables related by foreign keys will be inserted into the PARENT_CHILD_BASE table. Since the process uses this table to search the foreign-key hierarchy, you can modify it. For example, an informal relationship with another table may be maintained by the application, not by a foreign-key constraint. You can define this relationship to the process by inserting an entry into the PARENT_CHILD_BASE table.

Self-referencing and bidirectional foreign keys may generate unexpected and undesirable results. For example, if one of the customers has a CUST_REPORTS_TO_CUST_FK, the EXTRACT routine will select the reports-to customer and subsequently run through its foreign-key hierarchy which may select far more data than was intended. You can delete any undesired foreign-key relationships from the PARENT_CHILD_BASE table so that it will not be searched.

Be careful when adding rows to or removing rows from the PARENT_CHILD_BASE table. An alternative is to create a nonexisting constraint or temporarily disable an unwanted foreign key before creating the PARENT_CHILD_BASE table. Only relationships defined in the PARENT _CHILD_BASE table are used in the foreign-key searches. Identifying the Detail Data Data collection works best when the database rows being identified and collected will not change. The process can take time if a large volume of data is being selected. With everything in place, identify the detail data by running:

EXECUTE EXTRACT.DetailData

DetailData matches the columns of the primary (or unique) keys of the seed tables in PARENT_CHILD_ROWIDS to foreign-key columns of tables that reference them. This information is used to form the dynamic SQL SELECT commands. When these are executed, all matching rows of related data are inserted into PARENT_CHILD_ROWIDS and become seeds for the next iteration. This continues until the end of the foreign-key hierarchy is reached.

For example, one CUSTOMERS seed may result in the collection of a dozen CUSTOMER_ACCOUNTS. These accounts can then collect 60 or so CARDS. The cards in turn collect thousands of SALE_EVENTS and SALE_ITEMS. At the end of the detail collection, the PARENT_CHILD_ROWIDS table might look like Table 1 (selected rows displayed).

In the table, you can identify each iteration by an increase in the P_LEVEL. The PCR_SEQUENCE number increases sequentially but may skip numbers when it hits duplicate rows (they can�t be inserted into the table).

Identifying the Reference Data This next step may be unnecessary for some purposes, such as moving the data into a database with reference data. But if you need to include all reference and code data, you must backtrack through the foreign-key hierarchy, using:

EXECUTE EXTRACT.ReferenceData

ReferenceData matches the columns of foreign keys from seed tables in PARENT_CHILD_ROWIDS to primary (or unique) key columns from tables that are referenced by the foreign keys. The procedure is otherwise much like DetailData. For the reference data collection, the P_LEVEL is started at -1 and counts backward while the sequence number continues to increment, so you�ll know in what order each piece of data was collected, as well as the process and iteration level.

All the rows in all the tables that are related to the original seed rowids will have been identified at the completion of this procedure. The following scripts may lend some insights:

select count(*),p_level,parent_table_name from parent_child_rowids
group by p_level,parent_table_name;

select count(*),parent_table_name from parent_child_rowids
group by parent_table_name;

The first script lists the number of rows in each table that were selected by each iteration of the extraction process. Where tables are referenced by a number of foreign keys, you will see the same table names listed at different iteration levels. The second script simply lists the total number of rows that were gathered for each table. This information can help you to estimate data space requirements. Extracting the Data The PARENT_CHILD_ROWID table contains the rowids of all the data to be extracted. If this table is large, you can improve data-collection performance by creating an index:

CREATE INDEX PCR_IX ON PARENT_CHILD_ROWIDS                         (P_OWNER,PARENT_TABLE_NAME,P_ROWID);
You can extract the data into small versions of the original tables by spooling the output of a script such as:

select 'create table '||parent_table_name||
    ' as select T.* from '||p_owner||'.'||parent_table_name||
    ' T, parent_child_rowids PCR '||
    ' where PCR.p_owner = '''||p_owner||
    ''' and PCR.parent_table_name = '''||parent_table_name||
    ''' and PCR.p_rowid = T.rowid;'
 from parent_child_rowids
 group by p_owner, parent_table_name
 order by p_owner, parent_table_name

When you run the spooled file, all the data is collected into tables of the same name in the current schema. You can then apply indexes and foreign keys as required. You now have a complete, referentially intact subset of data.

To archive, you can export the tables as a complete set. You can also make quick work of deleting obsolete detail data by using the PARENT_TABLE_NAME and P_ROWID columns of the PARENT_CHILD_ROWIDS table.

This technique works well for collecting the data into a schema where the tables do not already exist. Inserting the data into an existing database is more complicated. You can prevent duplicate rows by allowing primary (or unique) key constraints to fail upon insert. If you were to insert a large number of rows via a single insert statement and one or more duplicate rows were found, the entire set would fail. If you insert the rows one at a time, then only the duplicates fail.

With foreign keys enabled, you need to consider row-insertion order. Foreign keys keep you from inserting rows into a detail table if the master table row doesn�t exist.

To insert into a database with foreign keys, you can use a script such as Printing a Hierarchical List of Referring and Referenced. Run the script in the schema in which the tables and constraints reside to get the LEVEL_TAB table. You can use this table to provide the correct insertion order. The following script uses LEVEL_TAB to generate individual insert statements in the correct order for insertion into an existing database:

select 'insert into '||R.parent_table_name||
    ' select * from '||R.p_owner||'.'||R.parent_table_name||
    ' where rowid = chartorowid('''||R.p_rowid||''');'
 from parent_child_rowids R, owner.level_tab L
 where R.parent_table_name = L.table_name
 order by L.level_number, R.p_owner, R.parent_table_name, R.pcr_sequence
Inserting rows singly isn�t efficient. If you�re sure there�s no row duplication between the target database and the insertion data, you can modify the script to insert in table sets:

select 'insert into '||parent_table_name||
    ' (select T.* from '||p_owner||'.'||parent_table_name||
    ' T, PCR.parent_child_rowids PCR '||
    ' where PCR.p_owner = '''||p_owner||
    ''' and PCR.parent_table_name = '''||parent_table_name||
    ''' and PCR.p_rowid = T.rowid );'
 from parent_child_rowids R, owner.level_tab L
 where R.parent_table_name = L.table_name
 group by L.level_number, R.p_owner, R.parent_table_name
 order by L.level_number, R.p_owner, R.parent_table_name

Summary Providing good data for program and report testing can greatly improve the quality of your end product�the production database. The process described here can be a painless and reasonably quick way to extract a representative and referentially intact subset of data from a larger database. A key factor is to know the data and the foreign-key relationships well. Choosing seed rows wisely will guarantee a good crop of data, just as testing the waters of Lake Erie will help you catch the winning fish.

Wayne Linton ([email protected]) is the Oracle database administrator on a large development project at Shell Canada Ltd.

This is a copy of an article published @ http://www.oramag.com/