Creating my own #ETL data validation #FTW

Recently on the same project I created an Agile Data Warehouse and Extract, Transform, and Load automated test suite, I was tasked to create a data validation process. We need to create a foundational process that could be leveraged to provide ongoing data validation for the data load process. We were responsible for loading data into the Operational Data Store, but we were the last step in a lengthy process. Previously, the data made hops from Legacy to web applications, to SAP, and then finally to the Operational Data store. Further complicating the issue was that at each step, there was a different data technology used:

  • Legacy – ISAM
  • Web Applications – Sybase
  • SAP – Oracle
  • Operational Data Store – SQL Server 2012

Aggregation is your friend!

What we designed was a process where we created a set of the same aggregation reports at each of the four points along the data migration. Typically people create aggregation reports for the obvious financial transactions with numeric totals. These are reports that will sum the transactions across dates, clients, or accounts. We decided we needed to create aggregate reports across all tables and objects in the databases whether they were transactional or not.

Once we analyzed our Operational Data Store with 380+ tables, we determined that we required 77 aggregate reports to provide coverage across the data model.

For example, the aggregate reports that reported on agreements/contracts were created with the following SQL:

— contracts/agreements by client

SELECT client_number, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— sum contract/agreement numbers by client

SELECT client_number, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— contract/agreement numbers by renewal month

SELECT renewal_month, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

— sum contract/agreement numbers by renewal month

SELECT renewal_month, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

The Process

The most interesting thing about the process was how I was able to create a copy of each report in a separate database. The solution is designed so that the process can run every day and create 77 reports in a ‘etl_validation’ database suffixed with the date. The solution is also configurable so that databases older than 90 days are automatically deleted.

The challenge I had was that the ‘INTO’ statement couldn’t have a variable that named a unique database suffixed with the date. So you will see in the SQL that the reports are created in a “etl_validation_temp” database that is later renamed to a database name suffixed with the date. I thought about creating dynamic SQL where I could have specified the database name in the INTO clause, but I thought that solution would not have been as maintainable.

The Problem

So all was good right? Not so fast. It worked for the first database, but then I ran into a problem when I created the second database. Although I could rename the database, I couldn’t programmatically rename the files the database uses. So then second database couldn’t be created as it tried to use the same file name.

Fair enough…

So I determined I can create the databases initially with the database name suffixed with the date, rename the database to be suffixed with “temp” to allow the creation of the aggregate reports, and then rename it back to the original database name suffixed with the date. After this I thought about whether the dynamic SQL solution was cleaner, but I decided to stick with this one.

Summary

Regardless, the process works and has been running to create these reports in under 20 minutes every day. Not too bad since some of the tables have over a 100 million records and the database is larger than 80 gigs…

Here is the SQL is created to initially create the database:

SELECT ‘started at ‘ + CAST(GETDATE() AS NVARCHAR(30))

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

exec(‘create database ‘ + @newname)

USE master

EXEC(‘ALTER DATABASE ‘ + @newname +’ Modify Name = etl_validation_temp;’)
GO

And to rename it at the end..

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

USE master;

EXEC(‘ALTER DATABASE etl_validation_temp Modify Name = ‘ + @newname + ‘;’)
GO

SELECT ‘ended at ‘ + CAST(GETDATE() AS NVARCHAR(30))
GO

And finally, the SQL to clean up database older than 90 days:

USE master
go

DECLARE @currentdate DATETIME
DECLARE @limitdate DATETIME

DECLARE @currentdatetxt NVARCHAR(20)
DECLARE @limitdatetxt NVARCHAR(20)

DECLARE @maxcounter INTEGER
DECLARE @counter INTEGER

DECLARE @database_name NVARCHAR(80)
DECLARE @sql NVARCHAR(255)

SET @currentdate = GETDATE()
SET @limitdate = @currentdate – 90

SET @currentdatetxt = CONVERT(VARCHAR(20),@currentdate,112)
SET @limitdatetxt = CONVERT(VARCHAR(20),@limitdate,112)

DECLARE @olddbs TABLE
(
id INT IDENTITY(1, 1) PRIMARY KEY WITH FILLFACTOR = 100
,[dbname] nvarchar(255)
)
INSERT INTO @olddbs ( [dbname])
SELECT name FROM sys.databases WHERE name LIKE ‘etl_validation%’ AND create_date < @limitdate

SET @MaxCounter = @@ROWCOUNT

SET @counter = 1
WHILE @counter <= @MaxCounter
BEGIN

——Execution

SELECT @database_name = [dbname]
FROM @olddbs
WHERE id = @counter

SET @sql = ‘drop database ‘ + @database_name + ‘;’

PRINT @sql

exec sp_executesql @SQL

SET @Counter = @counter + 1;

END;
go

 

 

Advertisement

Author: Terry Bunio

Terry Bunio is passionate about his work as the Manager of the Project Management Office at the University of Manitoba. Terry oversees the governance on Information Technology projects to make sure the most important projects are being worked on in a consistent and effective way. Terry also provides leadership on the customized Project Methodology that is followed. The Project Methodology is a equal mix of Prince2, Agile, Traditional, and Business Value. Terry strives to bring Brutal Visibility, Eliminating Information islands, Right Sizing Documentation, Promoting Collaboration and Role-Based Non-Consensus, and short Feedback Loops to Minimize Inventory to the Agile Project Management Office. As a fan of pragmatic Agile, Terry always tries to determine if we can deliver value as soon as possible through iterations. As a practical Project Manager, Terry is known to challenge assumptions and strive to strike the balance between the theoretical and real world approaches for both Traditional and Agile approaches. Terry is a fan of AWE (Agile With Estimates), the Green Bay Packers, Winnipeg Jets, and asking why?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: