Recently on the same project I created an Agile Data Warehouse and Extract, Transform, and Load automated test suite, I was tasked to create a data validation process. We need to create a foundational process that could be leveraged to provide ongoing data validation for the data load process. We were responsible for loading data into the Operational Data Store, but we were the last step in a lengthy process. Previously, the data made hops from Legacy to web applications, to SAP, and then finally to the Operational Data store. Further complicating the issue was that at each step, there was a different data technology used:
- Legacy – ISAM
- Web Applications – Sybase
- SAP – Oracle
- Operational Data Store – SQL Server 2012
Aggregation is your friend!
What we designed was a process where we created a set of the same aggregation reports at each of the four points along the data migration. Typically people create aggregation reports for the obvious financial transactions with numeric totals. These are reports that will sum the transactions across dates, clients, or accounts. We decided we needed to create aggregate reports across all tables and objects in the databases whether they were transactional or not.
Once we analyzed our Operational Data Store with 380+ tables, we determined that we required 77 aggregate reports to provide coverage across the data model.
For example, the aggregate reports that reported on agreements/contracts were created with the following SQL:
— contracts/agreements by client
SELECT client_number, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;
— sum contract/agreement numbers by client
SELECT client_number, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;
— contract/agreement numbers by renewal month
SELECT renewal_month, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;
— sum contract/agreement numbers by renewal month
SELECT renewal_month, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;
The Process
The most interesting thing about the process was how I was able to create a copy of each report in a separate database. The solution is designed so that the process can run every day and create 77 reports in a ‘etl_validation’ database suffixed with the date. The solution is also configurable so that databases older than 90 days are automatically deleted.
The challenge I had was that the ‘INTO’ statement couldn’t have a variable that named a unique database suffixed with the date. So you will see in the SQL that the reports are created in a “etl_validation_temp” database that is later renamed to a database name suffixed with the date. I thought about creating dynamic SQL where I could have specified the database name in the INTO clause, but I thought that solution would not have been as maintainable.
The Problem
So all was good right? Not so fast. It worked for the first database, but then I ran into a problem when I created the second database. Although I could rename the database, I couldn’t programmatically rename the files the database uses. So then second database couldn’t be created as it tried to use the same file name.
Fair enough…
So I determined I can create the databases initially with the database name suffixed with the date, rename the database to be suffixed with “temp” to allow the creation of the aggregate reports, and then rename it back to the original database name suffixed with the date. After this I thought about whether the dynamic SQL solution was cleaner, but I decided to stick with this one.
Summary
Regardless, the process works and has been running to create these reports in under 20 minutes every day. Not too bad since some of the tables have over a 100 million records and the database is larger than 80 gigs…
Here is the SQL is created to initially create the database:
SELECT ‘started at ‘ + CAST(GETDATE() AS NVARCHAR(30))
DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname
SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)
set @newname = ‘etl_validation_’ + @currentdate
exec(‘create database ‘ + @newname)
USE master
EXEC(‘ALTER DATABASE ‘ + @newname +’ Modify Name = etl_validation_temp;’)
GO
And to rename it at the end..
DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname
SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)
set @newname = ‘etl_validation_’ + @currentdate
USE master;
EXEC(‘ALTER DATABASE etl_validation_temp Modify Name = ‘ + @newname + ‘;’)
GO
SELECT ‘ended at ‘ + CAST(GETDATE() AS NVARCHAR(30))
GO
And finally, the SQL to clean up database older than 90 days:
USE master
go
DECLARE @currentdate DATETIME
DECLARE @limitdate DATETIME
DECLARE @currentdatetxt NVARCHAR(20)
DECLARE @limitdatetxt NVARCHAR(20)
DECLARE @maxcounter INTEGER
DECLARE @counter INTEGER
DECLARE @database_name NVARCHAR(80)
DECLARE @sql NVARCHAR(255)
SET @currentdate = GETDATE()
SET @limitdate = @currentdate – 90
SET @currentdatetxt = CONVERT(VARCHAR(20),@currentdate,112)
SET @limitdatetxt = CONVERT(VARCHAR(20),@limitdate,112)
DECLARE @olddbs TABLE
(
id INT IDENTITY(1, 1) PRIMARY KEY WITH FILLFACTOR = 100
,[dbname] nvarchar(255)
)
INSERT INTO @olddbs ( [dbname])
SELECT name FROM sys.databases WHERE name LIKE ‘etl_validation%’ AND create_date < @limitdate
SET @MaxCounter = @@ROWCOUNT
SET @counter = 1
WHILE @counter <= @MaxCounter
BEGIN
——Execution
SELECT @database_name = [dbname]
FROM @olddbs
WHERE id = @counter
SET @sql = ‘drop database ‘ + @database_name + ‘;’
PRINT @sql
exec sp_executesql @SQL
SET @Counter = @counter + 1;
END;
go