Many times as Data professionals we no longer have full control over the quality of data in the source systems. I am discussing SAP in my example, but I could have easily mentioned PeopleSoft, SalesForce, or a number of other purchased solutions. Usually those solutions are purchased and then we are tasked with maintaining those environments and also extracting data from those environments to be incorporated into a Business Intelligence corporate solution.
Our issue is one somewhat of our own choosing as well. We want to enforce integrity and constraints at a greater level than what was intended and specified in the purchased applications. This may be for a variety of reasons including that the business never specified it as a requirement. It may also be that the purchased application was never built to handle that level of integrity.
To be clear, this isn’t a complaint but more a reflection of reality. We as Data professionals are going to receive data that is not as consistent and complete as we as Data professionals want it to be. (I purposely did not state ‘require’ as there could be a discussion of what is truly required) So what are we to do?
The Problem
Typically we end up extracting data from these purchased applications and load them into a consolidated database. This database can be either a relational or dimensional database. We also typically need to cleanse the data we are loading so load the business can report on the data in a clear and consistent manner.
The challenge is what we do with data that we cannot load in a consistent manner. We really have two options; modify the data or reject it outright. Although there are many types of inconsistent data we may need to correct, I will limit my discussion to data that links tables together. Typically we define Referential Integrity or Foreign Keys constraints to ensure that the data to link tables are valid so that reports and queries return correct results.
Possible Solution
When we have more control over the quality of source systems, I usually see the solution embedded in the Extract, Transform, and Load (ETL) solution that extracts and loads the data into a corporate database. This is because the data issues will be more known, of lesser frequency, and the data issues are things we can correct ourselves. In this type of solution, the Foreign Key constraints are always enabled and the ETL solution validates all the data values before trying to insert the data in the database. Any errors that are encountered will result in the data being changed or rejected and an error written to a log file.
There are two majors issue with this approach:
1) Performance – The look-up to validate all Foreign Keys row by row can cause the process to run slower. It can eliminate a performant two step approach where some of the fields can be set in a subsequent SQL Update statement. (Depending on the column’s Nullability) It can also prevent the use of some bulk load methods in SQL Server Integration Services.
2) Availability – If major data issues are encountered, the data issues may prevent the data load from continuing and may affect the availability of the database.
Our Solution
Since we are loading data from multiple external providers, we designed a different solution.
Although we have Foreign Key constraints on the entire database, they will be disabled during the load. (and during the week) We will enable them every Sunday to validate the data loaded has not broken integrity rules. If we find we cannot re-enable any constraint, we will email the Data Team informing them of the offending constraint for investigation. If all Foreign Key constraints can be re-enabled, we will inform the Data Team of the success and disable them again.
We could also do this re-enabling nightly if we start to encountered more frequent data errors.
In this manner, we are in a better position to react to data outside of our control and load the data as quickly as possible.
Our SQL Server Solution
A couple of things to note about our SQL Server solution. Frequently I see the solution to re-enable all constraints use the sp_msforeachtable stored procedure. A sample of how to do this is listed below:
EXEC sp_msforeachtable “ALTER TABLE ? NOCHECK CONSTRAINT all”
This solution is virtually useless you can guarantee all your constraints can be re-enabled without failure. If one constraint fails, it will stop the process. Not good.
To accommodate the ability to re-enable all constraints even when errors are encountered we created our own processes to disable and re-enable our constraints using a cursor.
Here is the disable constraints SQL
DECLARE @disable_sql NVARCHAR(255)
SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_disable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id
DECLARE disable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ NOCHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_disable_constraints
OPEN disable_cursor
FETCH NEXT FROM disable_cursor INTO @disable_sql
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT @disable_sql
EXEC sp_executesql @disable_sql
FETCH NEXT FROM disable_cursor INTO @disable_sql
END
CLOSE disable_cursor
DEALLOCATE disable_cursor
DROP TABLE temp_disable_constraints
And our re-enable constraint SQL:
DECLARE @enable_sql NVARCHAR(255)
SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_enable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id
DECLARE enable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ WITH CHECK CHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_enable_constraints
OPEN enable_cursor
FETCH NEXT FROM enable_cursor INTO @enable_sql
WHILE @@FETCH_STATUS = 0
BEGIN
BEGIN TRY
EXEC sp_executesql @enable_sql
END TRY
BEGIN CATCH
PRINT ‘ERROR–>’ + @enable_sql
FETCH NEXT FROM enable_cursor INTO @enable_sql
CONTINUE
END CATCH
FETCH NEXT FROM enable_cursor INTO @enable_sql
END
CLOSE enable_cursor
DEALLOCATE enable_cursor
DROP TABLE temp_enable_constraints
Conclusion
This solution has provided us the flexibility to load our data as efficiently as possible and validate our Foreign Key relationships on a recurring basis. It also minimizes the chance that our load process will stop mid-stream. Did I mentioned this is a key requirements as we are loading data into the Data Warehouse every 60 minutes? 🙂
I was initially concerned with how long it would take to re-enable the constraints, but it only takes 75 minutes to re-enable 616 Foreign Key constraints on a 1.1 Terabyte database. Thanks Microsoft!
Now that we have this process we also plan to use it on large software deployments just to ensure to major data issues were introduced with the deployment as well.