#SAP, Breaking Data, and Re-enabling #SQLServer Database Referential Integrity Constraints #Microsoft #FTW

Many times as Data professionals we no longer have full control over the quality of data in the source systems. I am discussing SAP in my example, but I could have easily mentioned PeopleSoft, SalesForce, or a number of other purchased solutions. Usually those solutions are purchased and then we are tasked with maintaining those environments and also extracting data from those environments to be incorporated into a Business Intelligence corporate solution.

Our issue is one somewhat of our own choosing as well. We want to enforce integrity and constraints at a greater level than what was intended and specified in the purchased applications. This may be for a variety of reasons including that the business never specified it as a requirement. It may also be that the purchased application was never built to handle that level of integrity.

To be clear, this isn’t a complaint but more a reflection of reality. We as Data professionals are going to receive data that is not as consistent and complete as we as Data professionals want it to be. (I purposely did not state ‘require’ as there could be a discussion of what is truly required) So what are we to do?

The Problem

Typically we end up extracting data from these purchased applications and load them into a consolidated database. This database can be either a relational or dimensional database. We also typically need to cleanse the data we are loading so load the business can report on the data in a clear and consistent manner.

The challenge is what we do with data that we cannot load in a consistent manner. We really have two options; modify the data or reject it outright. Although there are many types of inconsistent data we may need to correct, I will limit my discussion to data that links tables together. Typically we define Referential Integrity or Foreign Keys constraints to ensure that the data to link tables are valid so that reports and queries return correct results.

Possible Solution

When we have more control over the quality of source systems, I usually see the solution embedded in the Extract, Transform, and Load (ETL) solution that extracts and loads the data into a corporate database.  This is because the data issues will be more known, of lesser frequency, and the data issues are things we can correct ourselves. In this type of solution, the Foreign Key constraints are always enabled and the  ETL solution validates all the data values before trying to insert the data in the database. Any errors that are encountered will result in the data being changed or rejected and an error written to a log file.

There are two majors issue with this approach:

1) Performance – The look-up to validate all Foreign Keys row by row can cause the process to run slower. It can eliminate a performant two step approach where some of the fields can be set in a subsequent SQL Update statement. (Depending on the column’s Nullability) It can also prevent the use of some bulk load methods in SQL Server Integration Services.

2) Availability – If major data issues are encountered, the data issues may prevent the data load from continuing and may affect the availability of the database.

Our Solution

Since we are loading data from multiple external providers, we designed a different solution.

Although we have Foreign Key constraints on the entire database, they will be disabled during the load. (and during the week) We will enable them every Sunday to validate the data loaded has not broken integrity rules. If we find we cannot re-enable any constraint, we will email the Data Team informing them of the offending constraint for investigation. If all Foreign Key constraints can be re-enabled, we will inform the Data Team of the success and disable them again.

We could also do this re-enabling nightly if we start to encountered more frequent data errors.

In this manner, we are in a better position to react to data outside of our control and load the data as quickly as possible.

Our SQL Server Solution

A couple of things to note about our SQL Server solution. Frequently I see the solution to re-enable all constraints use the sp_msforeachtable stored procedure. A sample of how to do this is listed below:

EXEC sp_msforeachtable “ALTER TABLE ? NOCHECK CONSTRAINT all”

This solution is virtually useless you can guarantee all your constraints can be re-enabled without failure. If one constraint fails, it will stop the process. Not good.

To accommodate the ability to re-enable all constraints even when errors are encountered we created our own processes to disable and re-enable our constraints using a cursor.

Here is the disable constraints SQL

DECLARE @disable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_disable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE disable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ NOCHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_disable_constraints

OPEN disable_cursor
FETCH NEXT FROM disable_cursor INTO @disable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

PRINT @disable_sql

EXEC sp_executesql @disable_sql
FETCH NEXT FROM disable_cursor INTO @disable_sql

END

CLOSE disable_cursor
DEALLOCATE disable_cursor
DROP TABLE temp_disable_constraints

And our re-enable constraint SQL:

DECLARE @enable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_enable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE enable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ WITH CHECK CHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_enable_constraints

OPEN enable_cursor
FETCH NEXT FROM enable_cursor INTO @enable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

BEGIN TRY
EXEC sp_executesql @enable_sql
END TRY

BEGIN CATCH
PRINT ‘ERROR–>’ + @enable_sql
FETCH NEXT FROM enable_cursor INTO @enable_sql
CONTINUE
END CATCH

FETCH NEXT FROM enable_cursor INTO @enable_sql

END

CLOSE enable_cursor
DEALLOCATE enable_cursor
DROP TABLE temp_enable_constraints

Conclusion

This solution has provided us the flexibility to load our data as efficiently as possible and validate our Foreign Key relationships on a recurring basis. It also minimizes the chance that our load process will stop mid-stream. Did I mentioned this is a key requirements as we are loading data into the Data Warehouse every 60 minutes? 🙂

I was initially concerned with how long it would take to re-enable the constraints, but it only takes 75 minutes to re-enable 616 Foreign Key constraints on a 1.1 Terabyte database. Thanks Microsoft!

Now that we have this process we also plan to use it on large software deployments just to ensure to major data issues were introduced with the deployment as well.

Advertisement

Creating my own #ETL data validation #FTW

Recently on the same project I created an Agile Data Warehouse and Extract, Transform, and Load automated test suite, I was tasked to create a data validation process. We need to create a foundational process that could be leveraged to provide ongoing data validation for the data load process. We were responsible for loading data into the Operational Data Store, but we were the last step in a lengthy process. Previously, the data made hops from Legacy to web applications, to SAP, and then finally to the Operational Data store. Further complicating the issue was that at each step, there was a different data technology used:

  • Legacy – ISAM
  • Web Applications – Sybase
  • SAP – Oracle
  • Operational Data Store – SQL Server 2012

Aggregation is your friend!

What we designed was a process where we created a set of the same aggregation reports at each of the four points along the data migration. Typically people create aggregation reports for the obvious financial transactions with numeric totals. These are reports that will sum the transactions across dates, clients, or accounts. We decided we needed to create aggregate reports across all tables and objects in the databases whether they were transactional or not.

Once we analyzed our Operational Data Store with 380+ tables, we determined that we required 77 aggregate reports to provide coverage across the data model.

For example, the aggregate reports that reported on agreements/contracts were created with the following SQL:

— contracts/agreements by client

SELECT client_number, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— sum contract/agreement numbers by client

SELECT client_number, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— contract/agreement numbers by renewal month

SELECT renewal_month, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

— sum contract/agreement numbers by renewal month

SELECT renewal_month, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

The Process

The most interesting thing about the process was how I was able to create a copy of each report in a separate database. The solution is designed so that the process can run every day and create 77 reports in a ‘etl_validation’ database suffixed with the date. The solution is also configurable so that databases older than 90 days are automatically deleted.

The challenge I had was that the ‘INTO’ statement couldn’t have a variable that named a unique database suffixed with the date. So you will see in the SQL that the reports are created in a “etl_validation_temp” database that is later renamed to a database name suffixed with the date. I thought about creating dynamic SQL where I could have specified the database name in the INTO clause, but I thought that solution would not have been as maintainable.

The Problem

So all was good right? Not so fast. It worked for the first database, but then I ran into a problem when I created the second database. Although I could rename the database, I couldn’t programmatically rename the files the database uses. So then second database couldn’t be created as it tried to use the same file name.

Fair enough…

So I determined I can create the databases initially with the database name suffixed with the date, rename the database to be suffixed with “temp” to allow the creation of the aggregate reports, and then rename it back to the original database name suffixed with the date. After this I thought about whether the dynamic SQL solution was cleaner, but I decided to stick with this one.

Summary

Regardless, the process works and has been running to create these reports in under 20 minutes every day. Not too bad since some of the tables have over a 100 million records and the database is larger than 80 gigs…

Here is the SQL is created to initially create the database:

SELECT ‘started at ‘ + CAST(GETDATE() AS NVARCHAR(30))

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

exec(‘create database ‘ + @newname)

USE master

EXEC(‘ALTER DATABASE ‘ + @newname +’ Modify Name = etl_validation_temp;’)
GO

And to rename it at the end..

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

USE master;

EXEC(‘ALTER DATABASE etl_validation_temp Modify Name = ‘ + @newname + ‘;’)
GO

SELECT ‘ended at ‘ + CAST(GETDATE() AS NVARCHAR(30))
GO

And finally, the SQL to clean up database older than 90 days:

USE master
go

DECLARE @currentdate DATETIME
DECLARE @limitdate DATETIME

DECLARE @currentdatetxt NVARCHAR(20)
DECLARE @limitdatetxt NVARCHAR(20)

DECLARE @maxcounter INTEGER
DECLARE @counter INTEGER

DECLARE @database_name NVARCHAR(80)
DECLARE @sql NVARCHAR(255)

SET @currentdate = GETDATE()
SET @limitdate = @currentdate – 90

SET @currentdatetxt = CONVERT(VARCHAR(20),@currentdate,112)
SET @limitdatetxt = CONVERT(VARCHAR(20),@limitdate,112)

DECLARE @olddbs TABLE
(
id INT IDENTITY(1, 1) PRIMARY KEY WITH FILLFACTOR = 100
,[dbname] nvarchar(255)
)
INSERT INTO @olddbs ( [dbname])
SELECT name FROM sys.databases WHERE name LIKE ‘etl_validation%’ AND create_date < @limitdate

SET @MaxCounter = @@ROWCOUNT

SET @counter = 1
WHILE @counter <= @MaxCounter
BEGIN

——Execution

SELECT @database_name = [dbname]
FROM @olddbs
WHERE id = @counter

SET @sql = ‘drop database ‘ + @database_name + ‘;’

PRINT @sql

exec sp_executesql @SQL

SET @Counter = @counter + 1;

END;
go

 

 

How to create 10,000 Extract, Transform, and Load automated tests using 4 tables #agile #data

The thing I love about my chosen profession is the ability to learn new things and improve on lessons learned from past projects. Recently I was able to take on a problem that I have experienced on multiple past projects.

“How can we easily create automated tests for a Data Migration or Extract, Transform, and Load application?”

Recently I have been lucky to be on Agile projects where we were able to create a large number of automated tests. I was able to see the huge increase in quality that came with these automated tests. All of these projects were Web Applications developed in either C# or Java.

In my coding days, I was always either a Data Modeler or Database Programmer. I had been on three projects where I was the lead developer on the extract, transform, and load (ETL) application that was responsible for loading data into new enterprise databases. Sadly, we had absolutely no automated tests in any of these projects. As we developed our ETL application, we had to manually test the loads to ensure they were operating as intended. This became especially painful near the end of the project as a small insignificant change resulted in hours and hours of manual retesting. We soon discovered that we were more likely to make a mistake testing the changes than making the actual coding change. This situation was something that was not sustainable.

After working with Agile teams and seeing how they were able to easily create a large number of automated tests, I hoped I would get the chance to try to create them on the next Extract, Transform, and Load project I was on.

The Opportunity

My most recent project allowed me to again be part of a team that recreated the corporate Data Warehouse and also needed to create an entirely new ETL application to load the Data Warehouse. The database technology that was selected was the Microsoft stack. We used SQL Server 2012 as our database engine and SSIS as our technology to create the ETL application. The Data Warehouse we were loading had over 200+ main tables that the ETL process was loading.

One thing I wanted to ensure we did was to create a large number of automated tests for our ETL application. We investigated multiple frameworks that existed, but none of them seemed to allow us to easily create the number of automated tests we wanted. All of them seemed to still require a large amount of test set up and the tests themselves did not adapt easily to changes in the database schema. This had always been a problem in the past as I tried to created automated tests for ETL applications.

Serendipity

In my time as a Data Modeler/DBA I became very good at writing queries to read the Data Dictionary tables of the database itself to generate SQL statements to then be executed. Then was the serendipitous moment – could we also read the Data Dictionary tables to easily generate automated tests for the ETL application?

The answer was yes, with a small number of customized tables that contained the column to column data mapping information. (Since this information was not stored in the Data Dictionary). The Data Dictionary tables we accessed in SQL Server were the tables that are part of the INFORMATION_SCHEMA.

The Solution

Our solution contained the following elements:

  • tSQLt  source framework for the automated testing framework
  • Data Mapping Spreadsheet that defined the Column to Column mapping
  • 4 custom tables that contain information from the Data Mapping Spreadsheet
  • Stored Procedures that read the INFORMATION_SCHEMA and 4 custom tables to automatically generate the tests

4 Custom Tables

schema

ETL Automated Tests

With these four custom tables loaded from the Data Mapping Spreadsheet we created the following Stored Procedures to generate tests:

Table to Table – Every Table

  • TstTableCount: Compares record counts between source data and target data. This will be done on table to table basis. This will be done starting from the target table and comparing to the associated source table or tables.
  • TstTableColumnDistinct: Compares counts on distinct values of columns. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
  • TstTableColumnNull: Generates a report of all columns where all the contents of a field is all null. This typically can highlight situations where the column was not assigned in the ETL process.

Column – Every Column

  •  TstColumn­DataMapping: Compares columns directly assigned from a source column on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­ConstantMapping: Compares columns assigned a constant on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­NullMapping: Compares columns assigned a Null value on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumnTransformedMapping: Compares transformed columns on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.

The Results

By creating these 4 custom tables and stored procedures we are now able to generated 10’s of thousands tests nightly. More importantly, we are able to have these tests be flexible to schema changes as the tests are generated by reading the INFORMATION_SCHEMA  and 4 custom tables. A large part of generating our tests is now data driven.

 

12 Rules to create a Dimensional Model from a Normalized Model in an #agile way

Recently I have been on a project where I have been fortunate enough to develop both a normalized Operational Data Store and a Dimensional Data Warehouse. In addition to this, I have been fortunate to be able to recreate the Dimensional Data Warehouse three times over. This has been because the project has been done in an Agile and iterative manner. I have been amazed how the Dimensional model changed and improved along the way. Although some aspects of the model have remained constant, I would estimate that a good 1/3 of the model changed and improved with each major revision. That by itself is a great testament for why Data Warehouse projects can and should be done in an Agile process.

In the process of having to recreate the Dimensional Model three times, I have followed a process that has proved beneficial and been proven out.

So here is my 12 step program with associated level of difficulty in my humble opinion.

1) Remove tables that you don’t need for analytical reporting or history – Difficulty:Easy

First step is to recognize that not all tables in an Operational Data Store make sense to store in a Data Warehouse. Some tables are truly operational and don’t make sense for analytical reporting or trend analysis. Some tables also don’t require that historical data be kept on them. Be brutal and get rid of these tables at the start. These are the tables that can clutter a Dimensional Model.

2) De-normalize reference tables on master tables – Difficulty:Easy

This next step is also pretty easy. De-normalize your reference tables onto the main tables. Remember to de-normalize shared reference tables to all the main tables that reference them. There is some discussion in the Data Warehouse world about whether you should just store the reference table description and not the reference table codes, but I like to store both as they allow for more efficient reporting and querying.

3) Prefix column names with the table names to assist in data lineage before further de-normalization – Difficulty:Easy

This was a lesson learned from the first time I created a Dimensional model. If you don’t prefix column names with the current table names, it becomes very difficult later to determine where the field initially came from. The second time I created abbreviations for the tables and prefixed the columns with those abbreviations. The third time I prefixed the column names with the full table names. The design this third time is optimized. I know have a self-documenting model that communicates the data lineage automatically. I ended up using a ‘#’ to separate the table name from the column name for each column.

4) Collapse sub-type onto super types – Difficulty:Easy

The last easy step in the process. This is the step where you collapse sub-types onto the super-type. This situation is where you may have a person entity for shared attributes and then have two sub-types like employee and manager for attributes specific for each type of person. It is relatively easy to de-normalize these sub-types attributes back onto the super-type – especially since we have prefixed the column names with the table names in step 3.

5) Duplicate and collapse shared main tables – like address/phone/email/bank accounts – Difficulty:Medium

This step is similar to step 2 except that some main tables are shared like reference tables. These main tables are usually real world objects like addresses, email addresses, phone number, and bank accounts. Usually these result in one main table in a normalized model that is then shared and link to multiple other main tables. Duplicate these shared main tables and de-normalize them onto the main tables that reference them. This is an easy step once you identify these shared main tables, but sometime it does take a little experience to identify them.

6) Remove non-enterprise entities – Difficulty:Easy

This is a second step to weed out tables that may over-complicate the model. After you have collapsed the sub/super types and shared main tables, you may have seen areas of the Dimensional model that highlight non-enterprise entities. If these areas complicate the model, it is good to ask if your Data Warehouse model really requires them.  If you are unsure, remove them for now. You can always add them back if needed.

The example I had was where we had a type of person that was a Client Administration clerk. This person was different enough to require specific tables and attributes, but was not really required for enterprise reporting on claims, premiums, and bills. For this reason, we removed the objects from our Data Warehouse until we know we have a firm requirement for this type of person and these attributes.

7) Categorize your tables as Facts, Dimensions, Bridges, and Outriggers

I found it very helpful to then categorize the tables into four main categories:

  • Facts – Tables that define events or transactions. These tables usually have the attributes you will sum up in reports. Example: Claim,Bill
  • Dimensions – Tables that describe the Facts. These tables usually provide the filter and grouping criteria for your reports. Example: Client, Subscriber
  • Bridges – Tables that resolve many to many links between Dimensions and Facts. Example: Client_Subscriber
  • Outriggers – Tables that represents one to many relationships to Facts and Dimensions where you do not want to collapse them onto the main tables. Example: Claim_request

There is much more theory behind these categories that would be the topic of an entire book. If you are interested, the best book I have found on the subject is Chris Adamson’s Star Schema – highly recommended.

8) Don’t over de-normalize – Difficulty:Hard

I’ve listed this step as hard because it does take some experience to not over de-normalize. One of the goals of a dimensional model is to simplify the data model so there is the tendency to view the objective as having as few tables as humanly possible. While this is somewhat true, you need to remember to let the data define the model and not to force the data into your view of what the model should be.

Sometimes forcing tables together will massively increase the row count or not allow for flexibility if the cardinality of the relationships change in the future. This is something that needs to be balanced. The next step of creating Natural Keys will assist in this analysis.

9) Create Natural Keys – Difficulty:Hard

Create Natural Keys for all tables. This means identifying the keys that will define record uniqueness. This is a critical step to validate your model. The Natural keys for some tables will be one field. This is a great indication that the table is cohesive and defined well. If many fields together define uniqueness, you may want to examine if you over de-normalized the table. It may turn out that you didn’t but this may be a cause of the complicated Natural Key. Sometimes, the data and Natural Keys are just complicated depending on the data.

This step will provide crucial feedback for your design.

10) Make all columns mandatory

These last two steps are important to improve the usability of the Data Warehouse.

First step is to make all the columns mandatory. This will address the bane of all Data Warehouse querying and having to specify the dreaded ‘IS NOT NULL’/’IS NULL’ as part of all the queries in a Data Warehouse.

11) Create dummy records for every Dimension, Bridge, and Outrigger table

Since all columns are now mandatory, you will also need to create a dummy record in every Dimension, Bridge, and Outrigger table. These dummy records are used when an invalid value doesn’t allow the load process to link a record to a valid Dimension, Bridge, or Outrigger. This design addresses the second bane of Data Warehouse querying – the dreaded outer join.

12) Iterate Immediately

As I mentioned, I was able to vastly improve my Dimensional Model on subsequent attempts. I would recommend that you create your model and plan on recreating it immediately after you are finished. You will learn enough during the process, that you will be able to create a better model immediately.

Summary

These rules have helped me to create a Dimensional Model that I have been able to improve and iterate on as the project has evolved. The 13th rule I didn’t mention was to trust your instinct. Trust your intuition if it feels wrong to combine two tables. You are more than likely sensing some issue that will cause problems in the future.

Adaptive Data Model – #Agile or Anathema?

I have seen the concept of an Adaptive Data Model proposed as an Agile method to Data Modelling lately. (Most recently in Ken Collier’s excellent book – “Agile Analytics”) The theory is that you can be more Agile using an  Adaptive Data Model instead of a traditional Data Model of the business domain.

Definition

An Adaptive Data Model is a Data Model that doesn’t model the business data. Rather it is a data model of the data model that models the business data. 🙂   The Adaptive Data Model describes tables that contains the meta-data that describes the data model. In this way, the entire Data Model is data driven and stored in a series of tables. The advantages of this approach is that changes to the model can be made by updating the meta-data in the tables. (as opposed to having to generate Alter statements to update the database structures in the database)

A sample Adaptive Data Model is shown below. (Copyright Ken Collier – Agile Analytics)

ADM

 

A subsequent layer would then need to be created to allow for the data to be extracted in a traditional sense by the application. (and to be made sense of by the business) Some suggestions for this layer have recommended that this layer could be created with a series of views or stored procedures.

Agile?

Although the Adaptive Data Model does allow for the easy modification of the Data Model, is it Agile?

I propose that an Adaptive Data Model is neither Agile or a Data Model.

According to Wikipedia a Data Model is:

“A data model is an abstract model that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data are stored and accessed.”

An Adaptive Data Model does not meet this definition of a Data Model. It is a construct created to allow changes to be made to a data model.

In addition, I would propose that an Adaptive Data Model is also not Agile. It does not encourage frequent delivery and iterative development. An Adaptive Data Model is a complex solution that is not easily deployed in iterations to deliver value quickly and often to the business. If anything, it increases the technical debt of the project.

Summary

We as data professionals should be striving to make our processes more Agile and to be able to allow our processes to be refined iteratively like other areas of Software Development. Software Development proposed practices that allowed for iterative development. These didn’t include creating an Object Model of the Object model so that they didn’t have to fully embrace adaptive development practices.

An Adaptive Data Model feels like a short cut. It is hard to be Agile and iterative on data project, but trying to propose an Adaptive Data Model as a solution seems like a wrong turn. We need to find ways to allow our data designs to be change tolerant, adaptive, and test driven.

Why #Dimensional Modeling matters

I’ve recently completed a data modeling initiative on a major project. After doing this I’ve come to two major conclusions:

  1. The coverage area in Insurance is probably the most devious and twisted area of data that I have ever modeled.
  2. Dimensional Modeling should be done on every model to ensure you can simply model the domain.

Why Dimensional Model?

For the model I worked on, there are over 261 tables in the Relational Model. In the Dimensional Model, there are only 25. The process to distill a Relational Model to a Dimensional one is not easy, but the process highlighted problems in my Relational Model that I was unaware of. These issues would have been exposed later during testing or post-production, but being able to identify them in development was extremely valuable.

So what is Dimensional Modeling? although most people have been exposed to the rules of Data Normalization, not as many people have been exposed to Dimensional Modeling. I will try to explain my opinion of Dimensional Modeling, but if you are interested I would highly recommend ‘The Star Schema Reference’ by Chris Adamson. Simply put, it is the most complete and concise book ever written on Dimensional Modeling. Dimensional Modeling is usually done to create a model for a Data Warehouse. This Data Warehouse can then be used for Operational and Analytic reporting. But the process of Dimensional Modeling does not need to be limited to a Data Warehouse. The process itself has value to validate the Relational Model.

The term that is commonly used to refer to a Dimensional Model is a ‘Star Schema’ model. This is due to the fact that most Dimensional Models look like a star. (With the Facts in the centre and the Dimensions around the Fact)

Dimensional Modeling

Dimensional Modeling is the process of taking a Relational Model in some normal form and being able to distill the many tables down to the business objects that the business works with. In most cases this would result in single tables that correspond to objects like:

  • Client
  • Product
  • Sales Rep
  • Store
  • Sales Transaction

These business objects fall into two categories:

Facts – These are Facts about the business. They typically correspond to events or transactions and have metrics or measures that the business is concerned about. In many cases these fact tables are what the business wants to report on.

Dimensions – These are Dimensions to the Facts. They typically correspond to business objects that interact with Facts. These Dimensions represent how the business describes these objects and how they like to slice, dice, filter, sort, group, and order the Facts. A standard Dimension that occurs in all Dimensional Models that does not correspond to a business object is Time. Time is a dimension for almost all business events and transactions.

Initially, the act of creating these Facts and Dimensions is a large task of de-normalization. Although this sounds like a simply task, it is anything but. A Relational Model with hundreds of tables must be distilled down to a handful of tables. Many to Many relationships must be transformed into simple relationships that are easy to understand and report against. A natural key must also be defined that uniquely identifies a row in the Fact and Dimension. Although this sounds easy, you may find that some Dimensions don’t actually have a natural key. This is a clue that the Dimensional Model needs more attention.

A Dimensional Model also places the modeling of history at the forefront. A lot of attention is spent ensuring that data can accurately represent the changes that occur over time. Most of the time, a Relational Model is primarily concerned with current state.

Note: This is a gross over-simplification, if this is interesting to you I would recommend the book by Chris Adamson on all of the theory and complexity behind Dimensional Modeling. Or you can just start Googling!

Conclusion

The act of creating a Dimensional Model and creating the 25 Facts and Dimensions highlighted inconsistencies and duplication in my model. It also challenged my understanding of the data domain. It raised questions that I realized I didn’t have the answer for. It is easy for inconsistency and errors to hide in data model with 200+ tables. A data model that has 10-30 tables has nowhere to hide. It has the brutal transparency so valuable in Agile.

#SDEC12 Conference Review #Agile

Well another Software Development and Evolution conference has come and gone. (You’ve always wondered what SDEC stood for didn’t you?) It was a lot of work and effort to make it all happen, but in the end it was very enjoyable. I learned an immense amount and cant’t wait until next year.

My Highlights

      • The Joe Justice/Wikispeed Keynote on day 1 was entertaining and inspiring. If you aren’t familiar with the Joe Justice and Wikispeed story, I highly recommend you doing a search on YouTube or Google. Inspirational stuff on what can be accomplished when you ask why not? WikiSpeed
      • The Luke Hohmann/Innovation Games keynote on day 2 was energizing. I have been a fan of Innovation Games for a long time and it was energizing to hear Luke speak and provide the context on how and why Innovation Games are successful. InnovationGames
      • Adam Yuret @AdamYuret brought Lean Coffee to SDEC12. It was a highlight of mine to attend his session on Lean Coffee and learn how we can have our own Lean Coffee discussions. Although I must admit, I would prefer an afternoon coffee instead of an early morning one.
      • Chris Dagenais @MDChris had a couple of engaging and informative sessions on team building and peer feedback. Great sessions and audience was very engaged and interactive.
      • Lightning Talks made their first appearance at SDEC and were very well attended. There were great talks and tons of practical information compressed in 5 minute chunks.
      • Best presentation I attended was presented by Mark Kulchycki and Alyson Teterenko of Manitoba Hydro International. It was a real life tale from the trenches on how their team evolved and incorporated Agile principles into their PSCAD product development team. Awesome presentation, pragmatic approaches that everyone can use.
      • My personal highlight of the conference was the Innovation Games workshop with Luke Hohmann after the conference. It was an excellent session where Luke not only covered the Innovation Games themselves, but also the science and psychology of the games and the art of facilitation. Probably learned more in one day than I have for a long time.
      • I loved presenting my Agile Data Warehouse talk. I’m hopeful that I can have a follow-up presentation at SDEC13 that illustrates more how we used Innovation Games and show the actual models that were created.
      • It was great being able to just talk and share with everyone at the conference on what worked for them and what people are still struggling with.

Summary

It was a great conference with over 200 attendees. This was a new record for SDEC and caused us to be flexible to modify the lunch process for Day 2 to be more efficient. 🙂

I can’t wait for next year. We are gathering the feedback forms and listening to our Advisory council to assure the content and structure is even better next year! Thanks for your support and see you at SDEC13!

#Data #Hoarding and #Norm-aholics

Are you a sufferer of Data Hoarding? This is an affliction that affects thousands of businesses every day. Common symptoms are terabytes of data being kept around just in case you ever need them. The data ends up being stored in a multitude of disconnected tables just piled on top of one another with no organization. When a new piece of data is thought to be needed, another table is created and thrown on the pile.

The data is kept around if it is ever needed, but good luck ever finding it in the future when it is buried under piles of other data. Now where did I put that snapshot of the ledger balances from March 2011?

Even worse, most new applications create an isolated database to hold the data that the application requires. Doesn’t matter that at least 30-40% percent of the data already exists somewhere, it is too hard to integrate and create one common data source. Much easier to copy data once and then modify it as the application desires.

It doesn’t have to be this way.

Data Hoarders Anonymous

At Data Hoarders Anonymous we work with you and take you through a seven step program.

Phase I – Repent

1) Admit you have a problem

The first step to any recovery is admitting you have a problem. Look around at your databases, do you have a data dictionary? Do you have 20+ tables with the same name? Have you copied tables directly into a database without doing any analysis or integration?

2) Understand the Enterprise’s perspective

Data is an asset for the enterprise, but only if it is easy to understand. That is the key word, the enterprise must be able to get and understand the data easily. Typically when data has been replicated across god’s green earth, there are multiple copies of the truth and they are anything but easy to find. The data is worthless if it isn’t easy understood.

3) Apologize to Developers you have made extract data from multiple complex disconnected schemas

Take donuts with you. But apologize to those developers that you have made create programs that need to extract data from multiple sources and somehow make sense of it all. SQL should not be a language where you need a PHD to understand. If your SQL is longer than a screen, you probably have a Data Hoarding problem.

4) Apologize to Business users you have made explain inconsistent data to Vice-presidents

Take alcohol with you and apologize to the business users who don’t know what is included in the allocated income calculation and what table actually has the true up to date value. Apologize for not having a Data Dictionary available to them that would tell them where the data exists and what it actually means.

Phase II – Recovery

Now that you have repented, you can move onto the recovery portion…

5) Organize your data with Data Normalization

Data Normalization is your new friend. Learn the rules and let them assist you in removing data redundancy and enforcing consistency and integrity. But be very careful. Many on the recovery path of Data Hoarding end up becoming ‘Norm-aholics’. They normalize to at least third normal form all the time, even though it creates excessively complex schemas for the developers and data that can’t be accessed by business users. Even worse, ‘Norm-aholics’ have no self-awareness that they have an addiction. Never let the rules of Data Normalization affect the progress you have made. Sadly, many DBAs fall into this trap and let the theory of Data Normalization affect the understanding of the data.

Data Normalization must increase the understanding of the data, if not Data Normalization has been corrupted.

6) Simplify your data with Dimensional Analysis

Now that you have used Data Normalization to remove data redundancy and maximize consistency and integrity, the time has come to use the tools of Dimensional Analysis to ensure the data design is simple. Use Dimensional analysis to ensure that you haven’t traded one type of complexity for another. Evaluate your data across time and Master Data Dimensions to ensure that the transactions concepts are consistent and simple.

7) There can be only only one

At the last step, you will need to embrace the principle that there can be only one…

  • One owner and source of the data – The data must be owned by one business area and system
  • One place to update the data – the data must be updated in only one system that the owner controls
  • One way to get the data – there can not be multiple query paths to retrieve the required data
  • One truth – the data must have only one meaning and this meaning must be documented in a Data Dictionary

Coming to you soon on A&E – Data Hoarders Anonymous

Why #Dimensional Analysis should be done on every #datamodel

Those of you who have worked with me, know of my fondness for Operational Data stores. I have always believed in the importance of having an enterprise or holistic view of the data requirements for every application. An Operational Data Store seemed to be the perfect vehicle to ensure this happened. Perhaps my fondness was related to not wanting to stray too far from the normalization rules that I knew quite well. In this way, it was a new-ish discipline or context that really wasn’t new.

I always looked kinda sideways at those weird Dimensional modelers with their Star Schemas and Snowflakes. I mean if they really put their mind to it, they would be able to figure out how to solve their data needs with a nice relational normalized Operational Data Store, Only exceptional and massive amounts of data require the Dimensional modeling constructs that these models typically use right? I mean what is so complicated about a model with only 100 main tables? Shouldn’t everyone know how to write SQL by hand?

On my latest project, I have had the opportunity to become re-introduced to Dimensional Analysis and modeling and I have found the process fascinating and very valuable. Besides the obvious benefits that are being realized by being able to model the data in a way that allows the clients to efficiently write and execute queries, there was an unexpected benefit.

Taking a normalized Data Model and attempting to translate it into a Dimensional Model really challenges and validates your data model. It is easy to create a model with a multitude of complex relationships than it is to distill in down to a handful of FACTs and Dimensions. With so many relationships, it is possible to inconsistencies to exist and hide in the data model. I found multiple modeling errors in the process on trying to create a Dimensional model from my relational model. When you distill a relational model down to a Dimensional model, inconsistencies and errors become very apparent in the creation of the FACTs and Dimensions.

Dimensional Analysis also forces you to look at the data in a different way. Instead of a relational/hierarchical way, I find it forces me to look at the data in a chronological way and forces me to consider data changes, data history, and data latency in ways I may not have considered before. Not having to account for data across time and verify consistency at every point is quite a bit simpler.

Summary

I am a convert of using Dimensional Analysis on all my data models for validation of the data model and additional analysis of the data.  I’ve discovered that I need to understand the data better to create a Dimensional Model than a normalized model. More factors need to be considered and creating the Dimensional model with fewer objects requires that the data model has greater consistency, integrity, and cohesion.

Simple is hard. 

 

Why do we #DataModel at all?

People in the Database world take Normalization and Data Modeling as something that should be done without question. I compare it to best practices like versioning software. No one expects that anyone would create software without version control anymore.But more often recently I do get questioned and challenged on why we need to normalize and model data. Is it even required with the cheap disk space, memory, and server capacity available ?

According to Wikipedia and others, the objective of normalization is:

“Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”

The rules of Normalization

Normal form

Brief definition

Violation

1NF A relation is in 1NF if and only if all underlying domains contain scalar values only First Normal Form is violated when the table contains repeating groups
2NF A relation is in 2NF if and only if it is in 1NF and every non-key attribute is irreducibly dependent on the primary key – every column must depend solely on the primary key Second Normal Form is violated when a non-key field is a fact about a subset of a key
3NF A relation is in 3NF if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. Third normal form is violated when a non-key field is a fact about another non-key field
4NF Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A. Fourth Normal Form is violated when a table contains two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

The Questions

Now that we have reviewed the objectives and rules of normalization, let us summarize. The objective of Normalization is to:

  1. Minimize Redundancy
  2. Minimize Dependency

But what if we have extra storage available and storing redundant copies of data is not a problem? In fact, it probably will speed up our query response time. What if we also don’t require frequent modification of data so that having de-normalized data won’t result in update or deletion anomalies caused by excessive data dependency? Why should we still model our data and normalize?

The three reasons to Data Model

Simplicity, Consistency and Integrity, and Future Flexibility.

Simplicity

Every one of the violations mentioned above would require extra code and extra unit tests to validate proper functioning. Depending on the amount of violations, this can become a severe amount of technical debt that will needlessly be required in the software. There is an entire movement dedicated to the elimination of If statements. (www.antiifcampaign.com) Software that is Data Driven rather than Condition Driven is simpler and easier to maintain over the life of the application.

An application that is Data Driven can also automate the creation of their test cases to validate proper functioning of the application. This combined with the enhanced simplicity greatly adds to the quality of the application.

Consistency and Integrity

Even if the solution being modeled can accommodate redundant data and has the potential for minimal update and deletion anomalies currently, significant risk is being assumed by having these potential situations in your data model. How can you ensure that redundant data will be kept in sync and that update and deletion anomalies do not get introduced in the future as people and requirements change? Either this is through additional software development code or by additional processes and latent knowledge in resident experts. Neither of these situations are a good use of time and energy.

This is an example of an application-centric view of the data model. Unfortunately, not all activity on the Data Model can be guaranteed to always go through the application. Data Fixes, Conversions, and enhancements all have the ability to bypass the application’s business logic and compromise the integrity of the data. All it takes is one high value client with inaccurate or inconsistent data to irreparably harm a company’s reputation and revenue stream.

Future Flexibility

Solutions that are data driven and do not have excessive functional dependencies are much easier to evolve in the future. For example, I may have a business requirement to split one account type or combine account types. This type of conversion will be quite routine if I have modeled my data properly and minimized dependencies. If not, the conversion can be quite convoluted and I will probably need to evaluate code before I can determine the implications of making such a change. Then I have to be sure I address and update all the redundant code throughout the application. Just because the situation doesn’t exist currently with update and deletion anomalies doesn’t mean those situations won’t happen in the future.  

In addition, these changes to split or combine account types would probably also require code changes. If the solution was Data Driven, the possibility of these code changes would be minimized. (not to say they would never be required, but the probability of code changes would be minimized)

Summary

A well designed application and user interface will be able to be used with minimal training. It just makes sense and models the clients current processes and tasks.

A well designed data model should also have the same intuitive qualities. It also makes sense and models the business’s data. Not how the application functions, but how the business exists. Modeling the data in this manner minimizes work currently to work with the data and in the future to accommodate change.

In Object Oriented parlance, the Data Model itself should be loosely coupled with high cohesion. Both the Object Model and Data Model should share this characteristic. (Although they will implement it in quite distinct ways)