#BreakingBad at #SQLSATWPG

I’m very excited to be presenting my Breaking Bad presentation on OStress and SQL Profiler to SQL Saturday Winnipeg on November 22nd. There are 20 sessions that will be presented in 4 separate tracks with speakers from all over North America. (And a recent addition from South America)

The first person to refer to me as ‘Mr Heisenberg’ when asking a question will earn a special prize!

Hope to see you there!

Terry

Advertisement

#SAP, Breaking Data, and Re-enabling #SQLServer Database Referential Integrity Constraints #Microsoft #FTW

Many times as Data professionals we no longer have full control over the quality of data in the source systems. I am discussing SAP in my example, but I could have easily mentioned PeopleSoft, SalesForce, or a number of other purchased solutions. Usually those solutions are purchased and then we are tasked with maintaining those environments and also extracting data from those environments to be incorporated into a Business Intelligence corporate solution.

Our issue is one somewhat of our own choosing as well. We want to enforce integrity and constraints at a greater level than what was intended and specified in the purchased applications. This may be for a variety of reasons including that the business never specified it as a requirement. It may also be that the purchased application was never built to handle that level of integrity.

To be clear, this isn’t a complaint but more a reflection of reality. We as Data professionals are going to receive data that is not as consistent and complete as we as Data professionals want it to be. (I purposely did not state ‘require’ as there could be a discussion of what is truly required) So what are we to do?

The Problem

Typically we end up extracting data from these purchased applications and load them into a consolidated database. This database can be either a relational or dimensional database. We also typically need to cleanse the data we are loading so load the business can report on the data in a clear and consistent manner.

The challenge is what we do with data that we cannot load in a consistent manner. We really have two options; modify the data or reject it outright. Although there are many types of inconsistent data we may need to correct, I will limit my discussion to data that links tables together. Typically we define Referential Integrity or Foreign Keys constraints to ensure that the data to link tables are valid so that reports and queries return correct results.

Possible Solution

When we have more control over the quality of source systems, I usually see the solution embedded in the Extract, Transform, and Load (ETL) solution that extracts and loads the data into a corporate database.  This is because the data issues will be more known, of lesser frequency, and the data issues are things we can correct ourselves. In this type of solution, the Foreign Key constraints are always enabled and the  ETL solution validates all the data values before trying to insert the data in the database. Any errors that are encountered will result in the data being changed or rejected and an error written to a log file.

There are two majors issue with this approach:

1) Performance – The look-up to validate all Foreign Keys row by row can cause the process to run slower. It can eliminate a performant two step approach where some of the fields can be set in a subsequent SQL Update statement. (Depending on the column’s Nullability) It can also prevent the use of some bulk load methods in SQL Server Integration Services.

2) Availability – If major data issues are encountered, the data issues may prevent the data load from continuing and may affect the availability of the database.

Our Solution

Since we are loading data from multiple external providers, we designed a different solution.

Although we have Foreign Key constraints on the entire database, they will be disabled during the load. (and during the week) We will enable them every Sunday to validate the data loaded has not broken integrity rules. If we find we cannot re-enable any constraint, we will email the Data Team informing them of the offending constraint for investigation. If all Foreign Key constraints can be re-enabled, we will inform the Data Team of the success and disable them again.

We could also do this re-enabling nightly if we start to encountered more frequent data errors.

In this manner, we are in a better position to react to data outside of our control and load the data as quickly as possible.

Our SQL Server Solution

A couple of things to note about our SQL Server solution. Frequently I see the solution to re-enable all constraints use the sp_msforeachtable stored procedure. A sample of how to do this is listed below:

EXEC sp_msforeachtable “ALTER TABLE ? NOCHECK CONSTRAINT all”

This solution is virtually useless you can guarantee all your constraints can be re-enabled without failure. If one constraint fails, it will stop the process. Not good.

To accommodate the ability to re-enable all constraints even when errors are encountered we created our own processes to disable and re-enable our constraints using a cursor.

Here is the disable constraints SQL

DECLARE @disable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_disable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE disable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ NOCHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_disable_constraints

OPEN disable_cursor
FETCH NEXT FROM disable_cursor INTO @disable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

PRINT @disable_sql

EXEC sp_executesql @disable_sql
FETCH NEXT FROM disable_cursor INTO @disable_sql

END

CLOSE disable_cursor
DEALLOCATE disable_cursor
DROP TABLE temp_disable_constraints

And our re-enable constraint SQL:

DECLARE @enable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_enable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE enable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ WITH CHECK CHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_enable_constraints

OPEN enable_cursor
FETCH NEXT FROM enable_cursor INTO @enable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

BEGIN TRY
EXEC sp_executesql @enable_sql
END TRY

BEGIN CATCH
PRINT ‘ERROR–>’ + @enable_sql
FETCH NEXT FROM enable_cursor INTO @enable_sql
CONTINUE
END CATCH

FETCH NEXT FROM enable_cursor INTO @enable_sql

END

CLOSE enable_cursor
DEALLOCATE enable_cursor
DROP TABLE temp_enable_constraints

Conclusion

This solution has provided us the flexibility to load our data as efficiently as possible and validate our Foreign Key relationships on a recurring basis. It also minimizes the chance that our load process will stop mid-stream. Did I mentioned this is a key requirements as we are loading data into the Data Warehouse every 60 minutes? 🙂

I was initially concerned with how long it would take to re-enable the constraints, but it only takes 75 minutes to re-enable 616 Foreign Key constraints on a 1.1 Terabyte database. Thanks Microsoft!

Now that we have this process we also plan to use it on large software deployments just to ensure to major data issues were introduced with the deployment as well.

Creating my own #ETL data validation #FTW

Recently on the same project I created an Agile Data Warehouse and Extract, Transform, and Load automated test suite, I was tasked to create a data validation process. We need to create a foundational process that could be leveraged to provide ongoing data validation for the data load process. We were responsible for loading data into the Operational Data Store, but we were the last step in a lengthy process. Previously, the data made hops from Legacy to web applications, to SAP, and then finally to the Operational Data store. Further complicating the issue was that at each step, there was a different data technology used:

  • Legacy – ISAM
  • Web Applications – Sybase
  • SAP – Oracle
  • Operational Data Store – SQL Server 2012

Aggregation is your friend!

What we designed was a process where we created a set of the same aggregation reports at each of the four points along the data migration. Typically people create aggregation reports for the obvious financial transactions with numeric totals. These are reports that will sum the transactions across dates, clients, or accounts. We decided we needed to create aggregate reports across all tables and objects in the databases whether they were transactional or not.

Once we analyzed our Operational Data Store with 380+ tables, we determined that we required 77 aggregate reports to provide coverage across the data model.

For example, the aggregate reports that reported on agreements/contracts were created with the following SQL:

— contracts/agreements by client

SELECT client_number, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— sum contract/agreement numbers by client

SELECT client_number, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— contract/agreement numbers by renewal month

SELECT renewal_month, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

— sum contract/agreement numbers by renewal month

SELECT renewal_month, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

The Process

The most interesting thing about the process was how I was able to create a copy of each report in a separate database. The solution is designed so that the process can run every day and create 77 reports in a ‘etl_validation’ database suffixed with the date. The solution is also configurable so that databases older than 90 days are automatically deleted.

The challenge I had was that the ‘INTO’ statement couldn’t have a variable that named a unique database suffixed with the date. So you will see in the SQL that the reports are created in a “etl_validation_temp” database that is later renamed to a database name suffixed with the date. I thought about creating dynamic SQL where I could have specified the database name in the INTO clause, but I thought that solution would not have been as maintainable.

The Problem

So all was good right? Not so fast. It worked for the first database, but then I ran into a problem when I created the second database. Although I could rename the database, I couldn’t programmatically rename the files the database uses. So then second database couldn’t be created as it tried to use the same file name.

Fair enough…

So I determined I can create the databases initially with the database name suffixed with the date, rename the database to be suffixed with “temp” to allow the creation of the aggregate reports, and then rename it back to the original database name suffixed with the date. After this I thought about whether the dynamic SQL solution was cleaner, but I decided to stick with this one.

Summary

Regardless, the process works and has been running to create these reports in under 20 minutes every day. Not too bad since some of the tables have over a 100 million records and the database is larger than 80 gigs…

Here is the SQL is created to initially create the database:

SELECT ‘started at ‘ + CAST(GETDATE() AS NVARCHAR(30))

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

exec(‘create database ‘ + @newname)

USE master

EXEC(‘ALTER DATABASE ‘ + @newname +’ Modify Name = etl_validation_temp;’)
GO

And to rename it at the end..

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

USE master;

EXEC(‘ALTER DATABASE etl_validation_temp Modify Name = ‘ + @newname + ‘;’)
GO

SELECT ‘ended at ‘ + CAST(GETDATE() AS NVARCHAR(30))
GO

And finally, the SQL to clean up database older than 90 days:

USE master
go

DECLARE @currentdate DATETIME
DECLARE @limitdate DATETIME

DECLARE @currentdatetxt NVARCHAR(20)
DECLARE @limitdatetxt NVARCHAR(20)

DECLARE @maxcounter INTEGER
DECLARE @counter INTEGER

DECLARE @database_name NVARCHAR(80)
DECLARE @sql NVARCHAR(255)

SET @currentdate = GETDATE()
SET @limitdate = @currentdate – 90

SET @currentdatetxt = CONVERT(VARCHAR(20),@currentdate,112)
SET @limitdatetxt = CONVERT(VARCHAR(20),@limitdate,112)

DECLARE @olddbs TABLE
(
id INT IDENTITY(1, 1) PRIMARY KEY WITH FILLFACTOR = 100
,[dbname] nvarchar(255)
)
INSERT INTO @olddbs ( [dbname])
SELECT name FROM sys.databases WHERE name LIKE ‘etl_validation%’ AND create_date < @limitdate

SET @MaxCounter = @@ROWCOUNT

SET @counter = 1
WHILE @counter <= @MaxCounter
BEGIN

——Execution

SELECT @database_name = [dbname]
FROM @olddbs
WHERE id = @counter

SET @sql = ‘drop database ‘ + @database_name + ‘;’

PRINT @sql

exec sp_executesql @SQL

SET @Counter = @counter + 1;

END;
go

 

 

How to create 10,000 Extract, Transform, and Load automated tests using 4 tables #agile #data

The thing I love about my chosen profession is the ability to learn new things and improve on lessons learned from past projects. Recently I was able to take on a problem that I have experienced on multiple past projects.

“How can we easily create automated tests for a Data Migration or Extract, Transform, and Load application?”

Recently I have been lucky to be on Agile projects where we were able to create a large number of automated tests. I was able to see the huge increase in quality that came with these automated tests. All of these projects were Web Applications developed in either C# or Java.

In my coding days, I was always either a Data Modeler or Database Programmer. I had been on three projects where I was the lead developer on the extract, transform, and load (ETL) application that was responsible for loading data into new enterprise databases. Sadly, we had absolutely no automated tests in any of these projects. As we developed our ETL application, we had to manually test the loads to ensure they were operating as intended. This became especially painful near the end of the project as a small insignificant change resulted in hours and hours of manual retesting. We soon discovered that we were more likely to make a mistake testing the changes than making the actual coding change. This situation was something that was not sustainable.

After working with Agile teams and seeing how they were able to easily create a large number of automated tests, I hoped I would get the chance to try to create them on the next Extract, Transform, and Load project I was on.

The Opportunity

My most recent project allowed me to again be part of a team that recreated the corporate Data Warehouse and also needed to create an entirely new ETL application to load the Data Warehouse. The database technology that was selected was the Microsoft stack. We used SQL Server 2012 as our database engine and SSIS as our technology to create the ETL application. The Data Warehouse we were loading had over 200+ main tables that the ETL process was loading.

One thing I wanted to ensure we did was to create a large number of automated tests for our ETL application. We investigated multiple frameworks that existed, but none of them seemed to allow us to easily create the number of automated tests we wanted. All of them seemed to still require a large amount of test set up and the tests themselves did not adapt easily to changes in the database schema. This had always been a problem in the past as I tried to created automated tests for ETL applications.

Serendipity

In my time as a Data Modeler/DBA I became very good at writing queries to read the Data Dictionary tables of the database itself to generate SQL statements to then be executed. Then was the serendipitous moment – could we also read the Data Dictionary tables to easily generate automated tests for the ETL application?

The answer was yes, with a small number of customized tables that contained the column to column data mapping information. (Since this information was not stored in the Data Dictionary). The Data Dictionary tables we accessed in SQL Server were the tables that are part of the INFORMATION_SCHEMA.

The Solution

Our solution contained the following elements:

  • tSQLt  source framework for the automated testing framework
  • Data Mapping Spreadsheet that defined the Column to Column mapping
  • 4 custom tables that contain information from the Data Mapping Spreadsheet
  • Stored Procedures that read the INFORMATION_SCHEMA and 4 custom tables to automatically generate the tests

4 Custom Tables

schema

ETL Automated Tests

With these four custom tables loaded from the Data Mapping Spreadsheet we created the following Stored Procedures to generate tests:

Table to Table – Every Table

  • TstTableCount: Compares record counts between source data and target data. This will be done on table to table basis. This will be done starting from the target table and comparing to the associated source table or tables.
  • TstTableColumnDistinct: Compares counts on distinct values of columns. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
  • TstTableColumnNull: Generates a report of all columns where all the contents of a field is all null. This typically can highlight situations where the column was not assigned in the ETL process.

Column – Every Column

  •  TstColumn­DataMapping: Compares columns directly assigned from a source column on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­ConstantMapping: Compares columns assigned a constant on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­NullMapping: Compares columns assigned a Null value on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumnTransformedMapping: Compares transformed columns on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.

The Results

By creating these 4 custom tables and stored procedures we are now able to generated 10’s of thousands tests nightly. More importantly, we are able to have these tests be flexible to schema changes as the tests are generated by reading the INFORMATION_SCHEMA  and 4 custom tables. A large part of generating our tests is now data driven.

 

Business Case for creating a Data Model and Data Normalization

I recently came across a great article by @datachick on “Why having a Data Model is important?” on DataVersity. Highly recommended read.

Karen Lopez provides a great list of items on why she finds a Data Model helpful. I do believe there is something missing from the list of items though. All of the items mostly describe reasons from an IT point of view.

Here is Karen’s list on why you should create a data model:

  • Technologies used to move and persist data come in many forms, over time and at the same time. I support the creation of XML Documents, multiple DBMSs, multiple versions of the same DBMS. By having a single logical data model of data requirements, I can separate the rules and definitions of the requirements from their implementation.
  • The same data exists in many platforms, in many locations.  I need to be able to map those sources to targets across platforms and systems.  Doing this mapping, or data lineage, with a data model is much easier.  It also helps me understand the implications of making a change in one system on another system.  I can’t do that by just looking at code.
  • Writing stuff down is often a good way to impress people that have to provide requirements, compliance issues, security requirements, etc. They like not having to answer the same question over and over for different people and roles.  I can’t tell you how many times a business person has asked me if IT ever writes anything down.  Imagine your frustration if you had someone working on your house and ever tradesperson started their day with “Tell me about what sort of house you want? How many people will live here? Do you want to be able go get to the second floor? Do you need a bathroom?  How about a shower?”  That’s how many business professionals see us in IT.
  • A data model is a great way of capturing rules, constraints, definitions in a method that is technology independent. I can capture those things once and reference them in many places.
  • We can measure databases against our data models to assess fit. This can be done for application packages and custom development.
  • Enterprise data is complex.  I work with a data model that has 32,000 objects (tables, columns, datatypes, constraints, etc.) in it.  There is no way I could professionally manage change by just trying to remember all this information.  Nor could anyone else.
  • Modeling helps you ask the right questions before a bunch of time is spent coding, creating screens, reports, etc. This reduces costs and the number of bug fixes required.
  • Like all models, the data model is a communication tool and is good for tying requirements directly to designs and implementations.  This is especially true when I generate data prototypes based on the model.
  • Data governance can’t easily be done via reverse-engineered pictures of databases.
  • Once you’ve worked with a great data model, you can’t go back.

I agree with all of Karen’s points and I’d even like to add a few of my favourite benefits from an IT point of view as well.

  • I find HUGE benefits in being able to abstract domains across my entire data model so that changes to domains in the future can be done easily and ripple through the entire data model. The use of domains also ensures that my design is consistent and coherent across all the hundreds and thousands of tables in my data models.
  • I find HUGE benefits in being able to version my data model alongside the application code so I can revert to any point during the entire project.
  • I find HUGE benefits in helping me think through the problem. The danger with documents is they don’t have to be holistic and complete. You can start a document and describe functionality and relationships but you could forget to describe a couple of relationships without people noticing. But once you transform that description into a diagram, be it a data model or another diagram, orphans and missed relationships become very apparent. I like data models because they help me ensure my solution is correct, consistent, and complete.

The Business Case

One thing I always find fascinating is how hard it is to discuss and gain agreement from clients on why you need a data model and need to have a nicely normalized data model. Everyone in the data world takes this as just a fact, but how can you convince a client that they should spend the months of person-time to create one? Why should they spend tens of thousands of dollars? What is the business benefit?

I believe it comes down to two basic business benefits

1) The Data Model enables communication between IT professionals and the clients using visual means – As we are learning in all areas of the business world, a picture is worth a thousand words. We could create paragraphs upon paragraphs on how we believe the client’s business is structured and we would be wrong every time. We will still be wrong with a data model, but we will be less wrong as clients find it easier to point out the relationships we have misunderstood.

2) Normalization promotes and enables future business flexibility – A normalized database promotes a design that is highly cohesive and loosely coupled. Why does this matter? To the business it matters greatly! It means we can adapt to new realities in the business world quickly and easily. That means quicker time to market and less budget required for new functionality.

  • Need to integrate a new line of business? No Problem.
  • Need to accommodate a new language in the application? No Problem.
  • Need to roll out enhancements to keep up with the competition? No Problem.

Summary

If you agree with the benefits above, a Data Model is just the most efficient way of enabling these benefits. Especially with free tools like Oracle’s SQL Developer Data Modeler. With these free tools, you no longer need a large investment to start using them.

So why do I create a Data Model and normalize my data designs? Because they make good business sense and return benefits far exceeding the cost of creating and maintaining them.

12 Rules to create a Dimensional Model from a Normalized Model in an #agile way

Recently I have been on a project where I have been fortunate enough to develop both a normalized Operational Data Store and a Dimensional Data Warehouse. In addition to this, I have been fortunate to be able to recreate the Dimensional Data Warehouse three times over. This has been because the project has been done in an Agile and iterative manner. I have been amazed how the Dimensional model changed and improved along the way. Although some aspects of the model have remained constant, I would estimate that a good 1/3 of the model changed and improved with each major revision. That by itself is a great testament for why Data Warehouse projects can and should be done in an Agile process.

In the process of having to recreate the Dimensional Model three times, I have followed a process that has proved beneficial and been proven out.

So here is my 12 step program with associated level of difficulty in my humble opinion.

1) Remove tables that you don’t need for analytical reporting or history – Difficulty:Easy

First step is to recognize that not all tables in an Operational Data Store make sense to store in a Data Warehouse. Some tables are truly operational and don’t make sense for analytical reporting or trend analysis. Some tables also don’t require that historical data be kept on them. Be brutal and get rid of these tables at the start. These are the tables that can clutter a Dimensional Model.

2) De-normalize reference tables on master tables – Difficulty:Easy

This next step is also pretty easy. De-normalize your reference tables onto the main tables. Remember to de-normalize shared reference tables to all the main tables that reference them. There is some discussion in the Data Warehouse world about whether you should just store the reference table description and not the reference table codes, but I like to store both as they allow for more efficient reporting and querying.

3) Prefix column names with the table names to assist in data lineage before further de-normalization – Difficulty:Easy

This was a lesson learned from the first time I created a Dimensional model. If you don’t prefix column names with the current table names, it becomes very difficult later to determine where the field initially came from. The second time I created abbreviations for the tables and prefixed the columns with those abbreviations. The third time I prefixed the column names with the full table names. The design this third time is optimized. I know have a self-documenting model that communicates the data lineage automatically. I ended up using a ‘#’ to separate the table name from the column name for each column.

4) Collapse sub-type onto super types – Difficulty:Easy

The last easy step in the process. This is the step where you collapse sub-types onto the super-type. This situation is where you may have a person entity for shared attributes and then have two sub-types like employee and manager for attributes specific for each type of person. It is relatively easy to de-normalize these sub-types attributes back onto the super-type – especially since we have prefixed the column names with the table names in step 3.

5) Duplicate and collapse shared main tables – like address/phone/email/bank accounts – Difficulty:Medium

This step is similar to step 2 except that some main tables are shared like reference tables. These main tables are usually real world objects like addresses, email addresses, phone number, and bank accounts. Usually these result in one main table in a normalized model that is then shared and link to multiple other main tables. Duplicate these shared main tables and de-normalize them onto the main tables that reference them. This is an easy step once you identify these shared main tables, but sometime it does take a little experience to identify them.

6) Remove non-enterprise entities – Difficulty:Easy

This is a second step to weed out tables that may over-complicate the model. After you have collapsed the sub/super types and shared main tables, you may have seen areas of the Dimensional model that highlight non-enterprise entities. If these areas complicate the model, it is good to ask if your Data Warehouse model really requires them.  If you are unsure, remove them for now. You can always add them back if needed.

The example I had was where we had a type of person that was a Client Administration clerk. This person was different enough to require specific tables and attributes, but was not really required for enterprise reporting on claims, premiums, and bills. For this reason, we removed the objects from our Data Warehouse until we know we have a firm requirement for this type of person and these attributes.

7) Categorize your tables as Facts, Dimensions, Bridges, and Outriggers

I found it very helpful to then categorize the tables into four main categories:

  • Facts – Tables that define events or transactions. These tables usually have the attributes you will sum up in reports. Example: Claim,Bill
  • Dimensions – Tables that describe the Facts. These tables usually provide the filter and grouping criteria for your reports. Example: Client, Subscriber
  • Bridges – Tables that resolve many to many links between Dimensions and Facts. Example: Client_Subscriber
  • Outriggers – Tables that represents one to many relationships to Facts and Dimensions where you do not want to collapse them onto the main tables. Example: Claim_request

There is much more theory behind these categories that would be the topic of an entire book. If you are interested, the best book I have found on the subject is Chris Adamson’s Star Schema – highly recommended.

8) Don’t over de-normalize – Difficulty:Hard

I’ve listed this step as hard because it does take some experience to not over de-normalize. One of the goals of a dimensional model is to simplify the data model so there is the tendency to view the objective as having as few tables as humanly possible. While this is somewhat true, you need to remember to let the data define the model and not to force the data into your view of what the model should be.

Sometimes forcing tables together will massively increase the row count or not allow for flexibility if the cardinality of the relationships change in the future. This is something that needs to be balanced. The next step of creating Natural Keys will assist in this analysis.

9) Create Natural Keys – Difficulty:Hard

Create Natural Keys for all tables. This means identifying the keys that will define record uniqueness. This is a critical step to validate your model. The Natural keys for some tables will be one field. This is a great indication that the table is cohesive and defined well. If many fields together define uniqueness, you may want to examine if you over de-normalized the table. It may turn out that you didn’t but this may be a cause of the complicated Natural Key. Sometimes, the data and Natural Keys are just complicated depending on the data.

This step will provide crucial feedback for your design.

10) Make all columns mandatory

These last two steps are important to improve the usability of the Data Warehouse.

First step is to make all the columns mandatory. This will address the bane of all Data Warehouse querying and having to specify the dreaded ‘IS NOT NULL’/’IS NULL’ as part of all the queries in a Data Warehouse.

11) Create dummy records for every Dimension, Bridge, and Outrigger table

Since all columns are now mandatory, you will also need to create a dummy record in every Dimension, Bridge, and Outrigger table. These dummy records are used when an invalid value doesn’t allow the load process to link a record to a valid Dimension, Bridge, or Outrigger. This design addresses the second bane of Data Warehouse querying – the dreaded outer join.

12) Iterate Immediately

As I mentioned, I was able to vastly improve my Dimensional Model on subsequent attempts. I would recommend that you create your model and plan on recreating it immediately after you are finished. You will learn enough during the process, that you will be able to create a better model immediately.

Summary

These rules have helped me to create a Dimensional Model that I have been able to improve and iterate on as the project has evolved. The 13th rule I didn’t mention was to trust your instinct. Trust your intuition if it feels wrong to combine two tables. You are more than likely sensing some issue that will cause problems in the future.

My SQL Saturday Experience #SQLsat175

I just recently completed my presentation at SQL Saturday in Fargo. The experience was excellent. The facilities at the Microsoft Executive Briefing Centre in Fargo were outstanding. Very impressive organization throughout the entire event. I am very much looking forward to presenting at future SQL Saturdays.

I was presenting on my experiences installing and configuring SSRS and SharePoint over the last few months. It was a bit of a journey and the presentation was my homage to the multiple Blog posts that helped me through my issues.

It was very gratifying to hear that other people had also encountered the same issues that I had. When you create a new presentation, it is always a nervous experience to learn if people find value in your presentation. Thankfully, the attendees did find the presentation valuable and had run into the same issues I had.

Thanks to everyone who attended and for their great questions and discussion after the presentation. It was very enjoyable, and I met some great people whom  I hope to trade experiences with in the future. 🙂

My only regret is that the Microsoft store wasn’t open on the weekend. I think we will definitely need to make a special trip down to Faro again to visit the store in the future.

Thanks again for the organizers for selecting my talk. I was honoured to present with the other speakers. I hope I can speak at other SQL Saturdays in the future.

Cheers

Terry

#Agile Data Conversion

Last year I presented on Agile Data Warehouse – the final frontier at SDEC12. You can find the presentation here if you are interested.

This year on the same project I have been challenged with another new frontier. Now that we are executing and evolving the Data Warehouse, the question was posed as to how we can run Data Conversion projects in an Agile way. The project has a requirement to convert massive amounts of data from old Legacy systems to a new application. These old Legacy systems are written in Cobol and have existed since the early 1970’s. Basically they are old-school Legacy.

We have encountered the traditional quandary. Data conversion projects and application development projects need to occur in parallel because of the constraints on the project. Then both of these streams of projects need to be validated and tested for the start of Integration Testing. This clearly can be a challenge as changes in one can affect the other. The issue is that if we wait until the end of all the design of the applications before we can consider the specifications for conversion to be materially complete, we would not then have adequate time to be able to develop, convert, and validate and reconcile for the start of Integration Testing.Although the application projects and teams have embraced Agile methods, the Data conversion teams have been quite hesitant. The focus for data conversion has still be to have a complete set of specifications and to test and validate the entire data conversion process.

Once Data Conversion is code complete there is a large amount of work required to reconcile and validate the converted data. Although there will be some test automation, the amount of investigation can be significant to resolve one defect. In short, data conversion validation will take much longer than application validation. Yet both streams need to feed the Integration Test.

Agile?

The question that was posed is whether this was a truly Agile process.

We didn’t deliver frequently and we didn’t minimize inventory. I would say we failed on both counts.

We were tasked to determine how we could possibly supply converted data for Integration Testing.

Epiphany

The epiphany was asking ourselves what an Agile Data Conversion would look like. We discussed that we would be able to do just enough to allow the testing to proceed. It turns out we are already doing some of that for the generation of Sample files when working with the package vendor. We are generating the files in accordance to the current specification and adapting as we learn more. So that is good – early and frequent feedback.

One thing we were not doing in our development was placing some rigor around how we adapted. In this case, doing things in an Agile way actually resulted some more structure. We still had to plan since we were time-boxed into when we needed to be complete. This was a change to ensure we had a plan to adapt rather than just ad-hoc adaption. Which all Agile project can suffer from.

But the one area we were not as Agile as we could be was the reconciliation and validation of the data conversions. We had planned that we had to balance all the conversions before we could say the data was ready for Integration Testing. This was going to result in multiple Mock conversions over many months.

Why?

The Solution

The solution we came up with was that we should be able to reconcile and validate the converted data iteratively. After we are able to convert a full set of data, we are not going to wait until we can validate all the data. We are going to find between 5-10 clients and their associated data that balance and look good. These clients and data will be simple at first. (although we will add complex clients if they balance) We will then add to the clients we use for Integration Testing as we are able to validate and balance them as we go along. If we can’t find 5 clients to balance off the hop, we have a larger issue. And in the spirit of Agile, it is surely better to know that right away that wait until we try to validate the entire set of data.

Ultimately we will end up validating all the converted data but in an iterative way.

I’ll report back as we see results. We also do have a plan B if needed. 🙂

Adaptive Data Model – #Agile or Anathema?

I have seen the concept of an Adaptive Data Model proposed as an Agile method to Data Modelling lately. (Most recently in Ken Collier’s excellent book – “Agile Analytics”) The theory is that you can be more Agile using an  Adaptive Data Model instead of a traditional Data Model of the business domain.

Definition

An Adaptive Data Model is a Data Model that doesn’t model the business data. Rather it is a data model of the data model that models the business data. 🙂   The Adaptive Data Model describes tables that contains the meta-data that describes the data model. In this way, the entire Data Model is data driven and stored in a series of tables. The advantages of this approach is that changes to the model can be made by updating the meta-data in the tables. (as opposed to having to generate Alter statements to update the database structures in the database)

A sample Adaptive Data Model is shown below. (Copyright Ken Collier – Agile Analytics)

ADM

 

A subsequent layer would then need to be created to allow for the data to be extracted in a traditional sense by the application. (and to be made sense of by the business) Some suggestions for this layer have recommended that this layer could be created with a series of views or stored procedures.

Agile?

Although the Adaptive Data Model does allow for the easy modification of the Data Model, is it Agile?

I propose that an Adaptive Data Model is neither Agile or a Data Model.

According to Wikipedia a Data Model is:

“A data model is an abstract model that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data are stored and accessed.”

An Adaptive Data Model does not meet this definition of a Data Model. It is a construct created to allow changes to be made to a data model.

In addition, I would propose that an Adaptive Data Model is also not Agile. It does not encourage frequent delivery and iterative development. An Adaptive Data Model is a complex solution that is not easily deployed in iterations to deliver value quickly and often to the business. If anything, it increases the technical debt of the project.

Summary

We as data professionals should be striving to make our processes more Agile and to be able to allow our processes to be refined iteratively like other areas of Software Development. Software Development proposed practices that allowed for iterative development. These didn’t include creating an Object Model of the Object model so that they didn’t have to fully embrace adaptive development practices.

An Adaptive Data Model feels like a short cut. It is hard to be Agile and iterative on data project, but trying to propose an Adaptive Data Model as a solution seems like a wrong turn. We need to find ways to allow our data designs to be change tolerant, adaptive, and test driven.

Memories of #ADABAS – My first database love

I recently was watching a DataVersity webinar on MongoDB schema design and I had flashbacks to my first DBA job at Investors Group in the early 90’s. You can find the webinar here.

MongoDB

MongoDB is one of the many NoSQL databases available. When I saw that DataVersity was going to have a webinar on MongoDB and how you define schemas in MongoDB I was very interested in learning about the topic. (It is a great presentation and introduction on the topic if you are interested)

One of the most interesting concepts was that you define documents which are like tables in SQL. These documents contain columns which are just name-value pairs of attributes when the data is stored. These documents can then be joined to other documents in one of two ways. Either they can be embedded into the main document or linked to the main document. This pattern is seen frequently is Object Oriented development but it is one that frequently causes issues in normalized databases.

These two methods also reminded me how I sometime like to organize tables in SQL databases along the same lines. I also like to identify which tables are truly independent (linked candidates) and which tables are dependent(embedded candidates) and need the context of other tables to typically be used. For example, Client and Product can be thought of as independent tables while the related tables of client_address and product_rate are dependent as they really only make sense when context is provided by Client and Product. I try to think of tables in these two ways as this categorization helps when we need to do Dimensional modeling.

I was intrigued with the functionality provided in MongoDB to allow for this embedding of other tables or documents.

But I did have concerns as all of the attributes to be saved in these documents are name-value pairs that are defined in each data command. There is also no consistency that is validated or verified on multiple data commands that operate the same documents. Yipe! Unlike SQL which separates Data Modification Language(DML) from Data Definition Language(DDL), in Mongo DB everything is a DML statement. As a result there is a lot of faith and confidence placed in the application to manage the integrity of the data. Although there is some additional work to define the data structure first and then operate on it, I feel that this structure is beneficial and has value.

Love at First Sight

As I was listening to the Webinar, I though back to my experience with ADABAS in the 90’s. After working with Oracle, Informix, Sybase, and SQL Server – I still think about the functionality I had with ADABAS. Unlike the other relational SQL databases, ADABAS did provide functionality that allowed you to embed tables/files in other tables/files through multi-value fields and periodic groups.(Tables are called Files and Indexes are called Descriptors in ADABAS, but the functionality is the same) Unlike MongoDB, ADABAS did provide functionality that required the creation of the tables/files first with their own DDL language and then modification of the data with their own DML.

In addition, ADABAS was able to provide performance throughput similar to what I have seen in any of the top-flight relational database engines. It certainly was high-performing. Now some of that high performance may have been helped by the lack of referential integrity provided by ADABAS. (That was the one drawback always mentioned when ADABAS was compared to other relational DBMSs) Any referential integrity must be maintained by the applications.

In retrospect, I don’t think I realized what a good DBMS I was using at the time.

I noticed that ADABAS now has a community edition that I am currently downloading… Maybe I will see what the technology looks like today. 🙂

Summary

Although MongoDb looks promising, I think ADABAS should try to inject their name and product into the NoSQL discussion. They have been providing NoSQL databases for over 40 years. They were cool before it was cool to be cool. 🙂

If I’ve peaked your curiosity, here are a couple of ADABAS links:

ADABAS Wikpedia Page

ADABAS Home Page