Recently I have been on a project where I have been fortunate enough to develop both a normalized Operational Data Store and a Dimensional Data Warehouse. In addition to this, I have been fortunate to be able to recreate the Dimensional Data Warehouse three times over. This has been because the project has been done in an Agile and iterative manner. I have been amazed how the Dimensional model changed and improved along the way. Although some aspects of the model have remained constant, I would estimate that a good 1/3 of the model changed and improved with each major revision. That by itself is a great testament for why Data Warehouse projects can and should be done in an Agile process.
In the process of having to recreate the Dimensional Model three times, I have followed a process that has proved beneficial and been proven out.
So here is my 12 step program with associated level of difficulty in my humble opinion.
1) Remove tables that you don’t need for analytical reporting or history – Difficulty:Easy
First step is to recognize that not all tables in an Operational Data Store make sense to store in a Data Warehouse. Some tables are truly operational and don’t make sense for analytical reporting or trend analysis. Some tables also don’t require that historical data be kept on them. Be brutal and get rid of these tables at the start. These are the tables that can clutter a Dimensional Model.
2) De-normalize reference tables on master tables – Difficulty:Easy
This next step is also pretty easy. De-normalize your reference tables onto the main tables. Remember to de-normalize shared reference tables to all the main tables that reference them. There is some discussion in the Data Warehouse world about whether you should just store the reference table description and not the reference table codes, but I like to store both as they allow for more efficient reporting and querying.
3) Prefix column names with the table names to assist in data lineage before further de-normalization – Difficulty:Easy
This was a lesson learned from the first time I created a Dimensional model. If you don’t prefix column names with the current table names, it becomes very difficult later to determine where the field initially came from. The second time I created abbreviations for the tables and prefixed the columns with those abbreviations. The third time I prefixed the column names with the full table names. The design this third time is optimized. I know have a self-documenting model that communicates the data lineage automatically. I ended up using a ‘#’ to separate the table name from the column name for each column.
4) Collapse sub-type onto super types – Difficulty:Easy
The last easy step in the process. This is the step where you collapse sub-types onto the super-type. This situation is where you may have a person entity for shared attributes and then have two sub-types like employee and manager for attributes specific for each type of person. It is relatively easy to de-normalize these sub-types attributes back onto the super-type – especially since we have prefixed the column names with the table names in step 3.
5) Duplicate and collapse shared main tables – like address/phone/email/bank accounts – Difficulty:Medium
This step is similar to step 2 except that some main tables are shared like reference tables. These main tables are usually real world objects like addresses, email addresses, phone number, and bank accounts. Usually these result in one main table in a normalized model that is then shared and link to multiple other main tables. Duplicate these shared main tables and de-normalize them onto the main tables that reference them. This is an easy step once you identify these shared main tables, but sometime it does take a little experience to identify them.
6) Remove non-enterprise entities – Difficulty:Easy
This is a second step to weed out tables that may over-complicate the model. After you have collapsed the sub/super types and shared main tables, you may have seen areas of the Dimensional model that highlight non-enterprise entities. If these areas complicate the model, it is good to ask if your Data Warehouse model really requires them. If you are unsure, remove them for now. You can always add them back if needed.
The example I had was where we had a type of person that was a Client Administration clerk. This person was different enough to require specific tables and attributes, but was not really required for enterprise reporting on claims, premiums, and bills. For this reason, we removed the objects from our Data Warehouse until we know we have a firm requirement for this type of person and these attributes.
7) Categorize your tables as Facts, Dimensions, Bridges, and Outriggers
I found it very helpful to then categorize the tables into four main categories:
- Facts – Tables that define events or transactions. These tables usually have the attributes you will sum up in reports. Example: Claim,Bill
- Dimensions – Tables that describe the Facts. These tables usually provide the filter and grouping criteria for your reports. Example: Client, Subscriber
- Bridges – Tables that resolve many to many links between Dimensions and Facts. Example: Client_Subscriber
- Outriggers – Tables that represents one to many relationships to Facts and Dimensions where you do not want to collapse them onto the main tables. Example: Claim_request
There is much more theory behind these categories that would be the topic of an entire book. If you are interested, the best book I have found on the subject is Chris Adamson’s Star Schema – highly recommended.
8) Don’t over de-normalize – Difficulty:Hard
I’ve listed this step as hard because it does take some experience to not over de-normalize. One of the goals of a dimensional model is to simplify the data model so there is the tendency to view the objective as having as few tables as humanly possible. While this is somewhat true, you need to remember to let the data define the model and not to force the data into your view of what the model should be.
Sometimes forcing tables together will massively increase the row count or not allow for flexibility if the cardinality of the relationships change in the future. This is something that needs to be balanced. The next step of creating Natural Keys will assist in this analysis.
9) Create Natural Keys – Difficulty:Hard
Create Natural Keys for all tables. This means identifying the keys that will define record uniqueness. This is a critical step to validate your model. The Natural keys for some tables will be one field. This is a great indication that the table is cohesive and defined well. If many fields together define uniqueness, you may want to examine if you over de-normalized the table. It may turn out that you didn’t but this may be a cause of the complicated Natural Key. Sometimes, the data and Natural Keys are just complicated depending on the data.
This step will provide crucial feedback for your design.
10) Make all columns mandatory
These last two steps are important to improve the usability of the Data Warehouse.
First step is to make all the columns mandatory. This will address the bane of all Data Warehouse querying and having to specify the dreaded ‘IS NOT NULL’/’IS NULL’ as part of all the queries in a Data Warehouse.
11) Create dummy records for every Dimension, Bridge, and Outrigger table
Since all columns are now mandatory, you will also need to create a dummy record in every Dimension, Bridge, and Outrigger table. These dummy records are used when an invalid value doesn’t allow the load process to link a record to a valid Dimension, Bridge, or Outrigger. This design addresses the second bane of Data Warehouse querying – the dreaded outer join.
12) Iterate Immediately
As I mentioned, I was able to vastly improve my Dimensional Model on subsequent attempts. I would recommend that you create your model and plan on recreating it immediately after you are finished. You will learn enough during the process, that you will be able to create a better model immediately.
Summary
These rules have helped me to create a Dimensional Model that I have been able to improve and iterate on as the project has evolved. The 13th rule I didn’t mention was to trust your instinct. Trust your intuition if it feels wrong to combine two tables. You are more than likely sensing some issue that will cause problems in the future.