When Data Modeling goes too far

One thing I have struggled with when I have created Operational Data Stores is the tendency to create generic tables that promote re-use. I find these are usually tables like Address and Person. In an enterprise environment there may be many applications or Systems of Record that store Person or Address information. There is the tendency when we create the objects in an Operational Data Store to store like information in one table.

Why?

In most cases having multiple Address or Person tables don’t violate any rules of Normalization. It is just a habit of organization to try to have only one table for a certain entity. We probably don’t have a requirement to report on all addresses together? Maybe for Person, but not for Address.

So the question is, Is this over-modeling?

Early in my modeling career I would have thought the question ridiculous. But now, I’m not so sure. If I had it to do over, I think I would duplicate some fields on different tables and leave the objects separate . Is there a problem created by having clients and Sales Representatives in separate tables with the same columns? Certainly the Data Model has more tables but it many ways it is way easier to understand and query. One of the main reasons I would do this is how generic entities can complicate querying and reporting. Although it can provide a more aesthetically pleasing data model, the opposite could be said of the queries that are required to pull information out of the database. And if the queries and reports are more complicated, the same could be said of the Extract, Transform, and Load processes to load the data.

Going forward, I think I will resist the urge to combine similar entities when there isn’t a business reason to do so.

How about you?

Advertisement

Business Case for creating a Data Model and Data Normalization

I recently came across a great article by @datachick on “Why having a Data Model is important?” on DataVersity. Highly recommended read.

Karen Lopez provides a great list of items on why she finds a Data Model helpful. I do believe there is something missing from the list of items though. All of the items mostly describe reasons from an IT point of view.

Here is Karen’s list on why you should create a data model:

  • Technologies used to move and persist data come in many forms, over time and at the same time. I support the creation of XML Documents, multiple DBMSs, multiple versions of the same DBMS. By having a single logical data model of data requirements, I can separate the rules and definitions of the requirements from their implementation.
  • The same data exists in many platforms, in many locations.  I need to be able to map those sources to targets across platforms and systems.  Doing this mapping, or data lineage, with a data model is much easier.  It also helps me understand the implications of making a change in one system on another system.  I can’t do that by just looking at code.
  • Writing stuff down is often a good way to impress people that have to provide requirements, compliance issues, security requirements, etc. They like not having to answer the same question over and over for different people and roles.  I can’t tell you how many times a business person has asked me if IT ever writes anything down.  Imagine your frustration if you had someone working on your house and ever tradesperson started their day with “Tell me about what sort of house you want? How many people will live here? Do you want to be able go get to the second floor? Do you need a bathroom?  How about a shower?”  That’s how many business professionals see us in IT.
  • A data model is a great way of capturing rules, constraints, definitions in a method that is technology independent. I can capture those things once and reference them in many places.
  • We can measure databases against our data models to assess fit. This can be done for application packages and custom development.
  • Enterprise data is complex.  I work with a data model that has 32,000 objects (tables, columns, datatypes, constraints, etc.) in it.  There is no way I could professionally manage change by just trying to remember all this information.  Nor could anyone else.
  • Modeling helps you ask the right questions before a bunch of time is spent coding, creating screens, reports, etc. This reduces costs and the number of bug fixes required.
  • Like all models, the data model is a communication tool and is good for tying requirements directly to designs and implementations.  This is especially true when I generate data prototypes based on the model.
  • Data governance can’t easily be done via reverse-engineered pictures of databases.
  • Once you’ve worked with a great data model, you can’t go back.

I agree with all of Karen’s points and I’d even like to add a few of my favourite benefits from an IT point of view as well.

  • I find HUGE benefits in being able to abstract domains across my entire data model so that changes to domains in the future can be done easily and ripple through the entire data model. The use of domains also ensures that my design is consistent and coherent across all the hundreds and thousands of tables in my data models.
  • I find HUGE benefits in being able to version my data model alongside the application code so I can revert to any point during the entire project.
  • I find HUGE benefits in helping me think through the problem. The danger with documents is they don’t have to be holistic and complete. You can start a document and describe functionality and relationships but you could forget to describe a couple of relationships without people noticing. But once you transform that description into a diagram, be it a data model or another diagram, orphans and missed relationships become very apparent. I like data models because they help me ensure my solution is correct, consistent, and complete.

The Business Case

One thing I always find fascinating is how hard it is to discuss and gain agreement from clients on why you need a data model and need to have a nicely normalized data model. Everyone in the data world takes this as just a fact, but how can you convince a client that they should spend the months of person-time to create one? Why should they spend tens of thousands of dollars? What is the business benefit?

I believe it comes down to two basic business benefits

1) The Data Model enables communication between IT professionals and the clients using visual means – As we are learning in all areas of the business world, a picture is worth a thousand words. We could create paragraphs upon paragraphs on how we believe the client’s business is structured and we would be wrong every time. We will still be wrong with a data model, but we will be less wrong as clients find it easier to point out the relationships we have misunderstood.

2) Normalization promotes and enables future business flexibility – A normalized database promotes a design that is highly cohesive and loosely coupled. Why does this matter? To the business it matters greatly! It means we can adapt to new realities in the business world quickly and easily. That means quicker time to market and less budget required for new functionality.

  • Need to integrate a new line of business? No Problem.
  • Need to accommodate a new language in the application? No Problem.
  • Need to roll out enhancements to keep up with the competition? No Problem.

Summary

If you agree with the benefits above, a Data Model is just the most efficient way of enabling these benefits. Especially with free tools like Oracle’s SQL Developer Data Modeler. With these free tools, you no longer need a large investment to start using them.

So why do I create a Data Model and normalize my data designs? Because they make good business sense and return benefits far exceeding the cost of creating and maintaining them.

12 Rules to create a Dimensional Model from a Normalized Model in an #agile way

Recently I have been on a project where I have been fortunate enough to develop both a normalized Operational Data Store and a Dimensional Data Warehouse. In addition to this, I have been fortunate to be able to recreate the Dimensional Data Warehouse three times over. This has been because the project has been done in an Agile and iterative manner. I have been amazed how the Dimensional model changed and improved along the way. Although some aspects of the model have remained constant, I would estimate that a good 1/3 of the model changed and improved with each major revision. That by itself is a great testament for why Data Warehouse projects can and should be done in an Agile process.

In the process of having to recreate the Dimensional Model three times, I have followed a process that has proved beneficial and been proven out.

So here is my 12 step program with associated level of difficulty in my humble opinion.

1) Remove tables that you don’t need for analytical reporting or history – Difficulty:Easy

First step is to recognize that not all tables in an Operational Data Store make sense to store in a Data Warehouse. Some tables are truly operational and don’t make sense for analytical reporting or trend analysis. Some tables also don’t require that historical data be kept on them. Be brutal and get rid of these tables at the start. These are the tables that can clutter a Dimensional Model.

2) De-normalize reference tables on master tables – Difficulty:Easy

This next step is also pretty easy. De-normalize your reference tables onto the main tables. Remember to de-normalize shared reference tables to all the main tables that reference them. There is some discussion in the Data Warehouse world about whether you should just store the reference table description and not the reference table codes, but I like to store both as they allow for more efficient reporting and querying.

3) Prefix column names with the table names to assist in data lineage before further de-normalization – Difficulty:Easy

This was a lesson learned from the first time I created a Dimensional model. If you don’t prefix column names with the current table names, it becomes very difficult later to determine where the field initially came from. The second time I created abbreviations for the tables and prefixed the columns with those abbreviations. The third time I prefixed the column names with the full table names. The design this third time is optimized. I know have a self-documenting model that communicates the data lineage automatically. I ended up using a ‘#’ to separate the table name from the column name for each column.

4) Collapse sub-type onto super types – Difficulty:Easy

The last easy step in the process. This is the step where you collapse sub-types onto the super-type. This situation is where you may have a person entity for shared attributes and then have two sub-types like employee and manager for attributes specific for each type of person. It is relatively easy to de-normalize these sub-types attributes back onto the super-type – especially since we have prefixed the column names with the table names in step 3.

5) Duplicate and collapse shared main tables – like address/phone/email/bank accounts – Difficulty:Medium

This step is similar to step 2 except that some main tables are shared like reference tables. These main tables are usually real world objects like addresses, email addresses, phone number, and bank accounts. Usually these result in one main table in a normalized model that is then shared and link to multiple other main tables. Duplicate these shared main tables and de-normalize them onto the main tables that reference them. This is an easy step once you identify these shared main tables, but sometime it does take a little experience to identify them.

6) Remove non-enterprise entities – Difficulty:Easy

This is a second step to weed out tables that may over-complicate the model. After you have collapsed the sub/super types and shared main tables, you may have seen areas of the Dimensional model that highlight non-enterprise entities. If these areas complicate the model, it is good to ask if your Data Warehouse model really requires them.  If you are unsure, remove them for now. You can always add them back if needed.

The example I had was where we had a type of person that was a Client Administration clerk. This person was different enough to require specific tables and attributes, but was not really required for enterprise reporting on claims, premiums, and bills. For this reason, we removed the objects from our Data Warehouse until we know we have a firm requirement for this type of person and these attributes.

7) Categorize your tables as Facts, Dimensions, Bridges, and Outriggers

I found it very helpful to then categorize the tables into four main categories:

  • Facts – Tables that define events or transactions. These tables usually have the attributes you will sum up in reports. Example: Claim,Bill
  • Dimensions – Tables that describe the Facts. These tables usually provide the filter and grouping criteria for your reports. Example: Client, Subscriber
  • Bridges – Tables that resolve many to many links between Dimensions and Facts. Example: Client_Subscriber
  • Outriggers – Tables that represents one to many relationships to Facts and Dimensions where you do not want to collapse them onto the main tables. Example: Claim_request

There is much more theory behind these categories that would be the topic of an entire book. If you are interested, the best book I have found on the subject is Chris Adamson’s Star Schema – highly recommended.

8) Don’t over de-normalize – Difficulty:Hard

I’ve listed this step as hard because it does take some experience to not over de-normalize. One of the goals of a dimensional model is to simplify the data model so there is the tendency to view the objective as having as few tables as humanly possible. While this is somewhat true, you need to remember to let the data define the model and not to force the data into your view of what the model should be.

Sometimes forcing tables together will massively increase the row count or not allow for flexibility if the cardinality of the relationships change in the future. This is something that needs to be balanced. The next step of creating Natural Keys will assist in this analysis.

9) Create Natural Keys – Difficulty:Hard

Create Natural Keys for all tables. This means identifying the keys that will define record uniqueness. This is a critical step to validate your model. The Natural keys for some tables will be one field. This is a great indication that the table is cohesive and defined well. If many fields together define uniqueness, you may want to examine if you over de-normalized the table. It may turn out that you didn’t but this may be a cause of the complicated Natural Key. Sometimes, the data and Natural Keys are just complicated depending on the data.

This step will provide crucial feedback for your design.

10) Make all columns mandatory

These last two steps are important to improve the usability of the Data Warehouse.

First step is to make all the columns mandatory. This will address the bane of all Data Warehouse querying and having to specify the dreaded ‘IS NOT NULL’/’IS NULL’ as part of all the queries in a Data Warehouse.

11) Create dummy records for every Dimension, Bridge, and Outrigger table

Since all columns are now mandatory, you will also need to create a dummy record in every Dimension, Bridge, and Outrigger table. These dummy records are used when an invalid value doesn’t allow the load process to link a record to a valid Dimension, Bridge, or Outrigger. This design addresses the second bane of Data Warehouse querying – the dreaded outer join.

12) Iterate Immediately

As I mentioned, I was able to vastly improve my Dimensional Model on subsequent attempts. I would recommend that you create your model and plan on recreating it immediately after you are finished. You will learn enough during the process, that you will be able to create a better model immediately.

Summary

These rules have helped me to create a Dimensional Model that I have been able to improve and iterate on as the project has evolved. The 13th rule I didn’t mention was to trust your instinct. Trust your intuition if it feels wrong to combine two tables. You are more than likely sensing some issue that will cause problems in the future.

Adaptive Data Model – #Agile or Anathema?

I have seen the concept of an Adaptive Data Model proposed as an Agile method to Data Modelling lately. (Most recently in Ken Collier’s excellent book – “Agile Analytics”) The theory is that you can be more Agile using an  Adaptive Data Model instead of a traditional Data Model of the business domain.

Definition

An Adaptive Data Model is a Data Model that doesn’t model the business data. Rather it is a data model of the data model that models the business data. 🙂   The Adaptive Data Model describes tables that contains the meta-data that describes the data model. In this way, the entire Data Model is data driven and stored in a series of tables. The advantages of this approach is that changes to the model can be made by updating the meta-data in the tables. (as opposed to having to generate Alter statements to update the database structures in the database)

A sample Adaptive Data Model is shown below. (Copyright Ken Collier – Agile Analytics)

ADM

 

A subsequent layer would then need to be created to allow for the data to be extracted in a traditional sense by the application. (and to be made sense of by the business) Some suggestions for this layer have recommended that this layer could be created with a series of views or stored procedures.

Agile?

Although the Adaptive Data Model does allow for the easy modification of the Data Model, is it Agile?

I propose that an Adaptive Data Model is neither Agile or a Data Model.

According to Wikipedia a Data Model is:

“A data model is an abstract model that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data are stored and accessed.”

An Adaptive Data Model does not meet this definition of a Data Model. It is a construct created to allow changes to be made to a data model.

In addition, I would propose that an Adaptive Data Model is also not Agile. It does not encourage frequent delivery and iterative development. An Adaptive Data Model is a complex solution that is not easily deployed in iterations to deliver value quickly and often to the business. If anything, it increases the technical debt of the project.

Summary

We as data professionals should be striving to make our processes more Agile and to be able to allow our processes to be refined iteratively like other areas of Software Development. Software Development proposed practices that allowed for iterative development. These didn’t include creating an Object Model of the Object model so that they didn’t have to fully embrace adaptive development practices.

An Adaptive Data Model feels like a short cut. It is hard to be Agile and iterative on data project, but trying to propose an Adaptive Data Model as a solution seems like a wrong turn. We need to find ways to allow our data designs to be change tolerant, adaptive, and test driven.

The state of #Agile according to Data Modellers

DataVersity released their snapshot survey on Data Modelling and the results aren’t good for Data Modellers who want more adoption of Agile by the Data Modelling community.

Before we get into the details, DataVersity is a great source of references and webinars for all things data. You can find them by clicking on this link: DataVersity

The Results

1) How important is it to know how to work with Agile Teams? (this was graded on a scale of 1-10. I’ve tried to group them into categories)

  • 27% thought it was important (1-3 on a scale of 10)
  • 27% were neutral (4-6 on a scale of 10)
  • 46% thought it was not important (7-10 on a scale of 10)

2) How much experience do you have with Agile or Scrum projects?

  • 12% – 10+ projects
  • 9% – 5-10 projects
  • 20% – 3-4 projects
  • 16% – 2 projects
  • 8% – 1 project
  • 33% – no experience

3) If you have been the data architect or data modeller on an Agile project, how satisfied were you with the results?

  • 22% – very satisfied
  • 46% – fairly satisfied
  • 32% – not satisfied

Analysis

What I took from these responses are that the majority of people feel that Agile isn’t that important to them. But I hope there is some light at the end of the tunnel as a good percentage of Data Modellers have not been on an Agile project or have only been on 1 Agile project. (41%) In addition, of the ones that were on Agile projects, 68% reported that they were either very satisfied or fairly satisfied. Those are encouraging numbers. Although some Data Modellers don’t think Agile is important to them, there seems to be a correlation that once they are exposed to the methods, they view them as positive.

Dimensional Modelling

Those of you that have read my blogs have seen my positive comments on Dimensional Modelling. Another concern I had with the results of the survey are the percentage of Data Modellers that feel understanding Dimensional Modelling concepts are not important to them.

  • 28% thought it was important (1-3 on a scale of 10)
  • 34% were neutral (4-6 on a scale of 10)
  • 40% thought it was not important (7-10 on a scale of 10)

Given that all Data Modellers have had to model the data to support reporting, this is a somewhat concerning statistic. Dimensional Modelling is not something new as well, not like XML and No SQL data stores. Shockingly 62% felt that knowledge of XML and No SQL Data Stores are not important to them.

Summary

I think we have a ways to go to encourage the adoption of new processes, methods, and technologies in the Data Modelling and Database professions. Even within our own profession, items like Dimensional Modelling still lack full acceptance.

I would encourage every Data Modeller out there to learn about Agile and Agile methods. As we move into the second decade of Agile, its adoption is increasing and it will affect every type of project in the future. Understanding the methods will help the Data Modeller determine how best to integrate with Agile methods and practices.

I feel that Data Modellers must remember that we are providing a service to projects and the business. If the projects and the business are becoming more Agile, we also must become Agile. If not, the projects and businesses will be drawn towards other solutions and services that do align with their methods. This could possibly drive more projects toward No SQL and XML that are not good candidates for those projects.

 

#Agile Enterprise Data Model – #holistic and #iterative

I’ve recently been working on a Data Warehouse project that I am trying to do in an Agile way. Unlike Agile Software Development, Agile Data projects have some complexities that cause you to adapt the common Agile methods and practices. I am a big fan of User Story Mapping and creating a User Story Map to guide your development of an application and to create and prioritize your backlog. But I have felt that a Data Model tends to not allow stories to be sliced as thinly as application functionality. In many cases a group of tables must be taken together to make any sense at all. Unlike application components that can be segmented easier, it sometimes is quite difficult to segment a data model into parts that can be worked on independently.

I have been involved in other Data Modelling projects where intense analysis was done to try to understand all the intricacies of the data before the application development started. This analysis sometimes lasted 3 months, 6 months, or even years. It goes without saying that this is not an efficient was to organize a project.

If you try to segment application user stories, individual reports, and data requirements as stories they end up being a bit too fine-grained from database point of view. Creating a data model from a single user story or report can lead to a data model that has to undergo a great amount of re-factoring as new user stories, reports, and data requirements are considered. Frequently a new user story that enhances application functionality requires an entirely different data model. 😦

Then I started to think that this can happen to most Agile projects. They slice the User Stories thin from an application point of view and that works as the application components can be somewhat independent, but what happens when you try to line-up the data components?

I am sure this impedance has led to the more drastic database re-factoring that has caused it to have a bad reputation and caused people to write books on the subject.

Epiphany

Then it hit me. Although the User Story Mapping exercise does allow for an end to end view of the application that can then be iterated on, it does not create an end-to-end view of the Data Model or the Data Domain. In fact, some critical components of the Data Model may not be encountered until much later. This can cause significant rework based upon the design that has been done earlier.

So what could be the solution?

Agile Enterprise Data Model

An Agile Enterprise Data Model is a thin data model of the Data Domain that can be done as a two-week technical spike. This Agile Enterprise Data Model provides a User Story Map for the data. It provides a Data Model with placeholders for discussion and further refinement. The concept it not to do detailed Data Modelling but to understand all the data components at a high level that will need to be considered by the project. Used in conjunction with a User Story Map, it can be a valuable tool to allow the application and data model to be worked on in true iterations.

Typically an Agile Enterprise Data Model is a conceptual data model with 20-30 entities. There are minimal attributes and discussion and understanding revolves more around entities and relationships than anything. The Agile Enterprise Data Model can then be refined to create the application database, Operational Data Store, and Data Warehouse.

Summary

I have used the Agile Enterprise Data Model with very good success. It also works best in a Data Domain you are unfamiliar with as it allows you to understand a Data Domain holistically initially and then deliver in iterations. The best of both worlds.

Why #Dimensional Modeling matters

I’ve recently completed a data modeling initiative on a major project. After doing this I’ve come to two major conclusions:

  1. The coverage area in Insurance is probably the most devious and twisted area of data that I have ever modeled.
  2. Dimensional Modeling should be done on every model to ensure you can simply model the domain.

Why Dimensional Model?

For the model I worked on, there are over 261 tables in the Relational Model. In the Dimensional Model, there are only 25. The process to distill a Relational Model to a Dimensional one is not easy, but the process highlighted problems in my Relational Model that I was unaware of. These issues would have been exposed later during testing or post-production, but being able to identify them in development was extremely valuable.

So what is Dimensional Modeling? although most people have been exposed to the rules of Data Normalization, not as many people have been exposed to Dimensional Modeling. I will try to explain my opinion of Dimensional Modeling, but if you are interested I would highly recommend ‘The Star Schema Reference’ by Chris Adamson. Simply put, it is the most complete and concise book ever written on Dimensional Modeling. Dimensional Modeling is usually done to create a model for a Data Warehouse. This Data Warehouse can then be used for Operational and Analytic reporting. But the process of Dimensional Modeling does not need to be limited to a Data Warehouse. The process itself has value to validate the Relational Model.

The term that is commonly used to refer to a Dimensional Model is a ‘Star Schema’ model. This is due to the fact that most Dimensional Models look like a star. (With the Facts in the centre and the Dimensions around the Fact)

Dimensional Modeling

Dimensional Modeling is the process of taking a Relational Model in some normal form and being able to distill the many tables down to the business objects that the business works with. In most cases this would result in single tables that correspond to objects like:

  • Client
  • Product
  • Sales Rep
  • Store
  • Sales Transaction

These business objects fall into two categories:

Facts – These are Facts about the business. They typically correspond to events or transactions and have metrics or measures that the business is concerned about. In many cases these fact tables are what the business wants to report on.

Dimensions – These are Dimensions to the Facts. They typically correspond to business objects that interact with Facts. These Dimensions represent how the business describes these objects and how they like to slice, dice, filter, sort, group, and order the Facts. A standard Dimension that occurs in all Dimensional Models that does not correspond to a business object is Time. Time is a dimension for almost all business events and transactions.

Initially, the act of creating these Facts and Dimensions is a large task of de-normalization. Although this sounds like a simply task, it is anything but. A Relational Model with hundreds of tables must be distilled down to a handful of tables. Many to Many relationships must be transformed into simple relationships that are easy to understand and report against. A natural key must also be defined that uniquely identifies a row in the Fact and Dimension. Although this sounds easy, you may find that some Dimensions don’t actually have a natural key. This is a clue that the Dimensional Model needs more attention.

A Dimensional Model also places the modeling of history at the forefront. A lot of attention is spent ensuring that data can accurately represent the changes that occur over time. Most of the time, a Relational Model is primarily concerned with current state.

Note: This is a gross over-simplification, if this is interesting to you I would recommend the book by Chris Adamson on all of the theory and complexity behind Dimensional Modeling. Or you can just start Googling!

Conclusion

The act of creating a Dimensional Model and creating the 25 Facts and Dimensions highlighted inconsistencies and duplication in my model. It also challenged my understanding of the data domain. It raised questions that I realized I didn’t have the answer for. It is easy for inconsistency and errors to hide in data model with 200+ tables. A data model that has 10-30 tables has nowhere to hide. It has the brutal transparency so valuable in Agile.

Why do we #DataModel at all?

People in the Database world take Normalization and Data Modeling as something that should be done without question. I compare it to best practices like versioning software. No one expects that anyone would create software without version control anymore.But more often recently I do get questioned and challenged on why we need to normalize and model data. Is it even required with the cheap disk space, memory, and server capacity available ?

According to Wikipedia and others, the objective of normalization is:

“Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”

The rules of Normalization

Normal form

Brief definition

Violation

1NF A relation is in 1NF if and only if all underlying domains contain scalar values only First Normal Form is violated when the table contains repeating groups
2NF A relation is in 2NF if and only if it is in 1NF and every non-key attribute is irreducibly dependent on the primary key – every column must depend solely on the primary key Second Normal Form is violated when a non-key field is a fact about a subset of a key
3NF A relation is in 3NF if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. Third normal form is violated when a non-key field is a fact about another non-key field
4NF Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A. Fourth Normal Form is violated when a table contains two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

The Questions

Now that we have reviewed the objectives and rules of normalization, let us summarize. The objective of Normalization is to:

  1. Minimize Redundancy
  2. Minimize Dependency

But what if we have extra storage available and storing redundant copies of data is not a problem? In fact, it probably will speed up our query response time. What if we also don’t require frequent modification of data so that having de-normalized data won’t result in update or deletion anomalies caused by excessive data dependency? Why should we still model our data and normalize?

The three reasons to Data Model

Simplicity, Consistency and Integrity, and Future Flexibility.

Simplicity

Every one of the violations mentioned above would require extra code and extra unit tests to validate proper functioning. Depending on the amount of violations, this can become a severe amount of technical debt that will needlessly be required in the software. There is an entire movement dedicated to the elimination of If statements. (www.antiifcampaign.com) Software that is Data Driven rather than Condition Driven is simpler and easier to maintain over the life of the application.

An application that is Data Driven can also automate the creation of their test cases to validate proper functioning of the application. This combined with the enhanced simplicity greatly adds to the quality of the application.

Consistency and Integrity

Even if the solution being modeled can accommodate redundant data and has the potential for minimal update and deletion anomalies currently, significant risk is being assumed by having these potential situations in your data model. How can you ensure that redundant data will be kept in sync and that update and deletion anomalies do not get introduced in the future as people and requirements change? Either this is through additional software development code or by additional processes and latent knowledge in resident experts. Neither of these situations are a good use of time and energy.

This is an example of an application-centric view of the data model. Unfortunately, not all activity on the Data Model can be guaranteed to always go through the application. Data Fixes, Conversions, and enhancements all have the ability to bypass the application’s business logic and compromise the integrity of the data. All it takes is one high value client with inaccurate or inconsistent data to irreparably harm a company’s reputation and revenue stream.

Future Flexibility

Solutions that are data driven and do not have excessive functional dependencies are much easier to evolve in the future. For example, I may have a business requirement to split one account type or combine account types. This type of conversion will be quite routine if I have modeled my data properly and minimized dependencies. If not, the conversion can be quite convoluted and I will probably need to evaluate code before I can determine the implications of making such a change. Then I have to be sure I address and update all the redundant code throughout the application. Just because the situation doesn’t exist currently with update and deletion anomalies doesn’t mean those situations won’t happen in the future.  

In addition, these changes to split or combine account types would probably also require code changes. If the solution was Data Driven, the possibility of these code changes would be minimized. (not to say they would never be required, but the probability of code changes would be minimized)

Summary

A well designed application and user interface will be able to be used with minimal training. It just makes sense and models the clients current processes and tasks.

A well designed data model should also have the same intuitive qualities. It also makes sense and models the business’s data. Not how the application functions, but how the business exists. Modeling the data in this manner minimizes work currently to work with the data and in the future to accommodate change.

In Object Oriented parlance, the Data Model itself should be loosely coupled with high cohesion. Both the Object Model and Data Model should share this characteristic. (Although they will implement it in quite distinct ways)

Object Model and Data Model differences – Embrace the diversity

In my experience there is a distinctive difference in Data Models when they are created by developers with an Object Model mindset. Usually there is some work and negotiation that needs to be done to properly overcome the Object Model-Data Model impedance problem. I believe that both extreme points lead to less than optimal designs and architectures:

1) Implementing an Object Model directly into a Data Model

2) Implementing a Data Model directly into an Object Model

Both models have a rationale and logic as to how they structure components and sometimes that doesn’t translate well between each other. There are good reasons on why they differ.

Recently I’ve had a situation where I’ve come up against these differences. And they have arisen in regards to how each camp sometime models Super/Subtype tables. (and subsequently reference tables)

Definitions

I am using the term Super/Subtype tables to refer to a collection of tables that represent a Generic/Specific relationship. Unlike Parent/Child relationships that represent a relationship with a hierarchy,  the Generic/Specific are at the same entity level. Perhaps a few examples would assist the discussion:

Parent/Child

Typical Parent/Child relationships represents ‘belonging’ or ‘ownership’. Examples of this type of relationship would be:

  • Person/Address
  • Account/Transaction

In these examples, there is a clear understanding of precedence and hierarchy. You obviously need a Person before you can have a related Address. Because there is a Parent/Child relationship and hierarchy, there also usually a clearer distinction between the types. In many cases, this is due to the fact that the two entities have different cardinality. Even the most fervent object modeller will create separate objects if they have different cardinality.

Generic/Specific

Generic/Specific relationships do not represent ‘belonging’ or ‘ownership’. In addition, they also typically do not have different cardinality. These relationships represent ‘refinement’ or a type of ‘Master-Detail’ relationship. These relationships have one Master entity with multiple Specific entities. To be a Generic/Specific relationship, there must attributes at the Master level that applies to all Specific types, but attributes at the different Specific types that only apply to each Specific type. (Which is why the Specific entities exist. Otherwise we would just have them at the Master level if they applied to all Specific Types)

Usually these relationships are created in the Data Model in one of two ways:

1) Multiple tables – one Master table with multiple specific Detail tables. (and a type code on the Generic Entity to define which related Specific Entity applies for each row.

2) One table with nullable fields that are populated based upon the type code of row

Examples of this type of relationship are:

  • Financial Account – [Equity Account, Fixed Income Account]
  • Coverage – [Maximum, Deductible, Co-insurance]

In these examples, there is not a pronounced hierarchy. You obviously need a Financial Account before you can have a related Equity Account, but they are at the same level. As we mentioned before, these entities in the relationship do not have different cardinality.

My Experience

My experience is that Data Models and Object Models are typically the same when we are discussing Parent/Child entities and relationships. Differences of opinions occur with these Generic/Specific entities and relationships. Since they do not have different cardinality, they are very easy and natural to represent as an object.

  • From a data modeller point of view, this practice creates data entities with embedded data knowledge. No longer does the structure of the data represent the true nature of the data. Depending on how many Specific Types and attributes exist , this can cause considerable complexity. The design also breaks the rule of Second Normal Form.
  • From an object modeller point of view, this practice creates an object that can be re-used and leveraged. Rather than needing to create separate objects, I create one object! Isn’t this just Polymorphism?

Isn’t this only a problem depending on what the database is being used for? Since I am a data modeller, I think it is a problem no matter what the database is used for. I guess the issue is just how large of a problem it is. There is less of an issue if it is an application database that has a business layer that can incorporate this complexity and  shield it from the consumers of the data. This may be a much larger issue if this design is for an Operational Data Store or Reporting Database.

Summary

There are valid reasons on why designs and models should differ between objects and tables. This is especially true when analysing Generic/Specific relationships. IMHO.

Data Modeling mistake – Violating 4th Normal Form

As I read David C Hay’s awesome book on Data Model Patterns, I start to realize the mistakes I have made creating certain Data Models in the past. In particular, one mistake that I made repeatedly became very apparent.

Data Modeling mistake – Violating 4th Normal Form

Looking back in some of the Data Models I have created, I would say they have been pretty good and for the most part modeled the data in the enterprise pretty well. But in a couple of cases, we did create some components of the models that in hindsight leave me somewhat regretful.

The one situation that caused me to reflect was the solution we used to model relationships between three primary entities and addresses. Each of these relationships were a many-to-many relationship so certainly an intermediate table was required. In addition, we also need to define an attribute for the intended purpose of the address. As this was an attribute of the relationship and not the address itself, this was another requirement for a relationship table. So far, so good.

What we eventually decided on was the following structure as an address could be related to an account OR an instruction OR a client.

  • contact_method_id (PK)
  • Address_id (FK)
  • Account_id(FK) – nullable
  • Client_id(FK) – nulllable
  • Instruction_id(FK) – nullable
  • Address fields (no sense listing them all out)

We thought it would be simpler having one table that contained all the relationships to Address. Basically I think we thought that it was simpler from the Data Modeler point of view and made the diagram simpler. But what we underestimated is that it required the developers and business users to understand a more complicated model and know when the different nullable foreign keys would need to be populated. In a sense, we built additional logic into the data model to reduce the number of tables.

4th Normal Form

What we were doing was violating the 4th Normal Form for the reasons of perceived simplicity. In hindsight, I experienced the problems that violating 4th Normal Form can cause.

4th Normal Form is defined as: “Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity.”

All my Data Models now are in 4th Normal Form. They may have more entities, but more entities does not mean a more complex model. My standard now is for my Relational models to start in 4th Normal Form and not just 3rd Normal Form.