Business Case for creating a Data Model and Data Normalization

I recently came across a great article by @datachick on “Why having a Data Model is important?” on DataVersity. Highly recommended read.

Karen Lopez provides a great list of items on why she finds a Data Model helpful. I do believe there is something missing from the list of items though. All of the items mostly describe reasons from an IT point of view.

Here is Karen’s list on why you should create a data model:

  • Technologies used to move and persist data come in many forms, over time and at the same time. I support the creation of XML Documents, multiple DBMSs, multiple versions of the same DBMS. By having a single logical data model of data requirements, I can separate the rules and definitions of the requirements from their implementation.
  • The same data exists in many platforms, in many locations.  I need to be able to map those sources to targets across platforms and systems.  Doing this mapping, or data lineage, with a data model is much easier.  It also helps me understand the implications of making a change in one system on another system.  I can’t do that by just looking at code.
  • Writing stuff down is often a good way to impress people that have to provide requirements, compliance issues, security requirements, etc. They like not having to answer the same question over and over for different people and roles.  I can’t tell you how many times a business person has asked me if IT ever writes anything down.  Imagine your frustration if you had someone working on your house and ever tradesperson started their day with “Tell me about what sort of house you want? How many people will live here? Do you want to be able go get to the second floor? Do you need a bathroom?  How about a shower?”  That’s how many business professionals see us in IT.
  • A data model is a great way of capturing rules, constraints, definitions in a method that is technology independent. I can capture those things once and reference them in many places.
  • We can measure databases against our data models to assess fit. This can be done for application packages and custom development.
  • Enterprise data is complex.  I work with a data model that has 32,000 objects (tables, columns, datatypes, constraints, etc.) in it.  There is no way I could professionally manage change by just trying to remember all this information.  Nor could anyone else.
  • Modeling helps you ask the right questions before a bunch of time is spent coding, creating screens, reports, etc. This reduces costs and the number of bug fixes required.
  • Like all models, the data model is a communication tool and is good for tying requirements directly to designs and implementations.  This is especially true when I generate data prototypes based on the model.
  • Data governance can’t easily be done via reverse-engineered pictures of databases.
  • Once you’ve worked with a great data model, you can’t go back.

I agree with all of Karen’s points and I’d even like to add a few of my favourite benefits from an IT point of view as well.

  • I find HUGE benefits in being able to abstract domains across my entire data model so that changes to domains in the future can be done easily and ripple through the entire data model. The use of domains also ensures that my design is consistent and coherent across all the hundreds and thousands of tables in my data models.
  • I find HUGE benefits in being able to version my data model alongside the application code so I can revert to any point during the entire project.
  • I find HUGE benefits in helping me think through the problem. The danger with documents is they don’t have to be holistic and complete. You can start a document and describe functionality and relationships but you could forget to describe a couple of relationships without people noticing. But once you transform that description into a diagram, be it a data model or another diagram, orphans and missed relationships become very apparent. I like data models because they help me ensure my solution is correct, consistent, and complete.

The Business Case

One thing I always find fascinating is how hard it is to discuss and gain agreement from clients on why you need a data model and need to have a nicely normalized data model. Everyone in the data world takes this as just a fact, but how can you convince a client that they should spend the months of person-time to create one? Why should they spend tens of thousands of dollars? What is the business benefit?

I believe it comes down to two basic business benefits

1) The Data Model enables communication between IT professionals and the clients using visual means – As we are learning in all areas of the business world, a picture is worth a thousand words. We could create paragraphs upon paragraphs on how we believe the client’s business is structured and we would be wrong every time. We will still be wrong with a data model, but we will be less wrong as clients find it easier to point out the relationships we have misunderstood.

2) Normalization promotes and enables future business flexibility – A normalized database promotes a design that is highly cohesive and loosely coupled. Why does this matter? To the business it matters greatly! It means we can adapt to new realities in the business world quickly and easily. That means quicker time to market and less budget required for new functionality.

  • Need to integrate a new line of business? No Problem.
  • Need to accommodate a new language in the application? No Problem.
  • Need to roll out enhancements to keep up with the competition? No Problem.

Summary

If you agree with the benefits above, a Data Model is just the most efficient way of enabling these benefits. Especially with free tools like Oracle’s SQL Developer Data Modeler. With these free tools, you no longer need a large investment to start using them.

So why do I create a Data Model and normalize my data designs? Because they make good business sense and return benefits far exceeding the cost of creating and maintaining them.

Why do we #DataModel at all?

People in the Database world take Normalization and Data Modeling as something that should be done without question. I compare it to best practices like versioning software. No one expects that anyone would create software without version control anymore.But more often recently I do get questioned and challenged on why we need to normalize and model data. Is it even required with the cheap disk space, memory, and server capacity available ?

According to Wikipedia and others, the objective of normalization is:

“Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”

The rules of Normalization

Normal form

Brief definition

Violation

1NF A relation is in 1NF if and only if all underlying domains contain scalar values only First Normal Form is violated when the table contains repeating groups
2NF A relation is in 2NF if and only if it is in 1NF and every non-key attribute is irreducibly dependent on the primary key – every column must depend solely on the primary key Second Normal Form is violated when a non-key field is a fact about a subset of a key
3NF A relation is in 3NF if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. Third normal form is violated when a non-key field is a fact about another non-key field
4NF Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A. Fourth Normal Form is violated when a table contains two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

The Questions

Now that we have reviewed the objectives and rules of normalization, let us summarize. The objective of Normalization is to:

  1. Minimize Redundancy
  2. Minimize Dependency

But what if we have extra storage available and storing redundant copies of data is not a problem? In fact, it probably will speed up our query response time. What if we also don’t require frequent modification of data so that having de-normalized data won’t result in update or deletion anomalies caused by excessive data dependency? Why should we still model our data and normalize?

The three reasons to Data Model

Simplicity, Consistency and Integrity, and Future Flexibility.

Simplicity

Every one of the violations mentioned above would require extra code and extra unit tests to validate proper functioning. Depending on the amount of violations, this can become a severe amount of technical debt that will needlessly be required in the software. There is an entire movement dedicated to the elimination of If statements. (www.antiifcampaign.com) Software that is Data Driven rather than Condition Driven is simpler and easier to maintain over the life of the application.

An application that is Data Driven can also automate the creation of their test cases to validate proper functioning of the application. This combined with the enhanced simplicity greatly adds to the quality of the application.

Consistency and Integrity

Even if the solution being modeled can accommodate redundant data and has the potential for minimal update and deletion anomalies currently, significant risk is being assumed by having these potential situations in your data model. How can you ensure that redundant data will be kept in sync and that update and deletion anomalies do not get introduced in the future as people and requirements change? Either this is through additional software development code or by additional processes and latent knowledge in resident experts. Neither of these situations are a good use of time and energy.

This is an example of an application-centric view of the data model. Unfortunately, not all activity on the Data Model can be guaranteed to always go through the application. Data Fixes, Conversions, and enhancements all have the ability to bypass the application’s business logic and compromise the integrity of the data. All it takes is one high value client with inaccurate or inconsistent data to irreparably harm a company’s reputation and revenue stream.

Future Flexibility

Solutions that are data driven and do not have excessive functional dependencies are much easier to evolve in the future. For example, I may have a business requirement to split one account type or combine account types. This type of conversion will be quite routine if I have modeled my data properly and minimized dependencies. If not, the conversion can be quite convoluted and I will probably need to evaluate code before I can determine the implications of making such a change. Then I have to be sure I address and update all the redundant code throughout the application. Just because the situation doesn’t exist currently with update and deletion anomalies doesn’t mean those situations won’t happen in the future.  

In addition, these changes to split or combine account types would probably also require code changes. If the solution was Data Driven, the possibility of these code changes would be minimized. (not to say they would never be required, but the probability of code changes would be minimized)

Summary

A well designed application and user interface will be able to be used with minimal training. It just makes sense and models the clients current processes and tasks.

A well designed data model should also have the same intuitive qualities. It also makes sense and models the business’s data. Not how the application functions, but how the business exists. Modeling the data in this manner minimizes work currently to work with the data and in the future to accommodate change.

In Object Oriented parlance, the Data Model itself should be loosely coupled with high cohesion. Both the Object Model and Data Model should share this characteristic. (Although they will implement it in quite distinct ways)