#Data #Hoarding and #Norm-aholics

Are you a sufferer of Data Hoarding? This is an affliction that affects thousands of businesses every day. Common symptoms are terabytes of data being kept around just in case you ever need them. The data ends up being stored in a multitude of disconnected tables just piled on top of one another with no organization. When a new piece of data is thought to be needed, another table is created and thrown on the pile.

The data is kept around if it is ever needed, but good luck ever finding it in the future when it is buried under piles of other data. Now where did I put that snapshot of the ledger balances from March 2011?

Even worse, most new applications create an isolated database to hold the data that the application requires. Doesn’t matter that at least 30-40% percent of the data already exists somewhere, it is too hard to integrate and create one common data source. Much easier to copy data once and then modify it as the application desires.

It doesn’t have to be this way.

Data Hoarders Anonymous

At Data Hoarders Anonymous we work with you and take you through a seven step program.

Phase I – Repent

1) Admit you have a problem

The first step to any recovery is admitting you have a problem. Look around at your databases, do you have a data dictionary? Do you have 20+ tables with the same name? Have you copied tables directly into a database without doing any analysis or integration?

2) Understand the Enterprise’s perspective

Data is an asset for the enterprise, but only if it is easy to understand. That is the key word, the enterprise must be able to get and understand the data easily. Typically when data has been replicated across god’s green earth, there are multiple copies of the truth and they are anything but easy to find. The data is worthless if it isn’t easy understood.

3) Apologize to Developers you have made extract data from multiple complex disconnected schemas

Take donuts with you. But apologize to those developers that you have made create programs that need to extract data from multiple sources and somehow make sense of it all. SQL should not be a language where you need a PHD to understand. If your SQL is longer than a screen, you probably have a Data Hoarding problem.

4) Apologize to Business users you have made explain inconsistent data to Vice-presidents

Take alcohol with you and apologize to the business users who don’t know what is included in the allocated income calculation and what table actually has the true up to date value. Apologize for not having a Data Dictionary available to them that would tell them where the data exists and what it actually means.

Phase II – Recovery

Now that you have repented, you can move onto the recovery portion…

5) Organize your data with Data Normalization

Data Normalization is your new friend. Learn the rules and let them assist you in removing data redundancy and enforcing consistency and integrity. But be very careful. Many on the recovery path of Data Hoarding end up becoming ‘Norm-aholics’. They normalize to at least third normal form all the time, even though it creates excessively complex schemas for the developers and data that can’t be accessed by business users. Even worse, ‘Norm-aholics’ have no self-awareness that they have an addiction. Never let the rules of Data Normalization affect the progress you have made. Sadly, many DBAs fall into this trap and let the theory of Data Normalization affect the understanding of the data.

Data Normalization must increase the understanding of the data, if not Data Normalization has been corrupted.

6) Simplify your data with Dimensional Analysis

Now that you have used Data Normalization to remove data redundancy and maximize consistency and integrity, the time has come to use the tools of Dimensional Analysis to ensure the data design is simple. Use Dimensional analysis to ensure that you haven’t traded one type of complexity for another. Evaluate your data across time and Master Data Dimensions to ensure that the transactions concepts are consistent and simple.

7) There can be only only one

At the last step, you will need to embrace the principle that there can be only one…

  • One owner and source of the data – The data must be owned by one business area and system
  • One place to update the data – the data must be updated in only one system that the owner controls
  • One way to get the data – there can not be multiple query paths to retrieve the required data
  • One truth – the data must have only one meaning and this meaning must be documented in a Data Dictionary

Coming to you soon on A&E – Data Hoarders Anonymous

Advertisement

Why #Dimensional Analysis should be done on every #datamodel

Those of you who have worked with me, know of my fondness for Operational Data stores. I have always believed in the importance of having an enterprise or holistic view of the data requirements for every application. An Operational Data Store seemed to be the perfect vehicle to ensure this happened. Perhaps my fondness was related to not wanting to stray too far from the normalization rules that I knew quite well. In this way, it was a new-ish discipline or context that really wasn’t new.

I always looked kinda sideways at those weird Dimensional modelers with their Star Schemas and Snowflakes. I mean if they really put their mind to it, they would be able to figure out how to solve their data needs with a nice relational normalized Operational Data Store, Only exceptional and massive amounts of data require the Dimensional modeling constructs that these models typically use right? I mean what is so complicated about a model with only 100 main tables? Shouldn’t everyone know how to write SQL by hand?

On my latest project, I have had the opportunity to become re-introduced to Dimensional Analysis and modeling and I have found the process fascinating and very valuable. Besides the obvious benefits that are being realized by being able to model the data in a way that allows the clients to efficiently write and execute queries, there was an unexpected benefit.

Taking a normalized Data Model and attempting to translate it into a Dimensional Model really challenges and validates your data model. It is easy to create a model with a multitude of complex relationships than it is to distill in down to a handful of FACTs and Dimensions. With so many relationships, it is possible to inconsistencies to exist and hide in the data model. I found multiple modeling errors in the process on trying to create a Dimensional model from my relational model. When you distill a relational model down to a Dimensional model, inconsistencies and errors become very apparent in the creation of the FACTs and Dimensions.

Dimensional Analysis also forces you to look at the data in a different way. Instead of a relational/hierarchical way, I find it forces me to look at the data in a chronological way and forces me to consider data changes, data history, and data latency in ways I may not have considered before. Not having to account for data across time and verify consistency at every point is quite a bit simpler.

Summary

I am a convert of using Dimensional Analysis on all my data models for validation of the data model and additional analysis of the data.  I’ve discovered that I need to understand the data better to create a Dimensional Model than a normalized model. More factors need to be considered and creating the Dimensional model with fewer objects requires that the data model has greater consistency, integrity, and cohesion.

Simple is hard. 

 

Why do we #DataModel at all?

People in the Database world take Normalization and Data Modeling as something that should be done without question. I compare it to best practices like versioning software. No one expects that anyone would create software without version control anymore.But more often recently I do get questioned and challenged on why we need to normalize and model data. Is it even required with the cheap disk space, memory, and server capacity available ?

According to Wikipedia and others, the objective of normalization is:

“Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”

The rules of Normalization

Normal form

Brief definition

Violation

1NF A relation is in 1NF if and only if all underlying domains contain scalar values only First Normal Form is violated when the table contains repeating groups
2NF A relation is in 2NF if and only if it is in 1NF and every non-key attribute is irreducibly dependent on the primary key – every column must depend solely on the primary key Second Normal Form is violated when a non-key field is a fact about a subset of a key
3NF A relation is in 3NF if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. Third normal form is violated when a non-key field is a fact about another non-key field
4NF Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A. Fourth Normal Form is violated when a table contains two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

The Questions

Now that we have reviewed the objectives and rules of normalization, let us summarize. The objective of Normalization is to:

  1. Minimize Redundancy
  2. Minimize Dependency

But what if we have extra storage available and storing redundant copies of data is not a problem? In fact, it probably will speed up our query response time. What if we also don’t require frequent modification of data so that having de-normalized data won’t result in update or deletion anomalies caused by excessive data dependency? Why should we still model our data and normalize?

The three reasons to Data Model

Simplicity, Consistency and Integrity, and Future Flexibility.

Simplicity

Every one of the violations mentioned above would require extra code and extra unit tests to validate proper functioning. Depending on the amount of violations, this can become a severe amount of technical debt that will needlessly be required in the software. There is an entire movement dedicated to the elimination of If statements. (www.antiifcampaign.com) Software that is Data Driven rather than Condition Driven is simpler and easier to maintain over the life of the application.

An application that is Data Driven can also automate the creation of their test cases to validate proper functioning of the application. This combined with the enhanced simplicity greatly adds to the quality of the application.

Consistency and Integrity

Even if the solution being modeled can accommodate redundant data and has the potential for minimal update and deletion anomalies currently, significant risk is being assumed by having these potential situations in your data model. How can you ensure that redundant data will be kept in sync and that update and deletion anomalies do not get introduced in the future as people and requirements change? Either this is through additional software development code or by additional processes and latent knowledge in resident experts. Neither of these situations are a good use of time and energy.

This is an example of an application-centric view of the data model. Unfortunately, not all activity on the Data Model can be guaranteed to always go through the application. Data Fixes, Conversions, and enhancements all have the ability to bypass the application’s business logic and compromise the integrity of the data. All it takes is one high value client with inaccurate or inconsistent data to irreparably harm a company’s reputation and revenue stream.

Future Flexibility

Solutions that are data driven and do not have excessive functional dependencies are much easier to evolve in the future. For example, I may have a business requirement to split one account type or combine account types. This type of conversion will be quite routine if I have modeled my data properly and minimized dependencies. If not, the conversion can be quite convoluted and I will probably need to evaluate code before I can determine the implications of making such a change. Then I have to be sure I address and update all the redundant code throughout the application. Just because the situation doesn’t exist currently with update and deletion anomalies doesn’t mean those situations won’t happen in the future.  

In addition, these changes to split or combine account types would probably also require code changes. If the solution was Data Driven, the possibility of these code changes would be minimized. (not to say they would never be required, but the probability of code changes would be minimized)

Summary

A well designed application and user interface will be able to be used with minimal training. It just makes sense and models the clients current processes and tasks.

A well designed data model should also have the same intuitive qualities. It also makes sense and models the business’s data. Not how the application functions, but how the business exists. Modeling the data in this manner minimizes work currently to work with the data and in the future to accommodate change.

In Object Oriented parlance, the Data Model itself should be loosely coupled with high cohesion. Both the Object Model and Data Model should share this characteristic. (Although they will implement it in quite distinct ways)

Object Model and Data Model differences – Embrace the diversity

In my experience there is a distinctive difference in Data Models when they are created by developers with an Object Model mindset. Usually there is some work and negotiation that needs to be done to properly overcome the Object Model-Data Model impedance problem. I believe that both extreme points lead to less than optimal designs and architectures:

1) Implementing an Object Model directly into a Data Model

2) Implementing a Data Model directly into an Object Model

Both models have a rationale and logic as to how they structure components and sometimes that doesn’t translate well between each other. There are good reasons on why they differ.

Recently I’ve had a situation where I’ve come up against these differences. And they have arisen in regards to how each camp sometime models Super/Subtype tables. (and subsequently reference tables)

Definitions

I am using the term Super/Subtype tables to refer to a collection of tables that represent a Generic/Specific relationship. Unlike Parent/Child relationships that represent a relationship with a hierarchy,  the Generic/Specific are at the same entity level. Perhaps a few examples would assist the discussion:

Parent/Child

Typical Parent/Child relationships represents ‘belonging’ or ‘ownership’. Examples of this type of relationship would be:

  • Person/Address
  • Account/Transaction

In these examples, there is a clear understanding of precedence and hierarchy. You obviously need a Person before you can have a related Address. Because there is a Parent/Child relationship and hierarchy, there also usually a clearer distinction between the types. In many cases, this is due to the fact that the two entities have different cardinality. Even the most fervent object modeller will create separate objects if they have different cardinality.

Generic/Specific

Generic/Specific relationships do not represent ‘belonging’ or ‘ownership’. In addition, they also typically do not have different cardinality. These relationships represent ‘refinement’ or a type of ‘Master-Detail’ relationship. These relationships have one Master entity with multiple Specific entities. To be a Generic/Specific relationship, there must attributes at the Master level that applies to all Specific types, but attributes at the different Specific types that only apply to each Specific type. (Which is why the Specific entities exist. Otherwise we would just have them at the Master level if they applied to all Specific Types)

Usually these relationships are created in the Data Model in one of two ways:

1) Multiple tables – one Master table with multiple specific Detail tables. (and a type code on the Generic Entity to define which related Specific Entity applies for each row.

2) One table with nullable fields that are populated based upon the type code of row

Examples of this type of relationship are:

  • Financial Account – [Equity Account, Fixed Income Account]
  • Coverage – [Maximum, Deductible, Co-insurance]

In these examples, there is not a pronounced hierarchy. You obviously need a Financial Account before you can have a related Equity Account, but they are at the same level. As we mentioned before, these entities in the relationship do not have different cardinality.

My Experience

My experience is that Data Models and Object Models are typically the same when we are discussing Parent/Child entities and relationships. Differences of opinions occur with these Generic/Specific entities and relationships. Since they do not have different cardinality, they are very easy and natural to represent as an object.

  • From a data modeller point of view, this practice creates data entities with embedded data knowledge. No longer does the structure of the data represent the true nature of the data. Depending on how many Specific Types and attributes exist , this can cause considerable complexity. The design also breaks the rule of Second Normal Form.
  • From an object modeller point of view, this practice creates an object that can be re-used and leveraged. Rather than needing to create separate objects, I create one object! Isn’t this just Polymorphism?

Isn’t this only a problem depending on what the database is being used for? Since I am a data modeller, I think it is a problem no matter what the database is used for. I guess the issue is just how large of a problem it is. There is less of an issue if it is an application database that has a business layer that can incorporate this complexity and  shield it from the consumers of the data. This may be a much larger issue if this design is for an Operational Data Store or Reporting Database.

Summary

There are valid reasons on why designs and models should differ between objects and tables. This is especially true when analysing Generic/Specific relationships. IMHO.

How an #Agile Data Warehouse leveraged an Innovation Game #agile2012 #sdec2012

On my latest project I have been doing a lot of research and learning on how to conduct a Data Warehouse project in an Agile manner. Actually I’m really honoured to be able to combine my two passions in my career: Data Modeling and Agile.

This investigation has led me to what I feel is a unique approach to an Agile Data Warehouse project. Many of the Agile Data Warehouse projects I hear of and read about typically seem to follow a somewhat standard process. This process usually involves building a Data Warehouse in increments by working with departments within the corporation to understand their data requirements. (in many situations, this is a very detailed analysis of their data requirements) In this way, the Data Warehouse is built up over time until an Enterprise Data Warehouse is created. While this is a far cry better that trying to analyze the data requirements for the entire corporation, I do believe there are some limitations with this approach.

Limitations

1) If the Data Modelers are not experts in the existing data domain, this process may involve significant rework as data modeled for one department may need to be modified as new information is learned and existing data structures need to be integrated with data structures designed in subsequent releases. This risk can be significantly reduced if the data modelers are experts in the data domain, but this can be a real risk for consultants that are not experts.

2) We are still not implementing a Data Warehouse in an iterative fashion, only incrementally. While this is a step in the right direction, it is still falling short of being able to implement a Data Warehouse iteratively and get even quicker feedback. Since we are also essentially time-boxing implementing data requirements on a department by department basis, we are also only working on the most important data requirements on a departmental basis. We are not working on the prioritized data requirements for the entire corporation. If every department receives the same time allocation, we may work on some data requirements while other more important ones are left in the backlog. And finally, the data requirement for each department can get quite detailed. In many ways, we are still doing Big Design Up Front – in chunks.

Our Approach

My background in Data Modeling has always been aligned with Bill Inmon’s thoughts rather than Ralph Kimball’s. So to no surprise I felt I needed a good understanding of the data domain for the Enterprise before I could start to model the data. To address this I proposed we create and Agile Enterprise Data Model. An Agile Enterprise Data Model is an Enterprise Data Model that takes 6 weeks instead of 6 months to create. (or 6 years) Its purpose is to validate the entities, relationships, and primary attributes. It is not intended to drive down into every single attribute and ensure alignment. In my experience, this excessive detail has derailed other Enterprise Data Model projects. (as consensus cannot be reached or the data is always evolving) But in creating an Agile Enterprise Data Model, we now understand enough of the enterprise’s data so that even though we may only be modeling one department’s data, we know what integrations and relationships are required in the future.

I felt this addressed the first limitation.

The second limitation was much harder. I felt it was much harder to engage clients in small slices or iterations of their data requirements in a light-weight fashion that wouldn’t require reviewing their current data requirements in detail. How could we get a picture of the data requirements for the corporation at a high level? Many of the clients were concerned that if the project operated in iterations they would not get all of their requirements satisfied. In addition, our current project had a list of existing reports that the clients would like to be migrated to the new environment. How could we engage our clients and work with them to determine what their corporation’s data requirements were in an iterative light-weight manner? How could we provide visibility of what the entire corporation  data requirement were?

I turned to Innovation Games.

I am a fan of Innovation Games and any light weight process that can help to visualize data and engage the clients better. I was searching for a way to use a variant of the ‘Prune the Product Tree’ game to use. But I didn’t feel that just placing data requirements on the Product Tree would truly help the engagement of the client and provide the visibility required. What we really needed was a variant of the Product Tree that helped the clients ‘see’ their data requirements to confirm that we got it right and that showed the requirements for the entire corporation on one wall.

We ended up merging a ‘Prune the Product Tree’ game with a ‘Spider Web’ game with some unique additions.

Here is what are doing:

1) We are seeding the visualizations with 30-40 enterprise level reports that are used internally and externally.

2) We have asked for the top 20 reports from each department to further seed the visualizations. This will help to populate the visualization and I believe help the clients to generate other data requirements that we are missing.

3) We are capturing data requirements or reports in the simplest mode possible. We decided to leverage the Goal, Question, Metric method proposed by Victor Basili. Our data requirement stories will specify the following:

  • Noun : The object of primary interest
  • Question : The main inquiry about the noun
  • Reason : What is the business reason for the question
  • Frequency : How often do I need the answer?

An example is provided below:

Noun Question Reason Frequency
Claims Amount > $1,000 To generate audits Monthly

We will capture these additional data requirement stories in Silent Brainstorming sessions with users. This Silent Brainstorming will occur after we present the visualizations and review the data requirements that have seeded the visualization.

The final piece of the puzzle

Even with those first three pieces, it was still unclear how we could visualize the data and reporting requirements and gain engagement with the clients. Then during an Innovation Games course that was educating us on how to customize and create our own game, it struck me. The Product Tree is the wrong metaphor for our data requirements. We need something that visualizes the data requirements themselves. We iterated through a spider web metaphor, to a 3 dimensional axis metaphor, until we ended up on a hexagon. I’d recommend you find the shape that best fits your data domain. The hex and its six dimensions fit our data domain nicely.

The hex diagram below is our customized Innovation Game we will use to seed data requirements and generate new requirements.

The Concept

Usually for all data requirements or reports there is a person object at the centre. This is no exception. We will have a data hex for each type of person that the corporation needs to know about. The six accesses were chosen by understanding the data domain and by reviewing existing reports that are used by the client. The existing data requirements and reports will be translated into the data requirement stories and placed on the hex. Sessions will be facilitated to add additional stories onto the data hexes.

Once all of the stories have been placed on the data hexes we will prioritize the stories by moving the data requirement stories based on priority. Data Requirement stories that are high priority will be moved near the centre. Stories of less priority will be moved to the outer rings.

The Rationale

Using this metaphor we are able to do the following:

1) Visualize the hundreds of data requirements and reports of the corporation at once and prioritize them.

2) Ensure we are working on the true data requirements across the entire corporation.

3) Work in a true iterative fashion in how we deliver data requirements and ultimately build the Data Warehouse iteratively.

4) Use a light weight method that limits detailed design up front

Summary

If you want to hear how this process worked, the results will be presented at Agile 2012 in Dallas and at SDEC2012 in Winnipeg! I’ll also have a subsequent blog post that publishes the results. So far the feedback has been encouraging and we will be expanding the use of the process over the next few weeks.

How #Oracle helped to make my Data Model #Agile

Many people who know me know that I have a certain fondness for tools that can help to make our jobs easier. I agree that a fool with a tool is still a fool, but I would argue that craftsman with the right tool can be an even better craftsman. In essence, the problem isn’t the tool,it is the person wielding the tool. So let’s not be too quick to paint the use of all tools in a negative light.

Agile Data Modeling

It has always been a struggle to determine how we can manage our Data Models and Databases in an Agile way. In the past, the Data Modeling tools were limited in the functionality they provided that allowed both collaboration and versioning. Recently the Data Modeling tools have become much better at providing that functionality, but most of the Data Modeling tools I have seen have still provided that collaboration and versioning functionality in a proprietary way that was created by the Data Modeling vendor.

This has made the Data Model and Database operate as second class citizens on Agile projects. Usually separate processes need to be created to determine how we will integrate the Data Model and database into Continuous Integration environments. I believe that some of the resistance of Database Modelers and Database Administrators to implement Agile for Data Design activities is due to the fact that the tools at their disposal have placed them at an Agile disadvantage. Due to this the Data Modelers will prefer the illusion of a more stable Data Model so that they are potentially limiting the number of SQL change scripts they need to generate. (and thereby limiting their technical debt) SQL Change Scripts still seem to be the primary way that people roll out changes to their databases. It is not uncommon for projects to have a huge number of these scripts that then themselves need to be maintained. I have seen a wide variety of ways people use to generate these scripts that vary from a Data Modeling Tool generating them to having a DBA write them from scratch.

Having a large backlog of these scripts and having them be generated in such a way is a huge hinderance to being Agile and accepting change. Having to then try to synchronize these scripts in alignment with check-in versions can be beyond frustrating. And I don’t even want to talk about having to maintain all of these scripts and manually checking them in.

Agile Data Warehouse

For a recent project, we were tasked with creating a Data Warehouse and we needed to create the Data Warehouse in an Agile way. We needed to embrace Agility for two reasons:

  1. To deliver the maximum value to the client and deliver the best design that meets actual business needs
  2. To address the reality that we will be required to accept changes throughout the duration of the projects as sub-projects are completed

After reviewing Data Modeling tools that I have used in the past, I was still struggling with how we could develop a process that would allow us to easily adopt Agile practices for our Data Models and databases that would allow the Data Warehouse to be run as Agile as other projects.

Oracle SQL Developer Data Modeler

I did some initial investigation of what Data Modeling tools are now available. My objectives were to find an affordable tool that could be used to do the Data Modeling and that all developers could use to view the Data Models and collaborate on them. Most of the tools seemed to fall into two camps:

  1. Affordable Data Modeling tools that provided good collaboration but limited Data Modeling and integrated versioning functionality
  2. Expensive Data Modeling tools that provided good Data Modeling functionality but limited collaboration and integrated versioning functionality

Then on a list that reviewed the best Data Modeling tools, there was mention of Oracle SQL Developer Data Modeler. It was stated that it was a robust Data Modeling tool that was free. This was certainly something that needed further investigation.

Rarely does a new tool present such functionality that you require time to determine how to use it. 🙂

The functionality that was provided with Oracle SQL Developer Data Modeler absolutely changed our project. We now had the ability to be a first class Agile citizen with the rest of the Agile projects and be able to adapt to change easily. We would be able to roll out changes easily and efficiently as new requirements became apparent.

Oracle SQL Developer Data Modeler provided the following functionality:

  • Full, integrated functionality with Subversion. The Data Models can be checked in and versioned along with the application code.
  • Ability to define a domain library, naming standards, and glossary to ensure a consistent model
  • Ability to apply design rules to a model to ensure Data Modeling standards are being followed
  • Ability to model Logical, Relational, Dimensional, and Physical Models
  • Ability to store DML to populate data along with each table definition. (and determine when they should be executed – at the start/before object creation/after object creation/at the end)
  • Use of Barker notation. (Pet Peeve!)
  • Extensive functionality to generate different types of models and compare different types of models
  • Full versioning functionality to report on the differences between the versions and what changes will be committed on check-in.
  • Robust Data Modeling functionality to create Data Models and create physical databases.

Did I mention free so all the developers could use it to reviews the models and make requests? 

Summary

Although we still need to use Database Re-factoring practices to adapt to changes, the versioning and collaboration features of Oracle SQL Developer Data Modeler did cause us to pause and determine how we could best use its functionality. In fact, it caused us to define a process for how we could efficiently roll out Re-factoring changes in an automated way as Oracle SQL Developer Data Modeler now made the Re-factoring manageable.

If you are looking for an Agile Data Modeling tool, I would highly recommend it!

Did I mention it is free? 🙂

And the support for the product from Oracle is top-notch. Questions I have had have been answered same-day..

Data Modeling mistake – Violating 4th Normal Form

As I read David C Hay’s awesome book on Data Model Patterns, I start to realize the mistakes I have made creating certain Data Models in the past. In particular, one mistake that I made repeatedly became very apparent.

Data Modeling mistake – Violating 4th Normal Form

Looking back in some of the Data Models I have created, I would say they have been pretty good and for the most part modeled the data in the enterprise pretty well. But in a couple of cases, we did create some components of the models that in hindsight leave me somewhat regretful.

The one situation that caused me to reflect was the solution we used to model relationships between three primary entities and addresses. Each of these relationships were a many-to-many relationship so certainly an intermediate table was required. In addition, we also need to define an attribute for the intended purpose of the address. As this was an attribute of the relationship and not the address itself, this was another requirement for a relationship table. So far, so good.

What we eventually decided on was the following structure as an address could be related to an account OR an instruction OR a client.

  • contact_method_id (PK)
  • Address_id (FK)
  • Account_id(FK) – nullable
  • Client_id(FK) – nulllable
  • Instruction_id(FK) – nullable
  • Address fields (no sense listing them all out)

We thought it would be simpler having one table that contained all the relationships to Address. Basically I think we thought that it was simpler from the Data Modeler point of view and made the diagram simpler. But what we underestimated is that it required the developers and business users to understand a more complicated model and know when the different nullable foreign keys would need to be populated. In a sense, we built additional logic into the data model to reduce the number of tables.

4th Normal Form

What we were doing was violating the 4th Normal Form for the reasons of perceived simplicity. In hindsight, I experienced the problems that violating 4th Normal Form can cause.

4th Normal Form is defined as: “Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity.”

All my Data Models now are in 4th Normal Form. They may have more entities, but more entities does not mean a more complex model. My standard now is for my Relational models to start in 4th Normal Form and not just 3rd Normal Form.

#Agile Data Modeling – still a ways to go

I have wanted to write a Blog entry on Agile Data Modelling for a while now. It combines the two prime areas of interest for myself as I really started as a DBA/Data Architect and then moved on towards Project Management and Agile Project Management. But truly, Data Modelling has been and will always be my first true love. (Very appropriate that I am writing this article the day before Valentine’s day)

I am on a project currently where I am struggling to not fall back into the traditional ways I have done Data Modelling in the past. Since almost all of my Data Modelling experience has been on more traditional projects, it is easy to fall back into that pattern. Thanks to Scott Ambler and Steve Rogalsky for reminding me of how we can continue make Data Modelling more Agile.

More often than not, the areas of Database Design and Data Modelling has been one of the most resistant to Agile Methods. Recently I came across this Blog post by Tom Haughey on the Erwin site:

Agile Development and Data Modeling

In some ways, I thought Tom was quite Agile in his preference to segment Data Modelling projects into 3-6 month phases or increments  to help increase the chances for success. But other statements reminded me that we as Data Modellers still have ways to go before we have joined the rest of the Agile team.

Some of the concerning statements were:

“Data modeling has always been performed in an iterative and incremental manner. The data model has always been expanded and enriched in a collaborative manner. In my 28 years of involvement in data management, no qualities of data modeling have been more consistently reiterated, not even non-redundancy. It is absurd to imply that traditional data modeling is done in one continuous act or that it is done all upfront by an isolated team without involving Subject Matter Experts and without sensible examination of requirements.”

By this same definition one could also say all analysis has been iterative and incremental which we know is incorrect. I believe the misunderstanding may lie in what people define as an iteration. Of course Data Models are iterated as analysis and data design is done as more requirements are gathered. But is the data design part of an end-t0-end iteration where a segment of the data model is promoted to production and used? Or is there a horizontal iteration of creating a high level Enterprise Data Model before detailed data modeling is done? On almost all the projects I have been on the answer is a resounding no. There usually is a big bang implementation of the data model to the developers after months of analysis. If anything, Data Modelling tends to be more incremental than iterative.

“In summary, traditional data modeling is incremental, evolutionary and collaborative (and thereby agile) in its own right.”

Being incremental, evolutionary, and collaborative doesn’t necessary make you Agile. I also don’t know if you can ever achieve Agile as an end state as well. We are striving to be more Agile and I don’t believe the striving should ever end and we can rest because we are Agile.

“The implications of Agile proponents like Scott Ambler is that “the traditional approach of creating a (nearly) complete set of logical and physical data models up front or ‘early’ isn’t going to work.” One issue with a statement like this is what does “up front” or “early” mean. He says that the main advantage of the traditional approach is “that it makes the job of the database administrator (DBA) much easier – the data schema is put into place early and that’s what people use.” Actually, the main advantages are that it is a clear expression of business information requirements plus developers have a stable base from which to work.

This the one statement that perhaps is troubling. The desire to get a stable base for developers is very similar to trying to get a stable analysis base for developers. In my experience, Data Modelers can be perfectionists (like all great analysts) and they struggle with releasing something that is not fully done. But this goes against Agile. Just like other functionality, we should try data models early and often and get feedback on they are used and how they perform in production. We can then use that feedback to make the Data Models better as the project progresses. This example best highlights the difference between incremental and iterative Data Modelling:

Incremental – releasing stable sections of the Data Model for development and use. Limited changes to the Data Model are expected.

Iterative – releasing initial version of the Data Model for development and feedback to make the Data Model and future Data Models better. Moderate to significant changes to the Data Model are expected and embraced.

They say that it requires the designers “to get it right early, forcing you to identify most requirements even earlier in the project, and therefore forcing your project team into taking a serial approach to development.” On the contrary, data and process modeling, and thereby data design and program design, should be done in a flip-flop manner. You collaborate on the requirements, model some data, model some processes, and iterate this process till the modeling is done – using a white-board and Post”

Hopefully Tom Haughey will read this Blog post and clarify this statement with me. It does sound that there may be aspects of Iterative Data Modeling being proposed, but this conflicts with the earlier statement so I am unsure. The iterations still seem to be focused on the modeling and not explicitly incorporating development and having the functionality promoted and used by the clients in production. (The only true measure of value)

“But remember this. The traditional SDLC (System Development Life Cycle), whatever its faults, has successfully delivered the core systems that run business across the world. Imagine delivering a new large brokerage trading system in 2-week intervals, or going live with a space shuttle project 2-weeks at a time, or delivering a robotic systems for heart surgery in 2-week intervals. Much, but not all, of Agile development has focused on apps like web-based systems and smaller, non-strategic systems.”

I’m not sure if I agree with these comments. I guess it depends on how you define success. With the statistics of the Standish Chaos reports, I’m not sure how anyone can say the Traditional SDLC has successfully delivered core systems. It is true that is have delivered core systems, but many of those projects may not be defined as a success by the clients. The statement that Agile development has been focused on smaller, non-strategic systems is also concerning. I’ve personally used Agile on large, strategic systems. I’m sure many other people would agree.

“Database refactoring represents a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics. Database refactoring is more difficult than code refactoring. Code refactorings only need to maintain behavior. Database refactorings also must maintain existing integrity and other business rules. The term “database” includes structural objects, such as tables and columns, and logic objects such as stored procedures and triggers.”

While I agree that refactoring a database can be complicated, the risk of extreme changes to the Data Model can be mitigated by creating a High Level Enterprise Data Model in Iteration 0. (and potentially other methods) Frequently people against Agile state that iterations start without some initial work and as such, changes can be drastic and complex. This is incorrect. Agile is a continuum and if small phases of foundation work have value, Agile encourages their use. I have found this method very valuable.

Having experience in both creating Enterprise Data Models and software code. I would say refactoring significant portions of either hurt. So I would recommend trying to minimize drastic changes by doing some upfront high level modeling. I would not say refactoring Data Designs are easier though. A major framework change would be much more intensive and invasive.

Summary

So where does Agile Data Modeling go from here? Given that this was a pretty recent article, I’d say that there still is quite a way to go to incorporate Agile Methods in Data Modeling Methods. The good news is that Agile Data Modeling has much to offer Agile Projects. We just need to help to promote the use of Iterative Data Modeling in addition to Incremental Data Modeling. (Incremental Data Modeling is still better than the alternative)