The state of #Agile according to Data Modellers

DataVersity released their snapshot survey on Data Modelling and the results aren’t good for Data Modellers who want more adoption of Agile by the Data Modelling community.

Before we get into the details, DataVersity is a great source of references and webinars for all things data. You can find them by clicking on this link: DataVersity

The Results

1) How important is it to know how to work with Agile Teams? (this was graded on a scale of 1-10. I’ve tried to group them into categories)

  • 27% thought it was important (1-3 on a scale of 10)
  • 27% were neutral (4-6 on a scale of 10)
  • 46% thought it was not important (7-10 on a scale of 10)

2) How much experience do you have with Agile or Scrum projects?

  • 12% – 10+ projects
  • 9% – 5-10 projects
  • 20% – 3-4 projects
  • 16% – 2 projects
  • 8% – 1 project
  • 33% – no experience

3) If you have been the data architect or data modeller on an Agile project, how satisfied were you with the results?

  • 22% – very satisfied
  • 46% – fairly satisfied
  • 32% – not satisfied

Analysis

What I took from these responses are that the majority of people feel that Agile isn’t that important to them. But I hope there is some light at the end of the tunnel as a good percentage of Data Modellers have not been on an Agile project or have only been on 1 Agile project. (41%) In addition, of the ones that were on Agile projects, 68% reported that they were either very satisfied or fairly satisfied. Those are encouraging numbers. Although some Data Modellers don’t think Agile is important to them, there seems to be a correlation that once they are exposed to the methods, they view them as positive.

Dimensional Modelling

Those of you that have read my blogs have seen my positive comments on Dimensional Modelling. Another concern I had with the results of the survey are the percentage of Data Modellers that feel understanding Dimensional Modelling concepts are not important to them.

  • 28% thought it was important (1-3 on a scale of 10)
  • 34% were neutral (4-6 on a scale of 10)
  • 40% thought it was not important (7-10 on a scale of 10)

Given that all Data Modellers have had to model the data to support reporting, this is a somewhat concerning statistic. Dimensional Modelling is not something new as well, not like XML and No SQL data stores. Shockingly 62% felt that knowledge of XML and No SQL Data Stores are not important to them.

Summary

I think we have a ways to go to encourage the adoption of new processes, methods, and technologies in the Data Modelling and Database professions. Even within our own profession, items like Dimensional Modelling still lack full acceptance.

I would encourage every Data Modeller out there to learn about Agile and Agile methods. As we move into the second decade of Agile, its adoption is increasing and it will affect every type of project in the future. Understanding the methods will help the Data Modeller determine how best to integrate with Agile methods and practices.

I feel that Data Modellers must remember that we are providing a service to projects and the business. If the projects and the business are becoming more Agile, we also must become Agile. If not, the projects and businesses will be drawn towards other solutions and services that do align with their methods. This could possibly drive more projects toward No SQL and XML that are not good candidates for those projects.

 

Advertisement

#Agile Enterprise Data Model – #holistic and #iterative

I’ve recently been working on a Data Warehouse project that I am trying to do in an Agile way. Unlike Agile Software Development, Agile Data projects have some complexities that cause you to adapt the common Agile methods and practices. I am a big fan of User Story Mapping and creating a User Story Map to guide your development of an application and to create and prioritize your backlog. But I have felt that a Data Model tends to not allow stories to be sliced as thinly as application functionality. In many cases a group of tables must be taken together to make any sense at all. Unlike application components that can be segmented easier, it sometimes is quite difficult to segment a data model into parts that can be worked on independently.

I have been involved in other Data Modelling projects where intense analysis was done to try to understand all the intricacies of the data before the application development started. This analysis sometimes lasted 3 months, 6 months, or even years. It goes without saying that this is not an efficient was to organize a project.

If you try to segment application user stories, individual reports, and data requirements as stories they end up being a bit too fine-grained from database point of view. Creating a data model from a single user story or report can lead to a data model that has to undergo a great amount of re-factoring as new user stories, reports, and data requirements are considered. Frequently a new user story that enhances application functionality requires an entirely different data model. ūüė¶

Then I started to think that this can happen to most Agile projects. They slice the User Stories thin from an application point of view and that works as the application components can be somewhat independent, but what happens when you try to line-up the data components?

I am sure this impedance has led to the more drastic database re-factoring that has caused it to have a bad reputation and caused people to write books on the subject.

Epiphany

Then it hit me. Although the User Story Mapping exercise does allow for an end to end view of the application that can then be iterated on, it does not create an end-to-end view of the Data Model or the Data Domain. In fact, some critical components of the Data Model may not be encountered until much later. This can cause significant rework based upon the design that has been done earlier.

So what could be the solution?

Agile Enterprise Data Model

An Agile Enterprise Data Model is a thin data model of the Data Domain that can be done as a two-week technical spike. This Agile Enterprise Data Model provides a User Story Map for the data. It provides a Data Model with placeholders for discussion and further refinement. The concept it not to do detailed Data Modelling but to understand all the data components at a high level that will need to be considered by the project. Used in conjunction with a User Story Map, it can be a valuable tool to allow the application and data model to be worked on in true iterations.

Typically an Agile Enterprise Data Model is a conceptual data model with 20-30 entities. There are minimal attributes and discussion and understanding revolves more around entities and relationships than anything. The Agile Enterprise Data Model can then be refined to create the application database, Operational Data Store, and Data Warehouse.

Summary

I have used the Agile Enterprise Data Model with very good success. It also works best in a Data Domain you are unfamiliar with as it allows you to understand a Data Domain holistically initially and then deliver in iterations. The best of both worlds.

Why #Dimensional Modeling matters

I’ve recently completed a data modeling initiative on a major project. After doing this I’ve come to two major conclusions:

  1. The coverage area in Insurance is probably the most devious and twisted area of data that I have ever modeled.
  2. Dimensional Modeling should be done on every model to ensure you can simply model the domain.

Why Dimensional Model?

For the model I worked on, there are over 261 tables in the Relational Model. In the Dimensional Model, there are only 25. The process to distill a Relational Model to a Dimensional one is not easy, but the process highlighted problems in my Relational Model that I was unaware of. These issues would have been exposed later during testing or post-production, but being able to identify them in development was extremely valuable.

So what is Dimensional Modeling? although most people have been exposed to the rules of Data Normalization, not as many people have been exposed to Dimensional Modeling. I will try to explain my opinion of Dimensional Modeling, but if you are interested I would highly recommend ‘The Star Schema Reference’ by Chris Adamson. Simply put, it is the most complete and concise book ever written on Dimensional Modeling. Dimensional Modeling is usually done to create a model for a Data Warehouse. This Data Warehouse can then be used for Operational and Analytic reporting. But the process of Dimensional Modeling does not need to be limited to a Data Warehouse. The process itself has value to validate the Relational Model.

The term that is commonly used to refer to a Dimensional Model is a ‘Star Schema’ model. This is due to the fact that most Dimensional Models look like a star. (With the Facts in the centre and the Dimensions around the Fact)

Dimensional Modeling

Dimensional Modeling is the process of taking a Relational Model in some normal form and being able to distill the many tables down to the business objects that the business works with. In most cases this would result in single tables that correspond to objects like:

  • Client
  • Product
  • Sales Rep
  • Store
  • Sales Transaction

These business objects fall into two categories:

Facts – These are Facts about the business. They typically correspond to events or transactions and have metrics or measures that the business is concerned about. In many cases these fact tables are what the business wants to report on.

Dimensions РThese are Dimensions to the Facts. They typically correspond to business objects that interact with Facts. These Dimensions represent how the business describes these objects and how they like to slice, dice, filter, sort, group, and order the Facts. A standard Dimension that occurs in all Dimensional Models that does not correspond to a business object is Time. Time is a dimension for almost all business events and transactions.

Initially, the act of creating these Facts and Dimensions is a large task of de-normalization. Although this sounds like a simply task, it is anything but. A Relational Model with hundreds of tables must be distilled down to a handful of tables. Many to Many relationships must be transformed into simple relationships that are easy to understand and report against. A natural key must also be defined that uniquely identifies a row in the Fact and Dimension. Although this sounds easy, you may find that some Dimensions don’t actually have a natural key. This is a clue that the Dimensional Model needs more attention.

A Dimensional Model also places the modeling of history at the forefront. A lot of attention is spent ensuring that data can accurately represent the changes that occur over time. Most of the time, a Relational Model is primarily concerned with current state.

Note: This is a gross over-simplification, if this is interesting to you I would recommend the book by Chris Adamson on all of the theory and complexity behind Dimensional Modeling. Or you can just start Googling!

Conclusion

The act of creating a Dimensional Model and creating the 25 Facts and Dimensions highlighted inconsistencies and duplication in my model. It also challenged my understanding of the data domain. It raised questions that I realized I didn’t have the answer for. It is easy for inconsistency and errors to hide in data model with 200+ tables. A¬†data model¬†that has 10-30 tables has nowhere to¬†hide. It has the brutal transparency so valuable in Agile.

#Data #Hoarding and #Norm-aholics

Are you a sufferer of Data Hoarding? This is an affliction that affects thousands of businesses every day. Common symptoms are terabytes of data being kept around just in case you ever need them. The data ends up being stored in a multitude of disconnected tables just piled on top of one another with no organization. When a new piece of data is thought to be needed, another table is created and thrown on the pile.

The data is kept around if it is ever needed, but good luck ever finding it in the future when it is buried under piles of other data. Now where did I put that snapshot of the ledger balances from March 2011?

Even worse, most new applications create an isolated database to hold the data that the application requires. Doesn’t matter that at least 30-40% percent of the data already exists somewhere, it is too hard to integrate and create one common data source. Much easier to copy data once and then modify it as the application desires.

It doesn’t have to be this way.

Data Hoarders Anonymous

At Data Hoarders Anonymous we work with you and take you through a seven step program.

Phase I – Repent

1) Admit you have a problem

The first step to any recovery is admitting you have a problem. Look around at your databases, do you have a data dictionary? Do you have 20+ tables with the same name? Have you copied tables directly into a database without doing any analysis or integration?

2) Understand the Enterprise’s perspective

Data is an asset for the enterprise, but only if it is easy to understand. That is the key word, the enterprise must be able to get and understand the data easily. Typically when data has been replicated across god’s green earth, there are multiple copies of the truth and they are anything but easy to find. The data is worthless if it isn’t easy understood.

3) Apologize to Developers you have made extract data from multiple complex disconnected schemas

Take donuts with you. But apologize to those developers that you have made create programs that need to extract data from multiple sources and somehow make sense of it all. SQL should not be a language where you need a PHD to understand. If your SQL is longer than a screen, you probably have a Data Hoarding problem.

4) Apologize to Business users you have made explain inconsistent data to Vice-presidents

Take alcohol with you and apologize to the business users who don’t know what is included in the allocated income calculation and what table actually has the true up to date value. Apologize for not having a Data Dictionary available to them that would tell them where the data exists and what it actually means.

Phase II – Recovery

Now that you have repented, you can move onto the recovery portion…

5) Organize your data with Data Normalization

Data Normalization is your new friend. Learn the rules and let them assist you in removing data redundancy and enforcing consistency and integrity. But be very careful. Many on the recovery path of Data Hoarding end up becoming ‘Norm-aholics’. They normalize to at least third normal form all the time, even though it creates excessively complex schemas for the developers and data that can’t be accessed by business users. Even worse, ‘Norm-aholics’ have no self-awareness that they have an addiction. Never let the rules of Data Normalization affect the progress you have made. Sadly,¬†many DBAs fall into this trap and let the theory of Data Normalization affect the understanding of the data.

Data Normalization must increase the understanding of the data, if not Data Normalization has been corrupted.

6) Simplify your data with Dimensional Analysis

Now that you have used Data Normalization to remove data redundancy and maximize consistency and integrity, the time has come to use the tools of Dimensional Analysis to ensure the data design is simple. Use Dimensional analysis to ensure that you haven’t traded one type of complexity for another. Evaluate your data across time and Master Data Dimensions to ensure that the transactions concepts are consistent and simple.

7) There can be only only one

At the last step, you will need to embrace the principle that there can be only one…

  • One owner and source of the data – The data must be owned by one business area and system
  • One¬†place to update the data – the data must be updated in¬†only one system that the owner controls
  • One way to get the data – there can not be multiple query paths to retrieve the required data
  • One truth – the data must have only one meaning and this meaning must be documented in a Data Dictionary

Coming to you soon on A&E РData Hoarders Anonymous

Why #Dimensional Analysis should be done on every #datamodel

Those of you who have worked with me, know of my fondness for Operational Data stores. I have always believed in the importance of having an enterprise or holistic view of the data requirements for every application. An Operational Data Store seemed to be the perfect vehicle to ensure this happened. Perhaps my fondness was related to not wanting to stray too far from the normalization rules that I knew quite well. In this way, it was a new-ish discipline or context that really wasn’t new.

I always looked kinda sideways at those weird Dimensional modelers with their Star Schemas and Snowflakes. I mean if they really put their mind to it, they would be able to figure out how to solve their data needs with a nice relational normalized Operational Data Store, Only exceptional and massive amounts of data require the Dimensional modeling constructs that these models typically use right? I mean what is so complicated about a model with only 100 main tables? Shouldn’t everyone know how to write SQL by hand?

On my latest project, I have had the opportunity to become re-introduced to Dimensional Analysis and modeling and I have found the process fascinating and very valuable. Besides the obvious benefits that are being realized by being able to model the data in a way that allows the clients to efficiently write and execute queries, there was an unexpected benefit.

Taking a normalized Data Model and attempting to translate it into a Dimensional Model really challenges and validates your data model. It is easy to create a model with a multitude of complex relationships than it is to distill in down to a handful of FACTs and Dimensions. With so many relationships, it is possible to inconsistencies to exist and hide in the data model. I found multiple modeling errors in the process on trying to create a Dimensional model from my relational model. When you distill a relational model down to a Dimensional model, inconsistencies and errors become very apparent in the creation of the FACTs and Dimensions.

Dimensional Analysis also forces you to look at the data in a different way. Instead of a relational/hierarchical way, I find it forces me to look at the data in a chronological way and forces me to consider data changes, data history, and data latency in ways I may not have considered before. Not having to account for data across time and verify consistency at every point is quite a bit simpler.

Summary

I am a convert of using Dimensional Analysis on all my data models for validation of the data model and additional analysis of the data.¬† I’ve discovered that I need to understand the data better to create a Dimensional Model than a normalized model. More factors need to be considered and creating the Dimensional model with¬†fewer objects¬†requires that the data model has greater consistency, integrity, and cohesion.

Simple is hard. 

 

Why do we #DataModel at all?

People in the Database world take Normalization and Data Modeling as something that should be done without question. I compare it to best practices like versioning software. No one expects that anyone would create software without version control anymore.But more often recently I do get questioned and challenged on why we need to normalize and model data. Is it even required with the cheap disk space, memory, and server capacity available ?

According to Wikipedia and others, the objective of normalization is:

“Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”

The rules of Normalization

Normal form

Brief definition

Violation

1NF A relation is in 1NF if and only if all underlying domains contain scalar values only First Normal Form is violated when the table contains repeating groups
2NF A relation is in 2NF¬†if and only if it is in 1NF and every non-key attribute is irreducibly dependent on the primary key ‚Äď every column must depend solely on the primary key Second Normal Form is violated when a non-key field is a fact about a subset of a key
3NF A relation is in 3NF if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. Third normal form is violated when a non-key field is a fact about another non-key field
4NF Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A. Fourth Normal Form is violated when a table contains two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

The Questions

Now that we have reviewed the objectives and rules of normalization, let us summarize. The objective of Normalization is to:

  1. Minimize Redundancy
  2. Minimize Dependency

But what if we have extra storage available and storing redundant copies of data is not a problem? In fact, it probably will speed up our query response time. What if we also don’t require frequent modification of data so that having de-normalized data won’t result in update or deletion anomalies caused by excessive data dependency? Why should we still model our data and normalize?

The three reasons to Data Model

Simplicity, Consistency and Integrity, and Future Flexibility.

Simplicity

Every one of the violations mentioned above would require extra code and extra unit tests to validate proper functioning. Depending on the amount of violations, this can become a severe amount of technical debt that will needlessly be required in the software. There is an entire movement dedicated to the elimination of If statements. (www.antiifcampaign.com) Software that is Data Driven rather than Condition Driven is simpler and easier to maintain over the life of the application.

An application that is Data Driven can also automate the creation of their test cases to validate proper functioning of the application. This combined with the enhanced simplicity greatly adds to the quality of the application.

Consistency and Integrity

Even if the solution being modeled can accommodate redundant data and has the potential for minimal update and deletion anomalies currently, significant risk is being assumed by having these potential situations in your data model. How can you ensure that redundant data will be kept in sync and that update and deletion anomalies do not get introduced in the future as people and requirements change? Either this is through additional software development code or by additional processes and latent knowledge in resident experts. Neither of these situations are a good use of time and energy.

This is an example of an application-centric view of the data model. Unfortunately, not all activity on the Data Model can be guaranteed to always go through the application. Data Fixes, Conversions, and enhancements all have the ability to bypass the application’s business logic and compromise the integrity of the data. All it takes is one high value client with inaccurate or inconsistent data to irreparably harm a company’s reputation and revenue stream.

Future Flexibility

Solutions that are data driven and do not have excessive functional dependencies are much easier to evolve in the future. For example, I may have a business requirement to split one account type or combine account types. This type of¬†conversion will be quite routine if I have modeled my data properly and minimized dependencies. If not, the conversion can be quite convoluted and I will probably need to evaluate code before I can determine the implications of¬†making such a change. Then I have to be sure I address and update all the redundant code throughout the application. Just because the situation doesn’t exist currently with update and deletion anomalies doesn’t mean those situations won’t happen in the future.¬†¬†

In addition, these changes to split or combine account types would probably also require code changes. If the solution was Data Driven, the possibility of these code changes would be minimized. (not to say they would never be required, but the probability of code changes would be minimized)

Summary

A well designed application and user interface will be able to be used with minimal training. It just makes sense and models the clients current processes and tasks.

A well designed data model should also have the same intuitive qualities. It also makes sense and models the business’s data. Not how the application functions, but how the business exists. Modeling the data in this manner minimizes work currently to work with the data and in the future to accommodate change.

In Object Oriented parlance, the Data Model itself should be loosely coupled with high cohesion. Both the Object Model and Data Model should share this characteristic. (Although they will implement it in quite distinct ways)

How an #Agile Data Warehouse leveraged an #InnovationGame – Iteration 2!

A few weeks ago I authored a post that explained how were leveraging an Innovation game on my current Agile Data Warehouse project. You can find the original post here.

Iteration 2

One of the aspects I love about Agile is the freedom it allows you to be bold as it acknowledges it is impossible to get it perfect off the bat. I find that this encourages people to take a chance and try new things. If it doesn’t work? No problem, we will just adjust and get better as we progress.

My initial blog proposed using a visual Innovation Game with Visual Report Boards to allow for brutal transparency and the management of the data requirements for an Agile Data Warehouse project. Each Visual Report Board had an object at the centre with 6 dimensions around it that illustrated how the object could be reported on. (I affectionately refer to these boards or diagrams as Hexes) Our Objects were the people who the corporation were interested in. It turns out that this method has been successful. It has allowed us to create a visual backlog of data requirements and have the clients prioritize the work and guide the work according to the overall business priorities.

Why am I writing this blog then?

Well it turned out that once we started placing the data requirements or reporting stories on the Visual Report Boards, it became apparent that our objects and dimensions of the diagrams required tweaking.

Initially I had placed a Person object at the centre of the diagram. The dimensions where then aspects of how we need to report on that person. Although this object and dimensions may work for other projects, it did not work for our project. (Actually I think that most Data Warehouse projects would probably make the same change we did, but I’ll let you decide that for yourselves)

Our experience was that most of the reports or data requirements were very clustered on the Visual Report Boards and the diagrams did not allow for the visual communication of what the report was or what data was required. I was starting to worry that this process might not provide the brutal transparency that allowed for the efficient creation of a Data Warehouse in an Agile way.

Just the Facts

After reviewing the requirements, it became apparent I had the wrong objects at the centre! Rather than people at the centre, the objects at the centre should be transactional data. The centre objects needed to be the actual data that was summed, aggregated, filtered, sliced, and diced. Once I made this change, the value of the Visual Report Boards increased exponentially. Now they communicated the content and purpose of the data requirements.

The real indication that I was on the right track is that the Revenue and Expenses Hexes I now had were also the first two Fact tables that were needed in the Data Warehouse! This method of visualization and analysis was aligned 100% with the Data Warehouse design process. Of course the Hexes were Fact table. This made perfect sense. I imagine more Hexes will be needed in the future as we discover the need for more Fact tables.

In addition, we created one more Master Data Hex as some reports and data requirements are not related to transactions.

Summary

I am convinced the use of these Visual Report ¬†Boards and the related use of an Innovation Game enable the creation of a Data Warehouse in an Agile manner. We are executing in Iterations and not increments and the clients are thrilled with the control, visibility, and value they get every 2 weeks. I’ll post another blog once we have created enough of the Fact tables to provide more lessons learned.

Object Model and Data Model differences – Embrace the diversity

In my experience there is a distinctive difference in Data Models when they are created by developers with an Object Model mindset. Usually there is some work and negotiation that needs to be done to properly overcome the Object Model-Data Model impedance problem. I believe that both extreme points lead to less than optimal designs and architectures:

1) Implementing an Object Model directly into a Data Model

2) Implementing a Data Model directly into an Object Model

Both models have a rationale and logic as to how they structure components and sometimes that doesn’t translate well between each other. There are good reasons on why they differ.

Recently I’ve had a situation where I’ve come up against these differences. And they have arisen in regards to how each camp sometime models Super/Subtype tables. (and subsequently reference tables)

Definitions

I am using the term Super/Subtype tables to refer to a collection of tables that represent a Generic/Specific relationship. Unlike Parent/Child relationships that represent a relationship with a hierarchy,  the Generic/Specific are at the same entity level. Perhaps a few examples would assist the discussion:

Parent/Child

Typical Parent/Child relationships represents ‘belonging’ or ‘ownership’. Examples of this type of relationship would be:

  • Person/Address
  • Account/Transaction

In these examples, there is a clear understanding of precedence and hierarchy. You obviously need a Person before you can have a related Address. Because there is a Parent/Child relationship and hierarchy, there also usually a clearer distinction between the types. In many cases, this is due to the fact that the two entities have different cardinality. Even the most fervent object modeller will create separate objects if they have different cardinality.

Generic/Specific

Generic/Specific relationships do not represent ‘belonging’ or ‘ownership’. In addition, they also typically do not have different cardinality. These relationships represent ‘refinement’ or a type of ‘Master-Detail’ relationship. These relationships have one Master entity with multiple Specific entities. To be a Generic/Specific relationship, there must attributes at the Master level that applies to all Specific types, but attributes at the different Specific types that only apply to each Specific type. (Which is why the Specific entities exist. Otherwise we would just have them at the Master level if they applied to all Specific Types)

Usually these relationships are created in the Data Model in one of two ways:

1) Multiple tables – one Master table with multiple specific Detail tables. (and a type code on the Generic Entity to define which related Specific Entity applies for each row.

2) One table with nullable fields that are populated based upon the type code of row

Examples of this type of relationship are:

  • Financial Account – [Equity Account, Fixed Income Account]
  • Coverage – [Maximum, Deductible, Co-insurance]

In these examples, there is not a pronounced hierarchy. You obviously need a Financial Account before you can have a related Equity Account, but they are at the same level. As we mentioned before, these entities in the relationship do not have different cardinality.

My Experience

My experience is that Data Models and Object Models are typically the same when we are discussing Parent/Child entities and relationships. Differences of opinions occur with these Generic/Specific entities and relationships. Since they do not have different cardinality, they are very easy and natural to represent as an object.

  • From a data modeller point of view, this practice creates data entities with embedded data knowledge. No longer does the structure of the data represent the true nature of the data. Depending on how many Specific Types and attributes exist , this can cause considerable complexity. The design also breaks the rule of Second Normal Form.
  • From an object modeller point of view, this practice creates an object that can be re-used and leveraged. Rather than needing to create separate objects, I create one object! Isn’t this just Polymorphism?

Isn’t this only a problem depending on what the database is being used for? Since I am a data modeller, I think it is a problem no matter what the database is used for. I guess the issue is just how large of a problem it is. There is less of an issue if it is an application database that has a business layer that can incorporate this complexity and ¬†shield it from the consumers of the data. This may be a much larger issue if this design is for an Operational Data Store or Reporting Database.

Summary

There are valid reasons on why designs and models should differ between objects and tables. This is especially true when analysing Generic/Specific relationships. IMHO.

SQL Server 2012 – Cheaper than Open Source Database options

On a recent project we were tasked to review and recommend a database technology platform for a new Data Warehouse environment. This quickly turned into a database evaluation not just for the new Data Warehouse, but also for all future applications. The client was currently on Sybase, but was open to the most cost-effective option. The client also had experience with Open Source technologies in the past for both server operating systems (linux) and reporting solutions. (JasperSoft)

The criteria we specified had cost and performance as two important factors. Since we had limited experience with any Open Source databases in a Data Warehouse environment, we felt it would be useful to evaluate the options from a cost, functionality, and performance point of view.

This post is a recap of that process and of the somewhat surprising recommendation.

The Candidates

Since the client already was using Sybase, it was a no-brainer to include that one. It addition to Sybase, we included SQL Server and Oracle from the commercial DBMS’s due to their market share and existing use at the client. DB2 was excluded due to DB2 not currently being used at the client and the perceived higher cost of implementation of DB2. (Rightly or wrongly) It was felt that either Oracle or SQL Server could represent the commercial DBMS’s.

On the Open Source side, the DBMS’s MySQL and PostgreSQL were selected. MySQL was selected due to its large adoption in the Open Source community and recognition for performance. PostgreSQL was selected also for its recognition in the Open Source community for robustness and performance. MySQL and PostgreSQL were the only two Open Source DBMS’s as it was felt they were the clear leaders in the Open Source DBMS’s.

After performing detailed performance tests between MySQL and PostgreSQL, PostgreSQL was chosen as the Open Source DBMS option to compare against commercial DBMS’s. This was due to the fact that MySQL had significantly poorer performance when using the InnoDB engine that guaranteed ACID compliance and provided referential integrity. MySQL has an interesting piece of functionality to allow you to use different database engines that provide different functionality and types of processing. Unfortunately the engine used in many of the published benchmarks do not provide the functionality required by standard enterprise DBMS solutions.

The Requirements

The current Data Warehouse requirements were reviewed and it was determined that the vast majority of the reporting requirements were operational reports delivered in text format. This is probably not unlike many companies out there that are early in the use of Data Warehouses. This is not to preclude the use of more analytical functions in the future, but that was not the use of the Data Warehouse currently.

The current Data Warehouse also stored data in the 10’s of gigabytes. So although this was a considerable amount of data, it also did not require advanced Data Warehouse appliances or Big Data solutions. Traditional Data Warehouse architecture and designs would be sufficient for now and well into the future.

The requirement was that the Open Source solutions must be supported. For example, this meant that we would require support from EnterpriseDB if we chose PostgreSQL. I propose this is pretty typical for larger companies evaluating Open Source.

The Architecture

Due to the current requirements and expected future requirements, the recommended Data Warehouse solution architecture was designed to incorporate both an Operational Data Store and a Star Schema Data Warehouse. The Operational Data Store would be used to generate pure operational reports and would be loaded at least daily. The Star Schema Data Warehouse would be used to generate more analytical reports and would be loaded on a weekly basis.

The Evaluation

The recommended configuration for the evaluation was for two servers each with two Quad core CPUs. This was due to the current workload and the expected increased workload in the future. We did evaluate higher levels of configuration to ensure the cost was linear for future growth.

This was not a small Data Warehouse and is probably typical of most enterprises as an entry into Data Warehouse technology.

We separated our cost evaluation into two sections:

1)       Initial Cost

2)      Ongoing Cost

Initial Cost

Initial cost incorporated the server and DBMS license fees for both development and production environments and also two human factors:

Training ‚Äď What is the cost of the initial training required for the administrators.

Efficiency ‚Äď What is the expected efficiency for the administrators in the first three months.

The results can be seen below:


As you can see, the supported PostgreSQL option was significantly more affordable as far as initial cost.

Annual Cost

The Annual cost incorporated the following two factors:

License Support costs ‚Äď These costs are the support and maintenance costs for both the Production and Development environments.

Estimated Annual Maintenance Costs ‚Äď It is estimated that there will be additional ongoing effort to maintain the different DBMS solutions as compared to the current Sybase solution. This is due to the different complexities of the DBMS solution, the available tools for the DBMS solution, and the available support in the DBMS user communities.

  Estimated Annual Maintenance Rating Rationale
Sybase ASE

0%

No additional maintenance effort is required due to the fact that Sybase is currently used
Microsoft SQL Server 2012 Enterprise

5%

No additional maintenance effort is required due to the fact that Microsoft has an excellent set of tools and resources and the environment is similar to Sybase which is currently used

Baseline 5% additional cost has been added to account for the fact that a second DBMS must now be learned and supported.

Oracle 11

20%

Additional maintenance effort of 15% is estimated due to the complexity and toolset for Oracle.

Baseline 5% additional cost has been added to account for the fact that a second DBMS must now be learned and supported.

PostgreSQL – supported

10%

Additional maintenance effort of 5% is estimated due to the simplicity of the PostgreSQL solution and the relatively limited toolset and resources.

Baseline 5% additional cost has been added to account for the fact that a second DBMS must now be learned and supported.

The results can be seen below:

As you can see, the annual ongoing costs end up being in favour of SQL Server.

The 10 year Total Cost of Ownership (TCO) is:

DBMS Cost
Sybase ASE

$1,173,057.20

Microsoft 2012 Enterprise

$1,019,909.00

Oracle

$4,040,910.00

PostgreSQL ‚Äď supported

$1,074,025.00

Although SQL Server has an initial license cost, the ongoing support costs are less than supported PostgreSQL in almost all categories.

Conclusion

Over a ten-year period, the TCO for SQL Server is less than a supported version of PostgreSQL. This also includes the server licensing costs. As an added benefit, our solution also required a Reporting Solution and an Extract, Transform, and Load solution – SQL Server provides both of these solution bundled with the DBMS for no additional cost.

From a cost perspective, SQL Server is the clear winner.

For our situation, although cost was important, it was not the overriding factor. In fact, cost accounted for only 10% of the entire weight. The factors and their weights are listed below:

Evaluation Criteria

Weight

MBC Required Functionality

350

Cost

100

Future Commercial Viability

100

High Availability

50

Scalability

50

Product Track record/Future stability

50

Technical Functionality

350

Architectural Standards and Connectivity

50

DBMS standard functionality

100

Data Warehouse standard functionality

150

Security/Encryption

50

DBMS Management Functionality

50

Ease of Maintenance/Management

30

Database Configuration

20

People and Future Flexibility

50

Future Migration/Flexibility

20

Standard Technology/People Availability

30

Performance Functionality

200

Bulk Load Performance

20

Bulk Update Performance

20

Bulk Delete Performance

10

Basic Query Performance

20

Cross Product Query Performance

20

Sub Query Performance

20

Parallel Query Performance

20

Analytical Query Performance

20

Indexed Update Performance

20

Indexed Delete Performance

20

Non-Indexed Update Performance

10

Total

1000

After all the factors were evaluated, SQL Server finished 196 points ahead of PostgreSQL.

SQL Server was selected and I was enlightened. Although people assume Open Source DBMS’s are the most inexpensive options, SQL Server is the leader once you include other factors. In fact, SQL Server was the leader in both cost and functionality.

SQL Server Integration Services (SQL Server’s ETL solution) and SQL Server Reporting Services (SQL Server’s Reporting solution) were NOT included in this comparison. If they were, the results would have been even more pronounced.

I encourage you to do your own evaluations. You may be surprised at the results. I know I was.

Please let me know if anyone is interested in specific evaluation criteria. I’d be happy to share our rationale and SQL performance scripts.