Good Form, Peter Pan!

Corny title of this section aside, database design is often a subject that developers gloss over. And because developers really cannot get by without utilizing databases, I’m writing this guide for them. It’s not even exclusive to any specific relational database management system (RDBMS); though, it must be an RDBMS- NoSQL solutions don’t fall under these guidelines by definition. But what I’m going to discuss here are the bare minimums that you should understand as a developer.

Database design is a very dense subject. It contains both theoretical and practical subjects. Regardless of how complicated the subject may be, a poorly designed database can bring even the best database-driven application to its knees. Unfortunately, developers never spend enough time in the design phase. Many developers don’t even recognize that they should invest more time in writing the database structure. Never mind understanding what to look for. So that said, I intend to water down the more important subjects.

And what are those subjects? Well, of course, they’re the six forms of database normalization! I’ll only be discussing the first three in this guide though. And of the six I’m going to invest the most amount of my energy into forms one through three. Forms four, five and six are most certainly important but if all you ever become familiar with are the first three, then you’re okay.

Normalizing the Abnormal

What is database normalization anyways? It’s the process of ensuring efficiency within your database. That’s an awful generic statement though. So to be more specific: it’s the process of ensuring there are no unnecessary redundancies or relationships (specifically dependent relationships) within the database. When writing code we adhere to principles such as KISS, DRY and refactoring to ensure we aren’t writing poor, spaghetti-like code. Well, we normalize our databases to ensure we don’t have a spaghetti-like structure.

Let’s start looking at normalization examples. It’ll make more sense soon.

I do want to note that the definitions I am going into here adhere those written by Edgar Codd. Codd is the father of relational databases, so his word is worth a few seconds of thought (in my opinion!). That said- Chris Date (another prominent thinker on relational theory) has some contradicting definitions to Codd’s. I may come back and introduce Date’s at a later… date (no pun intended). But for now, we’ll stick to the tried & proven definitions by Codd.

First Normal Form – No More Multivalue Attributes!

First normal form (1NF for short) concerns itself primarily with ensuring that fields contain only one piece of data. In other words- no serializing an array and slapping it into a single field to hold all of the data you want. Or worse yet- putting your data in as a concatenated string and then splitting it apart in the application.

Mr. Database Engineer receives a phone call one sunny morning from the Ms. HR Manager. The phone call went a little like this: “We need to know the dependents of our employees… when they drop dead on the job we need to know who to cut a check out too!”. So Mr. Database Engineer goes to the drawing board and comes out with this:

normalization_tut_01.png

Why is this a poor idea? Well, what happens when you have to add more dependents? You don’t just insert a new record. Instead, you need to retrieve the current list of dependents and modify the value.

The other half of the definition behind 1NF is that all values must be atomic. To be atomic means to be as simplified as possible. In other words- data is not atomic if it can be broken down. Can we break down the dependents list in the above relation? I would say so.

How do we normalize this relation according to 1NF? Well, we know we cannot have the dependents in the same relation as the employee’s data. After all, we need multiple dependents and cannot have multiple employee records (imagine managing that!). So what can we do? How about we remove the dependents field from the tuple entirely, and bring it into a relation of its own?

normalization_tut_02.png

What we see above is now compliant to first normal form. We cannot further simplify the data.

Second Normal Form – No Partial Dependencies

Second normal form (2NF) states: no attribute within a relation should be dependent on only part of the candidate key.

What’s a candidate key? It’s the set of attributes that make a tuple unique within a relation. In our employee relation above, first_name is not a candidate key because we’ll almost certainly have more than one John working for us eventually. last_name[/b] cannot be a feasible candidate key either, as we may have more than one Mr. or Ms. Doe working for us. But, [i]{first_name, last_name} may just make up a viable candidate key. In a real-world scenario, we would have a unique primary key (an employee ID) within this tuple.

Let’s assume Mr. Database Engineer at our pretend company failed to read my last section in its entirety. And because of that, his employees relation looks like this:

normalization_tut_03.png

Technically this complies to 1NF. Congratulations, Mr. Database Engineer. But now Ms. HR Manager is calling him saying: “Hey! We need to know where our employees live!”. Mr. Database Engineer then returns with this new structure:

normalization_tut_04.png

It’s compliant to 1NF and he’s now showing the employee’s address. What’s the problem here? Well- there’s a whole lot of redundancy- that’s the problem! Remember what I said earlier about what it means to normalize?

To make this mess compliant to 2NF we cannot have the employee’s address within all of those tuples. The address is pertinent only to the tuple’s candidate key ({first_name, last_name}). It has nothing to do with the other attributes. We need to simplify this.

normalization_tut_05.png

Oh look- Mr. Database Engineer now realizes if he had read my entire section on 1NF he could have avoided this!

Third Normal Form – It’s All About the Key

The third normal form requires that all attributes within a relation pertain strictly to the key. An attribute may not relate to another attribute that is not the key.

Mrs. CIO wants to list all of the office branches and their site directors, with a photo, on the website. She asks Mr. Database Engineer to make the storage of this data possible. So he whips this up:

normalization_tut_06.png

The city, state and zip code make up the candidate key. Address and/or director can make a strong case for inclusion in the candidate key due to the possibility of multiple branches on one campus (who knows why). But directors would presumably come and go- the address doesn’t. Thus it’s more fitting.

With a candidate key of {city, state, zip, address} and trying to achieve 3NF, our other attributes can ONLY describe that key alone. Every branch has a director- so we’re okay there. But what is the photo attribute describing? If we removed the director, it would mean nothing to the candidate key. Thus we can say it’s a transient relationship. The photo attribute only describes the candidate key through the director’s name attribute. Transient is not allowed. If it’s not directly describing the candidate key then it’s a no go.

What could we do to fix this? One could remove the photo attribute from the relation. Would that comply with 3NF? I think it would…

normalization_tut_07.png

Everything in the branches relation (top) pertains to the candidate key. And everything within the directors_photos relation is compliant to 3NF.

Bravo, we’ve made our way from a horrific mess to a normalized database. We’ve removed unnecessary redundancies and improved manageability!

Why Should I Care? What I Have Now Works Just Fine…

You, like most developers, have problem written plenty of database schemas that were a far cry from normalized. And they worked just fine. Well- you made them work just fine anyways. Let’s recap and expand on why you should care:

1. Redundancy

As stated, the goal of normalization is to reduce, if not eliminate, redundancy. There are plenty of good times for redundancy in IT and engineering. You want redundant power supplies in your storage application. And for that same storage application to be running on, for example, a RAID 5 so you have [S|H]DD redundancy; you can lose a drive and not lose any data. However, you do not want the same pieces of data to be thrown around your database recklessly. Why?

2. Maintainability

The answer to #1 and a reason all on its own is maintainability. When you have data just thrown around haphazardly you run the risk of needing to update multiple relations when changing just one thing. When Mr. Database Engineer screwed up and half-implemented my advice for accomplishing 1NF he generated a whole lot of redundancy. If there had been a mistake with the employee’s name, it would have had to be updated in many places.

3. Efficiency and Portability

These two go hand in hand, and really just build on the last two. Efficiency here ought to be self-explanatory. Less redundancy means less data. Less data typically means faster read-write actions. You also gain more grounds for indexing when your relations are horizontal and not vertical (aka, more but smaller relations vs. less but larger). Indexing is the greatest form of efficiency this side of the Mississippi.

Portability plays very nicely with efficiency. As a consequence of being efficient, your data is going to be more accessible via other mediums. Perhaps you’re going to take your application and expand it into an iOS app. If you aren’t relying on the application code to make up for all of the mistakes in the database, you can more easily work with that database. But if your database is riffled with inefficiencies due to not being normalized?

Well, there’s a good chance your iOS app is now going to have to support a lot of “fix it” code. Nobody likes “fix it” code.

Final Words

I started this tutorial some time ago (… as in months ago), but never got around to wrapping it up. So you may notice a disconnect in how the tutorial is written. I apologize for that.

There’s a glaring simplification of terminology and concepts within this tutorial. Unfortunately, normalization is difficult to explain on a solid middle ground. Either I word things completely accurately and have to write an entire dictionary to coexist with the tutorial, or I simplify and ensure maximum readability. I opted for the latter. I’ve tried to not let my simplification come in the way of accuracy. But, in an effort to do so, I may have inadvertently slipped up. Should you notice such an instance let me know and I will reword that portion.

Either way- I hope this serves its purpose as an introduction into database normalization.

Relational Database Design for Developers (Database Normalization)