Sunday, June 22, 2014

Database Normalization

Next to Codd's Rules, this is the concept a DBA should be aware. A database is normalized to avoid redundancy of data. For this, large tables are broken into small tables and relationships are defined among them. This avoid anomalies when making additions, deletions and updates to the data.

Edgar F. Codd defined normalization forms in database which was later extended by Raymond F. Boyce. Codd specified three normal forms namely First Normal Form (1NF), Second Normal Form (2NF) and Third Normal Form (3NF). Later Codd and Boyce specified Boyce-Codd Normal Form which is the next level of 3NF.

There are several normal forms defined,

    1) First Normal Form (1NF)
    2) Second Normal Form (2NF)
    3) Third Normal Form (3NF)
    4) Fourth Normal Form (4NF)
    5) Fifth Normal Form (5NF)
    6) Sixth Normal Form (6NF)
    7) Boyce-Codd Normal Form (BCNF)
    8) Elementary Key Normal Form (EKNF)
    9) Domain/Key Normal Form (DKNF)

Among the above normal forms, 1NF, 2NF, 3NF and BCNF are the most important. Most of the organizations normalize their database up to BCNF. The remaining are the higher levels of normalization. If you reach 3NF then your database is said to be normalized. In this post, we will discuss about 1NF, 2NF, 3NF and BCNF.

Before that let's have a glance at the technical terms used in Normalization concept,

Tuple - A tuple is a row or a record of the table.

Candidate Key - A candidate key is a primary key defined on single column of the table.

Super Key - A composite primary key is called a Super Key.

First Normal Form (1NF)


According to Codd, a table is said to be in 1NF if each attribute contains atomic values. This means each column of the table should contain only one value. Data in the table should be organized into rows. When data is fetched from the table, there shouldn't be multiple rows displaying same data from the set of columns. There should exist a Primary Key which distinguishes rows as unique among them. This Primary Key column shouldn't have duplicate values.

Consider a scenario where there is a table that shows the information about film actors acted in different films as below,

Actor Code
Name
Film
JD
Johnny Depp
Pirates of Carribbean, Blow
GC
George Clooney
Gravity, Oceans Eleven

The above table shows the information about actors acted in various films. It violates 1NF as it contains multiple values in a row for single column. If that has to be organized then it should be as follows,

Actor Code
Actor
Film
JD
Johnny Depp
Pirates of Carribbean
JD
Johnny Depp
Blow
GC
George Clooney
Gravity
GC
George Clooney
Oceans Eleven

Now the table is organized in such a way that in each row there is only one value for a column. But this still violates 1NF in the aspect of Primary Key. In the above table, the column Actor Code can be considered as Primary Key as it remains unique though there exists another actor with the same name, he will be given a different code. 1NF states that column defined as Primary Key has unique columns. To have that uniqueness, the above table can be broken into two tables, one being Actor and the other being Films as below,

                               Actor
Actor Code (PK)
Name
 JD
Johnny Depp
GC
George Clooney







                           Films
Actor Code (FK)
Film
JD
Pirates of Caribbean
JD
Blow
GC
Gravity
GC
Oceans Eleven










Now there are two tables namely Actor and Films which are in 1NF. In the table Actor, Actor Code is the primary key which is a Foreign Key in the table Films.

Second Normal Form (2NF)


According to Codd, a table is in 2NF if it is already in 1NF and no non-prime attribute is dependent on any proper subset of the candidate key. In database, multiple columns can be concatenated as a single primary key. Codd's statement means that every column other than the columns defined in primary key must be dependent on all the columns of primary key.

If there is a single Primary Key then there is no violation of 2NF. If there is a composite primary key then the non-prime attributes should be in such a way that they should depend on the entire composite primary key. i.e., on all the columns used in the primary key.

Consider the following table,

Actor Code (PK)
Film (PK)
Actor
Genre
JD
Blow
Johnny Depp
Crime
GC
Gravity
George Clooney
Sci-Fi
BP
Fight Club
Brad Pitt
Drama

Here there is a composite primary key containing Actor Code and Film columns. But the remaining columns Actor and Genre do not depend on the entire primary key. Actor depends only on Actor Code whereas Genre depends only on Film. This states that there is a partial dependency of non-prime attributes on the primary key which violates 2NF. To have this table in 2NF, it can be split into two with names Actor as one table and Film as another table between whom there can be a relation.

              Actor                                         Films

Actor Code (PK)
Actor
JD
Johnny Depp
GC
George Clooney
BP
Brad Pitt
Film (PK)
Genre
Blow
Crime
Gravity
Sci-Fi
Fight Club
Drama


Now both tables are in 2NF. A relation can be established between them by either taking Film as Foreign Key in Actor table or Actor Code as Foreign key in table Film.

Third Normal Form (3NF)


This is the next level of 2NF. Codd stated that, a table is said to be in 3NF when it is already in 2NF and there exists no transitive dependency among the attributes of table. This means there is no dependency among columns of table such that Column2 is dependent on Column1 and Column3 is dependent on Column2. So, here Column3 is dependent on Column1 through Column2.

Consider the following table which shows the information about Academy Award winners for different years,

Category (PK)
Year (PK)
Winner
Date of Birth
Best Picture
1992
Clint Eastwood
May 31, 1930
Best Actor
1993
Tom Hanks
July 9, 1956
Best Director
1993
Steven Spielberg
December 18, 1946

Assume there is a composite primary key combining Category and Year columns. There exists a Transitive Dependency violating 3NF. The Column Winner is dependent on primary key columns whereas Date of Birth is dependent on Winner. So, Date of Birth is dependent on primary key columns through the column Winner where violation occurs. To have the table in 3NF, it can be split into two as Dates of Birth and Oscar Winners. There can be a foreign key relationship between the two tables as below,

      Dates of Birth
Winner (PK)
Date of Birth
Clint Eastwood
May 31, 1930
Tom Hanks
July 9, 1956
Steven Spielberg
December 18, 1946

                                        Oscar Winners
Category (PK)
Year (PK)
Winner (FK)
  Best Picture
1992
Clint Eastwood
Best Actor
1993
Tom Hanks
Best Director
1993
Steven Spielberg

Now transitive dependency is removed and the tables are in 3NF.

Boyce - Codd Normal Form (BCNF)


When your database tables reach 3NF it means that your database is normalized. BCNF is a higher level than 3NF which reduces more anomalies in your database tables.

According to Boyce and Codd, a table is said to be in BCNF if it is already in 3NF and there is no overlapping of multiple candidate keys. 

Consider the case of a person's information. A person can be identified by his Driving License or by his Passport or PAN. BCNF states that every attribute should describe every part of the Super key as the above three keys are combined as Super Key. If an attribute describes one of the candidate keys but not another candidate key then it violates BCNF. This means, if a person has Driving License, Passport and PAN then entity is in BCNF. If he doesn't have any of the above then the entity is not in BCNF.

This is all about important normal forms to be aware of. If you want to acquire knowledge on other specified normal forms, follow the following link,

http://en.wikipedia.org/wiki/Database_normalization

No comments:

Post a Comment