Thursday 9 May 2013

Database normalization

Introduction

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.
The central concept in this discussion is the notion of Functional Dependency (FDs), which depends on the semantics of the Data and which deals with what Information in a relation is dependent on what other information in the relation.
We perform normalization on database to avoid redundancy and associated problems and anomalies that may occur. The data redundancies yield the following anomalies
1. Update Anomaly If one copy of redundant data is updated, an inconsistency is created unless all copies are similarly updated.
2. Insertion Anomaly It may not be possible to store some information unless some other key information is stored as well.
3. Deletion Anomaly It may not be possible to delete some information without loosing some other related information as well.
Normalization decomposes a Relation into two or more smaller relations. Normalization removes the anomalies in the database, but the data that could be retrieved from one relation will require several relations to be joined after normalization.

Functional Dependency

There are several stages of the normalization process. They are called 1NF, 2NF, 3NF, EKNF, BCNF, 4NF, 5NF, DKNF, 6NF. The higher normal forms is a subset of the lower normal forms; for example, BCNF is always in 3NF as well as in 2NF and 1NF.
First, let me introduce Functional Dependencies FD:
A function or mapping ƒ from set A to a set B, denoted ƒ:A→B, is a correspondence in which to each element x of A (domain) corresponds exactly one element y=f(x) of B (range). We use the notation x-->y, saying x determines y or y is determined by x. This states that, In a legal relation state of R, for any value of x there exists only a single definite value of y, this is known as Single Valued Dependency. Functional Dependencies define constraints on the set of legal relations.
A functional dependency is trivial if it is satisfied by all instances of a relation, this means that A --> B holds, when B is a subset of A;
e.g. FullName --> FirstName, where FullName = FirstName + LastName; or simply X-->X
The Right Hand Side is dependent on the Left Hand Side (determinant) of the FD. Sets of functional dependencies may have redundant dependencies that can be inferred from the others. Intuitively, a canonical cover of F is a minimal set of FD equivalent to F+, having no redundant dependencies or redundant parts of dependencies. Clearly F+ is a superset of F.

all about Keys and non-keys

Super Key is the set of attributes which when taken collectively can uniquely identify a tuple in a relation. This means the super-key has the uniqueness property.
A super-key also having the irreducible property is a candidate key, which implies that any proper-subset of the candidate key no longer remains the candidate key (minimal superkey).
One of the candidate key is chosen by the DBA as the principal key called the Primary key, the rest if any are the alternate keys or secondary keys. Primary key enforces the Entity Integrity rule which states that no component of the primary key is allowed to accept NULL values.
Referential Integrity rule states that the database must not contain any unmatched foreign key. A Foreign key in a base relation R2, is a subset of the set of attributes of R2, say FK, such that:
There exists a base relation R1 not necessarily distinct with a candidate key CK and for every value of FK in the current value of R2 is identical to the value of CK in some tuple in the current value of R1.
Foreign keys may be referenced to any candidate keys either the primary key or unique keys.
The attributes which are a part of a key attribute is called the prime attribute, the rest which are not a part of any key attribute are non-prime attributes.

Determine keys in R with F

We say that an attribute Y is functionally determined by X if X→Y. To test whether a set X is a key, we must compute the set of attributes functionally determined by X. We call the set of all attributes functionally determined by X under a set F of FD the closure of X under F, denoting this by X+. If X+ contains all the attributes in R, or X→R, then X is a superkey in R. The steps in finding the keys in R with a given set of FD is:
Step 1: Draw a dependency graph of F, each vertices corresponds to an attribute, and dependency by directed edges.
Step 2: Identify the set of source-vertices Vn that have no incoming edges, and the set of sink-vertices Vo that have only incoming edges. (Vn≈»indegree=0; Vo≈»outdegree=0)
Step 3: Compute Vn+. A candidate key is a set of attributes that:
1. contains all the attributes in Vn (attributes present only in the LHS of a FD), Vn is the set of the prime attributes.
2. contains no attributes in Vo (attributes present only in the RHS of a FD), Vo is the set of the non-prime attributes.
3. it is the minimal superkey, confirmed by the closure of remaining attributes in R.

First Normal Form 1NF

A relation is in 1NF if all underlying domains contain (indivisible) atomic values only.
There should not be any duplicate rows in the table
Each cell is single-valued (no repeating groups or arrays)
Entries in a column are of the same kind.
All non-key attributes are dependent on the key attributes
Clearly, when after we have defined the key attributes and all the attributes in R are single valued, the relation R is in 1NF

Second Normal Form 2NF

A relation is in 2NF if it is in 1NF and every non-prime (non-key) attribute is fully dependent on the primary key of the relation. Clearly, a table is in 2NF if it satisfies 1NF and has no partial dependencies.

In this context, fully dependent means that, no proper-subset of the determinant can any longer determine the RHS of a FD; partial dependencies mean, any proper-subset of the determinant may determine the RHS of the FD.
Again, if all the keys are single valued, then a relation which is in 1NF is also in 2NF. A Relation which is in 1NF is not in 2NF only if there exists a functional dependency of the form AB-->C, where either A-->C or B-->C holds (a partial dependency exists), having AB the composite key, and C the non-prime attribute.

Third Normal Form 3NF

A relation R is in 3NF if R is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R. All non-prime attributes in R are required to be directly dependent on each candidate key of the relation.
A relation is in 3NF only if whenever a nontrivial FD A-->B holds in R, either A is a candidate key or B is a prime attribute(member of a candidate key) of R.
A relation R which is in 2NF is not in 3NF only if there exists a non-key attribute which determines another non-key attribute in R.

Boyce Codd Normal Form BCNF

A relation is in BCNF if it is in 3NF and every determinant (LHS of a FD) is a candidate key. A relational schema R is in BCNF if whenever a nontrivial functional dependency holds in R, then the determinant must be a key of R. BCNF does not allow a non-key attribute to functionally determine a prime attribute which was relaxed in 3NF.
A relation which is in 3NF is not in BCNF only if all the following conditions hold:
1. there are more than one candidate keys
2. the keys are composite and overlapping
3. there exists a FD from a non-overlapping attribute of one key to the non overlapping attribute of another key.

MultiValued Dependencies and 4NF

We have worked with single valued dependency; while Multivalued Dependency A-->>B defines a relationship in which a definite set of values of attribute B are determined by a single value of A. This means for any value of A instead of having a single value of B we have a definite set of values for B. Functional Dependencies sometimes are referred to as equality-generating dependencies, and Multivalued Dependencies are referred to as tuple-generating dependencies.
More formally, a MVD X-->>Y is called trivial if either Y is a subset of X or X union Y equals R. It is to be noted that every FD is also an MVD, since A-->B implies A-->{B,Φ}.
In 4NF we consider the non-trivial MVD; A relation R to be in 4NF, if whenever a MVD holds X-->>Y then either the dependency is trivial or X is a candidate key for R.
A relation which is in BCNF is not in 4NF, only if there exists 2 or more independent multivalued facts (MVD ≈ one-to-many or many-to-many relationships) about an entity. So, the relation has 3 or more attributes in R, based on the non-trivial multivalued dependency X-->>Y (X multidetermines Y) we decompose R into R1=(X U Y) and R2=(R-Y).

Fifth Normal Form 5NF

5NF sometimes called projection-join normal form (PJNF) where every non-trivial join dependency in the table is implied by the superkeys of the table.
Projection is the process of separating one relation into sub-relations and Join is the process of consolidating these sub-relations back into one relation. Sometimes, this join and projection operation produces spurious tuples because of join dependencies, we get more not less. The key objective of 5NF is to remove the join dependencies.
A Join Dependency JD denoted as JD(R1,R2) is equivalent to the MVD R1∩R2-->>(R1-R2) symmetrically R1∩R2-->>(R2-R1). Conversely, every MVD A-->>B is also a JD *[A U (R-B), A U B].
In 5NF we consider the non-additive join decomposition making sure that Lossless-Join decomposition can be achieved. 5NF does not differ from 4NF unless there exists a symmetric constraint such as a rule about the correspondence between the attributes in R (cyclic dependencies).

Domain/key normal form and 6NF

Domain/Key normal form requires every constraint on the table is a logical consequence of the table's domain constraints and key constraints. DKNF enforces database constraints by checking that each attribute in a tuple is of the appropriate domain and the key constraints are enforced.

DKNF defined by Ronald Fagin, theoretically known as the ultimate normal form; later C.J. Date with Hugh Darwen, and Nikos Lorentzos defined the Sixth Normal form which requires that the Table features no non-trivial join dependencies at all (with reference to generalized join operator). This implies that a Relvar R of degree n is in 6NF, iff it is in 5NF and it has no key of degree less than n-1.

Goals of Normalization

Let R be a relation schema with a set F of functional dependencies.
  • each relation scheme should be in a good form.
  • the decomposition is a lossless-join decomposition.
  • Preferably, the decomposition is dependency preserving.
A relation R is said to be lossless decomposition into R1 and R2, if only the natural-join of these sub relations produce the original relation without creating additional spurious tuples.
Let, R be decomposed into R1 and R2 (binary decomposition),
If either R1 ∩ R2 → R1 or R1 ∩ R2 → R2 holds then
the decomposition is loss-less else it is undesired lossy.
It is not necessary nor suggested to normalize our database to the highest normal form possible, I consider 3NF sufficient for most database systems. Normalization has the disadvantages that it may require complex join to query the database and make the programmer task harder in extra coding and effort. Although, we have a solution of using materialized view, we may want to use non-normalized schema for performance. The main idea of normalization is to decompose the relation into a number of single-theme tables removing the database anomalies and problems of data redundancy.

No comments:

Post a Comment