12. Normalization

Normalization — Designing Tables That Do Not Fight Back

So far we have learned how to create tables, how to query them, and how to protect them with constraints. But there is one question we have quietly skipped: which columns should go into which table in the first place?

You can store the same information in many different ways. Some designs work fine on day one and quietly fall apart three months later. Normalization is the set of rules that separates the designs that age well from the ones that do not. The rules have formal names and mathematical foundations, but the underlying idea is straightforward: every independent fact belongs in exactly one place.

Let’s start by looking at how things go wrong.

A “Broken” Database to Begin With

Imagine our school wants to keep track of who takes which subject, who teaches it, and what grade the student got. Someone (maybe in a hurry) decides to put everything into one big table. It looks innocent enough:

student_id	student_name	student_phone	class_name	class_room	course_id	course_title	teacher_name	teacher_kuerzel	grade
1	Anna Gruber	0664-111, 0676-222	3AHITM	R204	101	Databases	Mr. Maurhart	MAO	2
1	Anna Gruber	0664-111, 0676-222	3AHITM	R204	102	Networking	Ms. Kender	KEN	1
2	Ben Hofer	0699-333	3AHITM	R204	101	Databases	Mr. Maurhart	MAO	3
3	Cara Wimmer	0660-444	4AHITM	R110	101	Databasse	Mr. Maurhart	MAO	2

At first glance this table answers every question we have. But look more closely — it is already falling apart:

It repeats itself. Class 3AHITM and its room “R204” appear three times. “Mr. Maurhart” and his Kürzel “MAO” appear three times. All of that is stored once per enrollment, even though none of it has anything to do with any individual grade.
It contradicts itself. Row 4 says the course is called “Databasse”. Is that a typo, or a different course? The table cannot tell you, because the truth is scattered across many rows instead of being stored in exactly one place.
It mixes several things into one cell. student_phone holds two numbers separated by a comma. Now try to write a query that finds everyone reachable on a 0676 number — you can’t do it cleanly, because the database sees one long string, not two phone numbers.

This single table is trying to be a student list, a class directory, a course catalog, and a gradebook all at once. Whenever you force unrelated facts to live in the same row, you create the problems we look at next.

The Anomalies

The problems that come from badly designed tables have a name: anomalies. An anomaly is something that goes wrong when you try to insert, update, or delete data. There are three classic kinds, and they are worth knowing by name because they show up again and again.

Insertion anomaly — you cannot add a piece of information because some unrelated information is missing.
Update anomaly — to change one fact, you have to change it in many rows, and if you miss one, the data becomes inconsistent.
Deletion anomaly — deleting one thing accidentally destroys another thing you wanted to keep.

Let’s see each of them in our broken table, with two examples each.

Insertion Anomaly

You want to add a fact, but the table forces you to also know facts that have nothing to do with it.

Example 1 — A new course with no students yet. The school adds a new subject “Cyber Security”, taught by Ms. Schmidt. You want to record it now, before any class has been assigned to it. But every row needs a student_id and a grade. With this table you simply cannot store a course that has no students — even though the course clearly exists.
Example 2 — A new student not yet assigned to a class. A student transfers in mid-year. The admin has not yet decided which class they belong to, so there is no class_name or class_room to write down. There is also no course_id to fill in yet. With this table, the student cannot exist in the database until every other column is known too.

Update Anomaly

A single fact is stored in many rows. To change it, you must find and fix every copy.

Example 1 — The class moves to a different room. 3AHITM is relocated from R204 to R307 because of a renovation. Their room is written in two different rows (once for Anna, once for Ben). You must update both. Forget one, and the database now claims 3AHITM occupies two rooms at the same time.
Example 2 — Fixing the course title. You notice “Databasse” is a typo and want to correct it to “Databases”. But the course title is duplicated across multiple rows. Fix one and miss another, and the title is still inconsistent.

Deletion Anomaly

You delete one thing, and an unrelated thing disappears with it.

Example 1 — A student drops a subject. Cara is the only student enrolled in a particular course. She drops it, so you delete her row. With that single deletion, all knowledge of that course and its teacher is gone too — even though the course continues next semester.
Example 2 — Cleaning up at the end of the year. You delete the rows of students who have graduated. If a teacher only ever appeared next to those students, that teacher’s name and Kürzel quietly vanish from the database — even though the teacher is still on staff.

What Is Normalization?

Normalization is the process of organizing your columns into well-chosen tables so that each fact is stored only once, and the anomalies above cannot happen.

We do this in stages called normal forms (NF). Each normal form is a small checklist. You start with a messy table, apply the rule for the First Normal Form (1NF), then the Second Normal Form (2NF), then the Third Normal Form (3NF). The forms build on each other: a table cannot be in 2NF unless it is already in 1NF, and so on.

There are higher forms too (BCNF, 4NF, 5NF), but for almost all real-world database design getting to 3NF is the goal. In practice, “the database is normalized” usually means “it is in 3NF”.

We will not throw all the definitions at you up front. Instead, we introduce each definition exactly when we need it — that is how the inventors of these rules thought about them, and it is much easier to remember.

First Normal Form (1NF) — One Value per Cell

The very first rule is the simplest:

1NF: Every cell holds a single, atomic value, and there are no repeating groups. On top of that, every row must be uniquely identifiable (so the table needs a primary key).

“Atomic” just means cannot sensibly be split further for the way you use it. A phone number is atomic. A list of two phone numbers crammed into one cell is not.

Two things in our table break 1NF:

student_phone contains 0664-111, 0676-222 — two values in one cell.
There is no proper primary key yet that guarantees each row is unique.

To fix the multi-value problem, we pull the phone numbers out into their own rows, in their own table. One phone number per row:

PostgreSQL
MariaDB / MySQL

CREATE TABLE student_phones (
  student_id INTEGER     NOT NULL,
  phone      VARCHAR(30) NOT NULL,
  PRIMARY KEY (student_id, phone)
);

CREATE TABLE student_phones (
  student_id INT         NOT NULL,
  phone      VARCHAR(30) NOT NULL,
  PRIMARY KEY (student_id, phone)
);

Now a student can have one, two, or ten phone numbers — each one is a clean, separate row you can query and filter:

-- Find every student reachable on a 0676 number. Easy now:
SELECT student_id
FROM   student_phones
WHERE  phone LIKE '0676%';

For the rest of the data, we put it into one big table that does have a real primary key. Since a single student_id is not unique (Anna appears twice) and a single course_id is not unique either, the thing that uniquely identifies a row is the combination of both:

PostgreSQL
MariaDB / MySQL

CREATE TABLE enrollments (
  student_id      INTEGER      NOT NULL,
  student_name    VARCHAR(100) NOT NULL,
  class_name      VARCHAR(10)  NOT NULL,
  class_room      VARCHAR(10)  NOT NULL,
  course_id       INTEGER      NOT NULL,
  course_title    VARCHAR(100) NOT NULL,
  teacher_name    VARCHAR(100) NOT NULL,
  teacher_kuerzel VARCHAR(10)  NOT NULL,
  grade           SMALLINT,
  PRIMARY KEY (student_id, course_id)
);

CREATE TABLE enrollments (
  student_id      INT          NOT NULL,
  student_name    VARCHAR(100) NOT NULL,
  class_name      VARCHAR(10)  NOT NULL,
  class_room      VARCHAR(10)  NOT NULL,
  course_id       INT          NOT NULL,
  course_title    VARCHAR(100) NOT NULL,
  teacher_name    VARCHAR(100) NOT NULL,
  teacher_kuerzel VARCHAR(10)  NOT NULL,
  grade           SMALLINT,
  PRIMARY KEY (student_id, course_id)
);

This table is now in 1NF: every cell is atomic, and (student_id, course_id) uniquely identifies each row. Good — but it is still full of repetition. “Mr. Maurhart”, “MAO” and “R204” still appear multiple times. 1NF made the data tidy, not non-redundant. That’s the job of the next step.

Second Normal Form (2NF) — Depend on the Whole Key

To talk about 2NF we need one new idea: the functional dependency.

Now the rule:

2NF: The table is in 1NF, and every non-key column depends on the whole primary key — not just on part of it.

This rule only bites when the primary key has more than one column, like our (student_id, course_id). Let’s check each non-key column and ask “what does it really depend on?”:

student_name, class_name, class_room → depend on student_id only. (The course has nothing to do with which class a student is in.)
course_title, teacher_name, teacher_kuerzel → depend on course_id only.
grade → depends on both student_id and course_id. (A grade is a student’s result in a specific course.)

The columns that depend on only part of the key are called partial dependencies, and they are exactly what 2NF forbids. The fix is to give each “part” its own table:

PostgreSQL
MariaDB / MySQL

-- Facts that depend on the student only:
CREATE TABLE students (
  student_id   SERIAL       PRIMARY KEY,
  student_name VARCHAR(100) NOT NULL,
  class_name   VARCHAR(10)  NOT NULL,
  class_room   VARCHAR(10)  NOT NULL
);

-- Facts that depend on the course only:
CREATE TABLE courses (
  course_id       SERIAL       PRIMARY KEY,
  course_title    VARCHAR(100) NOT NULL,
  teacher_name    VARCHAR(100) NOT NULL,
  teacher_kuerzel VARCHAR(10)  NOT NULL
);

-- Facts that depend on BOTH (the enrollment itself):
CREATE TABLE enrollments (
  student_id INTEGER  NOT NULL REFERENCES students(student_id),
  course_id  INTEGER  NOT NULL REFERENCES courses(course_id),
  grade      SMALLINT,
  PRIMARY KEY (student_id, course_id)
);

-- Facts that depend on the student only:
CREATE TABLE students (
  student_id   INT          AUTO_INCREMENT PRIMARY KEY,
  student_name VARCHAR(100) NOT NULL,
  class_name   VARCHAR(10)  NOT NULL,
  class_room   VARCHAR(10)  NOT NULL
);

-- Facts that depend on the course only:
CREATE TABLE courses (
  course_id       INT          AUTO_INCREMENT PRIMARY KEY,
  course_title    VARCHAR(100) NOT NULL,
  teacher_name    VARCHAR(100) NOT NULL,
  teacher_kuerzel VARCHAR(10)  NOT NULL
);

-- Facts that depend on BOTH (the enrollment itself):
CREATE TABLE enrollments (
  student_id INT      NOT NULL,
  course_id  INT      NOT NULL,
  grade      SMALLINT,
  PRIMARY KEY (student_id, course_id),
  FOREIGN KEY (student_id) REFERENCES students(student_id),
  FOREIGN KEY (course_id)  REFERENCES courses(course_id)
);

Look what we just won. A course now exists as one row in courses, completely independent of whether any student takes it. That kills the insertion anomaly (we can add a course before assigning it to any class) and the deletion anomaly (removing the last student no longer erases the course). And “Mr. Maurhart” is written exactly once — per course, not per enrollment.

But we are not done. Look at both intermediate tables now:

student_id	student_name	class_name	class_room
1	Anna Gruber	3AHITM	R204
2	Ben Hofer	3AHITM	R204
3	Cara Wimmer	4AHITM	R110

course_id	course_title	teacher_name	teacher_kuerzel
101	Databases	Mr. Maurhart	MAO
102	Networking	Ms. Kender	KEN

If 3AHITM has thirty students, “R204” appears thirty times in students. If Mr. Maurhart teaches five courses, “MAO” appears five times in courses. The repetition is back, just in different tables. Why? Because class_room doesn’t really depend on the student — it depends on the class. And teacher_kuerzel doesn’t depend on the course — it depends on the teacher. That’s exactly what 3NF solves.

Third Normal Form (3NF) — No Detours Through Other Columns

One more definition, and it is the one most students trip over — so let’s make it concrete first.

The rule:

3NF: The table is in 2NF, and no non-key column depends on another non-key column. Every non-key column must depend on the key, the whole key, and nothing but the key.

Both tables above break 3NF. The fix is the same in both cases — give the “middle” entity its own table:

PostgreSQL
MariaDB / MySQL

-- The class gets its own table; the room belongs to the class, not the student:
CREATE TABLE classes (
  class_name VARCHAR(10) PRIMARY KEY,
  class_room VARCHAR(10) NOT NULL
);

CREATE TABLE students (
  student_id   SERIAL       PRIMARY KEY,
  student_name VARCHAR(100) NOT NULL,
  class_name   VARCHAR(10)  NOT NULL REFERENCES classes(class_name)
);

-- The teacher gets their own table; the Kürzel belongs to the teacher, not the course:
CREATE TABLE teachers (
  teacher_id      SERIAL       PRIMARY KEY,
  teacher_name    VARCHAR(100) NOT NULL,
  teacher_kuerzel VARCHAR(10)  NOT NULL UNIQUE
);

CREATE TABLE courses (
  course_id    SERIAL       PRIMARY KEY,
  course_title VARCHAR(100) NOT NULL,
  teacher_id   INTEGER      NOT NULL REFERENCES teachers(teacher_id)
);

-- The class gets its own table; the room belongs to the class, not the student:
CREATE TABLE classes (
  class_name VARCHAR(10) PRIMARY KEY,
  class_room VARCHAR(10) NOT NULL
);

CREATE TABLE students (
  student_id   INT          AUTO_INCREMENT PRIMARY KEY,
  student_name VARCHAR(100) NOT NULL,
  class_name   VARCHAR(10)  NOT NULL,
  FOREIGN KEY (class_name) REFERENCES classes(class_name)
);

-- The teacher gets their own table; the Kürzel belongs to the teacher, not the course:
CREATE TABLE teachers (
  teacher_id      INT          AUTO_INCREMENT PRIMARY KEY,
  teacher_name    VARCHAR(100) NOT NULL,
  teacher_kuerzel VARCHAR(10)  NOT NULL UNIQUE
);

CREATE TABLE courses (
  course_id    INT          AUTO_INCREMENT PRIMARY KEY,
  course_title VARCHAR(100) NOT NULL,
  teacher_id   INT          NOT NULL,
  FOREIGN KEY (teacher_id) REFERENCES teachers(teacher_id)
);

Now 3AHITM’s room is stored in exactly one place: one row in classes. When the class moves to a different room, you change a single value and every student in 3AHITM is instantly correct. And Mr. Maurhart’s Kürzel lives in one row in teachers — change it once and every course he teaches reflects it. Both update anomalies are gone.

Here is the complete design after reaching 3NF. Five small tables, each about one clear thing:

classes — which classes exist and where they sit. (class_name, class_room)
students — who the students are and which class they belong to. (student_id, student_name, class_name)
teachers — who the teachers are and their official Kürzel. (teacher_id, teacher_name, teacher_kuerzel)
courses — which subjects exist and who teaches them. (course_id, course_title, teacher_id)
enrollments — which student takes which course, and the resulting grade. (student_id, course_id, grade)

To get the original wide view back, you simply JOIN the pieces together when you need them:

SELECT s.student_name,
       s.class_name,
       cl.class_room,
       c.course_title,
       t.teacher_name,
       t.teacher_kuerzel,
       e.grade
FROM   enrollments e
JOIN   students s  ON s.student_id  = e.student_id
JOIN   classes  cl ON cl.class_name = s.class_name
JOIN   courses  c  ON c.course_id   = e.course_id
JOIN   teachers t  ON t.teacher_id  = c.teacher_id;

The data is stored without redundancy, and the convenient “everything in one row” view is just one query away. That is the whole trade normalization offers: store it clean, assemble it on demand.

A Quick Word on the Higher Normal Forms

You will hear about BCNF (a stricter version of 3NF), 4NF, and 5NF. They deal with rarer situations involving overlapping candidate keys and complex many-to-many relationships. They are good to have heard of, but in everyday database design they almost never change what you would do after reaching 3NF. For this course — and for most professional work — 3NF is where you aim.

Can You Normalize Too Much?

Yes. And this is the part many textbooks forget to mention.

Every time you split a table to reach a higher normal form, you create another table that must be joined back at query time. A perfectly normalized design can spread a single logical “thing” across six or seven tables. Reading it back then means a six-table JOIN — and joins cost time. On a busy system with millions of rows, that cost is real.

Normalization optimizes for writing (no duplicated facts, no anomalies). But many systems spend most of their life reading. There are situations where deliberately keeping some redundancy makes the system faster and simpler:

Reporting and analytics. A dashboard that adds up last month’s sales doesn’t want to join ten tables for every chart. Data warehouses often use denormalized “star schemas” on purpose.
Expensive repeated joins. If 95% of your queries always join the same five tables, sometimes it is worth storing a pre-joined copy.
Values that are cheap to copy and never change. Storing the country name next to its code may technically be redundant, but it can save a join for almost no risk.

Deliberately stepping back from full normalization is called denormalization. The key word is deliberately: you denormalize after you understand the normalized design and have a measured reason to deviate — never because you were too lazy to split the table in the first place.

Summary

Normal form	The question it answers	The fix
1NF	Does every cell hold a single atomic value, and is every row uniquely identifiable?	Split multi-valued cells into separate rows; define a primary key.
2NF	Does every non-key column depend on the whole primary key?	Move columns that depend on only part of a composite key into their own table.
3NF	Does any non-key column depend on another non-key column?	Move the “detour” columns (transitive dependencies) into their own table.

The big ideas to take with you:

Anomalies (insertion, update, deletion) are the symptoms; redundant, tangled data is the disease.
Normalization cures it by storing every fact exactly once, one table per “thing”.
3NF is the practical target for almost all designs.
Denormalization is a conscious, measured exception for read-heavy systems — not an excuse to skip the thinking.

Design clean first. Optimize later, on purpose, with numbers in hand.

Learning Outcomes: What you should be able to do after this chapter

After completing this chapter, students should be able to:

Name: name the insertion, update and deletion anomalies as well as the normal forms 1NF–3NF.
Explain: explain which problem each normal form solves (atomicity, full functional dependency, transitivity).
Apply: transform a given table step by step into 1NF, 2NF and 3NF.
Analyze: analyze functional dependencies and redundancies in a schema.
Evaluate: weigh up when denormalization is justified.