12. Normalization
Normalization — Designing Tables That Do Not Fight Back
Section titled “Normalization — Designing Tables That Do Not Fight Back”So far we have learned how to create tables, how to query them, and how to protect them with constraints. But there is one question we have quietly skipped: which columns should go into which table in the first place?
You can store the same information in many different ways. Some designs work fine on day one and quietly fall apart three months later. Normalization is the set of rules that separates the designs that age well from the ones that do not. The rules have formal names and mathematical foundations, but the underlying idea is straightforward: every independent fact belongs in exactly one place.
Let’s start by looking at how things go wrong.
A “Broken” Database to Begin With
Section titled “A “Broken” Database to Begin With”Imagine our school wants to keep track of who takes which subject, who teaches it, and what grade the student got. Someone (maybe in a hurry) decides to put everything into one big table. It looks innocent enough:
| student_id | student_name | student_phone | class_name | class_room | course_id | course_title | teacher_name | teacher_kuerzel | grade |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Anna Gruber | 0664-111, 0676-222 | 3AHITM | R204 | 101 | Databases | Mr. Maurhart | MAO | 2 |
| 1 | Anna Gruber | 0664-111, 0676-222 | 3AHITM | R204 | 102 | Networking | Ms. Kender | KEN | 1 |
| 2 | Ben Hofer | 0699-333 | 3AHITM | R204 | 101 | Databases | Mr. Maurhart | MAO | 3 |
| 3 | Cara Wimmer | 0660-444 | 4AHITM | R110 | 101 | Databasse | Mr. Maurhart | MAO | 2 |
At first glance this table answers every question we have. But look more closely — it is already falling apart:
- It repeats itself. Class 3AHITM and its room “R204” appear three times. “Mr. Maurhart” and his Kürzel “MAO” appear three times. All of that is stored once per enrollment, even though none of it has anything to do with any individual grade.
- It contradicts itself. Row 4 says the course is called “Databasse”. Is that a typo, or a different course? The table cannot tell you, because the truth is scattered across many rows instead of being stored in exactly one place.
- It mixes several things into one cell.
student_phoneholds two numbers separated by a comma. Now try to write a query that finds everyone reachable on a0676number — you can’t do it cleanly, because the database sees one long string, not two phone numbers.
This single table is trying to be a student list, a class directory, a course catalog, and a gradebook all at once. Whenever you force unrelated facts to live in the same row, you create the problems we look at next.
The Anomalies
Section titled “The Anomalies”The problems that come from badly designed tables have a name: anomalies. An anomaly is something that goes wrong when you try to insert, update, or delete data. There are three classic kinds, and they are worth knowing by name because they show up again and again.
- Insertion anomaly — you cannot add a piece of information because some unrelated information is missing.
- Update anomaly — to change one fact, you have to change it in many rows, and if you miss one, the data becomes inconsistent.
- Deletion anomaly — deleting one thing accidentally destroys another thing you wanted to keep.
Let’s see each of them in our broken table, with two examples each.
Insertion Anomaly
Section titled “Insertion Anomaly”You want to add a fact, but the table forces you to also know facts that have nothing to do with it.
-
Example 1 — A new course with no students yet. The school adds a new subject “Cyber Security”, taught by Ms. Schmidt. You want to record it now, before any class has been assigned to it. But every row needs a
student_idand agrade. With this table you simply cannot store a course that has no students — even though the course clearly exists. -
Example 2 — A new student not yet assigned to a class. A student transfers in mid-year. The admin has not yet decided which class they belong to, so there is no
class_nameorclass_roomto write down. There is also nocourse_idto fill in yet. With this table, the student cannot exist in the database until every other column is known too.
Update Anomaly
Section titled “Update Anomaly”A single fact is stored in many rows. To change it, you must find and fix every copy.
-
Example 1 — The class moves to a different room. 3AHITM is relocated from R204 to R307 because of a renovation. Their room is written in two different rows (once for Anna, once for Ben). You must update both. Forget one, and the database now claims 3AHITM occupies two rooms at the same time.
-
Example 2 — Fixing the course title. You notice “Databasse” is a typo and want to correct it to “Databases”. But the course title is duplicated across multiple rows. Fix one and miss another, and the title is still inconsistent.
Deletion Anomaly
Section titled “Deletion Anomaly”You delete one thing, and an unrelated thing disappears with it.
-
Example 1 — A student drops a subject. Cara is the only student enrolled in a particular course. She drops it, so you delete her row. With that single deletion, all knowledge of that course and its teacher is gone too — even though the course continues next semester.
-
Example 2 — Cleaning up at the end of the year. You delete the rows of students who have graduated. If a teacher only ever appeared next to those students, that teacher’s name and Kürzel quietly vanish from the database — even though the teacher is still on staff.
What Is Normalization?
Section titled “What Is Normalization?”Normalization is the process of organizing your columns into well-chosen tables so that each fact is stored only once, and the anomalies above cannot happen.
We do this in stages called normal forms (NF). Each normal form is a small checklist. You start with a messy table, apply the rule for the First Normal Form (1NF), then the Second Normal Form (2NF), then the Third Normal Form (3NF). The forms build on each other: a table cannot be in 2NF unless it is already in 1NF, and so on.
There are higher forms too (BCNF, 4NF, 5NF), but for almost all real-world database design getting to 3NF is the goal. In practice, “the database is normalized” usually means “it is in 3NF”.
We will not throw all the definitions at you up front. Instead, we introduce each definition exactly when we need it — that is how the inventors of these rules thought about them, and it is much easier to remember.
First Normal Form (1NF) — One Value per Cell
Section titled “First Normal Form (1NF) — One Value per Cell”The very first rule is the simplest:
1NF: Every cell holds a single, atomic value, and there are no repeating groups. On top of that, every row must be uniquely identifiable (so the table needs a primary key).
“Atomic” just means cannot sensibly be split further for the way you use it. A phone number is atomic. A list of two phone numbers crammed into one cell is not.
Two things in our table break 1NF:
student_phonecontains0664-111, 0676-222— two values in one cell.- There is no proper primary key yet that guarantees each row is unique.
To fix the multi-value problem, we pull the phone numbers out into their own rows, in their own table. One phone number per row:
CREATE TABLE student_phones ( student_id INTEGER NOT NULL, phone VARCHAR(30) NOT NULL, PRIMARY KEY (student_id, phone));CREATE TABLE student_phones ( student_id INT NOT NULL, phone VARCHAR(30) NOT NULL, PRIMARY KEY (student_id, phone));Now a student can have one, two, or ten phone numbers — each one is a clean, separate row you can query and filter:
-- Find every student reachable on a 0676 number. Easy now:SELECT student_idFROM student_phonesWHERE phone LIKE '0676%';For the rest of the data, we put it into one big table that does have a real primary key. Since a single student_id is not unique (Anna appears twice) and a single course_id is not unique either, the thing that uniquely identifies a row is the combination of both:
CREATE TABLE enrollments ( student_id INTEGER NOT NULL, student_name VARCHAR(100) NOT NULL, class_name VARCHAR(10) NOT NULL, class_room VARCHAR(10) NOT NULL, course_id INTEGER NOT NULL, course_title VARCHAR(100) NOT NULL, teacher_name VARCHAR(100) NOT NULL, teacher_kuerzel VARCHAR(10) NOT NULL, grade SMALLINT, PRIMARY KEY (student_id, course_id));CREATE TABLE enrollments ( student_id INT NOT NULL, student_name VARCHAR(100) NOT NULL, class_name VARCHAR(10) NOT NULL, class_room VARCHAR(10) NOT NULL, course_id INT NOT NULL, course_title VARCHAR(100) NOT NULL, teacher_name VARCHAR(100) NOT NULL, teacher_kuerzel VARCHAR(10) NOT NULL, grade SMALLINT, PRIMARY KEY (student_id, course_id));This table is now in 1NF: every cell is atomic, and (student_id, course_id) uniquely identifies each row. Good — but it is still full of repetition. “Mr. Maurhart”, “MAO” and “R204” still appear multiple times. 1NF made the data tidy, not non-redundant. That’s the job of the next step.
Second Normal Form (2NF) — Depend on the Whole Key
Section titled “Second Normal Form (2NF) — Depend on the Whole Key”To talk about 2NF we need one new idea: the functional dependency.
Now the rule:
2NF: The table is in 1NF, and every non-key column depends on the whole primary key — not just on part of it.
This rule only bites when the primary key has more than one column, like our (student_id, course_id). Let’s check each non-key column and ask “what does it really depend on?”:
student_name,class_name,class_room→ depend onstudent_idonly. (The course has nothing to do with which class a student is in.)course_title,teacher_name,teacher_kuerzel→ depend oncourse_idonly.grade→ depends on bothstudent_idandcourse_id. (A grade is a student’s result in a specific course.)
The columns that depend on only part of the key are called partial dependencies, and they are exactly what 2NF forbids. The fix is to give each “part” its own table:
-- Facts that depend on the student only:CREATE TABLE students ( student_id SERIAL PRIMARY KEY, student_name VARCHAR(100) NOT NULL, class_name VARCHAR(10) NOT NULL, class_room VARCHAR(10) NOT NULL);
-- Facts that depend on the course only:CREATE TABLE courses ( course_id SERIAL PRIMARY KEY, course_title VARCHAR(100) NOT NULL, teacher_name VARCHAR(100) NOT NULL, teacher_kuerzel VARCHAR(10) NOT NULL);
-- Facts that depend on BOTH (the enrollment itself):CREATE TABLE enrollments ( student_id INTEGER NOT NULL REFERENCES students(student_id), course_id INTEGER NOT NULL REFERENCES courses(course_id), grade SMALLINT, PRIMARY KEY (student_id, course_id));-- Facts that depend on the student only:CREATE TABLE students ( student_id INT AUTO_INCREMENT PRIMARY KEY, student_name VARCHAR(100) NOT NULL, class_name VARCHAR(10) NOT NULL, class_room VARCHAR(10) NOT NULL);
-- Facts that depend on the course only:CREATE TABLE courses ( course_id INT AUTO_INCREMENT PRIMARY KEY, course_title VARCHAR(100) NOT NULL, teacher_name VARCHAR(100) NOT NULL, teacher_kuerzel VARCHAR(10) NOT NULL);
-- Facts that depend on BOTH (the enrollment itself):CREATE TABLE enrollments ( student_id INT NOT NULL, course_id INT NOT NULL, grade SMALLINT, PRIMARY KEY (student_id, course_id), FOREIGN KEY (student_id) REFERENCES students(student_id), FOREIGN KEY (course_id) REFERENCES courses(course_id));Look what we just won. A course now exists as one row in courses, completely independent of whether any student takes it. That kills the insertion anomaly (we can add a course before assigning it to any class) and the deletion anomaly (removing the last student no longer erases the course). And “Mr. Maurhart” is written exactly once — per course, not per enrollment.
But we are not done. Look at both intermediate tables now:
| student_id | student_name | class_name | class_room |
|---|---|---|---|
| 1 | Anna Gruber | 3AHITM | R204 |
| 2 | Ben Hofer | 3AHITM | R204 |
| 3 | Cara Wimmer | 4AHITM | R110 |
| course_id | course_title | teacher_name | teacher_kuerzel |
|---|---|---|---|
| 101 | Databases | Mr. Maurhart | MAO |
| 102 | Networking | Ms. Kender | KEN |
If 3AHITM has thirty students, “R204” appears thirty times in students. If Mr. Maurhart teaches five courses, “MAO” appears five times in courses. The repetition is back, just in different tables. Why? Because class_room doesn’t really depend on the student — it depends on the class. And teacher_kuerzel doesn’t depend on the course — it depends on the teacher. That’s exactly what 3NF solves.
Third Normal Form (3NF) — No Detours Through Other Columns
Section titled “Third Normal Form (3NF) — No Detours Through Other Columns”One more definition, and it is the one most students trip over — so let’s make it concrete first.
The rule:
3NF: The table is in 2NF, and no non-key column depends on another non-key column. Every non-key column must depend on the key, the whole key, and nothing but the key.
Both tables above break 3NF. The fix is the same in both cases — give the “middle” entity its own table:
-- The class gets its own table; the room belongs to the class, not the student:CREATE TABLE classes ( class_name VARCHAR(10) PRIMARY KEY, class_room VARCHAR(10) NOT NULL);
CREATE TABLE students ( student_id SERIAL PRIMARY KEY, student_name VARCHAR(100) NOT NULL, class_name VARCHAR(10) NOT NULL REFERENCES classes(class_name));
-- The teacher gets their own table; the Kürzel belongs to the teacher, not the course:CREATE TABLE teachers ( teacher_id SERIAL PRIMARY KEY, teacher_name VARCHAR(100) NOT NULL, teacher_kuerzel VARCHAR(10) NOT NULL UNIQUE);
CREATE TABLE courses ( course_id SERIAL PRIMARY KEY, course_title VARCHAR(100) NOT NULL, teacher_id INTEGER NOT NULL REFERENCES teachers(teacher_id));-- The class gets its own table; the room belongs to the class, not the student:CREATE TABLE classes ( class_name VARCHAR(10) PRIMARY KEY, class_room VARCHAR(10) NOT NULL);
CREATE TABLE students ( student_id INT AUTO_INCREMENT PRIMARY KEY, student_name VARCHAR(100) NOT NULL, class_name VARCHAR(10) NOT NULL, FOREIGN KEY (class_name) REFERENCES classes(class_name));
-- The teacher gets their own table; the Kürzel belongs to the teacher, not the course:CREATE TABLE teachers ( teacher_id INT AUTO_INCREMENT PRIMARY KEY, teacher_name VARCHAR(100) NOT NULL, teacher_kuerzel VARCHAR(10) NOT NULL UNIQUE);
CREATE TABLE courses ( course_id INT AUTO_INCREMENT PRIMARY KEY, course_title VARCHAR(100) NOT NULL, teacher_id INT NOT NULL, FOREIGN KEY (teacher_id) REFERENCES teachers(teacher_id));Now 3AHITM’s room is stored in exactly one place: one row in classes. When the class moves to a different room, you change a single value and every student in 3AHITM is instantly correct. And Mr. Maurhart’s Kürzel lives in one row in teachers — change it once and every course he teaches reflects it. Both update anomalies are gone.
Here is the complete design after reaching 3NF. Five small tables, each about one clear thing:
-
classes— which classes exist and where they sit. (class_name,class_room) -
students— who the students are and which class they belong to. (student_id,student_name,class_name) -
teachers— who the teachers are and their official Kürzel. (teacher_id,teacher_name,teacher_kuerzel) -
courses— which subjects exist and who teaches them. (course_id,course_title,teacher_id) -
enrollments— which student takes which course, and the resulting grade. (student_id,course_id,grade)
To get the original wide view back, you simply JOIN the pieces together when you need them:
SELECT s.student_name, s.class_name, cl.class_room, c.course_title, t.teacher_name, t.teacher_kuerzel, e.gradeFROM enrollments eJOIN students s ON s.student_id = e.student_idJOIN classes cl ON cl.class_name = s.class_nameJOIN courses c ON c.course_id = e.course_idJOIN teachers t ON t.teacher_id = c.teacher_id;The data is stored without redundancy, and the convenient “everything in one row” view is just one query away. That is the whole trade normalization offers: store it clean, assemble it on demand.
A Quick Word on the Higher Normal Forms
Section titled “A Quick Word on the Higher Normal Forms”You will hear about BCNF (a stricter version of 3NF), 4NF, and 5NF. They deal with rarer situations involving overlapping candidate keys and complex many-to-many relationships. They are good to have heard of, but in everyday database design they almost never change what you would do after reaching 3NF. For this course — and for most professional work — 3NF is where you aim.
Can You Normalize Too Much?
Section titled “Can You Normalize Too Much?”Yes. And this is the part many textbooks forget to mention.
Every time you split a table to reach a higher normal form, you create another table that must be joined back at query time. A perfectly normalized design can spread a single logical “thing” across six or seven tables. Reading it back then means a six-table JOIN — and joins cost time. On a busy system with millions of rows, that cost is real.
Normalization optimizes for writing (no duplicated facts, no anomalies). But many systems spend most of their life reading. There are situations where deliberately keeping some redundancy makes the system faster and simpler:
- Reporting and analytics. A dashboard that adds up last month’s sales doesn’t want to join ten tables for every chart. Data warehouses often use denormalized “star schemas” on purpose.
- Expensive repeated joins. If 95% of your queries always join the same five tables, sometimes it is worth storing a pre-joined copy.
- Values that are cheap to copy and never change. Storing the country name next to its code may technically be redundant, but it can save a join for almost no risk.
Deliberately stepping back from full normalization is called denormalization. The key word is deliberately: you denormalize after you understand the normalized design and have a measured reason to deviate — never because you were too lazy to split the table in the first place.
Summary
Section titled “Summary”| Normal form | The question it answers | The fix |
|---|---|---|
| 1NF | Does every cell hold a single atomic value, and is every row uniquely identifiable? | Split multi-valued cells into separate rows; define a primary key. |
| 2NF | Does every non-key column depend on the whole primary key? | Move columns that depend on only part of a composite key into their own table. |
| 3NF | Does any non-key column depend on another non-key column? | Move the “detour” columns (transitive dependencies) into their own table. |
The big ideas to take with you:
- Anomalies (insertion, update, deletion) are the symptoms; redundant, tangled data is the disease.
- Normalization cures it by storing every fact exactly once, one table per “thing”.
- 3NF is the practical target for almost all designs.
- Denormalization is a conscious, measured exception for read-heavy systems — not an excuse to skip the thinking.
Design clean first. Optimize later, on purpose, with numbers in hand.
Learning Outcomes: What you should be able to do after this chapter
Section titled “Learning Outcomes: What you should be able to do after this chapter”After completing this chapter, students should be able to:
- Name: name the insertion, update and deletion anomalies as well as the normal forms 1NF–3NF.
- Explain: explain which problem each normal form solves (atomicity, full functional dependency, transitivity).
- Apply: transform a given table step by step into 1NF, 2NF and 3NF.
- Analyze: analyze functional dependencies and redundancies in a schema.
- Evaluate: weigh up when denormalization is justified.