11. Data Integrity
Data Integrity — Ensuring Data Quality in SQL
Section titled “Data Integrity — Ensuring Data Quality in SQL”Motivation: Imagine an online shop where a customer order references a customer ID that does not exist, or a product price is accidentally stored as a negative number. Both situations lead to inconsistent, unreliable data — and ultimately to wrong decisions or broken applications.
Data integrity is the set of rules and mechanisms that prevent exactly these problems. It ensures that data stored in a database is correct, consistent, and meaningful at all times.
These three words — correct, consistent, and meaningful — are worth defining precisely, because they are often used loosely in everyday language.
-
Correct means that every stored value is a valid instance of what the column is supposed to represent. A date of birth column contains an actual calendar date, not an arbitrary string. A price column contains a non-negative number, not
NULLor a letter. Correctness is enforced by data types,NOT NULL, andCHECKconstraints. -
Consistent means that the data does not contradict itself — neither within a single row nor across related tables. An order row that references a customer who does not exist is inconsistent. An
end_datethat is earlier than thestart_datein the same row is inconsistent. Consistency is enforced byFOREIGN KEYconstraints and multi-columnCHECKconstraints. -
Meaningful means that the data makes sense in the real-world context it is meant to model. A value can be correct (technically a valid integer) and consistent (no contradictions) but still be meaningless — for example, a quantity of zero in an order line, or a discount of 150%. Meaningful data satisfies the business rules of the domain. These rules are enforced by user-defined
CHECKconstraints.
Why Data Integrity Matters
Section titled “Why Data Integrity Matters”Without integrity rules, databases are fragile:
- A foreign key column could reference a row that was deleted.
- A salary column could hold a negative value.
- Two users could accidentally share the same email address.
- A mandatory field could be left empty.
These problems are often hard to detect after the fact and expensive to fix. By defining integrity constraints at the database level, errors are caught immediately at the moment of insertion or update — regardless of which application or user caused them.
The key benefits of enforcing data integrity are:
| Benefit | Explanation |
|---|---|
| Reliability | Data is consistent and trustworthy across the whole system. |
| Automation | The database enforces rules automatically, reducing the burden on application code. |
| Early error detection | Invalid data is rejected immediately, before it can cause harm. |
| Interoperability | Multiple applications using the same database all benefit from the same rules. |
Categories of Data Integrity
Section titled “Categories of Data Integrity”Data integrity is traditionally divided into four categories. Each addresses a different dimension of data quality.
Entity Integrity
Section titled “Entity Integrity”Every table must be able to uniquely identify each of its rows. No two rows may be indistinguishable from one another, and no row may be “anonymous”.
Rule: Every table must have a primary key. The primary key must be unique and must never be NULL.
-- The id column uniquely identifies each customer.CREATE TABLE customers ( id SERIAL PRIMARY KEY, name TEXT NOT NULL);-- The id column uniquely identifies each customer.CREATE TABLE customers ( id INT AUTO_INCREMENT PRIMARY KEY, name TEXT NOT NULL);Without entity integrity, it would be impossible to reliably address a specific row — for example when updating or deleting a record.
Referential Integrity
Section titled “Referential Integrity”When one table references rows in another table (via a foreign key), those referenced rows must actually exist. A reference to a non-existent row is called a dangling reference and leads to inconsistent data.
Rule: Every value in a foreign key column must either match an existing primary key value in the referenced table, or be NULL (if the relationship is optional).
-- An order must reference an existing customer.CREATE TABLE orders ( id SERIAL PRIMARY KEY, customer_id INTEGER NOT NULL REFERENCES customers(id), placed_at DATE NOT NULL);-- An order must reference an existing customer.CREATE TABLE orders ( id INT AUTO_INCREMENT PRIMARY KEY, customer_id INT NOT NULL, placed_at DATE NOT NULL, FOREIGN KEY (customer_id) REFERENCES customers(id));If someone tries to insert an order with a customer_id that does not exist in customers, the database will reject the operation immediately.
Domain Integrity
Section titled “Domain Integrity”Each column has a defined domain: the set of all values that are valid for that column. Domain integrity ensures that only values within this domain can be stored.
The domain of a column is determined by:
- its data type (e.g.,
INTEGER,DATE,VARCHAR(100)) - additional constraints (e.g.,
NOT NULL,CHECK,DEFAULT)
Examples:
- A
birth_datecolumn of typeDATEwill reject the text"hello". - A
pricecolumn withCHECK (price >= 0)will reject-5.00. - A
statuscolumn withCHECK (status IN ('ACTIVE', 'INACTIVE'))rejects any other string.
User-Defined Integrity
Section titled “User-Defined Integrity”Beyond the three structural categories above, databases also support user-defined rules that reflect specific business requirements. These are constraints that cannot be derived from the schema structure alone but must be explicitly specified.
Examples:
- A
discountpercentage must be between0and100. - A
start_datemust not lie after anend_date. - A
quantityin an order line must be greater than zero.
In SQL, user-defined integrity is primarily expressed using CHECK constraints and, in more advanced scenarios, triggers or stored procedures.
Integrity Constraints in SQL
Section titled “Integrity Constraints in SQL”SQL provides a set of constraints that are defined directly in the table schema. The database engine evaluates them automatically on every INSERT, UPDATE, and DELETE.
NOT NULL
Section titled “NOT NULL”A column declared NOT NULL must always contain a value. An attempt to insert or update a row with NULL in that column is rejected.
CREATE TABLE employees ( id SERIAL PRIMARY KEY, first_name TEXT NOT NULL, last_name TEXT NOT NULL, email TEXT -- optional (NULL is allowed));CREATE TABLE employees ( id INT AUTO_INCREMENT PRIMARY KEY, first_name TEXT NOT NULL, last_name TEXT NOT NULL, email TEXT -- optional (NULL is allowed));Use NOT NULL for every column where an empty value makes no sense — for example, a person’s name or a product code.
UNIQUE
Section titled “UNIQUE”A column (or combination of columns) declared UNIQUE must contain no duplicate values across all rows. Unlike a primary key, a UNIQUE column may contain NULL — and, in most database systems, multiple NULL values are allowed because NULL is never considered equal to anything, including another NULL.
CREATE TABLE users ( id SERIAL PRIMARY KEY, username TEXT NOT NULL UNIQUE, email TEXT NOT NULL UNIQUE);-- MariaDB/MySQL requires VARCHAR (not TEXT) for UNIQUE constraints,-- because index keys on TEXT columns need an explicit prefix length.CREATE TABLE users ( id INT AUTO_INCREMENT PRIMARY KEY, username VARCHAR(255) NOT NULL UNIQUE, email VARCHAR(255) NOT NULL UNIQUE);PRIMARY KEY
Section titled “PRIMARY KEY”The PRIMARY KEY constraint is the central pillar of entity integrity. It combines two rules in one:
- The column (or columns) must be unique.
- The column (or columns) must never be
NULL.
A table can have only one primary key, but that key can span multiple columns (a composite primary key).
-- Single-column primary keyCREATE TABLE products ( product_id SERIAL PRIMARY KEY, name TEXT NOT NULL);
-- Composite primary key (common in junction tables)CREATE TABLE course_registrations ( student_id INTEGER NOT NULL, course_id INTEGER NOT NULL, PRIMARY KEY (student_id, course_id));-- Single-column primary keyCREATE TABLE products ( product_id INT AUTO_INCREMENT PRIMARY KEY, name TEXT NOT NULL);
-- Composite primary key (common in junction tables)CREATE TABLE course_registrations ( student_id INT NOT NULL, course_id INT NOT NULL, PRIMARY KEY (student_id, course_id));FOREIGN KEY and Referential Actions
Section titled “FOREIGN KEY and Referential Actions”A FOREIGN KEY constraint links a column in one table to the primary key of another table, enforcing referential integrity.
CREATE TABLE orders ( id SERIAL PRIMARY KEY, customer_id INTEGER NOT NULL REFERENCES customers(id), total NUMERIC(10, 2) NOT NULL CHECK (total >= 0));What happens when the referenced row changes?
Section titled “What happens when the referenced row changes?”A question arises immediately: what should happen to the child rows when a referenced row in the parent table is deleted or updated?
SQL allows you to specify this behavior using referential actions:
| Action | Effect |
|---|---|
RESTRICT | The operation on the parent row is rejected if child rows exist. This is the default in most systems. |
CASCADE | The operation is propagated to all child rows automatically (e.g., delete or update them). |
SET NULL | The foreign key columns in child rows are set to NULL. |
SET DEFAULT | The foreign key columns in child rows are set to their default value. |
NO ACTION | Similar to RESTRICT, but the check may be deferred until the end of a transaction. |
CREATE TABLE orders ( id SERIAL PRIMARY KEY, customer_id INTEGER NOT NULL REFERENCES customers(id) ON DELETE RESTRICT ON UPDATE CASCADE, placed_at DATE NOT NULL);CREATE TABLE orders ( id INT AUTO_INCREMENT PRIMARY KEY, customer_id INT NOT NULL, placed_at DATE NOT NULL, CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE RESTRICT ON UPDATE CASCADE);Reading the example above:
ON DELETE RESTRICT— if someone tries to delete a customer who still has orders, the database refuses the deletion.ON UPDATE CASCADE— if a customer’sidchanges (rare, but possible), all matchingcustomer_idvalues inordersare updated automatically.
A CHECK constraint defines a logical condition that every row must satisfy. The condition can involve any column in the same table and any SQL expression that returns a boolean result.
CREATE TABLE products ( id SERIAL PRIMARY KEY, name TEXT NOT NULL, price NUMERIC(10, 2) NOT NULL CHECK (price >= 0), discount NUMERIC(5, 2) CHECK (discount BETWEEN 0 AND 100));CREATE TABLE products ( id INT AUTO_INCREMENT PRIMARY KEY, name TEXT NOT NULL, price DECIMAL(10, 2) NOT NULL CHECK (price >= 0), discount DECIMAL(5, 2) CHECK (discount BETWEEN 0 AND 100));You can also write table-level CHECK constraints that compare multiple columns with each other:
CREATE TABLE projects ( id SERIAL PRIMARY KEY, start_date DATE NOT NULL, end_date DATE, CONSTRAINT chk_dates CHECK (end_date IS NULL OR end_date >= start_date));CREATE TABLE projects ( id INT AUTO_INCREMENT PRIMARY KEY, start_date DATE NOT NULL, end_date DATE, CONSTRAINT chk_dates CHECK (end_date IS NULL OR end_date >= start_date));Here, end_date may be left empty (the project is still running), but if it is set it must not be earlier than start_date.
DEFAULT
Section titled “DEFAULT”A DEFAULT value is not a constraint in the strictest sense — it does not reject invalid data. Instead, it provides a fallback value that is used automatically when no explicit value is supplied for a column during an INSERT.
CREATE TABLE articles ( id SERIAL PRIMARY KEY, title TEXT NOT NULL, published BOOLEAN NOT NULL DEFAULT FALSE, created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP);CREATE TABLE articles ( id INT AUTO_INCREMENT PRIMARY KEY, title TEXT NOT NULL, published BOOLEAN NOT NULL DEFAULT FALSE, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP);If you insert a new article without specifying published or created_at, the database fills in FALSE and the current timestamp automatically.
DEFAULT is particularly useful for:
- Timestamps — recording when a row was created.
- Boolean flags — setting a sensible initial state (e.g., “inactive”, “unpublished”).
- Status columns — starting every new record in a defined state.
Naming Constraints
Section titled “Naming Constraints”The database generates a name automatically for every constraint that has no explicit name. These auto-generated names are often cryptic and hard to work with when you receive an error message or need to drop a constraint later.
It is good practice to name every constraint explicitly using the CONSTRAINT keyword:
CREATE TABLE order_items ( id SERIAL PRIMARY KEY, order_id INTEGER NOT NULL, product_id INTEGER NOT NULL, quantity INTEGER NOT NULL, unit_price NUMERIC(10, 2) NOT NULL,
CONSTRAINT fk_order_items_order FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
CONSTRAINT fk_order_items_product FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE RESTRICT,
CONSTRAINT chk_order_items_quantity CHECK (quantity > 0),
CONSTRAINT chk_order_items_unit_price CHECK (unit_price >= 0));CREATE TABLE order_items ( id INT AUTO_INCREMENT PRIMARY KEY, order_id INT NOT NULL, product_id INT NOT NULL, quantity INT NOT NULL, unit_price DECIMAL(10, 2) NOT NULL,
CONSTRAINT fk_order_items_order FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
CONSTRAINT fk_order_items_product FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE RESTRICT,
CONSTRAINT chk_order_items_quantity CHECK (quantity > 0),
CONSTRAINT chk_order_items_unit_price CHECK (unit_price >= 0));A widely used naming convention is:
pk_<table>— primary keyuq_<table>_<column>— unique constraintfk_<table>_<referenced_table>— foreign keychk_<table>_<column>— check constraint
Constraint Violations
Section titled “Constraint Violations”When an operation violates a constraint, the database aborts the statement and returns an error. The data is not changed. Here are the most common violations and what triggers them:
| Constraint | Violation Example |
|---|---|
NOT NULL | Inserting a row without a value for a required column. |
UNIQUE | Inserting a row whose email already exists in the table. |
PRIMARY KEY | Inserting a row with an id that already exists. |
FOREIGN KEY | Inserting an order for a customer_id that does not exist. |
CHECK | Inserting a product with a negative price. |
-- Attempting to insert an order for customer_id = 999 (does not exist):INSERT INTO orders (customer_id, placed_at) VALUES (999, CURRENT_DATE);-- Result: ERROR — foreign key constraint violation-- Attempting to insert a product with a negative price:INSERT INTO products (name, price) VALUES ('Widget', -5.00);-- Result: ERROR — check constraint violationError messages differ between database systems, but they always tell you which constraint was violated and on which table. Reading these messages carefully is an important skill when working with SQL.
Adding and Removing Constraints
Section titled “Adding and Removing Constraints”Constraints can also be added to or removed from an existing table using ALTER TABLE.
-- Add a CHECK constraint to an existing tableALTER TABLE productsADD CONSTRAINT chk_products_price CHECK (price >= 0);
-- Add a UNIQUE constraintALTER TABLE usersADD CONSTRAINT uq_users_email UNIQUE (email);
-- Add a FOREIGN KEY constraintALTER TABLE ordersADD CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id) REFERENCES customers(id);
-- Remove a constraint by nameALTER TABLE productsDROP CONSTRAINT chk_products_price;-- Add a CHECK constraintALTER TABLE productsADD CONSTRAINT chk_products_price CHECK (price >= 0);
-- Add a UNIQUE constraintALTER TABLE usersADD CONSTRAINT uq_users_email UNIQUE (email);
-- Add a FOREIGN KEY constraintALTER TABLE ordersADD CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id) REFERENCES customers(id);
-- Remove a CHECK constraintALTER TABLE productsDROP CONSTRAINT chk_products_price;
-- Remove a FOREIGN KEY constraintALTER TABLE ordersDROP FOREIGN KEY fk_orders_customer;
-- Remove a UNIQUE constraint (treated as an index in MariaDB/MySQL)ALTER TABLE usersDROP INDEX uq_users_email;Summary
Section titled “Summary”Data integrity is the foundation of a trustworthy database. It ensures that only correct, consistent, and meaningful data can enter the system — automatically, at the database level, for every application that uses it.
| Constraint | Category | What it enforces |
|---|---|---|
PRIMARY KEY | Entity | Unique, non-null identifier for every row |
NOT NULL | Domain | Column must always contain a value |
UNIQUE | Entity / Domain | No duplicate values in a column |
FOREIGN KEY | Referential | References must point to existing rows |
CHECK | Domain / User-defined | Values must satisfy a logical condition |
DEFAULT | Domain | Fallback value when no value is supplied |
Learning Outcomes: What you should be able to do after this chapter
Section titled “Learning Outcomes: What you should be able to do after this chapter”After completing this chapter, students should be able to:
- Name: name the categories of data integrity (entity, referential and domain integrity) and common constraints.
- Explain: explain why data integrity is essential for data quality.
- Apply: define, name, add and remove integrity constraints (
PRIMARY KEY,FOREIGN KEY,NOT NULL,UNIQUE,CHECK). - Analyze: analyze a constraint violation based on an error message and determine its cause.
- Evaluate: assess which constraints secure the integrity of a given data model.