Skip to content

11. Data Integrity

Switch to Zen Mode

Data Integrity — Ensuring Data Quality in SQL

Section titled “Data Integrity — Ensuring Data Quality in SQL”

Motivation: Imagine an online shop where a customer order references a customer ID that does not exist, or a product price is accidentally stored as a negative number. Both situations lead to inconsistent, unreliable data — and ultimately to wrong decisions or broken applications.

Data integrity is the set of rules and mechanisms that prevent exactly these problems. It ensures that data stored in a database is correct, consistent, and meaningful at all times.

These three words — correct, consistent, and meaningful — are worth defining precisely, because they are often used loosely in everyday language.

  • Correct means that every stored value is a valid instance of what the column is supposed to represent. A date of birth column contains an actual calendar date, not an arbitrary string. A price column contains a non-negative number, not NULL or a letter. Correctness is enforced by data types, NOT NULL, and CHECK constraints.

  • Consistent means that the data does not contradict itself — neither within a single row nor across related tables. An order row that references a customer who does not exist is inconsistent. An end_date that is earlier than the start_date in the same row is inconsistent. Consistency is enforced by FOREIGN KEY constraints and multi-column CHECK constraints.

  • Meaningful means that the data makes sense in the real-world context it is meant to model. A value can be correct (technically a valid integer) and consistent (no contradictions) but still be meaningless — for example, a quantity of zero in an order line, or a discount of 150%. Meaningful data satisfies the business rules of the domain. These rules are enforced by user-defined CHECK constraints.

Without integrity rules, databases are fragile:

  • A foreign key column could reference a row that was deleted.
  • A salary column could hold a negative value.
  • Two users could accidentally share the same email address.
  • A mandatory field could be left empty.

These problems are often hard to detect after the fact and expensive to fix. By defining integrity constraints at the database level, errors are caught immediately at the moment of insertion or update — regardless of which application or user caused them.

The key benefits of enforcing data integrity are:

BenefitExplanation
ReliabilityData is consistent and trustworthy across the whole system.
AutomationThe database enforces rules automatically, reducing the burden on application code.
Early error detectionInvalid data is rejected immediately, before it can cause harm.
InteroperabilityMultiple applications using the same database all benefit from the same rules.

Data integrity is traditionally divided into four categories. Each addresses a different dimension of data quality.

Every table must be able to uniquely identify each of its rows. No two rows may be indistinguishable from one another, and no row may be “anonymous”.

Rule: Every table must have a primary key. The primary key must be unique and must never be NULL.

-- The id column uniquely identifies each customer.
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL
);

Without entity integrity, it would be impossible to reliably address a specific row — for example when updating or deleting a record.

When one table references rows in another table (via a foreign key), those referenced rows must actually exist. A reference to a non-existent row is called a dangling reference and leads to inconsistent data.

Rule: Every value in a foreign key column must either match an existing primary key value in the referenced table, or be NULL (if the relationship is optional).

-- An order must reference an existing customer.
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
placed_at DATE NOT NULL
);

If someone tries to insert an order with a customer_id that does not exist in customers, the database will reject the operation immediately.

Each column has a defined domain: the set of all values that are valid for that column. Domain integrity ensures that only values within this domain can be stored.

The domain of a column is determined by:

  • its data type (e.g., INTEGER, DATE, VARCHAR(100))
  • additional constraints (e.g., NOT NULL, CHECK, DEFAULT)

Examples:

  • A birth_date column of type DATE will reject the text "hello".
  • A price column with CHECK (price >= 0) will reject -5.00.
  • A status column with CHECK (status IN ('ACTIVE', 'INACTIVE')) rejects any other string.

Beyond the three structural categories above, databases also support user-defined rules that reflect specific business requirements. These are constraints that cannot be derived from the schema structure alone but must be explicitly specified.

Examples:

  • A discount percentage must be between 0 and 100.
  • A start_date must not lie after an end_date.
  • A quantity in an order line must be greater than zero.

In SQL, user-defined integrity is primarily expressed using CHECK constraints and, in more advanced scenarios, triggers or stored procedures.

SQL provides a set of constraints that are defined directly in the table schema. The database engine evaluates them automatically on every INSERT, UPDATE, and DELETE.

A column declared NOT NULL must always contain a value. An attempt to insert or update a row with NULL in that column is rejected.

CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
email TEXT -- optional (NULL is allowed)
);

Use NOT NULL for every column where an empty value makes no sense — for example, a person’s name or a product code.

A column (or combination of columns) declared UNIQUE must contain no duplicate values across all rows. Unlike a primary key, a UNIQUE column may contain NULL — and, in most database systems, multiple NULL values are allowed because NULL is never considered equal to anything, including another NULL.

CREATE TABLE users (
id SERIAL PRIMARY KEY,
username TEXT NOT NULL UNIQUE,
email TEXT NOT NULL UNIQUE
);

The PRIMARY KEY constraint is the central pillar of entity integrity. It combines two rules in one:

  1. The column (or columns) must be unique.
  2. The column (or columns) must never be NULL.

A table can have only one primary key, but that key can span multiple columns (a composite primary key).

-- Single-column primary key
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name TEXT NOT NULL
);
-- Composite primary key (common in junction tables)
CREATE TABLE course_registrations (
student_id INTEGER NOT NULL,
course_id INTEGER NOT NULL,
PRIMARY KEY (student_id, course_id)
);

A FOREIGN KEY constraint links a column in one table to the primary key of another table, enforcing referential integrity.

CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
total NUMERIC(10, 2) NOT NULL CHECK (total >= 0)
);

What happens when the referenced row changes?

Section titled “What happens when the referenced row changes?”

A question arises immediately: what should happen to the child rows when a referenced row in the parent table is deleted or updated?

SQL allows you to specify this behavior using referential actions:

ActionEffect
RESTRICTThe operation on the parent row is rejected if child rows exist. This is the default in most systems.
CASCADEThe operation is propagated to all child rows automatically (e.g., delete or update them).
SET NULLThe foreign key columns in child rows are set to NULL.
SET DEFAULTThe foreign key columns in child rows are set to their default value.
NO ACTIONSimilar to RESTRICT, but the check may be deferred until the end of a transaction.
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL
REFERENCES customers(id)
ON DELETE RESTRICT
ON UPDATE CASCADE,
placed_at DATE NOT NULL
);

Reading the example above:

  • ON DELETE RESTRICT — if someone tries to delete a customer who still has orders, the database refuses the deletion.
  • ON UPDATE CASCADE — if a customer’s id changes (rare, but possible), all matching customer_id values in orders are updated automatically.

A CHECK constraint defines a logical condition that every row must satisfy. The condition can involve any column in the same table and any SQL expression that returns a boolean result.

CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
price NUMERIC(10, 2) NOT NULL CHECK (price >= 0),
discount NUMERIC(5, 2) CHECK (discount BETWEEN 0 AND 100)
);

You can also write table-level CHECK constraints that compare multiple columns with each other:

CREATE TABLE projects (
id SERIAL PRIMARY KEY,
start_date DATE NOT NULL,
end_date DATE,
CONSTRAINT chk_dates CHECK (end_date IS NULL OR end_date >= start_date)
);

Here, end_date may be left empty (the project is still running), but if it is set it must not be earlier than start_date.

A DEFAULT value is not a constraint in the strictest sense — it does not reject invalid data. Instead, it provides a fallback value that is used automatically when no explicit value is supplied for a column during an INSERT.

CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
published BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

If you insert a new article without specifying published or created_at, the database fills in FALSE and the current timestamp automatically.

DEFAULT is particularly useful for:

  • Timestamps — recording when a row was created.
  • Boolean flags — setting a sensible initial state (e.g., “inactive”, “unpublished”).
  • Status columns — starting every new record in a defined state.

The database generates a name automatically for every constraint that has no explicit name. These auto-generated names are often cryptic and hard to work with when you receive an error message or need to drop a constraint later.

It is good practice to name every constraint explicitly using the CONSTRAINT keyword:

CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INTEGER NOT NULL,
product_id INTEGER NOT NULL,
quantity INTEGER NOT NULL,
unit_price NUMERIC(10, 2) NOT NULL,
CONSTRAINT fk_order_items_order
FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
CONSTRAINT fk_order_items_product
FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE RESTRICT,
CONSTRAINT chk_order_items_quantity
CHECK (quantity > 0),
CONSTRAINT chk_order_items_unit_price
CHECK (unit_price >= 0)
);

A widely used naming convention is:

  • pk_<table> — primary key
  • uq_<table>_<column> — unique constraint
  • fk_<table>_<referenced_table> — foreign key
  • chk_<table>_<column> — check constraint

When an operation violates a constraint, the database aborts the statement and returns an error. The data is not changed. Here are the most common violations and what triggers them:

ConstraintViolation Example
NOT NULLInserting a row without a value for a required column.
UNIQUEInserting a row whose email already exists in the table.
PRIMARY KEYInserting a row with an id that already exists.
FOREIGN KEYInserting an order for a customer_id that does not exist.
CHECKInserting a product with a negative price.
-- Attempting to insert an order for customer_id = 999 (does not exist):
INSERT INTO orders (customer_id, placed_at) VALUES (999, CURRENT_DATE);
-- Result: ERROR — foreign key constraint violation
-- Attempting to insert a product with a negative price:
INSERT INTO products (name, price) VALUES ('Widget', -5.00);
-- Result: ERROR — check constraint violation

Error messages differ between database systems, but they always tell you which constraint was violated and on which table. Reading these messages carefully is an important skill when working with SQL.

Constraints can also be added to or removed from an existing table using ALTER TABLE.

-- Add a CHECK constraint to an existing table
ALTER TABLE products
ADD CONSTRAINT chk_products_price CHECK (price >= 0);
-- Add a UNIQUE constraint
ALTER TABLE users
ADD CONSTRAINT uq_users_email UNIQUE (email);
-- Add a FOREIGN KEY constraint
ALTER TABLE orders
ADD CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id) REFERENCES customers(id);
-- Remove a constraint by name
ALTER TABLE products
DROP CONSTRAINT chk_products_price;

Data integrity is the foundation of a trustworthy database. It ensures that only correct, consistent, and meaningful data can enter the system — automatically, at the database level, for every application that uses it.

ConstraintCategoryWhat it enforces
PRIMARY KEYEntityUnique, non-null identifier for every row
NOT NULLDomainColumn must always contain a value
UNIQUEEntity / DomainNo duplicate values in a column
FOREIGN KEYReferentialReferences must point to existing rows
CHECKDomain / User-definedValues must satisfy a logical condition
DEFAULTDomainFallback value when no value is supplied

Learning Outcomes: What you should be able to do after this chapter

Section titled “Learning Outcomes: What you should be able to do after this chapter”

After completing this chapter, students should be able to:

  • Name: name the categories of data integrity (entity, referential and domain integrity) and common constraints.
  • Explain: explain why data integrity is essential for data quality.
  • Apply: define, name, add and remove integrity constraints (PRIMARY KEY, FOREIGN KEY, NOT NULL, UNIQUE, CHECK).
  • Analyze: analyze a constraint violation based on an error message and determine its cause.
  • Evaluate: assess which constraints secure the integrity of a given data model.