6. Physical Organization
Physical Organization of Data in Database Management Systems
Section titled “Physical Organization of Data in Database Management Systems”Physical organization forms the bridge between the relational world of SQL and the mechanical reality of storage media. Logical entities like tables and columns must be efficiently mapped by the Database Management System (DBMS) to physical drives. This layer is critical for performance, as disk I/O remains a primary bottleneck compared to high-speed RAM.
To optimize this mapping, DBMS usually organize data into fixed-size units called pages or blocks, to utilize caching strategies and buffer management.
Layout of Data on Storage Media
Section titled “Layout of Data on Storage Media”The physical storage of a database management system starts with a dedicated Data Directory on the file system. Most relational databases, including MariaDB and PostgreSQL, use a structured hierarchy of folders to separate system-wide metadata from actual user data. While MariaDB often uses a directory per database schema (containing .ibd files for InnoDB), PostgreSQL manages its cluster through the PGDATA environment variable.
”… data directory, commonly referred to as PGDATA … A common location for PGDATA is
/var/lib/pgsql/data.” See: PostgreSQL Storage Layout for more details.
Typical layouts of the PostgreSQL data directory include:
/var/lib/pgdata├── base│ ├── 1│ ├── 4│ ├── 5│ └── 16384├── global│ ├── 1213│ ├── 1213_fsm│ ├── 1213_vm│ ├── 1214│ ├── 1232│ ├── 1233│ ├── 1260│ ├── 1260_fsm│ ├── 1260_vm...│ ├── 6303│ ├── pg_control│ ├── pg_filenode.map│ └── pg_internal.init├── pg_commit_ts├── pg_dynshmem├── pg_hba.conf├── pg_ident.conf├── pg_logical│ ├── mappings│ ├── replorigin_checkpoint│ └── snapshots├── pg_multixact│ ├── members│ └── offsets├── pg_notify├── pg_replslot├── pg_serial├── pg_snapshots├── pg_stat│ └── pgstat.stat├── pg_stat_tmp├── pg_subtrans│ └── 0000├── pg_tblspc├── pg_twophase├── PG_VERSION├── pg_wal│ ├── 000000010000000000000001│ ├── archive_status│ └── summaries├── pg_xact│ └── 0000├── postgresql.auto.conf├── postgresql.conf└── postmaster.optsTypical layouts include:
-
Directory-per-Database: In PostgreSQL, the
base/directory contains subfolders named after the database’s OID (Object Identifier, a unique numeric identifier, e.g. 23), whereas MariaDB typically uses the literal database name for its folders (“northwind”). -
System Catalogs: Both systems store global information (like users and permissions) in a central location. PostgreSQL uses the
global/directory for this purpose. -
Transaction Logs: To ensure durability, databases maintain write-ahead logs. PostgreSQL stores these in
pg_wal/, similar to MariaDB’sib_logfileor redo logs.
To decouple physical storage from the logical database structure, many systems implement Tablespaces. This abstraction allows administrators to distribute data across different physical storage devices to balance I/O load. While MariaDB (via InnoDB) supports file-per-table or general tablespaces, PostgreSQL can map tables to different mount points or drives.
“Tablespaces in PostgreSQL allow database administrators to define locations in the file system where the files representing database objects can be stored. Once created, a tablespace can be referred to by name when creating database objects.” See: PostgreSQL Tablespaces for more details.
CREATE TABLESPACE fastspace LOCATION '/ssd1/postgresql/data';
CREATE TABLE mytable ( id SERIAL PRIMARY KEY, data TEXT) TABLESPACE fastspace;On the file system level, databases represent tables and indexes as Relation Files. A common challenge for DBMS is handling extremely large tables that might exceed file system limits. While some systems use a single large file, PostgreSQL and others employ a segmented approach.
PostgreSQL, specifically, limits each physical file to exactly 1 GB to ensure compatibility across different operating systems and ease of backup.
-
File Segmentation: If a table grows to 10 GB, PostgreSQL creates 10 files (e.g.,
16384,16384.1, etc.). In contrast, MariaDB’s InnoDB usually grows a single.ibdfile dynamically. -
Mapping Identifiers: Most databases do not name files after the table name to avoid naming collisions or OS-specific character issues; instead, they use internal IDs (OIDs).
-
Auxiliary Files: Beyond the main data, systems maintain maps for free space. PostgreSQL uses
_fsm(Free Space Map) and_vm(Visibility Map) files to optimize internal operations.
Finally, the physical layout is designed to interact with the Operating System’s I/O Layer. Most modern databases do not interact with the hardware directly (raw devices) but use standard file system calls. This means the DBMS relies on the OS kernel for low-level tasks like disk scheduling. To ensure that data is actually “safe” on the platter or flash cells, databases use synchronization commands.
Pages and Blocks
Section titled “Pages and Blocks”To manage data efficiently, a DBMS does not operate on individual rows or bytes when interacting with storage, it organizes all data into fixed-size units known as pages (or blocks).
“In postgresql, the block stored in the disk is called Page, and the block in memory is called Buffer. Tables and indexes are called Relation, and rows are called Tuple. When reading and writing data, the page is the smallest unit, each page has a default size of 8KB.” See: [Architecture of PostgreSQL DB] (https://medium.com/swlh/architecture-of-postgresql-db-d6b1ac4cc231)
This granularity allows the database to minimize the number of expensive I/O operations by reading an entire page into memory even if only a single record is requested. In PostgreSQL, the default page size is 8 KB, a value chosen to balance the overhead of the page header against the efficiency of modern file system block sizes. While other systems like MariaDB’s InnoDB use 16 KB by default, the principle remains the same: the page is the smallest unit of data that can be transferred between the disk and the database’s internal memory buffer.
Inside the page is a header that contains metadata about the page, such as the number of records, free space, and transaction visibility information. The rest of the page is used to store the actual data rows (tuples) and an array of line pointers that point to the location of each tuple within the page.
Heap Files and Tuples
Section titled “Heap Files and Tuples”To understand how a database finds a specific record, you have to look at the relationship between the Index (B-Tree) and the Heap (Pages).
In database terminology, a table is often called a Heap because data is stored in no particular order. The index is the “map” that tells the database exactly which page and which slot within that page contains the data.
The next sections will break down the physical organization of data in a PostgreSQL database, focusing on how the B-Tree index interacts with the heap files to efficiently retrieve records. MariaDB’s InnoDB engine has a similar structure, but we’ll focus on PostgreSQL for the sake of clarity and depth.
The CTID (Item Pointer)
Section titled “The CTID (Item Pointer)”Every record (tuple) in a PostgreSQL table has a physical address called a TID or CTID. This is a 6-byte value that acts as a direct pointer to the physical storage. It consists of two numbers:
- Block Number: Which 8 KB page in the file contains the record?
- Offset Number: Which “Line Pointer” (index) inside that page points to the actual data?
For example, a CTID of (42, 5) means: “Go to the 42nd block of the table file and look at the 5th line pointer (inside that block).”
The Role of the B-Tree
Section titled “The Role of the B-Tree”Without an index, the database would have to perform a Sequential Scan, reading every single 8 KB page from disk to find a record. The B-Tree Index is a separate physical file that stores a sorted list of values (e.g., user_id or any primary key) along with their corresponding CTIDs.
When you run a query like SELECT * FROM users WHERE id = 35:
-
Traversing the Tree: The database starts at the “Root” of the B-Tree and moves down through “Internal Nodes” by comparing the value
35. -
Finding the Leaf: It eventually reaches a Leaf Node. This leaf node contains the key
35and its physical address in the table: e.g.(Block 12, Offset 3). -
The “Pointer” Jump: The database engine now knows exactly which 8 KB page to load from the table file. It jumps directly to Block 12 and grabs the 3rd record.
Decoupling for Performance
Section titled “Decoupling for Performance”A crucial detail of the physical organization is that the B-Tree does not point to the data bytes directly. It points to the Line Pointer (Item Identifier) at the top of the page.
- Why? If the database needs to move a record within a page (e.g., to reclaim space), it only needs to update the internal Line Pointer.
- The Benefit: The B-Tree index remains valid because the “Offset Number” hasn’t changed, even if the “Data Bytes” moved a few “centimeters” to the left or right within that 8 KB block.
Summary of the Path
Section titled “Summary of the Path”- User Query ➡️
WHERE id = 35 - B-Tree Index ➡️ Searches for
35, findsCTID (Block, Offset) - Buffer Manager ➡️ Checks if
Blockis in RAM; if not, reads 8 KB from disk. - Page Access ➡️ Looks at
Offsetin the page’s Line Pointer array. - Data Retrieval ➡️ Follows the Line Pointer to the actual Tuple bytes.
MVCC and Versioning
Section titled “MVCC and Versioning”One key feature of modern RDBMS, especially PostgreSQL, is how they handle concurrent transactions. Accessing the same data simultaneously can lead to conflicts, and different databases have different strategies to manage this.
In PostgreSQL’s instead of using traditional row-level locking—which would force readers to wait for writers, PostgreSQL uses Multi-Version Concurrency Control (MVCC). In this model, the database does not overwrite data in place. Instead, it maintains multiple versions of the same row simultaneously.
The Physical Impact of Updates and Deletes
Section titled “The Physical Impact of Updates and Deletes”In many other databases (like MariaDB with InnoDB), updates are often handled using “undo logs.” In PostgreSQL, however, every UPDATE is physically treated as a DELETE followed by an INSERT.
- INSERT: A new tuple is added to a page.
- DELETE: The existing tuple is not removed from the disk; it is merely marked as “expired.”
- UPDATE: The old version of the row is marked as expired, and a brand-new version (with the updated values) is stored elsewhere in the heap, potentially even on a different page.
This means that at any given time, a single 8 KB page may contain several “dead” tuples that are no longer visible to any active transaction but still occupy physical space.
Table Bloat and VACUUM
Section titled “Table Bloat and VACUUM”The physical consequence of this architecture is a phenomenon known as Table Bloat. Since updated and deleted rows remain on disk, the database files will continue to grow even if the amount of “live” data remains constant. To reclaim this space, PostgreSQL employs a background process called VACUUM.
- The Process: VACUUM scans the pages, identifies tuples and marks space as “available” for future inserts.
- Autovacuum: In modern installations, a daemon automatically triggers this process based on the percentage of changed rows.
- Limitations: A standard VACUUM does not return space to the operating system; it only makes it available for new data within the existing 1 GB relation files. Only a
VACUUM FULL(which rewrites the entire table) can shrink the file size on disk, though it requires an exclusive lock on the table.
TOAST (The Oversized-Attribute Storage Technique)
Section titled “TOAST (The Oversized-Attribute Storage Technique)”Since PostgreSQL uses a fixed page size of 8 KB, it faces a physical challenge when a single row contains a large amount of data—such as a long text block, a high-resolution image, or a complex JSONB object. Because a tuple must fit entirely within a single page, PostgreSQL employs a mechanism called TOAST. This “out-of-line” storage technique ensures that large attributes do not bloat the main heap table or exceed the physical limits of a block.
How TOAST Works
Section titled “How TOAST Works”When a row is inserted or updated, PostgreSQL checks if the total size exceeds a certain threshold (usually 2 KB). If the data is too large, the system triggers the TOAST process:
- Compression: First, PostgreSQL attempts to compress the large value to see if it can fit back into the 8 KB page.
- Out-of-line Storage: If the compressed data is still too large, it is moved to a separate, internal TOAST table associated with the main table.
- The Pointer: In the original heap page, the actual data is replaced by a small Pointer (a “Toast Pointer”). This pointer contains the OID of the TOAST table and the specific location where the large value is stored.
Advantages for Performance
Section titled “Advantages for Performance”This physical separation offers a significant performance advantage for standard queries. When you execute a SELECT * but only filter by a small column (like a user_id or status), the database only needs to read the 8 KB pages of the main heap. The massive “TOASTed” data is never loaded into memory unless the specific large column is actually requested. This keeps the Buffer Cache clean and allows more “small” rows to fit into a single page, drastically speeding up sequential scans and index lookups.
Indexes at the Physical Level
Section titled “Indexes at the Physical Level”While both tables and indexes in PostgreSQL are stored as Relation Files divided into 8 KB pages, their internal physical structure differs significantly. In a heap file, data is added to any page with enough free space, creating an unordered collection. In contrast, an index file maintains a highly organized, hierarchical structure.
For a B-Tree index, the physical pages are categorized into meta-pages, internal nodes, and leaf nodes. The meta-page at the beginning of the file contains global information about the index, while the internal nodes store keys that act as signposts to guide the search. The leaf nodes, located at the bottom of the hierarchy, store the actual indexed values along with the CTIDs that point back to the heap.
Write Ahead Logging (WAL)
Section titled “Write Ahead Logging (WAL)”The Write Ahead Log (WAL) is the fundamental mechanism that ensures data integrity and recovery in the event of a system crash. Instead of immediately writing every change to the main 8 KB heap and index pages—which would require expensive random I/O—PostgreSQL first records the change as a sequential entry in the WAL.
This log, stored in the pg_wal directory, serves as a chronological diary of all modifications. The “Write Ahead” principle dictates that no data page can be flushed to the permanent relation files on disk until the corresponding log entry has been safely committed to the WAL.
This sequential writing is significantly faster than updating various scattered data files, allowing the database to provide high performance while guaranteeing that no committed transaction is lost.
Backup and Recovery
Section titled “Backup and Recovery”Database backup and recovery strategies are essential for ensuring data durability and availability. The physical organization of data, including the structure of heap files, indexes, and the WAL, directly impacts how backups are created and how recovery processes work.
Reasons for Backup and Recovery
Section titled “Reasons for Backup and Recovery”Despite the robust physical architecture of modern DBMS like PostgreSQL, data remains vulnerable to various external and internal threats. Backups are not merely a safety net but a critical component of data lifecycle management for the following reasons:
-
Hardware and System Failures: Even with high-end storage solutions, physical media can fail. A “head crash” on a traditional HDD or a controller failure on an SSD can lead to unrecoverable data corruption within the relation files.
-
Human Error and Malicious Activity: Accidental
DROP TABLEorDELETEcommands are among the most common causes of data loss. Since these operations are technically “valid” transactions, the database will faithfully execute them and record them in the WAL. In such cases, only a backup created prior to the mistake (combined with Point-in-Time Recovery) can restore the lost information. This also applies to malicious attacks, such as ransomware, which may encrypt the physical data directory. -
Software Defects and Corrupted Updates: Bugs in the DBMS itself, a poorly written extension, or an interrupted major version upgrade of an application can lead to data corruption.
-
Compliance and Legal Requirements: Many industries are legally mandated to retain snapshots of their data for several years. Physical backups serve as immutable records of the database state at a specific point in history, fulfilling auditing requirements that a live, constantly changing database cluster cannot meet.
Creating Backups
Section titled “Creating Backups”In many DBMS like PostgreSQL and MariaDB, there are two fundamentally different ways to create a backup, each interacting differently with the physical storage layer. Choosing the right method depends on the size of the database, the required recovery speed, and whether the backup needs to be portable across different versions or operating systems.
Logical Backups (pg_dump)
Section titled “Logical Backups (pg_dump)”A logical backup extracts the database structure and its content into a set of SQL commands or a custom archive format. When you run pg_dump (or mysqldump for MariaDB), the tool connects to the database like a regular client and reads the data through the SQL layer.
-
Mechanism: It performs a sequential scan of all tables, effectively bypassing the physical 8 KB page structure and focusing on the data values themselves.
-
Portability: Because it stores data as SQL (e.g.,
INSERTstatements), a logical backup can be restored onto a different CPU architecture or a newer version of PostgreSQL. -
Granularity: You can easily back up a single table, a specific schema, or the entire database.
-
Downside: For very large databases (Terabytes), logical backups are slow because the database must convert every physical row into text/SQL and then re-parse and re-index it during restoration.
Create a logical backup for the database ‘my_database’ as the user ‘db_admin’ with pg_dump in PostgreSQL:
pg_dump -U db_admin -d my_database &> BACKUP_2026-03-18.sqlCheck the size of the backup file (15KB in this example):
ls -hl BACKUP_2026-03-18.sql-rw-r--r-- 1 root root 15K Mar 18 08:12 BACKUP_2026-03-18.sqlCheck the first few lines of the backup file to confirm it contains SQL commands (omitting empty lines):
grep -v '^$' BACKUP_2026-03-18.sql | head -n 20---- PostgreSQL database dump--\restrict eDhVav5z1NQbVObw4zGJcnWnIu3FzAe9rR56JvPg1jp3F7SpZW7rGGULIzvwEAs-- Dumped from database version 17.6 (Debian 17.6-2.pgdg13+1)-- Dumped by pg_dump version 17.6 (Debian 17.6-2.pgdg13+1)SET statement_timeout = 0;SET lock_timeout = 0;SET idle_in_transaction_session_timeout = 0;SET transaction_timeout = 0;SET client_encoding = 'UTF8';SET standard_conforming_strings = on;SELECT pg_catalog.set_config('search_path', '', false);SET check_function_bodies = false;SET xmloption = content;SET client_min_messages = warning;SET row_security = off;---- Name: RoleEnum; Type: TYPE; Schema: public; Owner: digidog--Physical Backups (pg_basebackup)
Section titled “Physical Backups (pg_basebackup)”A physical backup, often called a “Binary Backup,” creates a bit-for-bit copy of the files in the PGDATA directory. This includes all 1 GB relation segments, the WAL files, and the internal metadata.
It is important to note that a physical backup is not just a simple file copy. The database must ensure that the backup is consistent, meaning that all files are in a state that can be used to start the database without corruption.
⚠️ This is complex because the database is constantly writing to these files, and a simple cp or tar command of the data directory would likely result in a corrupted backup. ⚠️
-
Mechanism: Tools like
pg_basebackupcopy the actual 8 KB pages directly from the file system. It is much faster for large datasets because it avoids the overhead of SQL processing. -
Consistency:
pg_basebackupuses a special replication protocol to ensure that the copied files, combined with the WAL segments generated during the backup, result in a consistent state. -
Restoration Speed: Recovery is significantly faster than logical backups because the database engine simply “starts up” using the copied files, only needing to replay a few WAL entries instead of rebuilding all indexes from scratch.
Incremental and Differential Backups
Section titled “Incremental and Differential Backups”While a physical “full” backup (pg_basebackup) copies every single 8 KB page of the entire database, Incremental Backups only store the pages that have actually changed since the last backup.
This approach significantly reduces the volume of data transferred and stored, especially for multi-terabyte databases where only a small percentage of rows are modified daily. In PostgreSQL (starting with version 17), this is implemented through a “block-level” tracking mechanism. Instead of scanning the entire heap to find changes, the database can maintain a summarized file of modified blocks, allowing the backup tool to pinpoint and extract only the necessary 8 KB units.
Steps for an incremental backup:
-
Configure the database to track block changes. You may need to enable this feature in the PostgreSQL configuration (
postgresql.conf) and restart the server. -
Initial Full Backup: Start with a complete physical backup using
pg_basebackup.
pg_basebackup -D /path/to/full_backup -Fp -P -U db_admin-
Track Changes: The database engine tracks which blocks have been modified since the last backup.
-
Create Incremental Backup: Use a tool that reads the block change tracking file to create a backup containing only the changed blocks.
pg_basebackup -D /path/to/incremental_1 --incremental=/path/to/full_backup backup_manifest -Fp -P -U db_adminRestore
Section titled “Restore”A backup is only as good as the ability to restore it. The restore process is the operational reversal of the backup, where data is moved from the backup storage back into the database engine’s physical layout. Depending on whether a logical or physical approach was used, the restore process interacts differently with the system’s pages, indexes, and write-ahead logs.
Restoring a Logical Backup
Section titled “Restoring a Logical Backup”Restoring a logical backup (created with pg_dump or mysqldump) is essentially an execution of a large SQL script. Since the backup consists of standard DDL (CREATE TABLE) and DML (INSERT) statements, the database engine must process every line as if it were a new transaction.
-
The Process: You use the
psql(ormariadb) utility. -
Physical Execution: As the commands are executed, the DBMS must allocate new 8 KB pages, insert tuples into the heap, and—most importantly—rebuild every index from scratch.
-
Performance: This is CPU and I/O intensive. Because indexes are recalculated during the import, restoring a multi-terabyte database logically can take significantly longer than the original backup took to create.
To restore a plain SQL backup file to a database:
psql -U db_admin -d my_database -f BACKUP_2026-03-18.sqlRestoring a Physical Backup
Section titled “Restoring a Physical Backup”Restoring a physical backup (created with pg_basebackup) is more like a file system operation than a database operation. Instead of executing SQL, you are replacing the contents of the PGDATA directory in PostgreSQL.
- The Process:
- The database server must be stopped.
- The existing (corrupted or old) data directory is cleared.
- The backup files (the 8 KB pages and control files) are copied into the
PGDATAlocation. - The server is started.
-
Physical Execution: Upon startup, the PostgreSQL engine enters “Recovery Mode.” It checks the
pg_controlfile and uses the archived WAL segments to “replay” any transactions that occurred during or after the backup was taken. -
Performance: This is extremely fast because the indexes and tables are already in their final binary form. There is no need for SQL parsing or index recalculation.
Point-in-Time Recovery (PITR)
Section titled “Point-in-Time Recovery (PITR)”Point-in-Time Recovery (PITR) is an advanced form of physical restore. It allows you to restore a database to a specific moment in time (e.g., “10:14 AM last Tuesday”), which is invaluable if a mistake—like accidentally dropping a production table—happened at 10:15 AM.
PITR requires two components:
- A Base Backup: A physical snapshot of the data directory.
- WAL Archiving: A continuous stream of WAL segments saved since that base backup was taken.
During recovery, you tell PostgreSQL the “recovery target time.” The engine starts with the base backup and “fast-forwards” through the WAL logs, re-applying every physical change block-by-block, until it reaches the exact microsecond before the error occurred. This provides a level of data resilience that is nearly impossible to achieve with logical backups alone.
The “recovery.conf” file is used to configure the recovery process, specifying the location of the base backup and the WAL archive, as well as the target time for recovery.
-- Specify recovery target timerestore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'recovery_target_time = '2026-01-25 14:34:00'recovery_target_action = 'promote'
-- Alternative targets:-- recovery_target_xid = '12345' # Recover to specific transaction-- recovery_target_lsn = '0/15D68C50' # Recover to specific WAL position-- recovery_target_name = 'before_migration' # Recover to named restore pointSee How to Recover Data with Point-in-Time Recovery in PostgreSQL for a detailed walkthrough of the PITR process. [16]