8. SQL Aggregation and Grouping

Aggregation and Grouping in SQL

While retrieving individual rows is essential, the real power of SQL lies in its ability to transform raw data into meaningful insights. In professional environments, you rarely need to see every single transaction or student record; instead, you need answers to high-level questions:

What is the average grade in a class?
How many students are enrolled in each subject?
What was the total revenue for the last month?

This is where Aggregation Functions and the GROUP BY clause come into play. Aggregation allows you to “collapse” multiple rows of data into a single summary value, while grouping enables you to partition your data into logical buckets before performing those calculations. Mastery of these tools is what turns a database user into a data analyst, allowing you to identify patterns and trends within your relational data.

Grouping Data (`GROUP BY`)

The GROUP BY clause is used to arrange identical data into groups. This “collapsing” of rows is the prerequisite for performing calculations on specific subsets of your data (e.g., calculating the average grade per class instead of for the entire school).

How it Works: The Bucket Concept

Imagine your students table. When you group by the class column, SQL creates a “bucket” for each unique class name (4A, 4B, etc.). Every row belonging to “4A” is placed into the 4A bucket.

The Golden Rule of Grouping:

When using GROUP BY, every column listed in your SELECT statement must satisfy one of two conditions:

It is featured in the GROUP BY clause.
It is wrapped in an Aggregate Function (like SUM or COUNT).

SELECT class, COUNT(student_id)
FROM students
GROUP BY class;

The query below is not valid, because first_name is neither grouped nor aggregated. The database wouldn’t know which first_name to show for the whole class, since there are multiple students in each class.

SELECT class, first_name
FROM students
GROUP BY class;

Grouping by More Than One Column

So far we have grouped by a single column. You can also list several columns in the GROUP BY clause. The key thing to understand is what becomes a bucket: SQL creates one bucket for every unique combination of values across all listed columns — not one bucket per column.

Think back to the boxes. With GROUP BY class the label on each box was just the class name. With GROUP BY class, subject the label on each box becomes a pair: (class, subject). Two rows land in the same box only if they agree on both values.

Imagine a grades table like this:

subject	score
Databases	1
Databases	1
Databases	2
Networking	1
Networking	3
Networking	3

Now we count how many times each score was given per subject:

SELECT subject, score, COUNT(*) AS how_many
FROM grades
GROUP BY subject, score;

SQL forms one bucket per unique (subject, score) combination and counts the rows in each:

subject	score	how_many
Databases	1	2
Databases	2	1
Networking	1	1
Networking	3	2

Things to notice:

“Databases” on its own is no longer a bucket. (Databases, 1) and (Databases, 2) are two separate buckets, because the second grouping column splits them apart.
The more columns you add to GROUP BY, the more buckets you get — and the smaller each bucket becomes (finer granularity).
The order of the columns in GROUP BY does not change which buckets are formed: GROUP BY subject, score and GROUP BY score, subject produce the same groups. (It only matters for the displayed order if you add an ORDER BY.)
The Golden Rule still applies: every column in the SELECT list (subject and score) is either listed in GROUP BY or wrapped in an aggregate function.

Aggregate Functions

Aggregate functions perform a calculation on a set of values and return a single value. They are almost always used in conjunction with GROUP BY.

Function	Description	Example
`COUNT()`	Returns the number of rows.	`COUNT(*)` counts all rows; `COUNT(col)` counts non-null values.
`SUM()`	Calculates the total sum of a numeric column.	`SUM(score)`
`AVG()`	Calculates the arithmetic mean (average).	`AVG(score)`
`MIN()`	Finds the minimum value in a set.	`MIN(birth_date)` (The oldest student)
`MAX()`	Finds the maximum value in a set.	`MAX(birth_date)` (The youngest student)

Example - Analyzing Class Performance:

SELECT
    subject,
    AVG(score) AS average_grade,
    COUNT(*) AS total_exams
FROM grades
GROUP BY subject;

Filtering: `WHERE` vs. `HAVING`

One of the most common points of confusion is when to use WHERE and when to use HAVING. Both are used for filtering, but they act at different stages of the query.

The `WHERE` Clause

The WHERE clause filters individual rows before any grouping happens. If a row doesn’t meet the WHERE condition, it never even makes it into a “bucket.”

The `HAVING` Clause

The HAVING clause filters groups after the GROUP BY and aggregate calculations have been performed. You use HAVING when you want to filter based on an aggregate result (e.g., “only show classes with more than 20 students”).

Comparison Example:

In the example below, WHERE filters out students born before 2010 before grouping, while HAVING filters out groups (classes) that have 2 or fewer students after grouping.

SELECT class, COUNT(*) as student_count
FROM students
WHERE birth_date > '2010-01-01'
GROUP BY class
HAVING COUNT(*) > 2;

The Logical Order of Execution

To understand why HAVING comes after GROUP BY, it helps to look at the order in which the database actually processes your command. Even though we write SELECT first, the database executes it much later:

FROM & JOIN: The database gathers all the raw data.
WHERE: Individual rows are filtered out.
GROUP BY: The remaining rows are organized into buckets.
HAVING: Entire buckets (groups) are filtered out based on aggregate values.
SELECT: The final columns and calculations are prepared for display.
ORDER BY: The final result set is sorted.

Summary: Aggregation and Grouping

In relational databases, data analysis often requires shifting from viewing individual records to generating high-level summaries. This chapter covered how to use Aggregate Functions to perform calculations and the GROUP BY clause to organize data into logical subsets.

1. The Core Concept of Grouping

Grouping allows you to “collapse” multiple rows into single summary rows based on shared values in specific columns.

The Bucket Logic: When you group by a column, SQL places all records with the same value into a single “bucket” for calculation.
The Golden Rule: Every column in your SELECT list must either be part of the GROUP BY clause or be wrapped in an aggregate function. This ensures there is no ambiguity about which data point to display for a group.

2. Essential Aggregate Functions

These functions perform calculations on a set of values within a group (or the entire table) to return a single informative value.

Function	Purpose	Common Use Case
`COUNT()`	Counts the number of items.	Counting students per class or total orders.
`SUM()`	Adds up numeric values.	Calculating total revenue or total points.
`AVG()`	Finds the arithmetic mean.	Determining average test scores or prices.
`MIN()`	Finds the lowest value.	Identifying the lowest grade or earliest date.
`MAX()`	Finds the highest value.	Identifying the top score or latest entry.

3. Filtering: `WHERE` vs. `HAVING`

Understanding the timing of these two clauses is critical for accurate data analysis.

WHERE (Pre-filter): Acts on individual rows before grouping. Use it to exclude specific data from being calculated at all (e.g., “Exclude inactive users”).
HAVING (Post-filter): Acts on groups after aggregation. Use it to filter based on the results of functions like SUM or COUNT (e.g., “Only show departments with more than 10 employees”).

4. Logical Execution Order

To write successful complex queries, you must remember that the database does not process clauses in the order they are written. The execution flow is:

FROM / JOIN (Gather data)
WHERE (Filter rows)
GROUP BY (Organize into buckets)
HAVING (Filter buckets)
SELECT (Compute results)
ORDER BY (Sort final list)

Learning Outcomes: What you should be able to do

Calculate: Use functions like SUM, AVG, and COUNT to extract metrics from raw data.
Organize: Apply GROUP BY to generate reports categorized by specific attributes (e.g., by date, category, or department).
Distinguish: Correctly choose between WHERE and HAVING based on whether the filter applies to raw rows or aggregated results.
Analyze: Determine the validity of a query based on the “Golden Rule” of grouping.
Sequence: Structure complex SQL statements following the logical order of execution to avoid syntax and logic errors.