Why is `SELECT DISTINCT country` Slow on Large Tables?

Running a SELECT DISTINCT query on a column like country might seem simple, but as your users table grows into the millions of rows, performance often hits a wall. Here is a technical breakdown of why this happens and how to fix it.

⚙️ How the Database Processes Your Query

When you execute SELECT DISTINCT country FROM users;, the database engine cannot simply "know" which countries exist. It has to find every unique value by following these steps:

1. Full Table Scans

If the country column is not indexed, the database must perform a Full Table Scan.
* It reads every single row from the disk.
* It loads every data page into memory.
* For a table with 50 million users, this is a massive I/O operation.

2. The Sorting and Hashing Overhead

To identify "distinct" values, the database has to compare every value it finds against the values it has already seen. It usually does this in two ways:
* Hash Aggregation: It builds a temporary hash table in memory. If the table exceeds the allocated RAM (work_mem), it spills to disk, slowing down the process significantly.
* Sort Merge: It sorts the entire column to group duplicates together and then grabs one from each group. Sorting is computationally expensive ($O(n \log n)$).

3. High Cardinality vs. Low Cardinality

Low Cardinality: If you only have 200 countries in a table of 10 million rows, the database still spends most of its time discarding duplicates.
High Cardinality: If the column has many unique values (like email), the memory footprint for the "unique list" becomes massive.

🛠️ The "Index" Misconception

Adding a standard B-Tree index on country helps, but it isn't a magic bullet.

Even with an index, the database may still perform an Index Full Scan. While reading an index is faster than reading the raw table (because indexes are smaller), the database still has to traverse the entire index leaf-node list to ensure it hasn't missed a single unique entry.

🚀 Solutions for Better Performance

If your query is taking seconds or minutes, consider these architectural changes:

1. Use a Lookup Table (Normalization)

Instead of storing "United States" 10 million times in the users table:
* Create a countries table with an id and name.
* Store country_id in the users table.
* New Query: SELECT name FROM countries; (This is instant).

2. Loose Index Scan (Skip Scan)

Some databases (like MySQL 8.0+ or PostgreSQL with certain techniques) can use a "Skip Scan." Instead of reading every row, the database:
1. Grabs the first value.
2. Jumps (skips) directly to the next higher value in the index.
3. This turns a multi-million row scan into a few hundred "probes."

3. Materialized Views or Caching

If the list of countries doesn't change every second:
* Cache the result in Redis.
* Use a Materialized View that refreshes once an hour.
* Maintain a separate summary table updated by triggers.

📝 Summary

The query is slow because DISTINCT is an exhaustive operation. Without a specialized "Skip Scan" or a normalized schema, the database is forced to do the heavy lifting of reading and filtering every single row in your database.

Why is `SELECT DISTINCT country` Slow on Large Tables?

Why is SELECT DISTINCT country Slow on Large Tables?

⚙️ How the Database Processes Your Query

1. Full Table Scans

2. The Sorting and Hashing Overhead

3. High Cardinality vs. Low Cardinality

🛠️ The "Index" Misconception

🚀 Solutions for Better Performance

1. Use a Lookup Table (Normalization)

2. Loose Index Scan (Skip Scan)

3. Materialized Views or Caching

📝 Summary

Comments (0)

Why is `SELECT DISTINCT country` Slow on Large Tables?