Subqueries
Subqueries are queries nested within other queries. They appear in SELECT lists, FROM clauses, and WHERE clauses. Supporting subqueries requires both parsing them correctly and planning them efficiently.
Types of Subqueries
Scalar Subqueries
A scalar subquery returns a single value and can appear wherever a scalar expression is valid:
SELECT id, name,
(SELECT AVG(salary) FROM employees) AS avg_salary
FROM employees
The subquery (SELECT AVG(salary) FROM employees) returns one value that applies to every row. If a scalar subquery returns more than one row, that is an error.
Correlated Subqueries
A correlated subquery references columns from the outer query:
SELECT id, name,
(SELECT COUNT(*) FROM orders WHERE orders.customer_id = customers.id) AS order_count
FROM customers
The inner query references customers.id from the outer query. This correlation creates a dependency: conceptually, the inner query runs once per row of the outer query.
Uncorrelated Subqueries
An uncorrelated subquery is self-contained:
SELECT * FROM orders
WHERE total > (SELECT AVG(total) FROM orders WHERE region = 'West')
The subquery does not reference the outer query, so it can be evaluated once and the result substituted.
EXISTS and IN Subqueries
The EXISTS predicate tests whether a subquery returns any rows:
SELECT id FROM customers
WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id)
This returns customers who have at least one order.
The IN predicate tests membership in a set:
SELECT * FROM products
WHERE category_id IN (SELECT id FROM categories WHERE active = true)
Both NOT EXISTS and NOT IN provide the negated forms.
Planning Subqueries
Uncorrelated Subqueries
Uncorrelated scalar subqueries are straightforward: execute them once during planning (or once at the start of execution) and substitute the result as a literal value.
SELECT * FROM orders WHERE total > (SELECT AVG(total) FROM orders)
Becomes:
SELECT * FROM orders WHERE total > 42500.00 -- after evaluating subquery
Correlated Subqueries: The Naive Approach
A naive implementation executes the correlated subquery once per row of the outer query. For the order count example:
SELECT id, name,
(SELECT COUNT(*) FROM orders WHERE orders.customer_id = customers.id) AS order_count
FROM customers
If customers has 100,000 rows, we run 100,000 separate queries against orders. This is extremely slow.
Decorrelation: Converting Subqueries to Joins
The solution is decorrelation: rewriting correlated subqueries as joins. The query above becomes:
SELECT c.id, c.name, COALESCE(o.order_count, 0) AS order_count
FROM customers c
LEFT JOIN (
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
) o ON c.id = o.customer_id
Now we scan orders once, aggregate, and join, which is much faster.
EXISTS to Semi Join
An EXISTS subquery becomes a semi join. A semi join returns rows from the left side where at least one match exists on the right, without duplicating left rows when multiple matches exist.
SELECT id FROM foo WHERE EXISTS (SELECT 1 FROM bar WHERE foo.id = bar.id)
Becomes:
Projection: foo.id
LeftSemi Join: foo.id = bar.id
Scan: foo
Scan: bar
NOT EXISTS to Anti Join
NOT EXISTS becomes an anti join, which returns rows from the left side where no match exists on the right:
SELECT id FROM foo WHERE NOT EXISTS (SELECT 1 FROM bar WHERE foo.id = bar.id)
Becomes:
Projection: foo.id
LeftAnti Join: foo.id = bar.id
Scan: foo
Scan: bar
IN to Semi Join
IN subqueries also become semi joins:
SELECT * FROM products WHERE category_id IN (SELECT id FROM categories WHERE active = true)
Becomes:
LeftSemi Join: products.category_id = categories.id
Scan: products
Filter: active = true
Scan: categories
Implementation Complexity
Subquery decorrelation is one of the more complex parts of a query engine. Challenges include:
Identifying correlation: The planner must determine which columns in the subquery reference the outer query.
Choosing the right join type: EXISTS maps to semi join, NOT EXISTS to anti join, scalar subqueries to left outer joins with aggregation.
Handling multiple correlations: A subquery might reference multiple outer tables in a complex query.
Preserving semantics: The rewritten query must produce exactly the same results, including handling of NULLs.
Nested subqueries: Subqueries can contain subqueries, requiring recursive decorrelation.
KQuery does not currently implement subqueries. Production query engines spend significant effort on subquery support since it is essential for SQL compatibility.
When Decorrelation Is Not Possible
Some correlated subqueries cannot be decorrelated into standard joins. These “lateral” or “dependent” subqueries must be evaluated per-row. Modern databases support LATERAL joins for this case:
SELECT c.*, recent_orders.*
FROM customers c,
LATERAL (SELECT * FROM orders WHERE customer_id = c.id ORDER BY date DESC LIMIT 3) recent_orders
This returns each customer with their three most recent orders. The LIMIT 3 depends on which customer we are processing, so it cannot be rewritten as a simple join.
Handling lateral joins requires either per-row evaluation (slow) or specialized operators that combine joining with limiting.
This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work
Copyright © 2020-2025 Andy Grove. All rights reserved.