Subqueries

Subqueries are queries nested within other queries. They appear in SELECT lists, FROM clauses, and WHERE clauses. Supporting subqueries requires both parsing them correctly and planning them efficiently.

Types of Subqueries

Scalar Subqueries

A scalar subquery returns a single value and can appear wherever a scalar expression is valid:

SELECT id, name,
       (SELECT AVG(salary) FROM employees) AS avg_salary
FROM employees

The subquery (SELECT AVG(salary) FROM employees) returns one value that applies to every row. If a scalar subquery returns more than one row, that is an error.

Correlated Subqueries

A correlated subquery references columns from the outer query:

SELECT id, name,
       (SELECT COUNT(*) FROM orders WHERE orders.customer_id = customers.id) AS order_count
FROM customers

The inner query references customers.id from the outer query. This correlation creates a dependency: conceptually, the inner query runs once per row of the outer query.

Uncorrelated Subqueries

An uncorrelated subquery is self-contained:

SELECT * FROM orders
WHERE total > (SELECT AVG(total) FROM orders WHERE region = 'West')

The subquery does not reference the outer query, so it can be evaluated once and the result substituted.

EXISTS and IN Subqueries

The EXISTS predicate tests whether a subquery returns any rows:

SELECT id FROM customers
WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id)

This returns customers who have at least one order.

The IN predicate tests membership in a set:

SELECT * FROM products
WHERE category_id IN (SELECT id FROM categories WHERE active = true)

Both NOT EXISTS and NOT IN provide the negated forms.

Planning Subqueries

Uncorrelated Subqueries

Uncorrelated scalar subqueries are straightforward: execute them once during planning (or once at the start of execution) and substitute the result as a literal value.

SELECT * FROM orders WHERE total > (SELECT AVG(total) FROM orders)

Becomes:

SELECT * FROM orders WHERE total > 42500.00  -- after evaluating subquery

Correlated Subqueries: The Naive Approach

A naive implementation executes the correlated subquery once per row of the outer query. For the order count example:

SELECT id, name,
       (SELECT COUNT(*) FROM orders WHERE orders.customer_id = customers.id) AS order_count
FROM customers

If customers has 100,000 rows, we run 100,000 separate queries against orders. This is extremely slow.

Decorrelation: Converting Subqueries to Joins

The solution is decorrelation: rewriting correlated subqueries as joins. The query above becomes:

SELECT c.id, c.name, COALESCE(o.order_count, 0) AS order_count
FROM customers c
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
) o ON c.id = o.customer_id

Now we scan orders once, aggregate, and join, which is much faster.

EXISTS to Semi Join

An EXISTS subquery becomes a semi join. A semi join returns rows from the left side where at least one match exists on the right, without duplicating left rows when multiple matches exist.

SELECT id FROM foo WHERE EXISTS (SELECT 1 FROM bar WHERE foo.id = bar.id)

Becomes:

Projection: foo.id
    LeftSemi Join: foo.id = bar.id
        Scan: foo
        Scan: bar

NOT EXISTS to Anti Join

NOT EXISTS becomes an anti join, which returns rows from the left side where no match exists on the right:

SELECT id FROM foo WHERE NOT EXISTS (SELECT 1 FROM bar WHERE foo.id = bar.id)

Becomes:

Projection: foo.id
    LeftAnti Join: foo.id = bar.id
        Scan: foo
        Scan: bar

IN to Semi Join

IN subqueries also become semi joins:

SELECT * FROM products WHERE category_id IN (SELECT id FROM categories WHERE active = true)

Becomes:

LeftSemi Join: products.category_id = categories.id
    Scan: products
    Filter: active = true
        Scan: categories

Implementation Complexity

Subquery decorrelation is one of the more complex parts of a query engine. Challenges include:

Identifying correlation: The planner must determine which columns in the subquery reference the outer query.

Choosing the right join type: EXISTS maps to semi join, NOT EXISTS to anti join, scalar subqueries to left outer joins with aggregation.

Handling multiple correlations: A subquery might reference multiple outer tables in a complex query.

Preserving semantics: The rewritten query must produce exactly the same results, including handling of NULLs.

Nested subqueries: Subqueries can contain subqueries, requiring recursive decorrelation.

KQuery does not currently implement subqueries. Production query engines spend significant effort on subquery support since it is essential for SQL compatibility.

When Decorrelation Is Not Possible

Some correlated subqueries cannot be decorrelated into standard joins. These "lateral" or "dependent" subqueries must be evaluated per-row. Modern databases support LATERAL joins for this case:

SELECT c.*, recent_orders.*
FROM customers c,
LATERAL (SELECT * FROM orders WHERE customer_id = c.id ORDER BY date DESC LIMIT 3) recent_orders

This returns each customer with their three most recent orders. The LIMIT 3 depends on which customer we are processing, so it cannot be rewritten as a simple join.

Handling lateral joins requires either per-row evaluation (slow) or specialized operators that combine joining with limiting.

Subquery decorrelation is itself a form of query optimization, transforming a slow nested execution into an efficient join. The next chapter explores other optimizations that query engines apply to improve performance.

This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work

Keyboard shortcuts

How Query Engines Work