Introduction
Query engines are the invisible workhorses powering modern data infrastructure. Every time you run a SQL query against a database, execute a Spark job, or query a data lake, a query engine is transforming your high-level request into an efficient execution plan. Understanding how query engines work gives you insight into one of the most important abstractions in computing.
This book takes a hands-on approach to demystifying query engines. Rather than surveying existing systems, we will build a fully functional query engine from scratch, covering each component in enough depth that you could implement your own.
Who This Book Is For
This book is for software engineers who want to understand the internals of query engines. You might be:
- A data engineer who wants to understand why queries perform the way they do
- A database developer looking to learn foundational concepts
- A software engineer curious about compiler-like systems
- Someone building tooling that needs to parse or analyze SQL
Basic programming knowledge is assumed. The examples use Kotlin, chosen for its conciseness, but the concepts apply to any language.
What You Will Learn
By the end of this book, you will understand how to:
- Design a columnar type system using Apache Arrow
- Build data source connectors for CSV and Parquet files
- Represent queries as logical and physical plans
- Create a DataFrame API for building queries programmatically
- Translate logical plans into executable physical plans
- Implement query optimizations like projection and predicate push-down
- Parse SQL and convert it to query plans
- Execute queries in parallel across multiple CPU cores
- Design distributed query execution across a cluster
How This Book Is Organized
The book follows the natural architecture of a query engine, building each layer on top of the previous.
The first four chapters cover the foundations: What is a Query Engine?, Apache Arrow, Type System, and Data Sources. We start with what a query engine is, then establish our foundation with Apache Arrow for the memory model, a type system for representing data, and data source abstractions for reading files.
The next three chapters cover query representation: Logical Plans, DataFrames, and SQL Support. We define logical plans and expressions to represent queries abstractly, build a DataFrame API for constructing plans programmatically, and add SQL support so queries can be written in the familiar query language.
The Physical Plans, Query Planning, and Joins chapters cover execution. We translate logical plans into physical plans containing executable code, implement a query planner to automate that translation, then cover joins, one of the most complex operations in query processing.
The Subqueries, Query Optimizers, and Query Execution chapters continue with more advanced topics. We handle subqueries, build optimizer rules to transform plans into more efficient forms, and execute queries to compare performance.
The Parallel Query Execution and Distributed Query Execution chapters cover scaling. We extend the engine to execute queries in parallel across CPU cores, then across distributed clusters.
The Testing and Benchmarks chapters cover quality. We cover testing strategies including fuzzing, and benchmarking approaches for measuring performance.
This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work
Copyright © 2020-2025 Andy Grove. All rights reserved.