SQL Support
The source code discussed in this chapter can be found in the sql module of the KQuery project.
The previous chapter showed how to build queries using the DataFrame API. But most users expect to write SQL. This chapter covers parsing SQL text into a logical plan, which involves two steps: parsing (text to syntax tree) and planning (syntax tree to logical plan).
The Journey from SQL to Logical Plan
Consider this query:
SELECT id, first_name, salary * 1.1 AS new_salary
FROM employee
WHERE state = 'CO'
To execute this, we need to:
- Tokenize: Break the text into tokens (keywords, identifiers, literals, operators)
- Parse: Build a syntax tree that represents the query structure
- Plan: Convert the syntax tree into a logical plan
The result is the same logical plan we could build with the DataFrame API, but constructed from SQL text instead of code.
Tokenizing
The tokenizer (or lexer) converts a string into a sequence of tokens. Each token has a type and a value.
data class Token(val text: String, val type: TokenType, val endOffset: Int)
Token types include:
- Keywords:
SELECT,FROM,WHERE,AND,OR, etc. - Identifiers: table names, column names
- Literals: strings (
'hello'), numbers (42,3.14) - Symbols: operators and punctuation (
+,-,*,/,=,<,>,(,),,)
For the query SELECT a + b FROM c, tokenizing produces:
listOf(
Token("SELECT", Keyword.SELECT, ...),
Token("a", Literal.IDENTIFIER, ...),
Token("+", Symbol.PLUS, ...),
Token("b", Literal.IDENTIFIER, ...),
Token("FROM", Keyword.FROM, ...),
Token("c", Literal.IDENTIFIER, ...)
)
The tokenizer handles details like recognizing that SELECT is a keyword but employee is an identifier, parsing string literals with their quotes, and recognizing multi-character operators like <= and !=.
Parsing with Pratt Parsers
Parsing turns tokens into a tree structure. The challenge is handling operator precedence correctly. In 1 + 2 * 3, multiplication binds tighter than addition, so the result should be 1 + (2 * 3), not (1 + 2) * 3.
KQuery uses a Pratt parser, based on Vaughan Pratt’s 1973 paper “Top Down Operator Precedence”. Pratt parsers handle precedence elegantly and produce clear, debuggable code.
The core algorithm is remarkably simple:
interface PrattParser {
fun parse(precedence: Int = 0): SqlExpr? {
var expr = parsePrefix() ?: return null
while (precedence < nextPrecedence()) {
expr = parseInfix(expr, nextPrecedence())
}
return expr
}
fun nextPrecedence(): Int
fun parsePrefix(): SqlExpr?
fun parseInfix(left: SqlExpr, precedence: Int): SqlExpr
}
The algorithm:
- Parse a “prefix” expression (a literal, identifier, or unary operator)
- While the next operator has higher precedence than what we started with, parse it as an “infix” expression (binary operator) with the current expression as the left side
- Return when we hit a lower-precedence operator
The magic is in parseInfix, which recursively calls parse with the new operator’s precedence. This naturally groups higher-precedence operations first.
SQL Expressions
The parser builds a syntax tree using SQL expression types:
interface SqlExpr
data class SqlIdentifier(val id: String) : SqlExpr
data class SqlString(val value: String) : SqlExpr
data class SqlLong(val value: Long) : SqlExpr
data class SqlDouble(val value: Double) : SqlExpr
data class SqlBinaryExpr(val l: SqlExpr, val op: String, val r: SqlExpr) : SqlExpr
data class SqlAlias(val expr: SqlExpr, val alias: SqlIdentifier) : SqlExpr
data class SqlFunction(val name: String, val args: List<SqlExpr>) : SqlExpr
These mirror logical expressions but stay closer to SQL syntax. We use a generic SqlBinaryExpr with a string operator rather than separate classes for each operator since the distinctions matter more in the logical plan.
Precedence in Action
Consider parsing 1 + 2 * 3:
Tokens: [1] [+] [2] [*] [3]
Precedence: 0 50 0 60 0
Addition has precedence 50, multiplication has 60. Walking through:
- Parse prefix:
SqlLong(1) - Next token
+has precedence 50 > 0, so parse infix - In
parseInfix, consume+, then recursively callparse(50) - Parse prefix:
SqlLong(2) - Next token
*has precedence 60 > 50, so parse infix - Consume
*, recursively callparse(60) - Parse prefix:
SqlLong(3) - No more tokens, return
SqlLong(3) - Return
SqlBinaryExpr(SqlLong(2), "*", SqlLong(3)) - Next precedence is 0 < 50, so return
- Return
SqlBinaryExpr(SqlLong(1), "+", SqlBinaryExpr(...))
Result: 1 + (2 * 3), as expected.
Compare with 1 * 2 + 3:
Tokens: [1] [*] [2] [+] [3]
Precedence: 0 60 0 50 0
- Parse prefix:
SqlLong(1) *has precedence 60 > 0, parse infix- Consume
*, callparse(60) - Parse prefix:
SqlLong(2) +has precedence 50 < 60, so returnSqlLong(2)- Return
SqlBinaryExpr(SqlLong(1), "*", SqlLong(2)) +has precedence 50 > 0, parse infix- Consume
+, callparse(50) - Parse prefix:
SqlLong(3) - No more tokens, return
- Return
SqlBinaryExpr(SqlBinaryExpr(...), "+", SqlLong(3))
Result: (1 * 2) + 3, correct again.
Parsing SELECT Statements
Beyond expressions, we need to parse complete SQL statements. A SELECT statement has the following structure:
data class SqlSelect(
val projection: List<SqlExpr>,
val tableName: String,
val selection: SqlExpr?,
val groupBy: List<SqlExpr>,
val having: SqlExpr?,
val orderBy: List<SqlSort>,
val limit: Int?
) : SqlRelation
Parsing a SELECT statement is straightforward procedural code: expect SELECT, parse a comma-separated list of expressions, expect FROM, parse the table name, optionally parse WHERE and its expression, and so on.
SQL Planning: The Hard Part
Parsing is mechanical. Planning is where things get interesting.
Consider this query:
SELECT id, first_name, salary/12 AS monthly_salary
FROM employee
WHERE state = 'CO' AND monthly_salary > 1000
The WHERE clause references both state (a column from the table) and monthly_salary (an alias defined in the SELECT list). This is natural for humans but creates a problem: the filter needs columns that exist at different points in the plan.
If we filter before projecting, monthly_salary does not exist yet. If we filter after projecting, state may no longer be available.
Solution: Intermediate Projections
One approach adds columns to an intermediate projection:
Projection: #id, #first_name, #monthly_salary
Filter: #state = 'CO' AND #monthly_salary > 1000
Projection: #id, #first_name, #salary/12 AS monthly_salary, #state
Scan: employee
The inner projection computes all needed columns including state. The filter can then reference everything. The outer projection removes state from the final output.
Translation Logic
The planner walks the SQL expression tree and builds logical expressions:
fun createLogicalExpr(expr: SqlExpr, input: LogicalPlan): LogicalExpr {
return when (expr) {
is SqlIdentifier -> Column(expr.id)
is SqlString -> LiteralString(expr.value)
is SqlLong -> LiteralLong(expr.value)
is SqlDouble -> LiteralDouble(expr.value)
is SqlAlias -> Alias(createLogicalExpr(expr.expr, input), expr.alias.id)
is SqlBinaryExpr -> {
val l = createLogicalExpr(expr.l, input)
val r = createLogicalExpr(expr.r, input)
when (expr.op) {
"=" -> Eq(l, r)
"!=" -> Neq(l, r)
">" -> Gt(l, r)
">=" -> GtEq(l, r)
"<" -> Lt(l, r)
"<=" -> LtEq(l, r)
"AND" -> And(l, r)
"OR" -> Or(l, r)
"+" -> Add(l, r)
"-" -> Subtract(l, r)
"*" -> Multiply(l, r)
"/" -> Divide(l, r)
else -> throw SQLException("Unknown operator: ${expr.op}")
}
}
else -> throw SQLException("Unsupported expression: $expr")
}
}
Finding Column References
To determine which columns the filter needs, we walk the expression tree:
fun findColumnReferences(expr: LogicalExpr, columns: MutableSet<String>) {
when (expr) {
is Column -> columns.add(expr.name)
is Alias -> findColumnReferences(expr.expr, columns)
is BinaryExpr -> {
findColumnReferences(expr.l, columns)
findColumnReferences(expr.r, columns)
}
}
}
With this, the planner can compare columns in the filter against columns in the projection and add any missing ones to the intermediate projection.
Aggregate Queries
Aggregate queries add more complexity. Consider:
SELECT department, AVG(salary) AS avg_salary
FROM employee
WHERE state = 'CO'
GROUP BY department
HAVING avg_salary > 50000
The planner must:
- Identify aggregate functions (
AVG) - Separate grouping expressions (
department) from aggregates - Handle HAVING, which filters after aggregation
- Ensure columns in SELECT are either in GROUP BY or inside aggregates
The full implementation handles these cases but the code is involved. See the source repository for details.
Why Build Your Own Parser?
You might wonder why we build a parser instead of using a parser generator like ANTLR or a library like Apache Calcite.
Building a hand-written parser has advantages:
- Control: You decide exactly what SQL features to support
- Error messages: You can produce clear, context-specific errors
- Simplicity: No external dependencies or generated code
- Learning: Understanding parsing deepens your understanding of the whole system
For a production system, using an existing SQL parser is often sensible. But for learning how query engines work, building a parser reveals how SQL’s apparent flexibility maps to structured operations.
This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work
Copyright © 2020-2025 Andy Grove. All rights reserved.