SQL Support

The source code discussed in this chapter can be found in the sql module of the KQuery project.

The previous chapter showed how to build queries using the DataFrame API. But most users expect to write SQL. This chapter covers parsing SQL text into a logical plan, which involves two steps: parsing (text to syntax tree) and planning (syntax tree to logical plan).

The Journey from SQL to Logical Plan

Consider this query:

SELECT id, first_name, salary * 1.1 AS new_salary
FROM employee
WHERE state = 'CO'

To execute this, we need to:

Tokenize: Break the text into tokens (keywords, identifiers, literals, operators)
Parse: Build a syntax tree that represents the query structure
Plan: Convert the syntax tree into a logical plan

The result is the same logical plan we could build with the DataFrame API, but constructed from SQL text instead of code.

Tokenizing

The tokenizer (or lexer) converts a string into a sequence of tokens. Each token has a type and a value.

data class Token(val text: String, val type: TokenType, val endOffset: Int)

Token types include:

Keywords: SELECT, FROM, WHERE, AND, OR, etc.
Identifiers: table names, column names
Literals: strings ('hello'), numbers (42, 3.14)
Symbols: operators and punctuation (+, -, *, /, =, <, >, (, ), ,)

For the query SELECT a + b FROM c, tokenizing produces:

listOf(
    Token("SELECT", Keyword.SELECT, ...),
    Token("a", Literal.IDENTIFIER, ...),
    Token("+", Symbol.PLUS, ...),
    Token("b", Literal.IDENTIFIER, ...),
    Token("FROM", Keyword.FROM, ...),
    Token("c", Literal.IDENTIFIER, ...)
)

The tokenizer handles details like recognizing that SELECT is a keyword but employee is an identifier, parsing string literals with their quotes, and recognizing multi-character operators like <= and !=.

Parsing with Pratt Parsers

Parsing turns tokens into a tree structure. The challenge is handling operator precedence correctly. In 1 + 2 * 3, multiplication binds tighter than addition, so the result should be 1 + (2 * 3), not (1 + 2) * 3.

KQuery uses a Pratt parser, based on Vaughan Pratt’s 1973 paper “Top Down Operator Precedence”. Pratt parsers handle precedence elegantly and produce clear, debuggable code.

The core algorithm is remarkably simple:

interface PrattParser {

    fun parse(precedence: Int = 0): SqlExpr? {
        var expr = parsePrefix() ?: return null
        while (precedence < nextPrecedence()) {
            expr = parseInfix(expr, nextPrecedence())
        }
        return expr
    }

    fun nextPrecedence(): Int
    fun parsePrefix(): SqlExpr?
    fun parseInfix(left: SqlExpr, precedence: Int): SqlExpr
}

The algorithm:

Parse a “prefix” expression (a literal, identifier, or unary operator)
While the next operator has higher precedence than what we started with, parse it as an “infix” expression (binary operator) with the current expression as the left side
Return when we hit a lower-precedence operator

The magic is in parseInfix, which recursively calls parse with the new operator’s precedence. This naturally groups higher-precedence operations first.

SQL Expressions

The parser builds a syntax tree using SQL expression types:

interface SqlExpr

data class SqlIdentifier(val id: String) : SqlExpr
data class SqlString(val value: String) : SqlExpr
data class SqlLong(val value: Long) : SqlExpr
data class SqlDouble(val value: Double) : SqlExpr
data class SqlBinaryExpr(val l: SqlExpr, val op: String, val r: SqlExpr) : SqlExpr
data class SqlAlias(val expr: SqlExpr, val alias: SqlIdentifier) : SqlExpr
data class SqlFunction(val name: String, val args: List<SqlExpr>) : SqlExpr

These mirror logical expressions but stay closer to SQL syntax. We use a generic SqlBinaryExpr with a string operator rather than separate classes for each operator since the distinctions matter more in the logical plan.

Precedence in Action

Consider parsing 1 + 2 * 3:

Tokens:      [1]  [+]  [2]  [*]  [3]
Precedence:   0   50    0   60    0

Addition has precedence 50, multiplication has 60. Walking through:

Parse prefix: SqlLong(1)
Next token + has precedence 50 > 0, so parse infix
In parseInfix, consume +, then recursively call parse(50)
Parse prefix: SqlLong(2)
Next token * has precedence 60 > 50, so parse infix
Consume *, recursively call parse(60)
Parse prefix: SqlLong(3)
No more tokens, return SqlLong(3)
Return SqlBinaryExpr(SqlLong(2), "*", SqlLong(3))
Next precedence is 0 < 50, so return
Return SqlBinaryExpr(SqlLong(1), "+", SqlBinaryExpr(...))

Result: 1 + (2 * 3), as expected.

Compare with 1 * 2 + 3:

Tokens:      [1]  [*]  [2]  [+]  [3]
Precedence:   0   60    0   50    0

Parse prefix: SqlLong(1)
* has precedence 60 > 0, parse infix
Consume *, call parse(60)
Parse prefix: SqlLong(2)
+ has precedence 50 < 60, so return SqlLong(2)
Return SqlBinaryExpr(SqlLong(1), "*", SqlLong(2))
+ has precedence 50 > 0, parse infix
Consume +, call parse(50)
Parse prefix: SqlLong(3)
No more tokens, return
Return SqlBinaryExpr(SqlBinaryExpr(...), "+", SqlLong(3))

Result: (1 * 2) + 3, correct again.

Parsing SELECT Statements

Beyond expressions, we need to parse complete SQL statements. A SELECT statement has the following structure:

data class SqlSelect(
    val projection: List<SqlExpr>,
    val tableName: String,
    val selection: SqlExpr?,
    val groupBy: List<SqlExpr>,
    val having: SqlExpr?,
    val orderBy: List<SqlSort>,
    val limit: Int?
) : SqlRelation

Parsing a SELECT statement is straightforward procedural code: expect SELECT, parse a comma-separated list of expressions, expect FROM, parse the table name, optionally parse WHERE and its expression, and so on.

SQL Planning: The Hard Part

Parsing is mechanical. Planning is where things get interesting.

Consider this query:

SELECT id, first_name, salary/12 AS monthly_salary
FROM employee
WHERE state = 'CO' AND monthly_salary > 1000

The WHERE clause references both state (a column from the table) and monthly_salary (an alias defined in the SELECT list). This is natural for humans but creates a problem: the filter needs columns that exist at different points in the plan.

If we filter before projecting, monthly_salary does not exist yet. If we filter after projecting, state may no longer be available.

Solution: Intermediate Projections

One approach adds columns to an intermediate projection:

Projection: #id, #first_name, #monthly_salary
    Filter: #state = 'CO' AND #monthly_salary > 1000
        Projection: #id, #first_name, #salary/12 AS monthly_salary, #state
            Scan: employee

The inner projection computes all needed columns including state. The filter can then reference everything. The outer projection removes state from the final output.

Translation Logic

The planner walks the SQL expression tree and builds logical expressions:

fun createLogicalExpr(expr: SqlExpr, input: LogicalPlan): LogicalExpr {
    return when (expr) {
        is SqlIdentifier -> Column(expr.id)
        is SqlString -> LiteralString(expr.value)
        is SqlLong -> LiteralLong(expr.value)
        is SqlDouble -> LiteralDouble(expr.value)
        is SqlAlias -> Alias(createLogicalExpr(expr.expr, input), expr.alias.id)
        is SqlBinaryExpr -> {
            val l = createLogicalExpr(expr.l, input)
            val r = createLogicalExpr(expr.r, input)
            when (expr.op) {
                "=" -> Eq(l, r)
                "!=" -> Neq(l, r)
                ">" -> Gt(l, r)
                ">=" -> GtEq(l, r)
                "<" -> Lt(l, r)
                "<=" -> LtEq(l, r)
                "AND" -> And(l, r)
                "OR" -> Or(l, r)
                "+" -> Add(l, r)
                "-" -> Subtract(l, r)
                "*" -> Multiply(l, r)
                "/" -> Divide(l, r)
                else -> throw SQLException("Unknown operator: ${expr.op}")
            }
        }
        else -> throw SQLException("Unsupported expression: $expr")
    }
}

Finding Column References

To determine which columns the filter needs, we walk the expression tree:

fun findColumnReferences(expr: LogicalExpr, columns: MutableSet<String>) {
    when (expr) {
        is Column -> columns.add(expr.name)
        is Alias -> findColumnReferences(expr.expr, columns)
        is BinaryExpr -> {
            findColumnReferences(expr.l, columns)
            findColumnReferences(expr.r, columns)
        }
    }
}

With this, the planner can compare columns in the filter against columns in the projection and add any missing ones to the intermediate projection.

Aggregate Queries

Aggregate queries add more complexity. Consider:

SELECT department, AVG(salary) AS avg_salary
FROM employee
WHERE state = 'CO'
GROUP BY department
HAVING avg_salary > 50000

The planner must:

Identify aggregate functions (AVG)
Separate grouping expressions (department) from aggregates
Handle HAVING, which filters after aggregation
Ensure columns in SELECT are either in GROUP BY or inside aggregates

The full implementation handles these cases but the code is involved. See the source repository for details.

Why Build Your Own Parser?

You might wonder why we build a parser instead of using a parser generator like ANTLR or a library like Apache Calcite.

Building a hand-written parser has advantages:

Control: You decide exactly what SQL features to support
Error messages: You can produce clear, context-specific errors
Simplicity: No external dependencies or generated code
Learning: Understanding parsing deepens your understanding of the whole system

For a production system, using an existing SQL parser is often sensible. But for learning how query engines work, building a parser reveals how SQL’s apparent flexibility maps to structured operations.

This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work

Keyboard shortcuts

How Query Engines Work