Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Logical Plans and Expressions

The source code discussed in this chapter can be found in the logical-plan module of the KQuery project.

When you write a SQL query, you describe what data you want. The query engine must figure out how to get it. The first step is building a logical plan: a tree structure that represents the computation without specifying exactly how to execute it.

Consider this query:

SELECT name, salary * 1.1 AS new_salary
FROM employees
WHERE department = 'Engineering'

Before we can execute this, we need a structured representation. The logical plan captures:

  • Read from the employees table
  • Keep only rows where department = 'Engineering'
  • Compute name and salary * 1.1 for each remaining row

This chapter covers how to represent these operations as a tree of logical plans and expressions.

Why Separate Logical from Physical?

We could jump straight from SQL to execution, but separating logical and physical planning has advantages:

  1. Validation: We can check that columns exist and types are compatible before doing any work
  2. Optimization: We can transform the logical plan to make it more efficient
  3. Flexibility: The same logical plan might execute differently depending on data size or available resources

A logical plan says "filter rows where X" without specifying whether to use an index, a hash table, or a sequential scan. Those are physical execution details decided later.

The LogicalPlan Interface

A logical plan represents a relation: a set of rows with a known schema. Each plan can have child plans as inputs, forming a tree.

/**
 * A logical plan represents a data transformation or action that returns a
 * relation (a set of tuples).
 */
interface LogicalPlan {

  /**
   * Returns the schema of the data that will be produced by this logical plan.
   */
  fun schema(): Schema

  /**
   * Returns the children (inputs) of this logical plan. This method is used to
   * enable use of the visitor pattern to walk a query tree.
   */
  fun children(): List<LogicalPlan>
  ...
}

schema() returns the output schema, the columns and their types that this plan produces. This is essential for validation. If a later plan references a column, we can check that it exists in the input schema.

children() returns the input plans. A scan has no children (it reads from a data source). A filter has one child (its input). A join has two children (left and right inputs). This method enables walking the plan tree.

Printing Logical Plans

Debugging query engines requires seeing what the plan looks like. We print plans as indented trees where children are nested under parents:

/** Format a logical plan in human-readable form */
fun format(plan: LogicalPlan, indent: Int = 0): String {
  val b = StringBuilder()
  0.until(indent).forEach { b.append("\t") }
  b.append(plan.toString()).append("\n")
  plan.children().forEach { b.append(format(it, indent + 1)) }
  return b.toString()
}

Our example query might print as:

Projection: #name, #salary * 1.1 AS new_salary
  Filter: #department = 'Engineering'
    Scan: employees; projection=None

Read this bottom-up: scan the employees table, filter to Engineering, project the columns we want.

Logical Expressions

Plans describe data flow. Expressions describe computations within a plan. A filter plan contains an expression that evaluates to true or false for each row. A projection plan contains expressions that compute output columns.

Expressions can be simple (a column reference, a literal value) or complex (nested arithmetic, function calls). Here are common expression types:

Expression TypeExamples
Literal Value"hello", 12.34, true
Column Referenceuser_id, first_name, salary
Math Expressionsalary * 0.1, price + tax
Comparison Expressionage >= 21, status != 'inactive'
Boolean Expressionage >= 21 AND country = 'US'
Aggregate ExpressionMIN(salary), MAX(salary), SUM(amount), COUNT(*)
Scalar FunctionUPPER(name), CONCAT(first_name, ' ', last_name)
Aliased Expressionsalary * 1.1 AS new_salary

Expressions form trees. The expression (a + b) * c has a multiply at the root with two children: an add expression (with children a and b) and a column reference c.

The LogicalExpr Interface

During planning, we need to know what type of value an expression produces. If you write a + b, that is only valid if both columns are numeric. The interface captures this:

/**
 * Logical Expression for use in logical query plans. The logical expression
 * provides information needed during the planning phase such as the name and
 * data type of the expression.
 */
interface LogicalExpr {

  /**
   * Return meta-data about the value that will be produced by this expression
   * when evaluated against a particular input.
   */
  fun toField(input: LogicalPlan): Field
}

The toField method returns the name and data type of the expression's output. It takes the input plan because some expressions depend on the input schema. A column reference has the type of whatever column it references. A comparison expression always returns boolean regardless of its inputs.

Column Expressions

The simplest expression references a column by name:

/** Logical expression representing a reference to a column by name. */
class Column(val name: String) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return input.schema().fields.find { it.name == name }
        ?: throw SQLException(
            "No column named '$name' in ${input.schema().fields.map { it.name }}")
  }

  override fun toString(): String {
    return "#$name"
  }
}

The toField implementation looks up the column in the input schema. If it does not exist, that is an error, which we catch during planning rather than execution.

The # prefix in toString is a convention to distinguish column references from literal strings when printing plans.

Literal Expressions

Expressions like salary * 1.1 need to represent the literal value 1.1. We need literal expressions for each data type:

/** Logical expression representing a literal string value. */
class LiteralString(val str: String) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(str, ArrowTypes.StringType)
  }

  override fun toString(): String {
    return "'$str'"
  }
}
/** Logical expression representing a literal long value. */
class LiteralLong(val n: Long) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(n.toString(), ArrowTypes.Int64Type)
  }

  override fun toString(): String {
    return n.toString()
  }
}
/** Logical expression representing a literal double value. */
class LiteralDouble(val n: Double) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(n.toString(), ArrowTypes.DoubleType)
  }

  override fun toString(): String {
    return n.toString()
  }
}

Literal expressions do not depend on the input plan since their type is fixed.

Binary Expressions

Most operators take two inputs: comparison (=, <, >), boolean logic (AND, OR), and arithmetic (+, -, *, /). We can share structure across these:

abstract class BinaryExpr(
    val name: String,
    val op: String,
    val l: LogicalExpr,
    val r: LogicalExpr
) : LogicalExpr {

  override fun toString(): String {
    return "$l $op $r"
  }
}

The name identifies the expression type. The op is the operator symbol for printing. The l and r are the left and right operands.

Comparison and Boolean Expressions

Comparisons and boolean operators always produce boolean results:

/** Binary expressions that return a boolean type */
abstract class BooleanBinaryExpr(
    name: String,
    op: String,
    l: LogicalExpr,
    r: LogicalExpr
) : BinaryExpr(name, op, l, r) {

  override fun toField(input: LogicalPlan): Field {
    return Field(name, ArrowTypes.BooleanType)
  }
}

/** Logical expression representing an equality (`=`) comparison */
class Eq(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("eq", "=", l, r)

/** Logical expression representing an inequality (`!=`) comparison */
class Neq(l: LogicalExpr, r: LogicalExpr) :
    BooleanBinaryExpr("neq", "!=", l, r)

/** Logical expression representing a greater than (`>`) comparison */
class Gt(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("gt", ">", l, r)

/**
 * Logical expression representing a greater than or equals (`>=`) comparison
 */
class GtEq(l: LogicalExpr, r: LogicalExpr) :
    BooleanBinaryExpr("gteq", ">=", l, r)

/** Logical expression representing a less than (`<`) comparison */
class Lt(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("lt", "<", l, r)

/** Logical expression representing a less than or equals (`<=`) comparison */
class LtEq(l: LogicalExpr, r: LogicalExpr) :
    BooleanBinaryExpr("lteq", "<=", l, r)

/** Logical expression representing a logical AND */
class And(l: LogicalExpr, r: LogicalExpr) :
    BooleanBinaryExpr("and", "AND", l, r)

/** Logical expression representing a logical OR */
class Or(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("or", "OR", l, r)

Math Expressions

In KQuery, arithmetic expressions preserve the type of the left operand. This is a simplified approach. A production system would handle type promotion.

abstract class MathExpr(
    name: String,
    op: String,
    l: LogicalExpr,
    r: LogicalExpr
) : BinaryExpr(name, op, l, r) {

  override fun toField(input: LogicalPlan): Field {
    return Field(name, l.toField(input).dataType)
  }
}

class Add(l: LogicalExpr, r: LogicalExpr) : MathExpr("add", "+", l, r)

class Subtract(l: LogicalExpr, r: LogicalExpr) :
    MathExpr("subtract", "-", l, r)

class Multiply(l: LogicalExpr, r: LogicalExpr) : MathExpr("mult", "*", l, r)

class Divide(l: LogicalExpr, r: LogicalExpr) : MathExpr("div", "/", l, r)

class Modulus(l: LogicalExpr, r: LogicalExpr) : MathExpr("mod", "%", l, r)

Aggregate Expressions

Aggregates reduce multiple rows to a single value: SUM, MIN, MAX, AVG, COUNT. They appear in aggregate plans (covered later) and have special semantics.

Most aggregates return the same type as their input expression. MIN(salary) on a double column returns a double. MAX(age) on an integer column returns an integer. This makes sense since min, max, sum, and average values share the type of what they are aggregating.

/**
 * Base class for aggregate functions that are of the same type as the
 * underlying expression
 */
abstract class AggregateExpr(val name: String, val expr: LogicalExpr) :
    LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(name, expr.toField(input).dataType)
  }

  override fun toString(): String {
    return "$name($expr)"
  }
}

/** Logical expression representing the SUM aggregate expression. */
class Sum(input: LogicalExpr) : AggregateExpr("SUM", input)

/** Logical expression representing the MIN aggregate expression. */
class Min(input: LogicalExpr) : AggregateExpr("MIN", input)

/** Logical expression representing the MAX aggregate expression. */
class Max(input: LogicalExpr) : AggregateExpr("MAX", input)

/** Logical expression representing the AVG aggregate expression. */
class Avg(input: LogicalExpr) : AggregateExpr("AVG", input)

COUNT is different. It returns a count of rows, which is always an integer regardless of what column you are counting. COUNT(salary) returns Int32 whether salary is stored as an integer, double, or string.

/** Logical expression representing the COUNT aggregate expression. */
class Count(input: LogicalExpr) : AggregateExpr("COUNT", input) {

  override fun toField(input: LogicalPlan): Field {
    return Field("COUNT", ArrowTypes.Int32Type)
  }

  override fun toString(): String {
    return "COUNT($expr)"
  }
}

Aliased Expressions

SQL's AS keyword renames an expression's output:

/** Aliased expression e.g. `expr AS alias`. */
class Alias(val expr: LogicalExpr, val alias: String) : LogicalExpr {
  override fun toField(input: LogicalPlan): Field {
    return Field(alias, expr.toField(input).dataType)
  }

  override fun toString(): String {
    return "$expr as $alias"
  }
}

The alias changes the name but preserves the type.

Logical Plans

With expressions defined, we can build the plans that use them.

Scan

Scan reads from a data source. It is the leaf node in every query tree, the place where data enters the plan.

/** Represents a scan of a data source */
class Scan(
    val path: String,
    val dataSource: DataSource,
    val projection: List<String>
) : LogicalPlan {

  val schema = deriveSchema()

  override fun schema(): Schema {
    return schema
  }

  private fun deriveSchema(): Schema {
    val schema = dataSource.schema()
    if (projection.isEmpty()) {
      return schema
    } else {
      return schema.select(projection)
    }
  }

  override fun children(): List<LogicalPlan> {
    return listOf()
  }

  override fun toString(): String {
    return if (projection.isEmpty()) {
      "Scan: $path; projection=None"
    } else {
      "Scan: $path; projection=$projection"
    }
  }
}

The projection parameter lists which columns to read. If empty, read all columns. This optimization matters because reading fewer columns means less I/O and less memory.

Selection (Filter)

Selection keeps only rows where an expression evaluates to true. This corresponds to SQL's WHERE clause.

/** Logical plan representing a selection (a.k.a. filter) against an input */
class Selection(val input: LogicalPlan, val expr: LogicalExpr) : LogicalPlan {
  override fun schema(): Schema {
    return input.schema()
  }

  override fun children(): List<LogicalPlan> {
    // selection does not change the schema of the input
    return listOf(input)
  }

  override fun toString(): String {
    return "Selection: $expr"
  }
}

The schema passes through unchanged since filtering only removes rows, not columns.

Projection

Projection computes new columns from expressions. This corresponds to SQL's SELECT list.

/**
 * Logical plan representing a projection (evaluating a list of expressions)
 * against an input
 */
class Projection(val input: LogicalPlan, val expr: List<LogicalExpr>) :
    LogicalPlan {
  override fun schema(): Schema {
    return Schema(expr.map { it.toField(input) })
  }

  override fun children(): List<LogicalPlan> {
    return listOf(input)
  }

  override fun toString(): String {
    return "Projection: ${ expr.map { it.toString() }.joinToString(", ") }"
  }
}

The output schema comes from the expressions. If you project name and salary * 1.1 AS bonus, the output schema has two columns with those names and appropriate types.

Aggregate

Aggregate groups rows and computes aggregate functions. This corresponds to SQL's GROUP BY with aggregate functions.

/** Logical plan representing an aggregate query against an input. */
class Aggregate(
    val input: LogicalPlan,
    val groupExpr: List<LogicalExpr>,
    val aggregateExpr: List<AggregateExpr>
) : LogicalPlan {

  override fun schema(): Schema {
    return Schema(
        groupExpr.map { it.toField(input) } +
            aggregateExpr.map { it.toField(input) })
  }

  override fun children(): List<LogicalPlan> {
    return listOf(input)
  }

  override fun toString(): String {
    return "Aggregate: groupExpr=$groupExpr, aggregateExpr=$aggregateExpr"
  }
}

The output schema has the grouping columns first, followed by the aggregate results. For SELECT department, AVG(salary) FROM employees GROUP BY department, the output has two columns: department and AVG(salary).

Join

Join combines rows from two inputs based on join keys. Unlike the plans we have seen so far, joins have two children: a left input and a right input.

class Join(
    val left: LogicalPlan,
    val right: LogicalPlan,
    val joinType: JoinType,
    val on: List<Pair<String, String>>
) : LogicalPlan {

  override fun schema(): Schema {
    ...
  }

  override fun children(): List<LogicalPlan> {
    return listOf(left, right)
  }

  override fun toString(): String {
    return "Join: type=$joinType, on=$on"
  }
}

The on parameter specifies pairs of column names to join on: the first element names a column from the left input, the second from the right. The output schema combines columns from both inputs, deduplicating columns with matching names in the join keys.

enum class JoinType {
  Inner,
  Left,
  Right
}

The join type determines which rows appear in the output: inner joins return only matching rows, outer joins include unmatched rows with nulls.

Joins are fundamental to relational queries but come with significant complexity: multiple join types, various join algorithms with different performance characteristics, and optimization challenges. The Joins chapter covers these topics in depth.

Putting It Together

Here is how our example query becomes a logical plan:

SELECT name, salary * 1.1 AS new_salary
FROM employees
WHERE department = 'Engineering'

Building bottom-up:

val scan = Scan("employees", employeeDataSource, listOf())

val filter = Selection(
    scan,
    Eq(Column("department"), LiteralString("Engineering"))
)

val project = Projection(
    filter,
    listOf(
        Column("name"),
        Alias(Multiply(Column("salary"), LiteralDouble(1.1)), "new_salary")
    )
)

Printed:

Projection: #name, #salary * 1.1 as new_salary
  Filter: #department = 'Engineering'
    Scan: employees; projection=None

This logical plan can now be validated (do the columns exist?), optimized (can we push the projection into the scan?), and eventually converted to a physical plan for execution.

Serialization

Query plans sometimes need to be serialized: sent across a network, stored for later, or passed between systems in different languages.

Options include language-specific serialization (JSON with Jackson in Java, kotlinx.serialization in Kotlin) or language-agnostic formats like Protocol Buffers or Avro.

A newer standard called Substrait aims to provide cross-language serialization for relational algebra. This is exciting because it enables mixing components: use Apache Calcite for query planning in Java, serialize to Substrait, execute in a Rust or C++ engine. If you are building a query engine today, Substrait is worth investigating.

This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work

Copyright © 2020-2025 Andy Grove. All rights reserved.