Testing
The source code discussed in this chapter can be found in the test directories throughout the KQuery project.
Query engines are complex systems where subtle bugs can cause queries to return incorrect results. Unlike a crash or obvious error, a query that silently returns wrong data is particularly dangerous because users may make decisions based on faulty information. Rigorous testing is essential.
Why Query Engines Are Hard to Test
Several factors make testing query engines challenging:
The combinatorial explosion of operators and expressions means there are infinite ways to combine them. A simple query with three operators and two expressions per operator already has many possible combinations. Hand-written tests cannot cover them all.
Type coercion adds another dimension. An expression like a > b must work correctly for integers, floats, strings, dates, and more. Each comparison operator multiplied by each type combination multiplied by null handling creates many test cases.
Edge cases abound: empty tables, single-row tables, tables with all nulls, extremely large values, NaN for floating point, and Unicode strings with special characters. Production data will eventually exercise all of these.
Query optimizers can introduce bugs. A query might work correctly without optimization but fail after the optimizer rewrites it. Or the optimizer might produce a plan that is semantically different from the original.
Unit Testing
Unit tests verify that individual components work correctly in isolation. For a query engine, this means testing expressions, operators, and data sources independently before testing them in combination.
Testing Expressions
Expression tests should verify that each expression produces correct results for valid inputs and handles edge cases appropriately.
Here is a test for the greater-than-or-equals expression across different numeric types:
@Test
fun `gteq longs`() {
val schema = Schema(listOf(
Field("a", ArrowTypes.Int64Type),
Field("b", ArrowTypes.Int64Type)
))
val a: List<Long> = listOf(111, 222, 333, Long.MIN_VALUE, Long.MAX_VALUE)
val b: List<Long> = listOf(111, 333, 222, Long.MAX_VALUE, Long.MIN_VALUE)
val batch = Fuzzer().createRecordBatch(schema, listOf(a, b))
val expr = GtEqExpression(ColumnExpression(0), ColumnExpression(1))
val result = expr.evaluate(batch)
assertEquals(a.size, result.size())
(0 until result.size()).forEach {
assertEquals(a[it] >= b[it], result.getValue(it))
}
}
This test includes MIN_VALUE and MAX_VALUE to verify boundary behavior. Similar tests should exist for bytes, shorts, integers, floats, doubles, and strings. Each numeric type can have different overflow or comparison semantics.
For floating point types, testing NaN behavior is critical:
@Test
fun `gteq doubles`() {
val schema = Schema(listOf(
Field("a", ArrowTypes.DoubleType),
Field("b", ArrowTypes.DoubleType)
))
val a: List<Double> = listOf(0.0, 1.0,
Double.MIN_VALUE, Double.MAX_VALUE, Double.NaN)
val b = a.reversed()
val batch = Fuzzer().createRecordBatch(schema, listOf(a, b))
val expr = GtEqExpression(ColumnExpression(0), ColumnExpression(1))
val result = expr.evaluate(batch)
assertEquals(a.size, result.size())
(0 until result.size()).forEach {
assertEquals(a[it] >= b[it], result.getValue(it))
}
}
NaN comparisons follow IEEE 754 semantics where NaN compared to anything (including itself) returns false. This surprises many developers, so explicit tests catch implementations that handle NaN incorrectly.
Creating Test Data
Tests need a convenient way to create record batches with arbitrary data. A helper method that takes a schema and column values makes tests cleaner:
fun createRecordBatch(schema: Schema, columns: List<List<Any?>>): RecordBatch {
val rowCount = columns[0].size
val root = VectorSchemaRoot.create(schema.toArrow(), rootAllocator)
root.allocateNew()
(0 until rowCount).forEach { row ->
(0 until columns.size).forEach { col ->
val v = root.getVector(col)
val value = columns[col][row]
when (v) {
is BitVector -> v.set(row, if (value as Boolean) 1 else 0)
is TinyIntVector -> v.set(row, value as Byte)
is SmallIntVector -> v.set(row, value as Short)
is IntVector -> v.set(row, value as Int)
is BigIntVector -> v.set(row, value as Long)
is Float4Vector -> v.set(row, value as Float)
is Float8Vector -> v.set(row, value as Double)
is VarCharVector -> v.set(row, (value as String).toByteArray())
else -> throw IllegalStateException()
}
}
}
root.rowCount = rowCount
return RecordBatch(schema, root.fieldVectors.map { ArrowFieldVector(it) })
}
With this helper, tests become declarative: define the schema, provide the data, and verify the result.
What to Test
Here are categories of unit tests that every query engine should have:
Type handling: What happens when an expression receives an unexpected type? For example, computing SUM on strings should produce a clear error.
Boundary values: Test minimum and maximum values for each numeric type. Test empty strings and very long strings.
Null handling: Every expression must handle null inputs correctly. Typically, any operation involving null produces null (SQL three-valued logic).
Overflow and underflow: What happens when multiplying two large integers overflows? The behavior should be documented and tested.
Special floating point values: Test positive and negative infinity, NaN, positive and negative zero.
Testing SQL Parsing
SQL parsing deserves dedicated tests because the parser is a critical component that interprets user queries. Parser bugs can cause queries to be misinterpreted, potentially with serious consequences.
Testing Operator Precedence
Mathematical expressions must respect operator precedence. These tests verify that multiplication binds tighter than addition:
@Test
fun `1 + 2 * 3`() {
val expr = parse("1 + 2 * 3")
val expected = SqlBinaryExpr(
SqlLong(1),
"+",
SqlBinaryExpr(SqlLong(2), "*", SqlLong(3))
)
assertEquals(expected, expr)
}
@Test
fun `1 * 2 + 3`() {
val expr = parse("1 * 2 + 3")
val expected = SqlBinaryExpr(
SqlBinaryExpr(SqlLong(1), "*", SqlLong(2)),
"+",
SqlLong(3)
)
assertEquals(expected, expr)
}
The tree structure of the parsed expression reveals which operations bind first. In 1 + 2 * 3, the 2 * 3 forms a subtree because multiplication has higher precedence.
Testing Query Structure
Tests should verify that each SQL clause is parsed correctly:
@Test
fun `parse SELECT with WHERE`() {
val select = parseSelect(
"SELECT id, first_name, last_name FROM employee WHERE state = 'CO'"
)
assertEquals(
listOf(
SqlIdentifier("id"),
SqlIdentifier("first_name"),
SqlIdentifier("last_name")
),
select.projection
)
assertEquals(
SqlBinaryExpr(SqlIdentifier("state"), "=", SqlString("CO")),
select.selection
)
assertEquals("employee", select.tableName)
}
@Test
fun `parse SELECT with aggregates`() {
val select = parseSelect(
"SELECT state, MAX(salary) FROM employee GROUP BY state"
)
assertEquals(
listOf(
SqlIdentifier("state"),
SqlFunction("MAX", listOf(SqlIdentifier("salary")))
),
select.projection
)
assertEquals(listOf(SqlIdentifier("state")), select.groupBy)
}
Integration Testing
Integration tests verify that components work together correctly. For a query engine, this means executing complete queries and verifying the results.
End-to-End Query Tests
These tests exercise the full query path from parsing through execution:
@Test
fun `employees in CO using DataFrame`() {
val ctx = ExecutionContext(mapOf())
val df = ctx.csv(employeeCsv)
.filter(col("state") eq lit("CO"))
.project(listOf(col("id"), col("first_name"), col("last_name")))
val batches = ctx.execute(df).asSequence().toList()
assertEquals(1, batches.size)
val batch = batches.first()
assertEquals("2,Gregg,Langford\n" + "3,John,Travis\n", batch.toCSV())
}
@Test
fun `employees in CA using SQL`() {
val ctx = ExecutionContext(mapOf())
val employee = ctx.csv(employeeCsv)
ctx.register("employee", employee)
val df = ctx.sql(
"SELECT id, first_name, last_name FROM employee WHERE state = 'CA'"
)
val batches = ctx.execute(df).asSequence().toList()
assertEquals(1, batches.size)
val batch = batches.first()
assertEquals("1,Bill,Hopkins\n", batch.toCSV())
}
These tests use a small, static test file so the expected results are known. The CSV output format makes assertions readable and diff-friendly when tests fail.
Testing Complex Operations
Aggregation, joins, and other complex operations need thorough integration testing:
@Test
fun `aggregate query`() {
val ctx = ExecutionContext(mapOf())
val df = ctx.csv(employeeCsv)
.aggregate(
listOf(col("state")),
listOf(Max(cast(col("salary"), ArrowType.Int(32, true))))
)
val batches = ctx.execute(df).asSequence().toList()
assertEquals(1, batches.size)
val batch = batches.first()
val expected = "CO,11500\n" + "CA,12000\n" + ",11500\n"
assertEquals(expected, batch.toCSV())
}
@Test
fun `inner join using DataFrame`() {
val leftSchema = Schema(listOf(
Field("id", ArrowTypes.Int32Type),
Field("name", ArrowTypes.StringType)
))
val rightSchema = Schema(listOf(
Field("id", ArrowTypes.Int32Type),
Field("dept", ArrowTypes.StringType)
))
val leftData = Fuzzer().createRecordBatch(
leftSchema,
listOf(listOf(1, 2, 3), listOf("Alice", "Bob", "Carol"))
)
val rightData = Fuzzer().createRecordBatch(
rightSchema,
listOf(listOf(1, 2, 4), listOf("Engineering", "Sales", "Marketing"))
)
val leftSource = InMemoryDataSource(leftSchema, listOf(leftData))
val rightSource = InMemoryDataSource(rightSchema, listOf(rightData))
val ctx = ExecutionContext(mapOf())
val leftDf = DataFrameImpl(Scan("left", leftSource, listOf()))
val rightDf = DataFrameImpl(Scan("right", rightSource, listOf()))
val joinedDf = leftDf.join(rightDf, JoinType.Inner, listOf("id" to "id"))
val batches = ctx.execute(joinedDf).asSequence().toList()
assertEquals(1, batches.size)
val batch = batches.first()
assertEquals(2, batch.rowCount())
assertEquals("1,Alice,Engineering\n2,Bob,Sales\n", batch.toCSV())
}
The join test creates in-memory data sources to have precise control over the input data. This makes the expected output predictable and the test deterministic.
Comparative Testing
Comparative testing executes the same query against a trusted reference implementation and verifies that both produce the same results. This is particularly valuable for catching subtle bugs.
Common approaches include:
- Running queries against PostgreSQL, DuckDB, or another established database and comparing results
- Comparing DataFrame API queries against equivalent SQL queries within the same engine
- Running optimized queries against unoptimized versions to verify the optimizer preserves semantics
The challenge with comparative testing is handling differences in null ordering, floating point precision, and result ordering when ORDER BY is not specified.
Fuzzing
Hand-written tests inevitably miss edge cases. Fuzzing generates random inputs to discover bugs that humans would not think to test.
Random Expression Generation
The fuzzer creates random expression trees by recursively building either leaf nodes (columns, literals) or binary expressions:
fun createExpression(input: DataFrame, depth: Int, maxDepth: Int): LogicalExpr {
return if (depth == maxDepth) {
// return a leaf node
when (rand.nextInt(4)) {
0 -> ColumnIndex(rand.nextInt(input.schema().fields.size))
1 -> LiteralDouble(rand.nextDouble())
2 -> LiteralLong(rand.nextLong())
3 -> LiteralString(randomString(rand.nextInt(64)))
else -> throw IllegalStateException()
}
} else {
// binary expressions
val l = createExpression(input, depth+1, maxDepth)
val r = createExpression(input, depth+1, maxDepth)
return when (rand.nextInt(8)) {
0 -> Eq(l, r)
1 -> Neq(l, r)
2 -> Lt(l, r)
3 -> LtEq(l, r)
4 -> Gt(l, r)
5 -> GtEq(l, r)
6 -> And(l, r)
7 -> Or(l, r)
else -> throw IllegalStateException()
}
}
}
A depth limit prevents unbounded recursion. The random seed should be fixed for reproducibility.
Random Plan Generation
Similarly, the fuzzer can generate random query plans:
fun createPlan(input: DataFrame,
depth: Int,
maxDepth: Int,
maxExprDepth: Int): DataFrame {
return if (depth == maxDepth) {
input
} else {
// recursively create an input plan
val child = createPlan(input, depth+1, maxDepth, maxExprDepth)
// apply a transformation to the plan
when (rand.nextInt(2)) {
0 -> {
val exprCount = 1.rangeTo(rand.nextInt(1, 5))
child.project(exprCount.map {
createExpression(child, 0, maxExprDepth)
})
}
1 -> child.filter(createExpression(input, 0, maxExprDepth))
else -> throw IllegalStateException()
}
}
}
Here is an example of a generated plan:
Filter: 'VejBmVBpYp7gHxHIUB6UcGx' OR 0.7762591612853446
Filter: 'vHGbOKKqR' <= 0.41876514212913307
Filter: 0.9835090312561898 <= 3342229749483308391
Filter: -5182478750208008322 < -8012833501302297790
Filter: 0.3985688976088563 AND #1
Filter: #5 OR 'WkaZ54spnoI4MBtFpQaQgk'
Scan: employee.csv; projection=None
Enhanced Random Values
Naive random number generation misses interesting edge cases. An enhanced random generator deliberately produces boundary values:
class EnhancedRandom(val rand: Random) {
fun nextDouble(): Double {
return when (rand.nextInt(8)) {
0 -> Double.MIN_VALUE
1 -> Double.MAX_VALUE
2 -> Double.POSITIVE_INFINITY
3 -> Double.NEGATIVE_INFINITY
4 -> Double.NaN
5 -> -0.0
6 -> 0.0
7 -> rand.nextDouble()
else -> throw IllegalStateException()
}
}
fun nextLong(): Long {
return when (rand.nextInt(5)) {
0 -> Long.MIN_VALUE
1 -> Long.MAX_VALUE
2 -> -0
3 -> 0
4 -> rand.nextLong()
else -> throw IllegalStateException()
}
}
}
By weighting the distribution toward edge cases, the fuzzer finds bugs faster than pure random generation.
Handling Invalid Plans
Most randomly generated plans are invalid. This is actually useful because it tests error handling. However, if you want to test execution specifically, the fuzzer can be made “smarter” by:
- Generating type-aware expressions (e.g., only compare compatible types)
- Generating AND/OR expressions with Boolean operands
- Ensuring aggregate expressions use aggregate functions
The trade-off is that smarter fuzzers may miss bugs in error paths.
Using Fuzzing Effectively
Some practical tips for fuzzing:
Fix the random seed: Use a constant seed so failures are reproducible. Log the seed if using system time.
Run many iterations: Fuzzing finds bugs probabilistically. Run thousands or millions of iterations.
Shrink failing cases: When fuzzing finds a bug, try to reduce the failing input to a minimal reproduction.
Keep regression tests: When fuzzing finds a bug, add the simplified case as a permanent test.
Golden Testing
Golden testing (also called snapshot testing) captures the output of a query and stores it as the expected result. Future runs compare against this “golden” output.
This approach works well for:
- Query plan pretty-printing (verifying plan structure)
- Explain output format
- Error messages
The downside is that golden tests can be brittle. Any change to output formatting breaks the tests, even if the underlying behavior is correct.
Testing Optimizations
Optimizer bugs are particularly insidious because the unoptimized query works correctly. Test optimizations by:
- Verifying that optimized and unoptimized plans produce identical results
- Verifying that specific optimizations fire when expected
- Testing that invalid optimizations do not fire
For example, to test projection push-down:
@Test
fun `projection pushdown reduces columns read`() {
val plan = // build a plan that selects 3 columns from a 10 column table
val optimized = Optimizer().optimize(plan)
// Verify the scan only reads the 3 needed columns
val scan = findScan(optimized)
assertEquals(3, scan.projection.size)
}
@Test
fun `optimization preserves query semantics`() {
val plan = // build a query plan
val optimized = Optimizer().optimize(plan)
val originalResult = execute(plan)
val optimizedResult = execute(optimized)
assertEquals(originalResult, optimizedResult)
}
Debugging Test Failures
When a test fails, the key is reproducing and understanding the failure.
Pretty-print plans: Implement a pretty() method on logical and physical plans. When a test fails, print the plan to understand what query was executed.
Log intermediate results: For debugging, add logging that shows data flowing through each operator.
Minimize the reproduction: Start with a failing query and systematically simplify it while keeping the failure. Remove unnecessary columns, filters, and joins until you have the smallest query that exhibits the bug.
Check boundary conditions: Many bugs occur at boundaries. If a test with 100 rows fails, try 0, 1, and 2 rows.
Verify test data: Sometimes the test itself is wrong. Double-check that the test data and expected results are correct.
Continuous Integration
Automated testing in CI catches regressions before they reach users. A good CI setup for a query engine includes:
- Run all unit tests on every commit
- Run integration tests on every pull request
- Run longer fuzzing sessions nightly or weekly
- Track test execution time to catch performance regressions
Test failures should block merging. Flaky tests (tests that sometimes pass and sometimes fail) must be fixed immediately because they train developers to ignore failures.
Summary
Testing query engines requires multiple complementary approaches. Unit tests verify individual components. Integration tests verify components work together. Fuzzing discovers edge cases that humans miss. Comparative testing catches semantic bugs by checking against a reference implementation.
Invest in testing infrastructure early. Good test utilities (for creating test data, comparing results, pretty-printing plans) make writing tests easier, which means more tests get written. A well-tested query engine gives users confidence that their query results are correct.
This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work
Copyright © 2020-2025 Andy Grove. All rights reserved.