squel

A useless row-oriented SQL query engine written to understand something about query processing.

With it, you can:

Run SQL-queries on CSV-files
Store and query TDB-files (see below)

Design

The system is implemented using the iterator / volcano processing model where each relational operator implements a getTuple function. Each operator in the query plan calls getTuple of it's children and after processing it, passes the tuple to it's parent (if any).

There are three subsystems:

The parser which is responsible for parsing the raw sql-query into a parse tree describing the query
The planner which is responsible for composing a query plan of the parse tree consisting of relational operators and binding identifiers to data
The executor which is responsible for executing the query described by the query plan. Execution means iterating over the operators.

SQL-support

The query engine has support for some basic SQL-features like:

Aggregation (COUNT,AVG,SUM,MIN,MAX)
Inner joins between two tables
Subqueries
Filtering (WHERE-clauses)
EXPLAIN, CREATE TABLE and INSERT INTO statements

This is about as complicated as it can get.

SELECT COUNT(unemployed)
FROM (
    SELECT lt.long_term_unemployed,lt.time,u.unemployed
    FROM
    './test/data/lt_unemployed.csv' AS lt 
    JOIN
    './test/data/unemployed.csv' AS u
        ON u.time=lt.time
)
WHERE long_term_unemployed > 10000 OR unemployed > 50000

See ./test/ for further examples.

Query planning

The query planner is relatively straight forward. It creates a query plan that comprises of the following operators (see src/include/planner/planner.h):

OP_SCANTDB
OP_SCAN
OP_PROJECT
OP_FILTER
OP_JOIN
OP_AGGREGATE
OP_HASHJOIN

Two operators are noteworthy:

OP_SCANTDB describes .tdb -table, whereas OP_SCAN is a scan on a plaintext file.
OP_HASHJOIN describes a hash-join. Any join operation based only on a single equality comparison is turned into a hashjoin. OP_JOIN is a mere nested loop join.

Custom file format `.tbd`

A table can be persisted in the custom tdb-file format. It consists of a header (describing the table: column types, names, etc.) and the actual data. The data is stored in a contiguous block where each record is of fixed size.

A .tbd file consists of a header and the actual records. The layout is described in src/include/io/tdb.h.

During query execution, a tuple is represented in memory in the same format. A column can be extracted from the block using an offset and datatype information. Queries on plaintext files are converted to the same format (and thus are slower than queries on .tdb-files).

See test/test_tdb.bats for an example of how to create and insert into a tdb-table.

Memory management

Thanks to the iterator model, the system needs to maintain only a single tuple in memory at a time during execution. This means that querying a 1KB table consumes as much memory as querying 1GB table (assuming the same table definition).

During query execution the operators call getTuple() on their child-operators. The parent passes a pointer to a tuple to the the child. The parent is responsible for managing the memory allocated for the tuple.

Assume a simple query with PROJECT, FILTER and a SCAN operators. The executor allocates memory for a single tuple. A pointer to that memory is passed to PROJECT and forward to SCAN through FILTER. The scan populates the memory with data from disk, filter considers if the tuple is valid and project selects column of the data. At best, this requires only a single allocation by executor. Once the executor has printed the tuple (for example), it may reuse the allocated memory.

An exception to this is joins and other blocking operators. Currently the inner table of the join is stored in memory in it's entirety. So one must be careful when crafting the query (choose the smaller table as the inner table).

Building

Simply run make to build under ./build:

make

This should able you to run the following query:

./build/squel "SELECT * FROM './test/data/small.csv'"

For running tests / cleaning up, you can do:

make tests
make clean

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
perf		perf
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
GRAMMAR.md		GRAMMAR.md
Makefile		Makefile
README.md		README.md
testi.csv		testi.csv
tmp.csv		tmp.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

squel

Design

SQL-support

Query planning

Custom file format `.tbd`

Memory management

Building

About

Uh oh!

Releases

Packages

Uh oh!

Languages

toppyy/squel

Folders and files

Latest commit

History

Repository files navigation

squel

Design

SQL-support

Query planning

Custom file format .tbd

Memory management

Building

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Custom file format `.tbd`

Packages