perf: Implement a TermSet variant which uses fast fields #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

stuhood merged 1 commit into main from stuhood.term-set-fast-fields

Oct 16, 2025

Collaborator

stuhood commented Oct 15, 2025 •

edited

Loading

The TermSet Query currently produces one Scorer/DocSet per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a HashSet of (encoded) term values.

Following the pattern set by the two execution modes of RangeQuery, this PR introduces a variant of TermSet which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large TermSets of primitives.


          Add a fast-field TermSet implementation which intersects the term set…

72a22f6

… with a fast field column.

stuhood force-pushed the stuhood.term-set-fast-fields branch from 7b5d85c to 72a22f6 Compare

October 15, 2025 23:25

stuhood mentioned this pull request

perf: Add a variant of TermSet which uses fast fields paradedb/paradedb#3351

Merged

stuhood marked this pull request as ready for review

October 15, 2025 23:39

Collaborator Author

stuhood commented Oct 16, 2025

Upstream at quickwit-oss#2718

rebasedming approved these changes

View reviewed changes

stuhood merged commit a8baceb into main

5 checks passed

stuhood deleted the stuhood.term-set-fast-fields branch

October 16, 2025 16:18

stuhood added a commit to paradedb/paradedb that referenced this pull request


          perf: Add a variant of TermSet which uses fast fields (#3351)

5fe1322

## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.

github-actions bot pushed a commit to paradedb/paradedb that referenced this pull request


          perf: Add a variant of TermSet which uses fast fields (#3351)

f66e54f

## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.

github-actions bot pushed a commit to paradedb/paradedb that referenced this pull request


          perf: Add a variant of TermSet which uses fast fields (#3351)

fa62383

## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.

This was referenced Oct 16, 2025

perf: Add a variant of TermSet which uses fast fields paradedb/paradedb#3359

Merged

perf: Add a variant of TermSet which uses fast fields paradedb/paradedb#3360

Merged

stuhood added a commit to paradedb/paradedb that referenced this pull request


          perf: Add a variant of TermSet which uses fast fields (#3360)

70567f4

## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.

Co-authored-by: Stu Hood <[email protected]>

stuhood added a commit to paradedb/paradedb that referenced this pull request


          perf: Add a variant of TermSet which uses fast fields (#3359)

1766b26

## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.

Co-authored-by: Stu Hood <[email protected]>

stuhood mentioned this pull request

fix: Add support for bool to the fast field TermSet implementation #70

Merged

stuhood added a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

3dcacaf

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

mdashti pushed a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

e30ba98

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

mdashti pushed a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

4dc91a6

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

mdashti pushed a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

5b1d1f6

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

mdashti pushed a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

32a1ab8

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

mdashti pushed a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

b4b1783

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

mdashti pushed a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

466ecd3

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

mdashti pushed a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

de84730

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

mdashti pushed a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

mdashti pushed a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

77154f6

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

mdashti pushed a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

e9912ea

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

mdashti pushed a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

1f12321

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

mdashti pushed a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

5201d05

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

stuhood added a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

668ab95

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

stuhood added a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

1a4a02a

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

stuhood added a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

e63d2ec

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

stuhood added a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

145155e

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

stuhood added a commit that referenced this pull request


          perf: Implement a TermSet variant which uses fast fields (#69)

9fe0899

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.

stuhood added a commit that referenced this pull request


          fix: Add support for bool to the fast field TermSet implementation (

e0476d2

#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet