Skip to content

Conversation

@stuhood
Copy link
Collaborator

@stuhood stuhood commented Oct 15, 2025

The TermSet Query currently produces one Scorer/DocSet per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a HashSet of (encoded) term values.

Following the pattern set by the two execution modes of RangeQuery, this PR introduces a variant of TermSet which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large TermSets of primitives.

@stuhood stuhood force-pushed the stuhood.term-set-fast-fields branch from 7b5d85c to 72a22f6 Compare October 15, 2025 23:25
@stuhood stuhood marked this pull request as ready for review October 15, 2025 23:39
@stuhood
Copy link
Collaborator Author

stuhood commented Oct 16, 2025

Upstream at quickwit-oss#2718

@stuhood stuhood merged commit a8baceb into main Oct 16, 2025
5 checks passed
@stuhood stuhood deleted the stuhood.term-set-fast-fields branch October 16, 2025 16:18
stuhood added a commit to paradedb/paradedb that referenced this pull request Oct 16, 2025
## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.
github-actions bot pushed a commit to paradedb/paradedb that referenced this pull request Oct 16, 2025
## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.
github-actions bot pushed a commit to paradedb/paradedb that referenced this pull request Oct 16, 2025
## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.
stuhood added a commit to paradedb/paradedb that referenced this pull request Oct 16, 2025
## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.

Co-authored-by: Stu Hood <[email protected]>
stuhood added a commit to paradedb/paradedb that referenced this pull request Oct 16, 2025
## What

Add a variant of `TermSet` for very large sets of terms which scans a
fast fields column and intersects it with the `TermSet`.

## Why

ParadeDB users occasionally use `TermSet` as a "limited total size join"
between two tables (essentially: an explicit hash join). But the
implementation of `TermSet` which operates on posting lists requires
creating one `Scorer` per term, and might potentially seek many times to
read and merge posting lists.

This implementation is approximately 2x faster for a
`paradedb.aggregate` call operating over an input `TermSet` query
containing 10mm bigint terms.

## How

See paradedb/tantivy#69.

Co-authored-by: Stu Hood <[email protected]>
stuhood added a commit that referenced this pull request Oct 16, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
mdashti pushed a commit that referenced this pull request Oct 21, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
mdashti pushed a commit that referenced this pull request Oct 21, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
mdashti pushed a commit that referenced this pull request Oct 22, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
mdashti pushed a commit that referenced this pull request Oct 22, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
mdashti pushed a commit that referenced this pull request Oct 22, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
mdashti pushed a commit that referenced this pull request Oct 22, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
mdashti pushed a commit that referenced this pull request Dec 3, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
mdashti pushed a commit that referenced this pull request Dec 3, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
mdashti pushed a commit that referenced this pull request Dec 3, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
mdashti pushed a commit that referenced this pull request Dec 3, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
mdashti pushed a commit that referenced this pull request Dec 3, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
mdashti pushed a commit that referenced this pull request Dec 3, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
stuhood added a commit that referenced this pull request Dec 3, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
stuhood added a commit that referenced this pull request Dec 3, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
stuhood added a commit that referenced this pull request Dec 10, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
stuhood added a commit that referenced this pull request Dec 10, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
stuhood added a commit that referenced this pull request Dec 10, 2025
The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values.

Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large `TermSet`s of primitives.
stuhood added a commit that referenced this pull request Dec 10, 2025
#70)

Missed in #69.

The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants