fix: added field validation to aggregations #87

mdashti · 2025-12-10T18:32:37Z

Ticket(s) Closed

Closes aggregations silently return empty results for invalid fields quickwit-oss/tantivy#2767

What

This PR adds validation to aggregation queries to ensure that field names exist in the schema and are configured as fast fields. Previously, aggregations would silently return empty results when given invalid field names, making it difficult to debug typos and configuration errors.

Why

Users were experiencing confusing behavior where aggregations with typos in field names would succeed but return empty results ({"buckets": []}), with no indication that the field didn't exist. This made it impossible to distinguish between:

A typo in the field name (configuration error)
A valid field with no data
A query that matched no documents

This behavior was inconsistent with other Tantivy query types (like ExistsQuery) and user expectations from SQL databases and Elasticsearch, where invalid column/field references return clear errors.

How

Modified the aggregation field accessor functions in accessor_helpers.rs:

get_ff_reader(): Now validates field existence before returning a column reader. Returns:
- FieldNotFound error if the field doesn't exist in the schema
- SchemaError if the field exists but isn't configured as a fast field
- Empty column only if the field is valid but has no values in the segment
get_all_ff_reader_or_empty(): Added the same validation logic for terms aggregations that handle multiple column types

The validation checks the schema to distinguish between non-existent fields and valid fields that happen to be empty in a particular segment, ensuring we only error on actual configuration problems.

Tests

Added test test_aggregation_invalid_field_returns_error covering all major aggregation types (date_histogram, histogram, terms, avg, range)
Fixed existing tests that were inadvertently using invalid field names

Breaking Change: Code using invalid field names in aggregations will now receive errors instead of empty results. This is intentional to catch configuration errors early.

…a()` (#17)

This PR modifies internal API signatures and implementation details so that `FileSlice`s are passed down into the innards of (at least) the `BlockwiseLinearCodec`. This allows tantivy to defer dereferencing large slices of bytes when reading numeric fast fields, and instead dereference only the slice of bytes it needs for any given compressed Block. The motivation here is for external `Directory` implementations where it's not exactly efficient to dereference large slices of bytes.

…mponents (#20) This overhauls `SegmentReader` to put its various components behind `OnceLock`s such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened. This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge. This isn't true now that the segment's physical files aren't opened until needed. As such, I've `#[ignore]`'d six tests that expose this problem. From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open. Overall, this cuts down a signficiant number of disk reads during pg_search's query planning. With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes. This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.

This removes up some overhead the profiler exposed. In the case I was testing, fast fields no longer shows up in the profile at all. I also renamed `BlockWithLength` to `BlockWithData`

Prior to this commit, the Footer tantivy serialized at the end of every file included a json blob that could be an arbitrary size. This changes the Footer to be exactly 24 bytes (6 u32s), making sure to keep the `crc` value. The other change we make here is to not actually read/validate the footer bytes when opening a file. From pg_search's perspective, this is quite unnecessary Postgres buffer cache I/O and increases index/segment opening overhead, which is something pg_search does often for each query. Two tests are ignored here as they test physical index files stored here in the repo that this change completely breaks.

This allows a `SegmentId` to be constructed from a `[u8; 16]` byte array. It also adds a `impl Default for SegementId`, which defaults to all nulls

This teaches tantivy how to "directly" delete a document in a segment. Our use case from pg_search is that we already know the segment_id and doc_id so it's waaaaay more efficient for us to delete docs through our `ambulkdelete()` routine. It avoids doing a search, and all the stuff around that, for each of our "ctid" terms that we want to delete.

Tantivy creates thread pools for some of its background work, specifically committing and merging. It's possible if one of the thread workers panics that rayon will simply abort the process. This is terrible from pg_search as that takes down the entire Postgres cluster. These changes allow a Directory to assign a panic handler that gets called in such cases. Which allows pg_search to gracefully rollback the current transaction, while presenting the panic message to the user.

This adds a function named `wants_cancel() -> bool` to the `Directory` trait. It allows a Directory implementation to indicate that it would like Tantivy to cancel an operation. Right now, querying this function only happens during key points of index merging, but _could_ be used in other places. Technically, segment merging is the only "black box" in tantivy that users don't otherwise have the direct ability to control. The default implementaiton of `wants_cancel()` returns false, so there's no fear of default tantivy spuriously cancelling a merge. The cancels happen "cleanly" such that if `wants_cancel()` returns true an `Err(TantivyError::Cancelled)` is returned from the calling function at that point, and the error result will be propogated up the stack. No panics are raised.

The `TopNComputer` was refactored in 0e04ec3 to support both ascending and descending order. But that refactor did not adjust the `TopNComputer::threshold` to include the `ComparableDoc`, which is what provides the relevant `Ord` implementation. Add a simple failing test, make it pass, and then expand the tests into proptests (which are already in use elsewhere in the codebase). Upstream at: quickwit-oss#2651

Not all queries use positions and it's okay if we (from the perspective of `pg_search`, anyways) defer opening/loading them until they're first needed. This is probably completely wrong for a mmap-based Directory, but we (again, `pg_search`) decided long ago that we don't care about that use case. This saves a lot of disk I/O when an index has lots of segments and the query doesn't need positions. As a drive by, make sure a random Vec has enough space before pushing items to it. This showed up in the profiler, believe it or not.

…tionary`. (#55) When a fast fields string/bytes `Dictionary` is opened, we currently read the entire dictionary from `FileSlice` -> `OwnedBytes`... and then immediately wrap it back into a `FileSlice`. Switching to `Dictionary::open` preserves the `FileSlice`, such that only the portions of the `Dictionary` which are actually accessed are read from disk/buffers.

We would like to be able to lazily load `BitpackedCodec` columns (similar to what 020bdff did for `BlockwiseLinearCodec`), because in the context of `pg_search`, immediately constructing `OwnedBytes` means copying the entire content of the column into memory. To do so, we expose some (slightly overlapped) block boundaries from `BitUnpacker`, and then lazily load each block when it is requested. Only the `get_val` function uses the cache: `get_row_ids_for_value_range` does not (yet), because it would be necessary to partition the row ids by block, and most of the time consumers using it are already loading reasonably large ranges anyway. See paradedb/paradedb#2894 for usage. There are a few 2x speedups in the benchmark suite, as well as a 1.8x speedup on a representative customer query. Unfortunately there are also some 13-19% slowdowns on aggregates: it looks like that is because aggregates use `get_vals`, for which the default implementation is to just call `get_val` in a loop.

## What Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now). ## How Add: * a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it. * a `Feature` trait which is implemented for `Score`s, and for fast fields. * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`. * a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN. * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns). * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`". This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s). ## Note This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so! ---- Upstream at quickwit-oss#2681

I failed to make this public in #57, which was necessary to use `TopOrderable` explicitly in a method signature.

This removes a check against `IndexWriterOptions` which disallowed zero indexing worker threads (`num_worker_threads`).

- Use a bitset to track used buckets in the `SharedArenaHashmap`, allowing for more efficient iteration - Create a global pool for both `MemoryArena` and `IndexingContext` - Reduce the MemoryArea page size by half (it's now 512KB instead of 1MB) - Centralize thread pool instances in `SegmentUpdater` so we can elide making them if all nthread sizes are zero

Adjusts `Feature` to use `Option` where necessary. `StringFeature::SegmentOutput` avoids being wrapped in an `Option` because the `u64::MAX` "niche" is safe to use in that case (since it would require that many terms to trigger a collision). Surprisingly, no performance impact downstream. I additionally needed to adjust the unit tests to: * include a NULL * use a stable id, because for some reason at three segments (but not two?) the fact that segment ords are not stable was exposed.

A MemoryArena should support allocations up to 4GB and #60 broke this by not accounting for the "max page id" when pages are now 50% the size of what the originally were. This cleans up the code so things stay in sync if we change NUM_BITS_PAGE_ADDR again and adds a unit test

In support of paradedb/paradedb#3203

The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values. Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!). Performance is significantly improved for large `TermSet`s of primitives.

#70) Missed in #69. The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

`MergeOptimizedInvertedIndexReader` was added in #32 in order to avoid making small reads to our underlying `FileHandle`. It did so by reading the entire content of the posting lists and positions at open time. As that PR says: > A likely downside to this approach is that now pg_search will be, indirectly, holding onto a lot of heap-allocated memory that was read from its block storage. Perhaps in the (near) future we can further optimize the new `MergeOptimizedInvertedIndexReader` such that it pages in blocks of a few megabytes at a time, on demand, rather than the whole file. This PR makes that change. But it additionally removes code that was later added in #47 to borrow individual entries rather than creating `OwnedBytes` for them. I believe that this code was added due to a misunderstanding: `OwnedBytes` is a total misnomer: the bytes are not "owned": they are immutably borrowed and reference counted. An `OwnedBytes` object can be created for any type which derefs to a slice of bytes, and can be cheaply cloned and sliced. So there is no need to actually borrow _or_ copy the buffer under the `OwnedBytes`. Removing the code that was doing so allows us to safely recreate our buffer without worrying about the lifetimes of buffers that we've handed out.

## What Reduce the per-segment buffer sizes from 4MB to 512KB. ## Why #71 moved from buffers which covered the entire file to maximum 4MB buffers. But for merges with very large segment counts, we need to be using more conservative buffer sizes. 512KB will still eliminate most posting list reads: posting lists larger than 512KB will skip the buffer.

* Removes allocation in a bunch of places * Removes sorting of terms if we're going to use the fast field execution method * Adds back the (accidentally dropped) cardinality threshold * Removes `bool` support -- using the posting lists is always more efficient for a `bool`, since there are at most two of them * More eagerly constructs the term `HashSet` so that it happens once, rather than once per segment

Adds `SnippetGenerator::snippets` to render multiple snippets in either score or position order. Additionally: renames the existing `limit` and `offset` arguments to disambiguate between "match" positions (which are concatenated into fragments), and "snippet" positions. Co-authored-by: Stu Hood <[email protected]>

…ts (#80)

This commit adds support for Polish language stemming. The previously used rust-stemmers crate is abandoned and unmaintained, which blocked the addition of new languages. This change addresses a user request for Polish stemming to improve BM25 recall in their use case. The tantivy-stemmers crate is a modern, maintained alternative that also opens the door for supporting many other languages in the future. - Added the tantivy-stemmers crate as a dependency to the workspace, alongside the existing rust-stemmers dependency (for backward compatibility) - Introduced an internal enum that can hold an algorithm from either rust-stemmers or tantivy-stemmers - Added Polish to the main Language enum, mapped to the new tantivy-stemmers implementation - Updated the token stream to handle both types of stemmers internally - Added the POLISH variant to the stopwords list - Existing tests pass - Added test_pl_tokenizer to verify that the Polish stemmer works correctly

…resent (#84) Co-authored-by: Stu Hood <[email protected]>

- Add missing size_hint module declaration - Remove test-only export serialize_and_load_u64_based_column_values - fixed quickwit CI issues

neilyio and others added 30 commits December 2, 2025 17:54

Use Levenshtein distance to score documents in fuzzy term queries

5d11ad6

feat: implement TokenFilter for Option<F> (#4)

d984fae

Fix managed paths (#5)

0fed815

expose AddOperation and with_max_doc (#7)

5edeff0

Directory trait can read/write meta/managed

c8c4755

quickwit compiles

133dfee

no pgrx, please

46bacd4

undo changes to segment_updater.rs

9f18a24

make save_metas provide previous metas

e064e96

Make DeleteMeta pub

0381d98

small changes to make MVCC work with delete

0974b52

Add a payload: &mut (dyn Any + '_) argument to `Directory::save_met…

64a7455

…a()` (#17)

fix compilation warnings on rust v1.83

99e105c

add reconsider_merge_policy to directory

b17c788

comments

53f8bbc

chore: point tantivy-fst to paradedb fork to fix regex

d0c816f

implement fuzzy scoring in sstable

27d476c

suppress warnings after rebase

066b5da

fix: tests in ci (#18)

f2f1fcc

chore: Don't do codecov (#21)

aff77d6

perf: remove some fast fields loading overhead (#22)

052eac0

This removes up some overhead the profiler exposed. In the case I was testing, fast fields no longer shows up in the profile at all. I also renamed `BlockWithLength` to `BlockWithData`

feat: Add ability to construct a SegmentId from raw bytes (#24)

16d12dd

This allows a `SegmentId` to be constructed from a `[u8; 16]` byte array. It also adds a `impl Default for SegementId`, which defaults to all nulls

Change default index precision to microseconds (#27)

e02c1e5

feat: make garbage collection opt-out (#28)

27b4553

rebasedming and others added 27 commits December 2, 2025 17:55

changes to make merging work (#48)

d1d6b09

make the directory BufWriter capacity configurable (#52)

532ab3e

chore: internal API visibility adjustments (#53)

598666a

Expose TopOrderable. (#58)

2dfd82e

I failed to make this public in #57, which was necessary to use `TopOrderable` explicitly in a method signature.

fix: Allow zero indexing & merging threads (#59)

93597ba

This removes a check against `IndexWriterOptions` which disallowed zero indexing worker threads (`num_worker_threads`).

feat: SnippetGenerator accepts limit/offset (#66)

71f1171

chore: Expose a method to create a segment with a particular id (#68)

c7c993b

In support of paradedb/paradedb#3203

fix: Add support for bool to the fast field TermSet implementation (

5201d05

#70) Missed in #69. The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.

chore: Make Language hashable (#79)

0dd3a27

fix: fixed integer overflow in ExpUnrolledLinkedList for large datase…

8741c18

…ts (#80)

fix: boolean query incorrectly dropping documents when AllScorer is p…

c146542

…resent (#84) Co-authored-by: Stu Hood <[email protected]>

fix: post-rebase fixes

a7329cc

- Add missing size_hint module declaration - Remove test-only export serialize_and_load_u64_based_column_values - fixed quickwit CI issues

Fixed agg validation

ae96014

mdashti self-assigned this Dec 10, 2025

mdashti requested a review from stuhood December 10, 2025 18:36

stuhood force-pushed the main branch from 0e9b40e to 9ffe4af Compare December 17, 2025 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: added field validation to aggregations #87

fix: added field validation to aggregations #87

Uh oh!

mdashti commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

fix: added field validation to aggregations #87

Are you sure you want to change the base?

fix: added field validation to aggregations #87

Uh oh!

Conversation

mdashti commented Dec 10, 2025

Ticket(s) Closed

What

Why

How

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants