Skip to content

Conversation

@itrofimow
Copy link

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.


TLDR: ~x2 speedup of HNSW index by __builtin_prefetch-ing tensors

Hi!
I was benchmarking my Vespa setup the other day, and decided to get a flamegraph of what proton is doing:
trunk
As expected, a lot of cpu cycles are being spent in HNSW index doing some HNSW things, one of such things being binary_hamming_distance. I decided to give it a closer look and soon realized that in my setup (gcc 14.2, aarch64) gcc fails miserably to produce unrolled/vectorized version of the code, and that it could probably be improved by either SimSIMD or hand-written intrinsincs. Inspired, I implemented a version of binary_hamming_distance that beats current code ~x1.8 in micro-benchmarks, deployed it, benchmarked... and saw no difference at all, none, zero.

That confused me, so I decided to give another function a look: get_num_subspaces_and_flag. Well, that's a one (3 with asserts) instruction function, huh? Could it be the overhead of it being in another TU and actually requiring a call? - well.. unlikely. That confused me even more and I went on to check if my perf is working ok


At some point I realized that I have a HNSW with ~1B of tensor<int8>(d1[128]), which is a lot of memory with very unpredictable access pattern, and that 1 instruction get_num_subspaces_and_flag function is actually a memory load, so I built a flamegraph for last-level-cache misses (perf record -e LLC-load-misses), and suddenly everything made sense:
trunk_llc_misses
As one can see, flamegraphs for cpu-cycles and LLC-load-misses look basically the same for HNSW index.

Looking closer at perf data for LLC I concluded that


The good thing about HNSW memory access pattern is that although it's very hard for hardware to predict and prefetch, we know exactly what memory we will have to access: given a vertex in the graph, check all its neighbors , thus we can rearrange how we walk the neighbors in such a way that with some hints to hardware there would be way less misses:

  1. for every neighbor, prefetch the memory in TensorAttribute::_refVector
  2. for every neighbor, prefetch its tensor in TensorBufferOperations (which requires a load from TensorAttribute::_refVector, but hopefully at this point the memory would already be brought to caches)
  3. for every neighbor, do what we currently do (and hopefully the tensor memory would already be in caches)

That's a lot of "hopefully", but that's just how prefetching hints work, and turns out they work wonders: with the patch applied flamegraphs for proton look like this
patch

with the LLC-load-misses flamegraph looking almost the same as it did, as expected (LLC-misses from prefetching are still there, but now they are async and don't stall us that much)
patch_llc_misses

Comparing trunk cpu-cycles flamegraph with patch cpu-cycles flamegraph it looks like there are ~x2 less cycles spent in HSNW index, which amount to ~10% of total cpu cycles spent, and that matches with CPU usage/timings I'm observing when benchmarking.

All benchmarks were conducted at commit b60aa0d, ~1B of tensor<int8>(d1[128]), AWS Graviton4

@itrofimow
Copy link
Author

Hi @boeker ! I see you've committed a plenty into hnsw_index.cpp recently, would you be able to give this PR a look?

@vekterli
Copy link
Member

Thanks for the detailed and very interesting writeup! We'll get to reviewing this as soon as time permits.

As expected, a lot of cpu cycles are being spent in HNSW index doing some HNSW things, one of such things being binary_hamming_distance. I decided to give it a closer look and soon realized that in my setup (gcc 14.2, aarch64) gcc fails miserably to produce unrolled/vectorized version of the code, and that it could probably be improved by either SimSIMD or hand-written intrinsincs. Inspired, I implemented a version of binary_hamming_distance that beats current code ~x1.8 in micro-benchmarks, deployed it, benchmarked... and saw no difference at all, none, zero.

I was inspired by your inspiration 🙂 and decided to implement an explicitly vectorized binary hamming distance function (via Highway) in #35073.

On NEON it beats the auto-vectorized code by ~1.6x on 128-byte vectors and ~2.1x for 8192-byte vectors. Would be very interested in hearing what vector length 1.8x was observed on, and your approach for getting there.

On SVE/SVE2 I get a ~2.1x speedup for 128 bytes. For 8192 bytes SVE(2) beats the auto-vectorized code by ~3x.

Difference on x64 AVX3-DL (AVX-512 +VPOPCNT and friends) is less pronounced for short vectors; ~1.2x for 128, but ~3.2x for 8192 (tested on a Sapphire Rapids system).

Note: these vector kernels are not yet enabled by default—they will be soon.

Benchmarked on an AWS Graviton 4 node using benchmark functionality added as part of #35073:

$ ~/git/vespa/vespalib/src/tests/hwaccelerated/vespalib_hwaccelerated_bench_app --benchmark_filter='Hamming'
Run on (16 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x16)
  L1 Instruction 64 KiB (x16)
  L2 Unified 2048 KiB (x16)
  L3 Unified 36864 KiB (x1)
Load Average: 5.98, 1.51, 0.51
-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
Binary Hamming Distance/uint8/Highway/SVE2_128/8                   2.14 ns         2.14 ns    326275775 bytes_per_second=6.94765Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/16                  1.97 ns         1.97 ns    355847438 bytes_per_second=15.117Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/32                  2.50 ns         2.50 ns    279695679 bytes_per_second=23.8174Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/64                  3.22 ns         3.22 ns    217195340 bytes_per_second=37.0362Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/128                 4.65 ns         4.65 ns    150592830 bytes_per_second=51.2758Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/256                 7.73 ns         7.73 ns     91102525 bytes_per_second=61.6708Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/512                 14.0 ns         14.0 ns     50044748 bytes_per_second=68.1854Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/1024                25.7 ns         25.7 ns     27278905 bytes_per_second=74.3529Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/2048                48.7 ns         48.7 ns     14377607 bytes_per_second=78.3409Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/4096                94.6 ns         94.6 ns      7398785 bytes_per_second=80.6318Gi/s
Binary Hamming Distance/uint8/Highway/SVE2_128/8192                 187 ns          186 ns      3753357 bytes_per_second=81.8166Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/8                       2.14 ns         2.14 ns    326356462 bytes_per_second=6.94836Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/16                      2.14 ns         2.14 ns    326412702 bytes_per_second=13.8968Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/32                      2.50 ns         2.50 ns    279761490 bytes_per_second=23.8235Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/64                      3.04 ns         3.04 ns    230346183 bytes_per_second=39.2319Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/128                     4.42 ns         4.42 ns    158180823 bytes_per_second=53.9691Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/256                     7.57 ns         7.57 ns     92433529 bytes_per_second=62.9942Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/512                     13.8 ns         13.8 ns     50581024 bytes_per_second=69.0917Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/1024                    25.9 ns         25.9 ns     27067123 bytes_per_second=73.6499Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/2048                    49.5 ns         49.5 ns     14141836 bytes_per_second=77.0653Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/4096                    96.3 ns         96.3 ns      7274359 bytes_per_second=79.2639Gi/s
Binary Hamming Distance/uint8/Highway/SVE2/8192                     190 ns          190 ns      3683384 bytes_per_second=80.3518Gi/s
Binary Hamming Distance/uint8/Highway/SVE/8                        2.14 ns         2.14 ns    326384690 bytes_per_second=6.94818Gi/s
Binary Hamming Distance/uint8/Highway/SVE/16                       2.14 ns         2.14 ns    326384429 bytes_per_second=13.8954Gi/s
Binary Hamming Distance/uint8/Highway/SVE/32                       2.50 ns         2.50 ns    279750080 bytes_per_second=23.822Gi/s
Binary Hamming Distance/uint8/Highway/SVE/64                       3.04 ns         3.04 ns    230365430 bytes_per_second=39.2339Gi/s
Binary Hamming Distance/uint8/Highway/SVE/128                      4.42 ns         4.42 ns    158403845 bytes_per_second=53.954Gi/s
Binary Hamming Distance/uint8/Highway/SVE/256                      7.55 ns         7.55 ns     92363289 bytes_per_second=63.1823Gi/s
Binary Hamming Distance/uint8/Highway/SVE/512                      13.8 ns         13.8 ns     50417105 bytes_per_second=68.8678Gi/s
Binary Hamming Distance/uint8/Highway/SVE/1024                     25.9 ns         25.9 ns     27007504 bytes_per_second=73.7061Gi/s
Binary Hamming Distance/uint8/Highway/SVE/2048                     49.6 ns         49.6 ns     14171277 bytes_per_second=76.9504Gi/s
Binary Hamming Distance/uint8/Highway/SVE/4096                     96.2 ns         96.2 ns      7273306 bytes_per_second=79.3231Gi/s
Binary Hamming Distance/uint8/Highway/SVE/8192                      190 ns          190 ns      3686645 bytes_per_second=80.3354Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/8                  4.29 ns         4.29 ns    163193312 bytes_per_second=3.47402Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/16                 1.88 ns         1.88 ns    373025813 bytes_per_second=15.8813Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/32                 2.38 ns         2.38 ns    293670870 bytes_per_second=24.9967Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/64                 3.67 ns         3.67 ns    190136479 bytes_per_second=32.4591Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/128                5.75 ns         5.75 ns    121426875 bytes_per_second=41.4332Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/256                10.1 ns         10.1 ns     68900422 bytes_per_second=47.1435Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/512                19.4 ns         19.4 ns     36156255 bytes_per_second=49.2574Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/1024               36.0 ns         36.0 ns     19467269 bytes_per_second=53.0508Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/2048               69.3 ns         69.3 ns     10093008 bytes_per_second=55.0118Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/4096                136 ns          136 ns      5137968 bytes_per_second=56.0133Gi/s
Binary Hamming Distance/uint8/Highway/NEON_BF16/8192                270 ns          270 ns      2592217 bytes_per_second=56.5077Gi/s
Binary Hamming Distance/uint8/Highway/NEON/8                       4.29 ns         4.29 ns    163189935 bytes_per_second=3.47411Gi/s
Binary Hamming Distance/uint8/Highway/NEON/16                      1.88 ns         1.88 ns    372998663 bytes_per_second=15.8808Gi/s
Binary Hamming Distance/uint8/Highway/NEON/32                      2.39 ns         2.39 ns    293665980 bytes_per_second=24.9902Gi/s
Binary Hamming Distance/uint8/Highway/NEON/64                      3.68 ns         3.68 ns    190686133 bytes_per_second=32.4209Gi/s
Binary Hamming Distance/uint8/Highway/NEON/128                     5.74 ns         5.74 ns    120971837 bytes_per_second=41.5435Gi/s
Binary Hamming Distance/uint8/Highway/NEON/256                     10.1 ns         10.1 ns     69245435 bytes_per_second=47.1529Gi/s
Binary Hamming Distance/uint8/Highway/NEON/512                     19.4 ns         19.4 ns     36102549 bytes_per_second=49.2254Gi/s
Binary Hamming Distance/uint8/Highway/NEON/1024                    35.9 ns         35.9 ns     19482024 bytes_per_second=53.0748Gi/s
Binary Hamming Distance/uint8/Highway/NEON/2048                    69.3 ns         69.3 ns     10092005 bytes_per_second=55.0084Gi/s
Binary Hamming Distance/uint8/Highway/NEON/4096                     136 ns          136 ns      5138878 bytes_per_second=56.0072Gi/s
Binary Hamming Distance/uint8/Highway/NEON/8192                     270 ns          270 ns      2591820 bytes_per_second=56.4957Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/8                       2.50 ns         2.50 ns    279773451 bytes_per_second=5.9554Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/16                      2.86 ns         2.86 ns    244788992 bytes_per_second=10.4222Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/32                      3.57 ns         3.57 ns    195839495 bytes_per_second=16.6749Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/64                      5.45 ns         5.45 ns    128349545 bytes_per_second=21.8573Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/128                     9.46 ns         9.46 ns     74011213 bytes_per_second=25.2011Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/256                     18.6 ns         18.6 ns     37623082 bytes_per_second=25.6614Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/512                     34.3 ns         34.3 ns     20427657 bytes_per_second=27.8403Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/1024                    69.4 ns         69.4 ns     10008466 bytes_per_second=27.4843Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/2048                     143 ns          143 ns      4884483 bytes_per_second=26.6131Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/4096                     284 ns          284 ns      2461372 bytes_per_second=26.8325Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON/8192                     566 ns          566 ns      1236663 bytes_per_second=26.9668Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/8          2.50 ns         2.50 ns    279738648 bytes_per_second=5.95454Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/16         2.86 ns         2.86 ns    244788349 bytes_per_second=10.4215Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/32         3.57 ns         3.57 ns    195825366 bytes_per_second=16.6747Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/64         5.45 ns         5.45 ns    128316425 bytes_per_second=21.8551Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/128        9.46 ns         9.46 ns     73975734 bytes_per_second=25.2099Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/256        18.6 ns         18.6 ns     37632029 bytes_per_second=25.6462Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/512        34.2 ns         34.2 ns     20457758 bytes_per_second=27.8521Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/1024       72.6 ns         72.6 ns      9990263 bytes_per_second=26.2759Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/2048        143 ns          143 ns      4885825 bytes_per_second=26.6159Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/4096        284 ns          284 ns      2461872 bytes_per_second=26.8367Gi/s
Binary Hamming Distance/uint8/AutoVec/NEON_FP16_DOTPROD/8192        566 ns          566 ns      1237442 bytes_per_second=26.9762Gi/s

@itrofimow
Copy link
Author

Cool!

On NEON it beats the auto-vectorized code by ~1.6x on 128-byte vectors and ~2.1x for 8192-byte vectors. Would be very interested in hearing what vector length 1.8x was observed on, and your approach for getting there.

I think it was a single-file gbench with copy-pasted binary_hamming_distance vs SimSIMD implementation on 128-bytes vectors, but when I tried to incorporate SimSIMD into IAccelerated framework gcc failed to unroll SimSIMD implementation (¯_(ツ)_/¯), and my hand-written intrinsics gave about the same x1.6 speedup in vespalib_hwaccelerated_bench_app.

I think this difference between x1.8 and x1.6 has do to with jumping through IAccelerated hoops, but I didn't dig any further and decided to investigate performance difference (the absence thereof) in marco-benchmarks first, hence this PR

@itrofimow
Copy link
Author

Hi @vekterli ! Did you have a chance to give this PR a closer look?

@vekterli
Copy link
Member

Sorry about the delay, we've been rather busy 😓 Hoping we'll get around to it this week.

My gut feeling is that we may want to template the search layer functions (filtered and unfiltered) on some kind of prefetching policy and then have a runtime decision on whether we want to actually prefetch anything, based on config and/or the size of the graph. In my anecdotal experience, explicitly prefetching often makes the performance go down if enough stuff is already present in the cache hierarchy, which may be the case for small graphs. But at some point the curves will intersect and prefetching should present an increasingly visible gain.

Could also be an interesting experiment to see if prefetching into only caches > L1 would be beneficial. L1D is comparatively tiny, so when prefetching many vectors we may (emphasis: may—needs benchmarking!) risk evicting useful stuff from it that we'll end up needing before we actually get around to using the vectors themselves.

@itrofimow
Copy link
Author

I mostly agree, and I happen to share the same anecdotal experience.


template the search layer functions (filtered and unfiltered) on some kind of prefetching policy and then have a runtime decision on whether we want to actually prefetch anything, based on config and/or the size of the graph.

Sounds very reasonable to me, although I believe that prefetching in TensorAttribute::_refVector very well could always be beneficial. Prefetching the actual tensors is definitely another story, as the data could easily just not fit into L1 and completely trash it also.


The prefetching policy you mentioned: I assume it has to make a run-time decision base on a

  • number of links
  • tensor size
  • L1d cache size
  • some config value

, right?

@vekterli vekterli self-requested a review November 5, 2025 11:36
@vekterli vekterli self-assigned this Nov 5, 2025
@vekterli
Copy link
Member

vekterli commented Nov 6, 2025

The prefetching policy you mentioned: I assume it has to make a run-time decision base on a

  • number of links
  • tensor size
  • L1d cache size
  • some config value
    , right?

To avoid falling for the delicious temptation to make a complex policy I think that an initial implementation should probably be one where the prefetching decision is made by the query and/or the rank profile rather than being deduced by the code. This lets us easily do performance testing with/without prefetching for various scenarios without having to recompile or reconfigure the entire system.

One question regarding the diff; it adds prefetching to SerializedFastValueAttribute which I would only really expect to see used when you have sparse dimensions (i.e. multiple subspaces). Is the int8 dense tensor contained within a sparse outer tensor somehow? Dense tensors are usually kept in a DenseTensorAttribute optimized for this purpose, which does not support multiple subspaces. This should also make prefetching easier since accessing a tensor does not entail reading a header in memory to route to the correct subspace.

Copy link
Member

@vekterli vekterli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, lots of stuff popping up... 👀

As I alluded to in the TensorBufferStore code, I have a concern that we risk polluting the caches with the current approach when there are many subspaces. Ideally we should start out with tensors stored in DenseTensorStore to avoid this risk, since dense tensors do not have multiple subspaces.

Re: my earlier comment/question, it is not entirely clear to me why your seemingly dense tensors ended up instantiating a SerializedFastValueAttribute, so that would be good to figure out first.

@itrofimow
Copy link
Author

Sorry for the late reply, I also got consumed by other things.

Regarding the tensor type: it actually is defined as type tensor<int8>(d0{},d1[128]); I've omitted the d0{} part due to thinking it is not significant. My bad, sorry for the confusion.

I am also not sure that prefetching the whole tensor data is a good thing to do due to exactly the same concerns you outlined above. Initially I tried prefetching just the first 4 cache-lines, but that felt arbitrary, so I ended up with the current approach with no noticeable difference.
Regarding prefetching the tensor header first and then prefetching the subspace needed: I'm not sure there would be enough time for memory subsystem to actually bring the header in before it's accessed in an attempt to prefetch the subspace.

Unfortunately, I won't be able to easily measure this change with dense tensors, as there aren't any in my setup.
Following up on the size of the sparse tensor prefetch, I think it could be okay to do the full prefetch as long as prefetching is allowed by the policy we discussed above, what do you think?

@itrofimow itrofimow force-pushed the hnsw_index_prefetch_tensors branch from caaa801 to dd22a05 Compare November 30, 2025 00:46
@itrofimow
Copy link
Author

I've addressed your inline comments and implemented a simple on/off policy for the prefetching.

Please forgive the force-push: I'm upstreaming these changes from a fork that considerably lags behind, and rebasing on top of current master turned out to be non-trivial due to recent changes.

@itrofimow itrofimow requested a review from vekterli November 30, 2025 00:58
@itrofimow
Copy link
Author

Also, I've got some follow-up work which also does prefetching in ranking (when accessing attributes values, tensors etc.) and it show improvements in my specific setup, so probably we could rename the prefetch-tensors thing into something more generic, which could be reused to guard prefetching in ranking as well

@boeker
Copy link
Contributor

boeker commented Dec 5, 2025

@itrofimow I made some changes in the HNSW index that now cause a merge conflict. I am cleaning up these changes right now, which should resolve the merge conflict, so please don't try to fix the merge conflict yourself right now. 😇

Edit: Done!

@itrofimow
Copy link
Author

Thanks @boeker ! Basically, me:

image

@itrofimow
Copy link
Author

Hi @vekterli ! Could you please give this PR another look?

Copy link
Member

@vekterli vekterli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, just some minor nitpicks.

(Would also be great if @arnej27959 or @toregge could have a quick look at the code that touches the tensor/data stores 👀)

We've had an internal discussion regarding the added rank profile property and consensus is that it makes sense to add as-is. However, we most likely want to do some additional work around this before we feel it's ready to be an officially documented feature. Off the top of my head:

  • We probably want a client query property that mirrors the rank profile property which can override prefetching on a per query basis.
  • Prefetching of sparse tensors should be subspace-aware.
  • We should unify prefetching between search_layer_helper and search_layer_filter_first_helper.
  • Since updates to the graph also involves a search, we should look into if/how we can expose feed-time prefetching as well, not just query-time.
  • We should set up and do our own large-scale (100M-1B) HNSW index experiments so that we have an environment where we can measure the impact of toggling prefetching.
  • We need to understand how prefetching interacts with paged tensors, both when tensors fit in the buffer cache and when the combined tensor footprint is >> main memory size.

Ideally we'd get this merged very soon to start getting perf test results (this will also tell us if we need to make the neighbor prefetching conditional for small graphs), but I have a heap of unspent vacation days that have accumulated and are bursting at the seams, so I can't really follow up this until after new year's 😅 🎅 Someone else™ would need to follow this up until then.

void set_exploration_slack(double v) { _exploration_slack = v; }
double get_exploration_slack() const { return _exploration_slack; }
void set_prefetch_tensors(bool v) { _prefetch_tensors = v; };
bool get_perfetch_tensors() const { return _prefetch_tensors; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo nit: perfetch -> prefetch

* before actually accessing the tensors for distance calculation, which could drastically reduce latencies,
* but could also completely trash the memory caches and make things considerably worse.
*
* Benchmakring the specific setup is recommended before enabling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo nit: Benchmakring -> Benchmarking

virtual vespalib::eval::TypedCells get_vector(uint32_t docid, uint32_t subspace) const noexcept = 0;
virtual VectorBundle get_vectors(uint32_t docid) const noexcept = 0;

virtual void prefetch_docid(uint32_t) const noexcept {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a code comment on what the difference is between these two, especially since the other interface functions retrieve vectors from doc ids (i.e. the two concepts are interlinked) whereas here they are separate.

virtual VectorBundle get_vectors(uint32_t docid) const noexcept = 0;

virtual void prefetch_docid(uint32_t) const noexcept {}
virtual void prefetch_vector(uint32_t) const noexcept {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding names to these parameters since their semantics can't necessarily be inferred from the generic uint32_t type alone (will need a [[maybe_unused]] or a (void)param_name)

void prefetch_vector(const DocVectorAccess& vectors, uint32_t docid)
{
vectors.prefetch_vector(docid);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these just be vectors.prefetch_... directly in the caller instead of going via forwarding freestanding functions?

@itrofimow
Copy link
Author

itrofimow commented Dec 12, 2025

We probably want a client query property that mirrors the rank profile property which can override prefetching on a per query basis.

I agree with this; probably it could be added later in another PR, or do you want me to add it in this PR?

We should unify prefetching between search_layer_helper and search_layer_filter_first_helper.

Makes sense to me. I didn't do that right away because that wasn't the use-case of mine, but that's easily doable I believe. Should I do it in this PR or later?

Prefetching of sparse tensors should be subspace-aware.

This is actually not at all straightforward, because in order to calculate the subspace memory location we have to access the tensor buffer, which leads to LLC-misses if not prefetched; we could prefetch the header for all the neighbors and then access it, but there likely wouldn't be enough time for prefetching to actually bring in the memory into caches. Maybe some smart reordering of prefetching could help with that.

Since updates to the graph also involves a search, we should look into if/how we can expose feed-time prefetching as well, not just query-time.

Thinking more of it, shouldn't (couldn't) this be a property of the index itself? An index property, which could be overridden by rank-profile, which could be overridden by query property.

We need to understand how prefetching interacts with paged tensors, both when tensors fit in the buffer cache and when the combined tensor footprint is >> main memory size.

To the best of my knowledge, prefetching shouldn't generate page-faults, meaning it won't help much for paged tensors unless they are already faulted-in. Shouldn't, however, doesn't mean it doesn't, so..


Also, before this gets in (if ever), I was thinking about changing the schema for the property into something like

prefetching:
    ann: true|false
    ranking: true|false (in my use-case prefetching attributes/tensors in ranking loop is profitable)
    ... potentially some other places (disk indexes come to mind): true|false

What do you think?

@itrofimow itrofimow changed the title [searchlib] NFC: prefetch tensors in HNSW index search [searchlib] add an option to prefetch tensors in HNSW index search Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants