-
Notifications
You must be signed in to change notification settings - Fork 685
[searchlib] add an option to prefetch tensors in HNSW index search #35057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[searchlib] add an option to prefetch tensors in HNSW index search #35057
Conversation
|
Hi @boeker ! I see you've committed a plenty into hnsw_index.cpp recently, would you be able to give this PR a look? |
|
Thanks for the detailed and very interesting writeup! We'll get to reviewing this as soon as time permits.
I was inspired by your inspiration 🙂 and decided to implement an explicitly vectorized binary hamming distance function (via Highway) in #35073. On NEON it beats the auto-vectorized code by ~1.6x on 128-byte vectors and ~2.1x for 8192-byte vectors. Would be very interested in hearing what vector length 1.8x was observed on, and your approach for getting there. On SVE/SVE2 I get a ~2.1x speedup for 128 bytes. For 8192 bytes SVE(2) beats the auto-vectorized code by ~3x. Difference on x64 AVX3-DL (AVX-512 + Note: these vector kernels are not yet enabled by default—they will be soon. Benchmarked on an AWS Graviton 4 node using benchmark functionality added as part of #35073: |
|
Cool!
I think it was a single-file gbench with copy-pasted I think this difference between x1.8 and x1.6 has do to with jumping through |
|
Hi @vekterli ! Did you have a chance to give this PR a closer look? |
|
Sorry about the delay, we've been rather busy 😓 Hoping we'll get around to it this week. My gut feeling is that we may want to template the search layer functions (filtered and unfiltered) on some kind of prefetching policy and then have a runtime decision on whether we want to actually prefetch anything, based on config and/or the size of the graph. In my anecdotal experience, explicitly prefetching often makes the performance go down if enough stuff is already present in the cache hierarchy, which may be the case for small graphs. But at some point the curves will intersect and prefetching should present an increasingly visible gain. Could also be an interesting experiment to see if prefetching into only caches > L1 would be beneficial. L1D is comparatively tiny, so when prefetching many vectors we may (emphasis: may—needs benchmarking!) risk evicting useful stuff from it that we'll end up needing before we actually get around to using the vectors themselves. |
|
I mostly agree, and I happen to share the same anecdotal experience.
Sounds very reasonable to me, although I believe that prefetching in The prefetching policy you mentioned: I assume it has to make a run-time decision base on a
, right? |
To avoid falling for the delicious temptation to make a complex policy I think that an initial implementation should probably be one where the prefetching decision is made by the query and/or the rank profile rather than being deduced by the code. This lets us easily do performance testing with/without prefetching for various scenarios without having to recompile or reconfigure the entire system. One question regarding the diff; it adds prefetching to |
vekterli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, lots of stuff popping up... 👀
As I alluded to in the TensorBufferStore code, I have a concern that we risk polluting the caches with the current approach when there are many subspaces. Ideally we should start out with tensors stored in DenseTensorStore to avoid this risk, since dense tensors do not have multiple subspaces.
Re: my earlier comment/question, it is not entirely clear to me why your seemingly dense tensors ended up instantiating a SerializedFastValueAttribute, so that would be good to figure out first.
searchlib/src/vespa/searchlib/tensor/serialized_fast_value_attribute.cpp
Outdated
Show resolved
Hide resolved
|
Sorry for the late reply, I also got consumed by other things. Regarding the tensor type: it actually is defined as I am also not sure that prefetching the whole tensor data is a good thing to do due to exactly the same concerns you outlined above. Initially I tried prefetching just the first 4 cache-lines, but that felt arbitrary, so I ended up with the current approach with no noticeable difference. Unfortunately, I won't be able to easily measure this change with dense tensors, as there aren't any in my setup. |
caaa801 to
dd22a05
Compare
|
I've addressed your inline comments and implemented a simple on/off policy for the prefetching. Please forgive the force-push: I'm upstreaming these changes from a fork that considerably lags behind, and rebasing on top of current master turned out to be non-trivial due to recent changes. |
|
Also, I've got some follow-up work which also does prefetching in ranking (when accessing attributes values, tensors etc.) and it show improvements in my specific setup, so probably we could rename the |
|
@itrofimow I made some changes in the HNSW index that now cause a merge conflict. I am cleaning up these changes right now, which should resolve the merge conflict, so please don't try to fix the merge conflict yourself right now. 😇 Edit: Done! |
|
Thanks @boeker ! Basically, me:
|
|
Hi @vekterli ! Could you please give this PR another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, just some minor nitpicks.
(Would also be great if @arnej27959 or @toregge could have a quick look at the code that touches the tensor/data stores 👀)
We've had an internal discussion regarding the added rank profile property and consensus is that it makes sense to add as-is. However, we most likely want to do some additional work around this before we feel it's ready to be an officially documented feature. Off the top of my head:
- We probably want a client query property that mirrors the rank profile property which can override prefetching on a per query basis.
- Prefetching of sparse tensors should be subspace-aware.
- We should unify prefetching between
search_layer_helperandsearch_layer_filter_first_helper. - Since updates to the graph also involves a search, we should look into if/how we can expose feed-time prefetching as well, not just query-time.
- We should set up and do our own large-scale (100M-1B) HNSW index experiments so that we have an environment where we can measure the impact of toggling prefetching.
- We need to understand how prefetching interacts with paged tensors, both when tensors fit in the buffer cache and when the combined tensor footprint is >> main memory size.
Ideally we'd get this merged very soon to start getting perf test results (this will also tell us if we need to make the neighbor prefetching conditional for small graphs), but I have a heap of unspent vacation days that have accumulated and are bursting at the seams, so I can't really follow up this until after new year's 😅 🎅 Someone else™ would need to follow this up until then.
| void set_exploration_slack(double v) { _exploration_slack = v; } | ||
| double get_exploration_slack() const { return _exploration_slack; } | ||
| void set_prefetch_tensors(bool v) { _prefetch_tensors = v; }; | ||
| bool get_perfetch_tensors() const { return _prefetch_tensors; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo nit: perfetch -> prefetch
| * before actually accessing the tensors for distance calculation, which could drastically reduce latencies, | ||
| * but could also completely trash the memory caches and make things considerably worse. | ||
| * | ||
| * Benchmakring the specific setup is recommended before enabling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo nit: Benchmakring -> Benchmarking
| virtual vespalib::eval::TypedCells get_vector(uint32_t docid, uint32_t subspace) const noexcept = 0; | ||
| virtual VectorBundle get_vectors(uint32_t docid) const noexcept = 0; | ||
|
|
||
| virtual void prefetch_docid(uint32_t) const noexcept {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a code comment on what the difference is between these two, especially since the other interface functions retrieve vectors from doc ids (i.e. the two concepts are interlinked) whereas here they are separate.
| virtual VectorBundle get_vectors(uint32_t docid) const noexcept = 0; | ||
|
|
||
| virtual void prefetch_docid(uint32_t) const noexcept {} | ||
| virtual void prefetch_vector(uint32_t) const noexcept {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding names to these parameters since their semantics can't necessarily be inferred from the generic uint32_t type alone (will need a [[maybe_unused]] or a (void)param_name)
| void prefetch_vector(const DocVectorAccess& vectors, uint32_t docid) | ||
| { | ||
| vectors.prefetch_vector(docid); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these just be vectors.prefetch_... directly in the caller instead of going via forwarding freestanding functions?
I agree with this; probably it could be added later in another PR, or do you want me to add it in this PR?
Makes sense to me. I didn't do that right away because that wasn't the use-case of mine, but that's easily doable I believe. Should I do it in this PR or later?
This is actually not at all straightforward, because in order to calculate the subspace memory location we have to access the tensor buffer, which leads to LLC-misses if not prefetched; we could prefetch the header for all the neighbors and then access it, but there likely wouldn't be enough time for prefetching to actually bring in the memory into caches. Maybe some smart reordering of prefetching could help with that.
Thinking more of it, shouldn't (couldn't) this be a property of the index itself? An index property, which could be overridden by rank-profile, which could be overridden by query property.
To the best of my knowledge, prefetching shouldn't generate page-faults, meaning it won't help much for paged tensors unless they are already faulted-in. Shouldn't, however, doesn't mean it doesn't, so.. Also, before this gets in (if ever), I was thinking about changing the schema for the property into something like What do you think? |

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.
TLDR: ~x2 speedup of HNSW index by
__builtin_prefetch-ing tensorsHi!

I was benchmarking my Vespa setup the other day, and decided to get a flamegraph of what proton is doing:
As expected, a lot of cpu cycles are being spent in HNSW index doing some HNSW things, one of such things being
binary_hamming_distance. I decided to give it a closer look and soon realized that in my setup (gcc 14.2, aarch64) gcc fails miserably to produce unrolled/vectorized version of the code, and that it could probably be improved by either SimSIMD or hand-written intrinsincs. Inspired, I implemented a version ofbinary_hamming_distancethat beats current code ~x1.8 in micro-benchmarks, deployed it, benchmarked... and saw no difference at all, none, zero.That confused me, so I decided to give another function a look:
get_num_subspaces_and_flag. Well, that's a one (3 with asserts) instruction function, huh? Could it be the overhead of it being in another TU and actually requiring acall? - well.. unlikely. That confused me even moreand I went on to check if my perf is working okAt some point I realized that I have a HNSW with ~1B of

tensor<int8>(d1[128]), which is a lot of memory with very unpredictable access pattern, and that 1 instructionget_num_subspaces_and_flagfunction is actually a memory load, so I built a flamegraph for last-level-cache misses (perf record -e LLC-load-misses), and suddenly everything made sense:As one can see, flamegraphs for cpu-cycles and LLC-load-misses look basically the same for HNSW index.
Looking closer at perf data for LLC I concluded that
search::tensor::SerializedFastValueAttribute::get_vectoris coming from herevespa/searchlib/src/vespa/searchlib/tensor/serialized_fast_value_attribute.cpp
Line 47 in 7691c8e
get_num_subspaces_and_flagis a load from tensor bufferbinary_hamming_distanceare also loads from tensor bufferThe good thing about HNSW memory access pattern is that although it's very hard for hardware to predict and prefetch, we know exactly what memory we will have to access: given a vertex in the graph, check all its neighbors , thus we can rearrange how we walk the neighbors in such a way that with some hints to hardware there would be way less misses:
TensorAttribute::_refVectorTensorBufferOperations(which requires a load fromTensorAttribute::_refVector, but hopefully at this point the memory would already be brought to caches)That's a lot of "hopefully", but that's just how prefetching hints work, and turns out they work wonders: with the patch applied flamegraphs for proton look like this

with the LLC-load-misses flamegraph looking almost the same as it did, as expected (LLC-misses from prefetching are still there, but now they are async and don't stall us that much)

Comparing trunk cpu-cycles flamegraph with patch cpu-cycles flamegraph it looks like there are ~x2 less cycles spent in HSNW index, which amount to ~10% of total cpu cycles spent, and that matches with CPU usage/timings I'm observing when benchmarking.
All benchmarks were conducted at commit b60aa0d, ~1B of
tensor<int8>(d1[128]), AWS Graviton4