-
Notifications
You must be signed in to change notification settings - Fork 496
Propagate cancellation within leaf search #6002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
97823ad to
05093f0
Compare
quickwit/quickwit-search/src/leaf.rs
Outdated
| } | ||
| } | ||
| while let Some(result) = join_set.join_next().await { | ||
| incremental_merge_collector.add_result(result??)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This not the original behavior, right? Before we would keep going, now we return an error right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were already erroring right away for JoinError. For regular errors we were continuing, but adding a SplitSearchError with "unknown" split_id to the list of failed splits. I think the most likely reason a child request might fail is a merge error. Given that the user doesn't know how many and which splits failed, it seems quite unlikely that the result can be reasonably used. I can revert this part if you think the partial result is valuable in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdettai-sk you might be right and this 100% deserves diving into the detail and do a refactoring...
However, we do not have time to do this at the moment, and the rest of the PR is valuable. Is it possible to revert this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done ✅
| // An explicit task cancellation is not an error. | ||
| continue; | ||
| } | ||
| let position = split_with_task_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use a map from task ID to split ID here instead of having to do the find index + remove dance.
Description
We observed that query spikes create huge leaf search tasks backlogs that don't get cancelled when the queries time out.
This is caused by the timeout cancellation that isn't propagated to spawned tasks.
This implementation is based on JoinSet, a Tokio primitive that helps managing the lifecycle of a group of tasks. It is crucial to make sure all the tasks get cancelled when the leaf request times out.
How was this PR tested?
Describe how you tested this PR.