Skip to content

Conversation

@rdettai-sk
Copy link
Collaborator

@rdettai-sk rdettai-sk commented Nov 28, 2025

Description

We observed that query spikes create huge leaf search tasks backlogs that don't get cancelled when the queries time out.

This is caused by the timeout cancellation that isn't propagated to spawned tasks.

This implementation is based on JoinSet, a Tokio primitive that helps managing the lifecycle of a group of tasks. It is crucial to make sure all the tasks get cancelled when the leaf request times out.

How was this PR tested?

Describe how you tested this PR.

@rdettai-sk rdettai-sk self-assigned this Nov 28, 2025
@guilload guilload self-requested a review December 2, 2025 18:30
@rdettai-sk rdettai-sk force-pushed the propagate-leaf-search-cancel branch from 97823ad to 05093f0 Compare December 5, 2025 14:05
}
}
while let Some(result) = join_set.join_next().await {
incremental_merge_collector.add_result(result??)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not the original behavior, right? Before we would keep going, now we return an error right away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already erroring right away for JoinError. For regular errors we were continuing, but adding a SplitSearchError with "unknown" split_id to the list of failed splits. I think the most likely reason a child request might fail is a merge error. Given that the user doesn't know how many and which splits failed, it seems quite unlikely that the result can be reasonably used. I can revert this part if you think the partial result is valuable in this scenario.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdettai-sk you might be right and this 100% deserves diving into the detail and do a refactoring...

However, we do not have time to do this at the moment, and the rest of the PR is valuable. Is it possible to revert this part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ✅

// An explicit task cancellation is not an error.
continue;
}
let position = split_with_task_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use a map from task ID to split ID here instead of having to do the find index + remove dance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants