-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
multi-stageRelated to the multi-stage query engineRelated to the multi-stage query engineobservability
Description
We've found it difficult to do post-hoc analysis of query timeouts/failures on the multistage engine without access to the stage stats. Debugging these queries have involved:
- Piecing together distributed server logs to understand where the query failed or slowed down
- Re-running the query, raising the timeout as needed, and collecting the stage stats adhoc
Given the complexity of the multistage engine, it'd be ideal for these queries to have a concise summary log from the broker similar to how single stage does it here. Some metadata that'd be helpful for debugging:
- Query success or failure
- Query latency
- Concise representation of the stage graph (1->[2,3],2->4,etc) including which stages are leafs
- Which stages of the query succeeded or failed
- For stages that failed, which servers failed the complete the stage
- How long (wall clock) time did each stage take
This would make it much easier to debug production issues on the fly and answer questions like:
- Is the query failing due to single server or multiple?
Metadata
Metadata
Assignees
Labels
multi-stageRelated to the multi-stage query engineRelated to the multi-stage query engineobservability