Skip to content

Summary broker query log for multistage engine queries #16001

@dang-stripe

Description

@dang-stripe

We've found it difficult to do post-hoc analysis of query timeouts/failures on the multistage engine without access to the stage stats. Debugging these queries have involved:

  1. Piecing together distributed server logs to understand where the query failed or slowed down
  2. Re-running the query, raising the timeout as needed, and collecting the stage stats adhoc

Given the complexity of the multistage engine, it'd be ideal for these queries to have a concise summary log from the broker similar to how single stage does it here. Some metadata that'd be helpful for debugging:

  1. Query success or failure
  2. Query latency
  3. Concise representation of the stage graph (1->[2,3],2->4,etc) including which stages are leafs
  4. Which stages of the query succeeded or failed
  5. For stages that failed, which servers failed the complete the stage
  6. How long (wall clock) time did each stage take

This would make it much easier to debug production issues on the fly and answer questions like:

  1. Is the query failing due to single server or multiple?

cc @gortiz @Jackie-Jiang @jadami10

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions