-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[tailsamplingprocessor] Remove misleading metric #42620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tailsamplingprocessor] Remove misleading metric #42620
Conversation
This PR removes the metric otelcol_processor_tail_sampling_sampling_decision_latency. Originally reported in open-telemetry#38502, this metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal. To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric.
axw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal.
Agreed, but is that intentional? It seems like a bug that could be fixed. I think it could be useful to know how long each policy's evaluator takes, particularly for more expensive ones like the OTTL evaluator.
Also, #42508 goes in the direction of making evaluators pluggable, so they may be arbitrarily complex.
To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric.
If that's the primary motivation, could you take the single-threadedness into account to reduce the instrumentation overhead? i.e. by accumulating locally and only updating metrics after all policies have been evaluated -- it appears there's something like that already in policyMetrics.addDecision.
| startTime := time.Now() | ||
|
|
||
| // Check all policies before making a final decision. | ||
| for i, p := range tsp.policies { | ||
| decision, err := p.evaluator.Evaluate(ctx, id, trace) | ||
| latency := time.Since(startTime) | ||
| tsp.telemetry.ProcessorTailSamplingSamplingDecisionLatency.Record(ctx, int64(latency/time.Microsecond), p.attribute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is really that this is cumulative of all preceding policies? In which case the metric, as-is, will really only be meaningful if there's a single policy. That could be fixed by moving the startTime to the top of the loop, if it's important to keep the metric.
|
@axw Thanks for the review!
Is that possible with a histogram? I don't see any way to accumulate other than putting all the intermediate latencies in a slice and then calling
These are good points, but my opinion is that we shouldn't keep code around that reduces performance so much when it's not producing actionable results. I can think of a couple ways to make it ok performance-wise and keep the ability to locate slow policies:
My preference would be for (3), but that's still a breaking change for the metric, so I'm not sure it needs to be created in this same PR. Thoughts? |
Co-authored-by: Andrew Wilkins <[email protected]>
Ah, I had missed that it was a histogram. I don't think we have an option at the moment then.
This sounds OK to me. @portertech thoughts? |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
) #### Description This PR removes the metric otelcol_processor_tail_sampling_sampling_decision_latency. It adds a pair of metrics as replacement called `processor_tail_sampling_sampling_policy_cpu_time` and `processor_tail_sampling_sampling_policy_executions`. It is an implementation of the feedback received in #42620 Originally reported in #38502, this metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal. To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric. As a replacement, I've added a metric to track total time spent on each policy as well as count total executions. This allows slow policies to still be identified by checking their total or average execution time without the heavy CPU / gc pressure / synchronization cost of recording a histogram in the inner loop. <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes #38502 - closed on accident, and I am not otel enough to reopen it
Description
This PR removes the metric otelcol_processor_tail_sampling_sampling_decision_latency.
This metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal.
To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric.
Link to tracking issue
Originally reported in
#38502, which I closed accidentally with a related PR.