[tailsamplingprocessor] Remove misleading metric #42620

Logiraptor · 2025-09-10T03:16:21Z

Description

This PR removes the metric otelcol_processor_tail_sampling_sampling_decision_latency.

This metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal.

To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric.

Link to tracking issue

Originally reported in
#38502, which I closed accidentally with a related PR.

This PR removes the metric otelcol_processor_tail_sampling_sampling_decision_latency. Originally reported in open-telemetry#38502, this metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal. To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric.

axw

This metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal.

Agreed, but is that intentional? It seems like a bug that could be fixed. I think it could be useful to know how long each policy's evaluator takes, particularly for more expensive ones like the OTTL evaluator.

Also, #42508 goes in the direction of making evaluators pluggable, so they may be arbitrarily complex.

To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric.

If that's the primary motivation, could you take the single-threadedness into account to reduce the instrumentation overhead? i.e. by accumulating locally and only updating metrics after all policies have been evaluated -- it appears there's something like that already in policyMetrics.addDecision.

.chloggen/logiraptor_remove-policy-latency.yaml

axw · 2025-09-10T03:53:16Z

processor/tailsamplingprocessor/processor.go

-	startTime := time.Now()

 	// Check all policies before making a final decision.
 	for i, p := range tsp.policies {
 		decision, err := p.evaluator.Evaluate(ctx, id, trace)
-		latency := time.Since(startTime)
-		tsp.telemetry.ProcessorTailSamplingSamplingDecisionLatency.Record(ctx, int64(latency/time.Microsecond), p.attribute)


So the problem is really that this is cumulative of all preceding policies? In which case the metric, as-is, will really only be meaningful if there's a single policy. That could be fixed by moving the startTime to the top of the loop, if it's important to keep the metric.

Logiraptor · 2025-09-10T12:03:18Z

@axw Thanks for the review!

If that's the primary motivation, could you take the single-threadedness into account to reduce the instrumentation overhead? i.e. by accumulating locally and only updating metrics after all policies have been evaluated -- it appears there's something like that already in policyMetrics.addDecision.

Is that possible with a histogram? I don't see any way to accumulate other than putting all the intermediate latencies in a slice and then calling Record in a loop. But that doesn't make it any faster.

Agreed, but is that intentional? It seems like a bug that could be fixed. I think it could be useful to know how long each policy's evaluator takes, particularly for more expensive ones like the OTTL evaluator.
Also, #42508 goes in the direction of making evaluators pluggable, so they may be arbitrarily complex.

These are good points, but my opinion is that we shouldn't keep code around that reduces performance so much when it's not producing actionable results.

I can think of a couple ways to make it ok performance-wise and keep the ability to locate slow policies:

Make it optional, like otelcol_processor_tail_sampling_count_spans_sampled. In this case I would disable it in my infrastructure for now, because the CPU cost is not worth it.
Record timings for a subset of evaluations, based on some sampling rate
Refactor the code to record total time spent instead of a histogram. In other words, it would be a single counter of total seconds per policy which is easy to accumulate and record after the loop.

My preference would be for (3), but that's still a breaking change for the metric, so I'm not sure it needs to be created in this same PR. Thoughts?

Co-authored-by: Andrew Wilkins <[email protected]>

axw · 2025-09-11T02:20:00Z

Is that possible with a histogram? I don't see any way to accumulate other than putting all the intermediate latencies in a slice and then calling Record in a loop. But that doesn't make it any faster.

Ah, I had missed that it was a histogram. I don't think we have an option at the moment then.
Theoretically there could be two options, but the metrics API does not support either of them:

Maintain a local histogram and later merge it in
Maintain a local histogram and later, for each bucket make a recording with the total & count

Refactor the code to record total time spent instead of a histogram. In other words, it would be a single counter of total seconds per policy which is easy to accumulate and record after the loop.

This sounds OK to me. @portertech thoughts?

github-actions · 2025-09-25T05:21:55Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Hoping to help spread the load of code ownership, and I'm spending lots of time working on the TSP anyway. List of contributions: * #41888 * #41617 * #41546 * #39761 * #37722 * #37035 * #41656 * #38502 * #42620 --------- Co-authored-by: Christos Markou <[email protected]>

github-actions · 2025-10-09T05:22:07Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

) #### Description This PR removes the metric otelcol_processor_tail_sampling_sampling_decision_latency. It adds a pair of metrics as replacement called `processor_tail_sampling_sampling_policy_cpu_time` and `processor_tail_sampling_sampling_policy_executions`. It is an implementation of the feedback received in #42620 Originally reported in #38502, this metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal. To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric. As a replacement, I've added a metric to track total time spent on each policy as well as count total executions. This allows slow policies to still be identified by checking their total or average execution time without the heavy CPU / gc pressure / synchronization cost of recording a histogram in the inner loop.   #### Link to tracking issue Fixes #38502 - closed on accident, and I am not otel enough to reopen it

Logiraptor added 2 commits September 9, 2025 23:09

Update changelog

9b5beb8

Logiraptor marked this pull request as ready for review September 10, 2025 03:18

Logiraptor requested a review from a team as a code owner September 10, 2025 03:18

Logiraptor requested a review from axw September 10, 2025 03:18

github-actions bot assigned braydonk Sep 10, 2025

github-actions bot added the processor/tailsampling Tail sampling processor label Sep 10, 2025

github-actions bot requested a review from portertech September 10, 2025 03:18

Logiraptor mentioned this pull request Sep 10, 2025

[tailsamplingprocessor] Add Logiraptor as codeowner #42621

Merged

Fix tests

ac215fa

axw reviewed Sep 10, 2025

View reviewed changes

Update .chloggen/logiraptor_remove-policy-latency.yaml

b8682d7

Co-authored-by: Andrew Wilkins <[email protected]>

axw added the waiting-for-code-owners label Sep 11, 2025

github-actions bot added the Stale label Sep 25, 2025

github-actions bot closed this Oct 9, 2025

Logiraptor mentioned this pull request Oct 13, 2025

[tailsamplingprocessor] Replace misleading policy latency metric #43510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tailsamplingprocessor] Remove misleading metric #42620

[tailsamplingprocessor] Remove misleading metric #42620

Uh oh!

Logiraptor commented Sep 10, 2025

Uh oh!

axw left a comment

Uh oh!

Uh oh!

axw Sep 10, 2025

Uh oh!

Logiraptor commented Sep 10, 2025 •

edited

Loading

Uh oh!

axw commented Sep 11, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

github-actions bot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[tailsamplingprocessor] Remove misleading metric #42620

[tailsamplingprocessor] Remove misleading metric #42620

Uh oh!

Conversation

Logiraptor commented Sep 10, 2025

Description

Link to tracking issue

Uh oh!

axw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

axw Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Logiraptor commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axw commented Sep 11, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

github-actions bot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Logiraptor commented Sep 10, 2025 •

edited

Loading