Skip to content

[tailsamplingprocessor] Decision Timer Latency Metric is misleading #38502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Logiraptor opened this issue Mar 10, 2025 · 1 comment · Fixed by #39761
Closed

[tailsamplingprocessor] Decision Timer Latency Metric is misleading #38502

Logiraptor opened this issue Mar 10, 2025 · 1 comment · Fixed by #39761
Labels
bug Something isn't working needs triage New item requiring triage processor/tailsampling Tail sampling processor

Comments

@Logiraptor
Copy link
Contributor

Component(s)

processor/tailsampling

What happened?

Description

After #37722, I realized there is another issue with the decision timer latency metric.

Basically, it currently measures a latency from starting policy evaluation until just after each trace is evaluated. This isn't super useful, consider the following scenario:

  • A batch with 10 traces, each taking 1ms to evaluate all policies

Steps to Reproduce

Run the tsp, observe the otelcol_processor_tail_sampling_sampling_decision_timer_latency metric.

Expected Result

I would expect this metric to report one of two things, either:

  1. The total time to evaluate a batch (in this case 10ms)
  2. The time to evaluate each trace (in this case 1ms)

IMO (1) is more useful, since it's a direct indication of whether the tsp is at risk of falling behind on processing traces.

Actual Result

  • The histogram will record times 1ms, 2ms, 3ms, etc up to 10ms.
  • In the end the p99 will be 9ms, p50 will be ~5ms, and average will be 5ms.

I considered opening a PR to change the implementation to (1) above, but figured I would open this issue first to make sure I'm not missing an important use-case for (2). Happy to submit the PR though if not!

Collector version

N/A

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

Log output

Additional context

No response

@Logiraptor Logiraptor added bug Something isn't working needs triage New item requiring triage labels Mar 10, 2025
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Mar 10, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Logiraptor added a commit to Logiraptor/opentelemetry-collector-contrib that referenced this issue Apr 29, 2025
It will now measure the total time spent on one tick of the decision timer. This
is more useful, since the decision tick loop needs to run at least once per
second for the tsp to keep up with incoming traces.

Fixes open-telemetry#38502
Logiraptor added a commit to Logiraptor/opentelemetry-collector-contrib that referenced this issue Apr 29, 2025
It will now measure the total time spent on one tick of the decision timer. This
is more useful, since the decision tick loop needs to run at least once per
second for the tsp to keep up with incoming traces.

Fixes open-telemetry#38502
@atoulme atoulme closed this as completed in 459d156 May 2, 2025
vincentfree pushed a commit to ing-bank/opentelemetry-collector-contrib that referenced this issue May 6, 2025
…ric (open-telemetry#39761)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

It will now measure the total time spent on one tick of the decision
timer. This is more useful, since the decision tick loop needs to run at
least once per second for the tsp to keep up with incoming traces.



<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#38502
vincentfree pushed a commit to ing-bank/opentelemetry-collector-contrib that referenced this issue May 20, 2025
…ric (open-telemetry#39761)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

It will now measure the total time spent on one tick of the decision
timer. This is more useful, since the decision tick loop needs to run at
least once per second for the tsp to keep up with incoming traces.



<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#38502
dragonlord93 pushed a commit to dragonlord93/opentelemetry-collector-contrib that referenced this issue May 23, 2025
…ric (open-telemetry#39761)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

It will now measure the total time spent on one tick of the decision
timer. This is more useful, since the decision tick loop needs to run at
least once per second for the tsp to keep up with incoming traces.



<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#38502
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage processor/tailsampling Tail sampling processor
Projects
None yet
1 participant