Skip to content

Question about tail_sampling result #28638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Liubey opened this issue Oct 26, 2023 · 7 comments
Closed

Question about tail_sampling result #28638

Liubey opened this issue Oct 26, 2023 · 7 comments
Labels
processor/tailsampling Tail sampling processor

Comments

@Liubey
Copy link
Contributor

Liubey commented Oct 26, 2023

Component(s)

processor/tailsampling

Describe the issue you're reporting

I have policies for tail_sampling for long-time request more than 10 seconds.
But the following phenomena are observed by myself:

  1. I verified not collect ALL request more than 10 seconds, for example, the request duration 50 seconds, not collect.
  2. the request duration more than 10 second a little, such as 11 seconds(but entirely by one span), It should collect?

for example 1 , the trace duration 31.37s, one span duration 31.22s. I think , the 31.22s span arrived after decision_wait, spans before 31.22s span will be discard?
the trace will be an incomplete trace or all of spans will be dropped ?
the span after31.22s span will be drop or collect?
But the num_traces will cache spans of trace in memory(whether it should drop or collect) for number of 5000.
So If the spans were cached by the num_traces( number of 5000), span before 31.22s span will not be discard,
the trace will be an complete trace.
Is it right?

for example 2, the trace duration 51.37s(far apart of 30s), one span duration 51.22s
cause the ``num_traces``` cache are fulled, spans of the trace before the 51.22s will be dropped by cache.
so when the 51.22s span and other spans of the trace arrived, it should be collect, I will get a incomplete trace.
But I got nothing in jaeger-collector(elasticsearch)

So, How to understand about num_traces and decision_wait?
I want collect ALL OF request who duration more than 10 seconds.
the num_traces config will cache all spans of 5000 trace?

the config:

      tail_sampling:
        decision_wait: 30s
        num_traces: 5000
        expected_new_traces_per_sec: 500
        policies:
          [
            {
              name: error,
              type: boolean_attribute,
              boolean_attribute: { key: error, value: true }
            },
            {
              name: latency,
              type: latency,
              latency: { threshold_ms: 10000 }
            },
          ]
@Liubey Liubey added the needs triage New item requiring triage label Oct 26, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Oct 26, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@Liubey
Copy link
Contributor Author

Liubey commented Oct 26, 2023

I see #23648, Could / Would you do me a favor?
@mizhexiaoxiao spans of trace will delete after decision_wait? What action will do when the collector receive part of the dropped trace after decision_wait?

@jpkrohling
Copy link
Member

jpkrohling commented Oct 26, 2023

  1. spans arriving after the decision_wait (30s in your example) will be evaluated separately, which means that traces can indeed be broken
  2. similar to the above, if a single span takes longer than the decision wait time but the rest of the trace isn't taking longer than the threshold (10s in your example), the whole trace will be discarded except for the span that arrived late and was evaluated individually

I want collect ALL OF request who duration more than 10 seconds.

Unfortunately, this is not possible with the current implementation: decisions are made at "decision_wait" and the decision isn't recorded

@jpkrohling jpkrohling removed the needs triage New item requiring triage label Oct 26, 2023
@Liubey
Copy link
Contributor Author

Liubey commented Oct 26, 2023

doubt about point 2:
when I make a trace like:

root span(50000.5ms)
   span1(0.1ms)
   span2(0.1ms)
   span3(0.1ms)
   span4(50000ms)
   span5(0.1ms)
   span6(0.1ms)

"the whole trace will be discarded except for the span that arrived late and was evaluated individually"
So I get the trace as follow or I can not get anything of the trace?

root span(50000.5ms)
   span4(50000ms)
   span5(0.1ms)
   span6(0.1ms)

@jpkrohling thank you so much.

@jpkrohling
Copy link
Member

You get the 50s and everything that comes within the decision wait for that.

@Liubey
Copy link
Contributor Author

Liubey commented Oct 27, 2023

@jpkrohling sorry for disturb again.
in collector version 0.77.0.
I make a long-span duration is 50s, but decision_wait=30s
According to your describe, I will get the incomplete trace:

root span(50000.5ms)
   span4(50000ms)
   span5(0.1ms)
   span6(0.1ms)

But I get nothing, even I test many times.
And confused about logs in collector do not print "notSampled", seems like these spans(span4,span5,span6 and root span) were disappeared.

@jpkrohling
Copy link
Member

I tried the following and it seemed to work:

receivers:
  otlp:
    protocols:
      grpc:

processors:
  tail_sampling:
    decision_wait: 5s
    num_traces: 50_000
    expected_new_traces_per_sec: 10_000
    policies:
      [
          {
            name: longer-than-1s,
            type: latency,
            latency: {threshold_ms: 1_000}
          },
      ]

exporters:
  logging:

connectors:

service:
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - tail_sampling
      exporters:
        - logging

I changed the telemetrygen tool to generate the root span with a duration of 10s:

diff --git a/cmd/telemetrygen/internal/traces/worker.go b/cmd/telemetrygen/internal/traces/worker.go
index cbcdc09ac1..6ae1c0abe8 100644
--- a/cmd/telemetrygen/internal/traces/worker.go
+++ b/cmd/telemetrygen/internal/traces/worker.go
@@ -35,7 +35,7 @@ type worker struct {
 const (
        fakeIP string = "1.2.3.4"
 
-       fakeSpanDuration = 123 * time.Microsecond
+       fakeSpanDuration = 10 * time.Second
 
        charactersPerMB = 1024 * 1024 // One character takes up one byte of space, so this number comes from the number of bytes in a megabyte
 )

I ran it with go run ./ traces --traces 100 --otlp-insecure and got all traces sampled (I ran it twice):

> curl -s localhost:8888/metrics | grep otelcol_processor_tail_sampling_count_traces_sampled
# HELP otelcol_processor_tail_sampling_count_traces_sampled Count of traces that were sampled or not
# TYPE otelcol_processor_tail_sampling_count_traces_sampled counter
otelcol_processor_tail_sampling_count_traces_sampled{policy="longer-than-1s",sampled="true",service_instance_id="6b7aa468-9b96-4397-b37c-ee59d74f7fe8",service_name="otelcol-contrib",service_version="0.87.0"} 200

Can you please check if the example is consistent with your use-case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

2 participants