Question about tail_sampling result #28638

Liubey · 2023-10-26T07:58:43Z

Component(s)

processor/tailsampling

Describe the issue you're reporting

I have policies for tail_sampling for long-time request more than 10 seconds.
But the following phenomena are observed by myself:

I verified not collect ALL request more than 10 seconds, for example, the request duration 50 seconds, not collect.
the request duration more than 10 second a little, such as 11 seconds(but entirely by one span), It should collect?

for example 1 , the trace duration 31.37s, one span duration 31.22s. I think , the 31.22s span arrived after decision_wait, spans before 31.22s span will be discard?
the trace will be an incomplete trace or all of spans will be dropped ?
the span after31.22s span will be drop or collect?
But the num_traces will cache spans of trace in memory(whether it should drop or collect) for number of 5000.
So If the spans were cached by the num_traces( number of 5000), span before 31.22s span will not be discard,
the trace will be an complete trace.
Is it right?

for example 2, the trace duration 51.37s(far apart of 30s), one span duration 51.22s
cause the ``num_traces``` cache are fulled, spans of the trace before the 51.22s will be dropped by cache.
so when the 51.22s span and other spans of the trace arrived, it should be collect, I will get a incomplete trace.
But I got nothing in jaeger-collector(elasticsearch)

So, How to understand about num_traces and decision_wait?
I want collect ALL OF request who duration more than 10 seconds.
the num_traces config will cache all spans of 5000 trace?

the config:

      tail_sampling:
        decision_wait: 30s
        num_traces: 5000
        expected_new_traces_per_sec: 500
        policies:
          [
            {
              name: error,
              type: boolean_attribute,
              boolean_attribute: { key: error, value: true }
            },
            {
              name: latency,
              type: latency,
              latency: { threshold_ms: 10000 }
            },
          ]

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-26T07:59:04Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Liubey · 2023-10-26T08:01:02Z

I see #23648, Could / Would you do me a favor?
@mizhexiaoxiao spans of trace will delete after decision_wait? What action will do when the collector receive part of the dropped trace after decision_wait?

jpkrohling · 2023-10-26T08:13:58Z

spans arriving after the decision_wait (30s in your example) will be evaluated separately, which means that traces can indeed be broken
similar to the above, if a single span takes longer than the decision wait time but the rest of the trace isn't taking longer than the threshold (10s in your example), the whole trace will be discarded except for the span that arrived late and was evaluated individually

I want collect ALL OF request who duration more than 10 seconds.

Unfortunately, this is not possible with the current implementation: decisions are made at "decision_wait" and the decision isn't recorded

Liubey · 2023-10-26T08:41:46Z

doubt about point 2:
when I make a trace like:

root span(50000.5ms)
   span1(0.1ms)
   span2(0.1ms)
   span3(0.1ms)
   span4(50000ms)
   span5(0.1ms)
   span6(0.1ms)

"the whole trace will be discarded except for the span that arrived late and was evaluated individually"
So I get the trace as follow or I can not get anything of the trace?

root span(50000.5ms)
   span4(50000ms)
   span5(0.1ms)
   span6(0.1ms)

@jpkrohling thank you so much.

jpkrohling · 2023-10-26T09:11:49Z

You get the 50s and everything that comes within the decision wait for that.

Liubey · 2023-10-27T07:36:41Z

@jpkrohling sorry for disturb again.
in collector version 0.77.0.
I make a long-span duration is 50s, but decision_wait=30s
According to your describe, I will get the incomplete trace:

root span(50000.5ms)
   span4(50000ms)
   span5(0.1ms)
   span6(0.1ms)

But I get nothing, even I test many times.
And confused about logs in collector do not print "notSampled", seems like these spans(span4,span5,span6 and root span) were disappeared.

jpkrohling · 2023-11-01T12:47:53Z

I tried the following and it seemed to work:

receivers:
  otlp:
    protocols:
      grpc:

processors:
  tail_sampling:
    decision_wait: 5s
    num_traces: 50_000
    expected_new_traces_per_sec: 10_000
    policies:
      [
          {
            name: longer-than-1s,
            type: latency,
            latency: {threshold_ms: 1_000}
          },
      ]

exporters:
  logging:

connectors:

service:
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - tail_sampling
      exporters:
        - logging

I changed the telemetrygen tool to generate the root span with a duration of 10s:

diff --git a/cmd/telemetrygen/internal/traces/worker.go b/cmd/telemetrygen/internal/traces/worker.go
index cbcdc09ac1..6ae1c0abe8 100644
--- a/cmd/telemetrygen/internal/traces/worker.go
+++ b/cmd/telemetrygen/internal/traces/worker.go
@@ -35,7 +35,7 @@ type worker struct {
 const (
        fakeIP string = "1.2.3.4"
 
-       fakeSpanDuration = 123 * time.Microsecond
+       fakeSpanDuration = 10 * time.Second
 
        charactersPerMB = 1024 * 1024 // One character takes up one byte of space, so this number comes from the number of bytes in a megabyte
 )

I ran it with go run ./ traces --traces 100 --otlp-insecure and got all traces sampled (I ran it twice):

> curl -s localhost:8888/metrics | grep otelcol_processor_tail_sampling_count_traces_sampled
# HELP otelcol_processor_tail_sampling_count_traces_sampled Count of traces that were sampled or not
# TYPE otelcol_processor_tail_sampling_count_traces_sampled counter
otelcol_processor_tail_sampling_count_traces_sampled{policy="longer-than-1s",sampled="true",service_instance_id="6b7aa468-9b96-4397-b37c-ee59d74f7fe8",service_name="otelcol-contrib",service_version="0.87.0"} 200

Can you please check if the example is consistent with your use-case?

Liubey added the needs triage New item requiring triage label Oct 26, 2023

github-actions bot added the processor/tailsampling Tail sampling processor label Oct 26, 2023

jpkrohling removed the needs triage New item requiring triage label Oct 26, 2023

jpkrohling closed this as completed Oct 26, 2023

github-actions bot mentioned this issue Oct 31, 2023

Weekly Report: 2023-10-24 - 2023-10-31 #28813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about tail_sampling result #28638

Question about tail_sampling result #28638

Liubey commented Oct 26, 2023

github-actions bot commented Oct 26, 2023

Uh oh!

Liubey commented Oct 26, 2023 •

edited

Loading

Uh oh!

jpkrohling commented Oct 26, 2023 •

edited

Loading

Uh oh!

Liubey commented Oct 26, 2023

Uh oh!

jpkrohling commented Oct 26, 2023

Uh oh!

Liubey commented Oct 27, 2023

Uh oh!

jpkrohling commented Nov 1, 2023

Uh oh!

Question about tail_sampling result #28638

Question about tail_sampling result #28638

Comments

Liubey commented Oct 26, 2023

Component(s)

Describe the issue you're reporting

github-actions bot commented Oct 26, 2023

Uh oh!

Liubey commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpkrohling commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Liubey commented Oct 26, 2023

Uh oh!

jpkrohling commented Oct 26, 2023

Uh oh!

Liubey commented Oct 27, 2023

Uh oh!

jpkrohling commented Nov 1, 2023

Uh oh!

Liubey commented Oct 26, 2023 •

edited

Loading

jpkrohling commented Oct 26, 2023 •

edited

Loading