Add new tail sampling processor policy: probabilistic #3876

yvrhdn · 2021-06-24T20:27:19Z

Description:
This adds a new tail sampling policy that samples a percentage of traces.

processor/tailsamplingprocessor/sampling/percentage.go

jpkrohling · 2021-06-25T09:53:28Z

processor/tailsamplingprocessor/README.md

@@ -15,6 +15,7 @@ Multiple policies exist today and it is straight forward to add more. These incl
 - `always_sample`: Sample all traces
 - `latency`: Sample based on the duration of the trace. The duration is determined by looking at the earliest start time and latest end time, without taking into consideration what happened in between.
 - `numeric_attribute`: Sample based on number attributes
+- `percentage`: Sample a percentage of traces. Only traces that have not been sampled yet by another policy are taken into account.


Why would I use this instead of the probabilistic sampling processor?
https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/probabilisticsamplerprocessor

Oh, wasn't aware of the probabilistic sampling processor (for some reason) 🤦🏻 I'll take a look at it and make sure our terminology is consistent.

I think the main difference is when the sampling decision happens. If you only want percentage or probabilistic sampling it doesn't make sense to use this tail sampling processor.
But combining this with other tail sampling policies makes it more meaningful. For instance:

policies: [ { name: all-errors, type: status_code, status_code: {status_codes: [ERROR]} }, { name: half-of-remaining, type: percentage, percentage: {percentage: 0.5} }, ]

This pipeline would sample all traces with status code error and 50% of the remaining traces. Using the probabilistic sampling processor you risk dropping traces with errors.

Good point. Make sure to mention that in the readme then.

Is there a difference between this and just putting the probabilistic sampling processor after tail sampling at 50%?

The result will be slightly different: putting the probabilistic sampling processor after the tail sampling processor will filter what the tail sampling processor samples.
The percentage sampler as implemented here will filter what is not sampled by another policy.

So for instance a pipeline:

tail sampling (sample all errors) -> probabilistic (at 50%)

Result: 50% of traces with error
Why:

the tail sampler drops every non-error

the probabilistic sampler drops 50% of what the tail sampler returns

While the following:

tail sampling (sampler all errors -> sample 50%)

Result: all traces with error + 50% of traces without

github-actions · 2021-07-08T05:18:09Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2021-07-20T05:18:23Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

yvrhdn · 2021-07-20T08:24:52Z

Not stale, I've been busy with some other stuff but I will continue working on this soon 🤞🏻

github-actions · 2021-07-31T05:18:17Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

# Conflicts: # CHANGELOG.md

processor/tailsamplingprocessor/internal/sampling/percentage.go

processor/tailsamplingprocessor/internal/sampling/percentage_test.go

Use sampling_percentage instead of percentage Make sampling_percentage a value between 0 and 100

… processor

yvrhdn · 2021-08-09T07:40:49Z

Looking at how the probabilistic sampling processor is implemented, I went for a very similar approach: instead of keeping a counter we can sample based upon the hashed trace ID. This is stateless (easier code) and allows the tail sampling processor to run distributed (i.e. if you have multiple collectors behind the Trace ID aware load-balancing exporter).

This does not take into account already sampled traces, so the effective sampling percentage might be higher if other policies sample a lot. But the processor has the same implementation using sampling.priority.

Example:

policies:
  latency:        samples 10% of traces
  probabilistic:  samples 25% of traces

effective sample percentage: 10% + (25% of remaining 90%) = 32,5%

I dislike that I basically had to copy a large part of the probabilistic sampling processor. Maybe we can extract the hashing and sampling part into a shared library? otel-go also does trace ID-based sampling and might also profit from this.

jpkrohling · 2021-08-09T09:18:12Z

processor/tailsamplingprocessor/README.md

+Running the probabilistic sampling processor is more efficient than the tail sampling processor.
+The probabilistic sampling policy makes decision based upon the trace ID, so waiting until more spans have arrived will not influence its decision. 
+
+...you are already using the tail sampling processor: add the probabilistic sampling policy as last in your chain.


Is adding the probabilistic sampling policy at the head of the chain considered a configuration error? If so, should a warning be generated?

Hmm, after thinking some more this isn't actually true: the order does not matter because the tail sampling processor will always run all policies. If one of the policies returns Sampled it will keep the trace. I was thinking too much of this as a pipeline.

jpkrohling · 2021-08-09T09:36:13Z

instead of keeping a counter we can sample based upon the hashed trace ID

Jaeger does something very similar under the name of downsampling at the storage plugin: https://github.com/jaegertracing/jaeger/blob/c5642b708be9e2577cc1c494889ae97946ccc78a/storage/spanstore/downsampling_writer.go#L131-L140

It uses fnv64a, which would be my preference for this one here as well.

Maybe we can extract the hashing and sampling part into a shared library? otel-go also does trace ID-based sampling and might also profit from this.

Great idea. I think we are using CRC for the load balancing exporter, but we should settle for the same across the board. My choice would be fnv instead of Murmur, though. Would you prefer to work on the refactoring first, and then change this PR, or to get this PR merged first, and do the refactoring later?

yvrhdn · 2021-08-09T11:16:03Z

It uses fnv64a, which would be my preference for this one here as well.

Sounds good 👍 The implementation will be easier as well since it's in the standard library.

Would you prefer to work on the refactoring first, and then change this PR, or to get this PR merged first, and do the refactoring later?

How about I switch this sampler to fnv64a, then we can (finally) merge this PR and look at refactoring/combining logic?

jpkrohling · 2021-08-09T11:42:31Z

Sounds good to me.

yvrhdn · 2021-08-09T15:59:41Z

Not sure why CI is failing, might be a performance issue.

Tests work locally but take 13s, I'm guessing the runners are slower than my machine.

➜ go test -race ./...
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor	1.911s
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/idbatcher	0.848s
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling	13.687s

I'll lower the size of the test.

bogdandrutu · 2021-08-18T20:31:47Z

Please rebase

# Conflicts: # CHANGELOG.md

yvrhdn · 2021-08-19T08:51:48Z

Done 👍 Would like me to squash the commits or does the repo squash on merge?

* Add new tail sampling processor policy: percentage * Update CHANGELOG.md * Fix tail_sampling_config.yaml * Fix typo * Reset counters to avoid overflow, improve tests * Move into internal as well * Use sampling_percentage similar to the probabilistic sampling processor Use sampling_percentage instead of percentage Make sampling_percentage a value between 0 and 100 * Combine tests in a single table-test * Rename percentage -> probabilistic for consistency with probabilistic processor * Add IncludeAlreadySampled option * Clarify test cases * Use the same algorithm as the probabilistic sampling processor * Simplify if-else case * Typo * Order of tail sampling policies does not matter * Switch hashing implementation from murmur3 to fnv-1a * Lower amount of traces to sample in test to speed up tests

Add new tail sampling processor policy: percentage

7578e67

yvrhdn requested a review from jpkrohling as a code owner June 24, 2021 20:27

yvrhdn requested a review from a team June 24, 2021 20:27

github-actions bot assigned owais Jun 24, 2021

Koenraad Verheyden added 4 commits June 24, 2021 22:27

Update CHANGELOG.md

f829362

Fix tail_sampling_config.yaml

d487e06

Fix typo

b7d9747

Reset counters to avoid overflow, improve tests

7e6eb4f

jpkrohling reviewed Jun 25, 2021

View reviewed changes

github-actions bot added the Stale label Jul 8, 2021

bogdandrutu removed the Stale label Jul 13, 2021

github-actions bot added the Stale label Jul 20, 2021

jpkrohling removed the Stale label Jul 21, 2021

github-actions bot added the Stale label Jul 31, 2021

Koenraad Verheyden added 2 commits August 4, 2021 20:24

Merge branch 'main' into percentage_sampling

387d13b

# Conflicts: # CHANGELOG.md

Move into internal as well

3f620b0

jpkrohling reviewed Aug 5, 2021

View reviewed changes

Koenraad Verheyden added 2 commits August 6, 2021 21:13

Use sampling_percentage similar to the probabilistic sampling processor

dcc1d0e

Use sampling_percentage instead of percentage Make sampling_percentage a value between 0 and 100

Combine tests in a single table-test

4e4e90b

yvrhdn marked this pull request as draft August 6, 2021 19:46

Koenraad Verheyden added 4 commits August 7, 2021 02:02

Rename percentage -> probabilistic for consistency with probabilistic…

5b7729f

… processor

Add IncludeAlreadySampled option

56fdb1a

Clarify test cases

7a69d0b

Use the same algorithm as the probabilistic sampling processor

d9e96e6

yvrhdn changed the title ~~Add new tail sampling processor policy: percentage~~ Add new tail sampling processor policy: probabilistic Aug 9, 2021

yvrhdn marked this pull request as ready for review August 9, 2021 07:40

github-actions bot assigned jpkrohling Aug 9, 2021

Simplify if-else case

91f8d5e

jpkrohling reviewed Aug 9, 2021

View reviewed changes

Koenraad Verheyden added 2 commits August 9, 2021 11:38

Typo

e674a9e

Order of tail sampling policies does not matter

648bdc3

Switch hashing implementation from murmur3 to fnv-1a

1d58e79

bogdandrutu removed the Stale label Aug 9, 2021

Lower amount of traces to sample in test to speed up tests

b82595c

jpkrohling approved these changes Aug 10, 2021

View reviewed changes

Merge branch 'main' into percentage_sampling

9106f13

# Conflicts: # CHANGELOG.md

bogdandrutu merged commit c51bea5 into open-telemetry:main Aug 19, 2021

yvrhdn deleted the percentage_sampling branch August 19, 2021 17:11

mapno mentioned this pull request Aug 24, 2021

Tempo: Add a flat downsampling processor grafana/agent#603

Closed

jiekun mentioned this pull request Sep 21, 2023

[processor/tailsampling] probabilistic policy hash distribution not good enough #27044

Closed

Add new tail sampling processor policy: probabilistic #3876

Add new tail sampling processor policy: probabilistic #3876

Uh oh!

Conversation

yvrhdn commented Jun 24, 2021

Uh oh!

Uh oh!

jpkrohling Jun 25, 2021

Choose a reason for hiding this comment

Uh oh!

yvrhdn Jun 30, 2021

Choose a reason for hiding this comment

Uh oh!

jpkrohling Jun 30, 2021

Choose a reason for hiding this comment

Uh oh!

joe-elliott Jun 30, 2021

Choose a reason for hiding this comment

Uh oh!

yvrhdn Jun 30, 2021

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 8, 2021

Uh oh!

github-actions bot commented Jul 20, 2021

Uh oh!

yvrhdn commented Jul 20, 2021

Uh oh!

github-actions bot commented Jul 31, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yvrhdn commented Aug 9, 2021

Uh oh!

jpkrohling Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

yvrhdn Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

jpkrohling commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yvrhdn commented Aug 9, 2021

Uh oh!

jpkrohling commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yvrhdn commented Aug 9, 2021

Uh oh!

bogdandrutu commented Aug 18, 2021

Uh oh!

yvrhdn commented Aug 19, 2021

Uh oh!

Uh oh!

jpkrohling commented Aug 9, 2021 •

edited

Loading

jpkrohling commented Aug 9, 2021 •

edited

Loading