Skip to content

Add new tail sampling processor policy: probabilistic #3876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 19, 2021

Conversation

yvrhdn
Copy link
Contributor

@yvrhdn yvrhdn commented Jun 24, 2021

Description:
This adds a new tail sampling policy that samples a percentage of traces.

@yvrhdn yvrhdn requested a review from jpkrohling as a code owner June 24, 2021 20:27
@yvrhdn yvrhdn requested a review from a team June 24, 2021 20:27
@@ -15,6 +15,7 @@ Multiple policies exist today and it is straight forward to add more. These incl
- `always_sample`: Sample all traces
- `latency`: Sample based on the duration of the trace. The duration is determined by looking at the earliest start time and latest end time, without taking into consideration what happened in between.
- `numeric_attribute`: Sample based on number attributes
- `percentage`: Sample a percentage of traces. Only traces that have not been sampled yet by another policy are taken into account.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wasn't aware of the probabilistic sampling processor (for some reason) 🤦🏻 I'll take a look at it and make sure our terminology is consistent.

I think the main difference is when the sampling decision happens. If you only want percentage or probabilistic sampling it doesn't make sense to use this tail sampling processor.
But combining this with other tail sampling policies makes it more meaningful. For instance:

    policies:
      [
          {
            name: all-errors,
            type: status_code,
            status_code: {status_codes: [ERROR]}
          },
          {
            name: half-of-remaining,
            type: percentage,
            percentage: {percentage: 0.5}
          },
     ]

This pipeline would sample all traces with status code error and 50% of the remaining traces. Using the probabilistic sampling processor you risk dropping traces with errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Make sure to mention that in the readme then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between this and just putting the probabilistic sampling processor after tail sampling at 50%?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result will be slightly different: putting the probabilistic sampling processor after the tail sampling processor will filter what the tail sampling processor samples.
The percentage sampler as implemented here will filter what is not sampled by another policy.

So for instance a pipeline:

tail sampling (sample all errors) -> probabilistic (at 50%)

Result: 50% of traces with error
Why:

  • the tail sampler drops every non-error
  • the probabilistic sampler drops 50% of what the tail sampler returns

While the following:

tail sampling (sampler all errors -> sample 50%)

Result: all traces with error + 50% of traces without

@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 20, 2021
@yvrhdn
Copy link
Contributor Author

yvrhdn commented Jul 20, 2021

Not stale, I've been busy with some other stuff but I will continue working on this soon 🤞🏻

@jpkrohling jpkrohling removed the Stale label Jul 21, 2021
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 31, 2021
Koenraad Verheyden added 2 commits August 4, 2021 20:24
Koenraad Verheyden added 2 commits August 6, 2021 21:13
Use sampling_percentage instead of percentage
Make sampling_percentage a value between 0 and 100
@yvrhdn yvrhdn marked this pull request as draft August 6, 2021 19:46
@yvrhdn yvrhdn changed the title Add new tail sampling processor policy: percentage Add new tail sampling processor policy: probabilistic Aug 9, 2021
@yvrhdn yvrhdn marked this pull request as ready for review August 9, 2021 07:40
@yvrhdn
Copy link
Contributor Author

yvrhdn commented Aug 9, 2021

Looking at how the probabilistic sampling processor is implemented, I went for a very similar approach: instead of keeping a counter we can sample based upon the hashed trace ID. This is stateless (easier code) and allows the tail sampling processor to run distributed (i.e. if you have multiple collectors behind the Trace ID aware load-balancing exporter).

This does not take into account already sampled traces, so the effective sampling percentage might be higher if other policies sample a lot. But the processor has the same implementation using sampling.priority.

Example:

policies:
  latency:        samples 10% of traces
  probabilistic:  samples 25% of traces

effective sample percentage: 10% + (25% of remaining 90%) = 32,5%

I dislike that I basically had to copy a large part of the probabilistic sampling processor. Maybe we can extract the hashing and sampling part into a shared library? otel-go also does trace ID-based sampling and might also profit from this.

Running the probabilistic sampling processor is more efficient than the tail sampling processor.
The probabilistic sampling policy makes decision based upon the trace ID, so waiting until more spans have arrived will not influence its decision.

...you are already using the tail sampling processor: add the probabilistic sampling policy as last in your chain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is adding the probabilistic sampling policy at the head of the chain considered a configuration error? If so, should a warning be generated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after thinking some more this isn't actually true: the order does not matter because the tail sampling processor will always run all policies. If one of the policies returns Sampled it will keep the trace. I was thinking too much of this as a pipeline.

@jpkrohling
Copy link
Member

jpkrohling commented Aug 9, 2021

instead of keeping a counter we can sample based upon the hashed trace ID

Jaeger does something very similar under the name of downsampling at the storage plugin: https://github.com/jaegertracing/jaeger/blob/c5642b708be9e2577cc1c494889ae97946ccc78a/storage/spanstore/downsampling_writer.go#L131-L140

It uses fnv64a, which would be my preference for this one here as well.

Maybe we can extract the hashing and sampling part into a shared library? otel-go also does trace ID-based sampling and might also profit from this.

Great idea. I think we are using CRC for the load balancing exporter, but we should settle for the same across the board. My choice would be fnv instead of Murmur, though. Would you prefer to work on the refactoring first, and then change this PR, or to get this PR merged first, and do the refactoring later?

@yvrhdn
Copy link
Contributor Author

yvrhdn commented Aug 9, 2021

It uses fnv64a, which would be my preference for this one here as well.

Sounds good 👍 The implementation will be easier as well since it's in the standard library.

Would you prefer to work on the refactoring first, and then change this PR, or to get this PR merged first, and do the refactoring later?

How about I switch this sampler to fnv64a, then we can (finally) merge this PR and look at refactoring/combining logic?

@jpkrohling
Copy link
Member

jpkrohling commented Aug 9, 2021

Sounds good to me.

@bogdandrutu bogdandrutu removed the Stale label Aug 9, 2021
@yvrhdn
Copy link
Contributor Author

yvrhdn commented Aug 9, 2021

Not sure why CI is failing, might be a performance issue.

Tests work locally but take 13s, I'm guessing the runners are slower than my machine.

➜ go test -race ./...
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor	1.911s
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/idbatcher	0.848s
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling	13.687s

I'll lower the size of the test.

@bogdandrutu
Copy link
Member

Please rebase

# Conflicts:
#	CHANGELOG.md
@yvrhdn
Copy link
Contributor Author

yvrhdn commented Aug 19, 2021

Done 👍 Would like me to squash the commits or does the repo squash on merge?

@bogdandrutu bogdandrutu merged commit c51bea5 into open-telemetry:main Aug 19, 2021
@yvrhdn yvrhdn deleted the percentage_sampling branch August 19, 2021 17:11
Aneurysm9 pushed a commit that referenced this pull request Aug 19, 2021
* Add new tail sampling processor policy: percentage

* Update CHANGELOG.md

* Fix tail_sampling_config.yaml

* Fix typo

* Reset counters to avoid overflow, improve tests

* Move into internal as well

* Use sampling_percentage similar to the probabilistic sampling processor

Use sampling_percentage instead of percentage
Make sampling_percentage a value between 0 and 100

* Combine tests in a single table-test

* Rename percentage -> probabilistic for consistency with probabilistic processor

* Add IncludeAlreadySampled option

* Clarify test cases

* Use the same algorithm as the probabilistic sampling processor

* Simplify if-else case

* Typo

* Order of tail sampling policies does not matter

* Switch hashing implementation from murmur3 to fnv-1a

* Lower amount of traces to sample in test to speed up tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants