-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add new tail sampling processor policy: probabilistic #3876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -15,6 +15,7 @@ Multiple policies exist today and it is straight forward to add more. These incl | |||
- `always_sample`: Sample all traces | |||
- `latency`: Sample based on the duration of the trace. The duration is determined by looking at the earliest start time and latest end time, without taking into consideration what happened in between. | |||
- `numeric_attribute`: Sample based on number attributes | |||
- `percentage`: Sample a percentage of traces. Only traces that have not been sampled yet by another policy are taken into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would I use this instead of the probabilistic sampling processor?
https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/probabilisticsamplerprocessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wasn't aware of the probabilistic sampling processor (for some reason) 🤦🏻 I'll take a look at it and make sure our terminology is consistent.
I think the main difference is when the sampling decision happens. If you only want percentage or probabilistic sampling it doesn't make sense to use this tail sampling processor.
But combining this with other tail sampling policies makes it more meaningful. For instance:
policies:
[
{
name: all-errors,
type: status_code,
status_code: {status_codes: [ERROR]}
},
{
name: half-of-remaining,
type: percentage,
percentage: {percentage: 0.5}
},
]
This pipeline would sample all traces with status code error and 50% of the remaining traces. Using the probabilistic sampling processor you risk dropping traces with errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Make sure to mention that in the readme then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between this and just putting the probabilistic sampling processor after tail sampling at 50%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result will be slightly different: putting the probabilistic sampling processor after the tail sampling processor will filter what the tail sampling processor samples.
The percentage sampler as implemented here will filter what is not sampled by another policy.
So for instance a pipeline:
tail sampling (sample all errors) -> probabilistic (at 50%)
Result: 50% of traces with error
Why:
- the tail sampler drops every non-error
- the probabilistic sampler drops 50% of what the tail sampler returns
While the following:
tail sampling (sampler all errors -> sample 50%)
Result: all traces with error + 50% of traces without
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Not stale, I've been busy with some other stuff but I will continue working on this soon 🤞🏻 |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
# Conflicts: # CHANGELOG.md
processor/tailsamplingprocessor/internal/sampling/percentage.go
Outdated
Show resolved
Hide resolved
processor/tailsamplingprocessor/internal/sampling/percentage.go
Outdated
Show resolved
Hide resolved
processor/tailsamplingprocessor/internal/sampling/percentage.go
Outdated
Show resolved
Hide resolved
processor/tailsamplingprocessor/internal/sampling/percentage_test.go
Outdated
Show resolved
Hide resolved
Use sampling_percentage instead of percentage Make sampling_percentage a value between 0 and 100
Looking at how the probabilistic sampling processor is implemented, I went for a very similar approach: instead of keeping a counter we can sample based upon the hashed trace ID. This is stateless (easier code) and allows the tail sampling processor to run distributed (i.e. if you have multiple collectors behind the Trace ID aware load-balancing exporter). This does not take into account already sampled traces, so the effective sampling percentage might be higher if other policies sample a lot. But the processor has the same implementation using Example:
I dislike that I basically had to copy a large part of the probabilistic sampling processor. Maybe we can extract the hashing and sampling part into a shared library? otel-go also does trace ID-based sampling and might also profit from this. |
Running the probabilistic sampling processor is more efficient than the tail sampling processor. | ||
The probabilistic sampling policy makes decision based upon the trace ID, so waiting until more spans have arrived will not influence its decision. | ||
|
||
...you are already using the tail sampling processor: add the probabilistic sampling policy as last in your chain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is adding the probabilistic sampling policy at the head of the chain considered a configuration error? If so, should a warning be generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after thinking some more this isn't actually true: the order does not matter because the tail sampling processor will always run all policies. If one of the policies returns Sampled
it will keep the trace. I was thinking too much of this as a pipeline.
Jaeger does something very similar under the name of downsampling at the storage plugin: https://github.com/jaegertracing/jaeger/blob/c5642b708be9e2577cc1c494889ae97946ccc78a/storage/spanstore/downsampling_writer.go#L131-L140 It uses fnv64a, which would be my preference for this one here as well.
Great idea. I think we are using CRC for the load balancing exporter, but we should settle for the same across the board. My choice would be fnv instead of Murmur, though. Would you prefer to work on the refactoring first, and then change this PR, or to get this PR merged first, and do the refactoring later? |
Sounds good 👍 The implementation will be easier as well since it's in the standard library.
How about I switch this sampler to fnv64a, then we can (finally) merge this PR and look at refactoring/combining logic? |
Sounds good to me. |
Not sure why CI is failing, might be a performance issue. Tests work locally but take 13s, I'm guessing the runners are slower than my machine.
I'll lower the size of the test. |
Please rebase |
# Conflicts: # CHANGELOG.md
Done 👍 Would like me to squash the commits or does the repo squash on merge? |
* Add new tail sampling processor policy: percentage * Update CHANGELOG.md * Fix tail_sampling_config.yaml * Fix typo * Reset counters to avoid overflow, improve tests * Move into internal as well * Use sampling_percentage similar to the probabilistic sampling processor Use sampling_percentage instead of percentage Make sampling_percentage a value between 0 and 100 * Combine tests in a single table-test * Rename percentage -> probabilistic for consistency with probabilistic processor * Add IncludeAlreadySampled option * Clarify test cases * Use the same algorithm as the probabilistic sampling processor * Simplify if-else case * Typo * Order of tail sampling policies does not matter * Switch hashing implementation from murmur3 to fnv-1a * Lower amount of traces to sample in test to speed up tests
Description:
This adds a new tail sampling policy that samples a percentage of traces.