You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor the probabilistic sampler processor; add FailClosed configuration, prepare for OTEP 235 support (#31946)
**Description:**
Refactors the probabilistic sampling processor to prepare it for more
OTEP 235 support.
This clarifies existing inconsistencies between tracing and logging
samplers, see the updated README. The tracing priority mechanism applies
a 0% or 100% sampling override (e.g., "1" implies 100% sampling),
whereas the logging sampling priority mechanism supports
variable-probability override (e.g., "1" implies 1% sampling).
This pins down cases where no randomness is available, and organizes the
code to improve readability. A new type called `randomnessNamer` carries
the randomness information (from the sampling pacakge) and a name of the
policy that derived it. When sampling priority causes the effective
sampling probability to change, the value "sampling.priority" replaces
the source of randomness, which is currently limited to "trace_id_hash"
or the name of the randomess-source attribute, for logs.
While working on #31894, I discovered that some inputs fall through to
the hash function with zero bytes of input randomness. The hash
function, computed on an empty input (for logs) or on 16 bytes of zeros
(which OTel calls an invalid trace ID), would produce a fixed random
value. So, for example, when logs are sampled and there is no TraceID
and there is no randomness attribute value, the result will be sampled
at approximately 82.9% and above.
In the refactored code, an error is returned when there is no input
randomness. A new boolean configuration field determines the outcome
when there is an error extracting randomness from an item of telemetry.
By default, items of telemetry with errors will not pass through the
sampler. When `FailClosed` is set to false, items of telemetry with
errors will pass through the sampler.
The original hash function, which uses 14 bits of information, is
structured as an "acceptance threshold", ultimately the test for
sampling translated into a positive decision when `Randomness <
AcceptThreshold`. In the OTEP 235 scheme, thresholds are rejection
thresholds--this PR modifies the original 14-bit accept threshold into a
56-bit reject threshold, using Threshold and Randomness types from the
sampling package. Reframed in this way, in the subsequent PR (i.e.,
#31894) the effective sampling probability will be seamlessly conveyed
using OTEP 235 semantic conventions.
Note, both traces and logs processors are now reduced to a function like
this:
```
return commonSamplingLogic(
ctx,
l,
lsp.sampler,
lsp.failClosed,
lsp.sampler.randomnessFromLogRecord,
lsp.priorityFunc,
"logs sampler",
lsp.logger,
)
```
which is a generic function that handles the common logic on a per-item
basis and ends in a single metric event. This structure makes it clear
how traces and logs are processed differently and have different
prioritization schemes, currently. This structure also makes it easy to
introduce new sampler modes, as shown in #31894. After this and #31940
merge, the changes in #31894 will be relatively simple to review as the
third part in a series.
**Link to tracking Issue:**
Depends on #31940.
Part of #31918.
**Testing:** Added. Existing tests already cover the exact random
behavior of the current hashing mechanism. Even more testing will be
introduced with the last step of this series. Note that
#32360
is added ahead of this test to ensure refactoring does not change
results.
**Documentation:** Added.
---------
Co-authored-by: Kent Quirk <[email protected]>
For trace spans, this sampler supports probabilistic sampling based on
25
+
a configured sampling percentage applied to the TraceID. In addition,
26
+
the sampler recognizes a `sampling.priority` annotation, which can
27
+
force the sampler to apply 0% or 100% sampling.
28
+
29
+
For log records, this sampler can be configured to use the embedded
30
+
TraceID and follow the same logic as applied to spans. When the
31
+
TraceID is not defined, the sampler can be configured to apply hashing
32
+
to a selected log record attribute. This sampler also supports
33
+
sampling priority.
34
+
35
+
## Consistency guarantee
36
+
37
+
A consistent probability sampler is a Sampler that supports
38
+
independent sampling decisions for each span or log record in a group
39
+
(e.g. by TraceID), while maximizing the potential for completeness as
40
+
follows.
41
+
42
+
Consistent probability sampling requires that for any span in a given
43
+
trace, if a Sampler with lesser sampling probability selects the span
44
+
for sampling, then the span would also be selected by a Sampler
45
+
configured with greater sampling probability.
46
+
47
+
## Completeness property
48
+
49
+
A trace is complete when all of its members are sampled. A
50
+
"sub-trace" is complete when all of its descendents are sampled.
51
+
52
+
Ordinarily, Trace and Logging SDKs configure parent-based samplers
53
+
which decide to sample based on the Context, because it leads to
54
+
completeness.
55
+
56
+
When non-root spans or logs make independent sampling decisions
57
+
instead of using the parent-based approach (e.g., using the
58
+
`TraceIDRatioBased` sampler for a non-root span), incompleteness may
59
+
result, and when spans and log records are independently sampled in a
60
+
processor, as by this component, the same potential for completeness
61
+
arises. The consistency guarantee helps minimimize this issue.
62
+
63
+
Consistent probability samplers can be safely used with a mixture of
64
+
probabilities and preserve sub-trace completeness, provided that child
65
+
spans and log records are sampled with probability greater than or
66
+
equal to the parent context.
67
+
68
+
Using 1%, 10% and 50% probabilities for example, in a consistent
69
+
probability scheme the 50% sampler must sample when the 10% sampler
70
+
does, and the 10% sampler must sample when the 1% sampler does. A
71
+
three-tier system could be configured with 1% sampling in the first
72
+
tier, 10% sampling in the second tier, and 50% sampling in the bottom
73
+
tier. In this configuration, 1% of traces will be complete, 10% of
74
+
traces will be sub-trace complete at the second tier, and 50% of
75
+
traces will be sub-trace complete at the third tier thanks to the
76
+
consistency property.
77
+
78
+
These guidelines should be considered when deploying multiple
79
+
collectors with different sampling probabilities in a system. For
80
+
example, a collector serving frontend servers can be configured with
81
+
smaller sampling probability than a collector serving backend servers,
82
+
without breaking sub-trace completeness.
83
+
84
+
## Sampling randomness
85
+
86
+
To achieve consistency, sampling randomness is taken from a
87
+
deterministic aspect of the input data. For traces pipelines, the
88
+
source of randomness is always the TraceID. For logs pipelines, the
89
+
source of randomness can be the TraceID or another log record
90
+
attribute, if configured.
91
+
92
+
For log records, the `attribute_source` and `from_attribute` fields determine the
93
+
source of randomness used for log records. When `attribute_source` is
94
+
set to `traceID`, the TraceID will be used. When `attribute_source`
95
+
is set to `record` or the TraceID field is absent, the value of
96
+
`from_attribute` is taken as the source of randomness (if configured).
97
+
98
+
## Sampling priority
99
+
100
+
The sampling priority mechanism is an override, which takes precedence
101
+
over the probabilistic decision in all modes.
102
+
103
+
🛑 Compatibility note: Logs and Traces have different behavior.
104
+
105
+
In traces pipelines, when the priority attribute has value 0, the
106
+
configured probability will by modified to 0% and the item will not
107
+
pass the sampler. When the priority attribute is non-zero the
108
+
configured probability will be set to 100%. The sampling priority
109
+
attribute is not configurable, and is called `sampling.priority`.
110
+
111
+
In logs pipelines, when the priority attribute has value 0, the
112
+
configured probability will by modified to 0%, and the item will not
113
+
pass the sampler. Otherwise, the logs sampling priority attribute is
114
+
interpreted as a percentage, with values >= 100 equal to 100%
115
+
sampling. The logs sampling priority attribute is configured via
116
+
`sampling_priority`.
117
+
118
+
## Sampling algorithm
119
+
120
+
### Hash seed
121
+
122
+
The hash seed method uses the FNV hash function applied to either a
123
+
Trace ID (spans, log records), or to the value of a specified
124
+
attribute (only logs). The hashed value, presumed to be random, is
125
+
compared against a threshold value that corresponds with the sampling
126
+
percentage.
127
+
128
+
This mode requires configuring the `hash_seed` field. This mode is
129
+
enabled when the `hash_seed` field is not zero, or when log records
130
+
are sampled with `attribute_source` is set to `record`.
131
+
132
+
In order for hashing to be consistent, all collectors for a given tier
133
+
(e.g. behind the same load balancer) must have the same
134
+
`hash_seed`. It is also possible to leverage a different `hash_seed`
135
+
at different collector tiers to support additional sampling
136
+
requirements.
137
+
138
+
This mode uses 14 bits of sampling precision.
139
+
140
+
### Error handling
141
+
142
+
This processor considers it an error when the arriving data has no
143
+
randomess. This includes conditions where the TraceID field is
144
+
invalid (16 zero bytes) and where the log record attribute source has
145
+
zero bytes of information.
146
+
147
+
By default, when there are errors determining sampling-related
148
+
information from an item of telemetry, the data will be refused. This
149
+
behavior can be changed by setting the `fail_closed` property to
150
+
false, in which case erroneous data will pass through the processor.
151
+
152
+
## Configuration
27
153
28
154
The following configuration options can be modified:
29
-
-`hash_seed` (no default): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
30
-
-`sampling_percentage` (default = 0): Percentage at which traces are sampled; >= 100 samples all traces
31
155
32
-
Examples:
156
+
-`sampling_percentage` (32-bit floating point, required): Percentage at which items are sampled; >= 100 samples all items, 0 rejects all items.
157
+
-`hash_seed` (32-bit unsigned integer, optional, default = 0): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
158
+
-`fail_closed` (boolean, optional, default = true): Whether to reject items with sampling-related errors.
33
159
34
-
```yaml
35
-
processors:
36
-
probabilistic_sampler:
37
-
hash_seed: 22
38
-
sampling_percentage: 15.3
39
-
```
160
+
### Logs-specific configuration
40
161
41
-
The probabilistic sampler supports sampling logs according to their trace ID, or by a specific log record attribute.
42
-
43
-
The probabilistic sampler optionally may use a `hash_seed` to compute the hash of a log record.
44
-
This sampler samples based on hash values determined by log records. See [Hashing](#hashing) for more information.
45
-
46
-
The following configuration options can be modified:
47
-
- `hash_seed` (no default, optional): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
48
-
- `sampling_percentage` (required): Percentage at which logs are sampled; >= 100 samples all logs, 0 rejects all logs.
49
-
- `attribute_source` (default = traceID, optional): defines where to look for the attribute in from_attribute. The allowed values are `traceID` or `record`.
50
-
- `from_attribute` (default = null, optional): The optional name of a log record attribute used for sampling purposes, such as a unique log record ID. The value of the attribute is only used if the trace ID is absent or if `attribute_source` is set to `record`.
51
-
- `sampling_priority` (default = null, optional): The optional name of a log record attribute used to set a different sampling priority from the `sampling_percentage` setting. 0 means to never sample the log record, and >= 100 means to always sample the log record.
52
-
53
-
## Hashing
54
-
55
-
In order for hashing to work, all collectors for a given tier (e.g. behind the same load balancer)
56
-
must have the same `hash_seed`. It is also possible to leverage a different `hash_seed` at
57
-
different collector tiers to support additional sampling requirements. Please refer to
58
-
[config.go](./config.go) for the config spec.
162
+
-`attribute_source` (string, optional, default = "traceID"): defines where to look for the attribute in from_attribute. The allowed values are `traceID` or `record`.
163
+
-`from_attribute` (string, optional, default = ""): The name of a log record attribute used for sampling purposes, such as a unique log record ID. The value of the attribute is only used if the trace ID is absent or if `attribute_source` is set to `record`.
164
+
-`sampling_priority` (string, optional, default = ""): The name of a log record attribute used to set a different sampling priority from the `sampling_percentage` setting. 0 means to never sample the log record, and >= 100 means to always sample the log record.
59
165
60
166
Examples:
61
167
62
-
Sample 15% of the logs:
168
+
Sample 15% of log records according to trace ID using the OpenTelemetry
169
+
specification.
170
+
63
171
```yaml
64
172
processors:
65
173
probabilistic_sampler:
@@ -76,7 +184,8 @@ processors:
76
184
from_attribute: logID # value is required if the source is not traceID
77
185
```
78
186
79
-
Sample logs according to the attribute `priority`:
187
+
Give sampling priority to log records according to the attribute named
188
+
`priority`:
80
189
81
190
```yaml
82
191
processors:
@@ -85,6 +194,7 @@ processors:
85
194
sampling_priority: priority
86
195
```
87
196
197
+
## Detailed examples
88
198
89
-
Refer to [config.yaml](./testdata/config.yaml) for detailed
90
-
examples on using the processor.
199
+
Refer to [config.yaml](./testdata/config.yaml) for detailed examples
0 commit comments