Skip to content

Commit 187259e

Browse files
felixbarnycarsonip
andauthored
[exporter/elasticsearch] Add _metric_names_hash to avoid metric rejections (#37511)
If metrics that have the same timestamp and dimensions aren't grouped into the same document, ES will consider them to be a duplicate. This adds a hash of the metric names that will be mapped as a dimension in Elasticsearch. The tradeoff is that if the composition of the metrics grouping changes over time, a new time series will be created. That has an impact on the rate aggregation for counters. ES mapping changes: elastic/elasticsearch#120952 --------- Co-authored-by: Carson Ip <[email protected]>
1 parent 63825ea commit 187259e

File tree

5 files changed

+90
-27
lines changed

5 files changed

+90
-27
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Use this changelog template to create an entry for release notes.
2+
3+
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
4+
change_type: enhancement
5+
6+
# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
7+
component: elasticsearchexporter
8+
9+
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
10+
note: Add `_metric_names_hash` field to metric documents in `otel` mode to avoid metric rejections
11+
12+
# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
13+
issues: [37511]
14+
15+
# (Optional) One or more lines of additional information to render under the primary note.
16+
# These lines will be padded with 2 spaces and then inserted directly into the document.
17+
# Use pipe (|) for multiline entries.
18+
subtext:
19+
20+
# If your change doesn't affect end users or the exported elements of any package,
21+
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
22+
# Optional: The change log or logs in which this entry should be included.
23+
# e.g. '[user]' or '[user, api]'
24+
# Include 'user' if the change is relevant to end users.
25+
# Include 'api' if there is a change to a library API.
26+
# Default: '[user]'
27+
change_logs: [user]

exporter/elasticsearchexporter/README.md

Lines changed: 37 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -457,23 +457,44 @@ processors:
457457

458458
### version_conflict_engine_exception
459459

460-
Symptom: elasticsearchexporter logs an error "failed to index document" with `error.type` "version_conflict_engine_exception" and `error.reason` containing "version conflict, document already exists".
461-
462-
This happens when the target data stream is a TSDB metrics data stream (e.g. using OTel mapping mode sending to a 8.16+ Elasticsearch). See the following scenarios.
460+
Symptom: `elasticsearchexporter` logs an error "failed to index document" with `error.type` "version_conflict_engine_exception" and `error.reason` containing "version conflict, document already exists".
461+
462+
This happens when the target data stream is a TSDB metrics data stream (e.g. using OTel mapping mode sending to a 8.16+ Elasticsearch).
463+
464+
Elasticsearch [Time Series Data Streams](https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html) requires that there must only be one document per timestamp with the same dimensions.
465+
The purpose is to avoid duplicate data when re-trying a batch of metrics that were previously sent but failed to be indexed.
466+
The dimensions are mostly made up of resource attributes, scope attributes, scope name, attributes, and the unit.
467+
468+
The exporter can only group metrics with the same dimensions into the same document if they arrive in the same batch.
469+
To ensure metrics are not dropped even if they arrive in different batches in the exporter, the exporter adds a fingerprint of the metric names to the document in the `otel` mapping mode.
470+
Note that you'll need to be on a minimum version of Elasticsearch in order for this to take effect 8.16.5, 8.17.3, 8.19.0, 9.0.0.
471+
If you are on an earlier version, either update your Elasticsearch cluster or install this custom component template:
472+
473+
```shell
474+
PUT _component_template/metrics-otel@custom
475+
{
476+
"template": {
477+
"mappings": {
478+
"properties": {
479+
"_metric_names_hash": {
480+
"type": "keyword",
481+
"time_series_dimension": true
482+
}
483+
}
484+
}
485+
}
486+
}
487+
```
463488

464-
1. When sending different metrics with the same dimension (mostly made up of resource attributes, scope attributes, attributes),
465-
`version_conflict_engine_exception` is returned by Elasticsearch when these metrics are not grouped into the same document.
466-
It also means that they have to be in the same batch in the exporter, as metric grouping is done per-batch in elasticsearchexporter.
467-
To work around the issue, use a [transform processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/transformprocessor/README.md) to ensure different metrics to never share the same set of dimensions. This is done at the expense of storage efficiency.
468-
This workaround will no longer be necessary once the limitation is lifted in Elasticsearch (see [issue](https://github.com/elastic/elasticsearch/issues/99123)).
489+
While in most situations, this error is just a sign that Elasticsearch's duplicate detection is working as intended, the data may be classified as a duplicate while it was not.
490+
This implies data is lost.
469491

470-
```yaml
471-
processors:
472-
transform/unique_dimensions:
473-
metric_statements:
474-
- context: datapoint
475-
statements:
476-
- set(attributes["metric_name"], metric.name)
477-
```
492+
1. If the data is not sent in `otel` mapping mode to `metrics-*.otel-*` data streams, the metrics name fingerprint is not applied.
493+
This can happen for OTel host and k8s metrics that the [`elasticinframetricsprocessor`](https://github.com/elastic/opentelemetry-collector-components/tree/main/processor/elasticinframetricsprocessor) has translated to the format the host and k8s dashboards in Kibana can consume.
494+
If these metrics arrive in the `elasticsearchexporter` in different batches, they will not be grouped to the same document.
495+
This can cause the `version_conflict_engine_exception` error.
496+
Try to remove the `batchprocessor` from the pipeline (or set `send_batch_max_size: 0`) to ensure metrics are not split into different batches.
497+
This gives the exporter the opportunity to group all related metrics into the same document.
478498

479499
2. Otherwise, check your metrics pipeline setup for misconfiguration that causes an actual violation of the [single writer principle](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer).
500+
This means that the same metric with the same dimensions is sent from multiple sources, which is not allowed in the OTel metrics data model.

exporter/elasticsearchexporter/exporter_test.go

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1176,19 +1176,19 @@ func TestExporterMetrics(t *testing.T) {
11761176
expected := []itemRequest{
11771177
{
11781178
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.metric.foo":"histogram"}}}`),
1179-
Document: []byte(`{"@timestamp":"0.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.foo":{"counts":[1,2,3,4],"values":[0.5,1.5,2.5,3.0]}},"resource":{},"scope":{}}`),
1179+
Document: []byte(`{"@timestamp":"0.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.foo":{"counts":[1,2,3,4],"values":[0.5,1.5,2.5,3.0]}},"resource":{},"scope":{},"_metric_names_hash":"f7fdad9f"}`),
11801180
},
11811181
{
11821182
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.metric.foo":"histogram"}}}`),
1183-
Document: []byte(`{"@timestamp":"3600000.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.foo":{"counts":[4,5,6,7],"values":[2.0,4.5,5.5,6.0]}},"resource":{},"scope":{}}`),
1183+
Document: []byte(`{"@timestamp":"3600000.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.foo":{"counts":[4,5,6,7],"values":[2.0,4.5,5.5,6.0]}},"resource":{},"scope":{},"_metric_names_hash":"f7fdad9f"}`),
11841184
},
11851185
{
11861186
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.metric.sum":"gauge_double"}}}`),
1187-
Document: []byte(`{"@timestamp":"3600000.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.sum":1.5},"resource":{},"scope":{},"start_timestamp":"7200000.0"}`),
1187+
Document: []byte(`{"@timestamp":"3600000.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.sum":1.5},"resource":{},"scope":{},"start_timestamp":"7200000.0","_metric_names_hash":"6e599000"}`),
11881188
},
11891189
{
11901190
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.metric.summary":"summary"}}}`),
1191-
Document: []byte(`{"@timestamp":"10800000.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.summary":{"sum":1.5,"value_count":1}},"resource":{},"scope":{},"start_timestamp":"10800000.0"}`),
1191+
Document: []byte(`{"@timestamp":"10800000.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"metric.summary":{"sum":1.5,"value_count":1}},"resource":{},"scope":{},"start_timestamp":"10800000.0","_metric_names_hash":"45a9e3cb"}`),
11921192
},
11931193
}
11941194

@@ -1257,7 +1257,7 @@ func TestExporterMetrics(t *testing.T) {
12571257
expected := []itemRequest{
12581258
{
12591259
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.sum":"gauge_long","metrics.summary":"summary"}}}`),
1260-
Document: []byte(`{"@timestamp":"0.0","_doc_count":10,"data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"sum":0,"summary":{"sum":1.0,"value_count":10}},"resource":{},"scope":{}}`),
1260+
Document: []byte(`{"@timestamp":"0.0","_doc_count":10,"data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"sum":0,"summary":{"sum":1.0,"value_count":10}},"resource":{},"scope":{},"_metric_names_hash":"7dc58200"}`),
12611261
},
12621262
}
12631263

@@ -1307,11 +1307,11 @@ func TestExporterMetrics(t *testing.T) {
13071307
expected := []itemRequest{
13081308
{
13091309
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.histogram.summary":"summary"}}}`),
1310-
Document: []byte(`{"@timestamp":"0.0","_doc_count":10,"data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"attributes":{},"metrics":{"histogram.summary":{"sum":1.0,"value_count":10}},"resource":{},"scope":{}}`),
1310+
Document: []byte(`{"@timestamp":"0.0","_doc_count":10,"data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"attributes":{},"metrics":{"histogram.summary":{"sum":1.0,"value_count":10}},"resource":{},"scope":{},"_metric_names_hash":"acbaed6b"}`),
13111311
},
13121312
{
13131313
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.exphistogram.summary":"summary"}}}`),
1314-
Document: []byte(`{"@timestamp":"3600000.0","_doc_count":10,"data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"attributes":{},"metrics":{"exphistogram.summary":{"sum":1.0,"value_count":10}},"resource":{},"scope":{}}`),
1314+
Document: []byte(`{"@timestamp":"3600000.0","_doc_count":10,"data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"attributes":{},"metrics":{"exphistogram.summary":{"sum":1.0,"value_count":10}},"resource":{},"scope":{},"_metric_names_hash":"29641c64"}`),
13151315
},
13161316
}
13171317

@@ -1350,7 +1350,7 @@ func TestExporterMetrics(t *testing.T) {
13501350
expected := []itemRequest{
13511351
{
13521352
Action: []byte(`{"create":{"_index":"metrics-generic.otel-default","dynamic_templates":{"metrics.foo.bar":"gauge_long","metrics.foo":"gauge_long","metrics.foo.bar.baz":"gauge_long"}}}`),
1353-
Document: []byte(`{"@timestamp":"0.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"foo":0,"foo.bar":0,"foo.bar.baz":0},"resource":{},"scope":{}}`),
1353+
Document: []byte(`{"@timestamp":"0.0","data_stream":{"dataset":"generic.otel","namespace":"default","type":"metrics"},"metrics":{"foo":0,"foo.bar":0,"foo.bar.baz":0},"resource":{},"scope":{},"_metric_names_hash":"204c382a"}`),
13541354
},
13551355
}
13561356

exporter/elasticsearchexporter/internal/serializer/otelserializer/metrics.go

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ package otelserializer // import "github.com/open-telemetry/opentelemetry-collec
66
import (
77
"bytes"
88
"fmt"
9+
"hash/fnv"
10+
"sort"
11+
"strconv"
912

1013
"github.com/elastic/go-structform"
1114
"github.com/elastic/go-structform/json"
@@ -47,10 +50,11 @@ func serializeDataPoints(v *json.Visitor, dataPoints []datapoints.DataPoint, val
4750

4851
dynamicTemplates := make(map[string]string, len(dataPoints))
4952
var docCount uint64
50-
metricNames := make(map[string]bool, len(dataPoints))
53+
metricNamesSet := make(map[string]bool, len(dataPoints))
54+
metricNames := make([]string, 0, len(dataPoints))
5155
for _, dp := range dataPoints {
5256
metric := dp.Metric()
53-
if _, present := metricNames[metric.Name()]; present {
57+
if _, present := metricNamesSet[metric.Name()]; present {
5458
*validationErrors = append(
5559
*validationErrors,
5660
fmt.Errorf(
@@ -61,7 +65,8 @@ func serializeDataPoints(v *json.Visitor, dataPoints []datapoints.DataPoint, val
6165
)
6266
continue
6367
}
64-
metricNames[metric.Name()] = true
68+
metricNamesSet[metric.Name()] = true
69+
metricNames = append(metricNames, metric.Name())
6570
// TODO here's potential for more optimization by directly serializing the value instead of allocating a pcommon.Value
6671
// the tradeoff is that this would imply a duplicated logic for the ECS mode
6772
value, err := dp.Value()
@@ -86,5 +91,14 @@ func serializeDataPoints(v *json.Visitor, dataPoints []datapoints.DataPoint, val
8691
if docCount != 0 {
8792
writeUIntField(v, "_doc_count", docCount)
8893
}
94+
sort.Strings(metricNames)
95+
hasher := fnv.New32a()
96+
for _, name := range metricNames {
97+
_, _ = hasher.Write([]byte(name))
98+
}
99+
// workaround for https://github.com/elastic/elasticsearch/issues/99123
100+
// should use a string field to benefit from run-length encoding
101+
writeStringFieldSkipDefault(v, "_metric_names_hash", strconv.FormatUint(uint64(hasher.Sum32()), 16))
102+
89103
return dynamicTemplates
90104
}

exporter/elasticsearchexporter/internal/serializer/otelserializer/metrics_test.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,5 +56,6 @@ func TestSerializeMetricsConflict(t *testing.T) {
5656
"metrics": map[string]any{
5757
"foo": json.Number("42"),
5858
},
59+
"_metric_names_hash": "a9f37ed7",
5960
}, result, eventAsJSON)
6061
}

0 commit comments

Comments
 (0)