-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Problem
We are looking to add the ability to attach data point labels that are specified in the Collector config file to all metrics that pass through the Collectors. This functionality does not exist at present. We request feedback from maintainers regarding the proposed solutions described below.
Use Case
Our need for this functionality stems from Cortex. We wish to have multiple collectors running concurrently and feeding data to Cortex for reliability, so that if any collector instance fails for any reason, there are others still collecting metrics and resulting in no data loss. Quite clearly this also means that many duplicate metrics will be exported to Cortex, however Cortex has a method for deduplication using “HA labels”. What this requires is for each metric to have 2 labels: “cluster” and “__replica__”. The most important fact is that each individual metric source (in this case, each collector instance) should have a different “__replica__” value. This is doable because the Collector config file supports use of environment variables, hence a unique value such as a pod name can be used. Cortex will choose which “__replica__” to collect metrics from within each cluster, hence the issue of duplicate metrics is solved.
More information can be found here: https://cortexmetrics.io/docs/production/ha-pair-handling/
To be more specific, in OTLP format if a metric looks like this:
[
{
"resource": {
"attributes": [
{
"key": "service.name",
"value": {
"Value": {
"string_value": "otel-collector"
}
}
}
]
},
"instrumentation_library_metrics": [
{
"metrics": [
{
"name": "otelcol_process_cpu_seconds",
"description": "Total CPU user and system time in seconds",
"unit": "s",
"Data": {
"double_gauge": {
"data_points": [
{
"labels": [
{
"key": "service_instance_id",
"value": "xxxx"
}
],
"time_unix_nano": 1601479169981000000
}
]
}
}
}
]
}
]
}
]
We want it to look like this after modifications:
[
{
"resource": {
"attributes": [
{
"key": "service.name",
"value": {
"Value": {
"string_value": "otel-collector"
}
}
}
]
},
"instrumentation_library_metrics": [
{
"metrics": [
{
"name": "otelcol_process_cpu_seconds",
"description": "Total CPU user and system time in seconds",
"unit": "s",
"Data": {
"double_gauge": {
"data_points": [
{
"labels": [
{
"key": "service_instance_id",
"value": "xxxx"
},
{
"key": "cluster",
"value": "some value"
},
,
{
"key": "__replica__",
"value": "some unique value"
}
],
"time_unix_nano": 1601479169981000000
}
]
}
}
}
]
}
]
}
]
Note that the difference between the two metrics is highlighted.
Potential Solutions
(Recommended) Solution 1: New ‘labelprocessor’
A possible solution is a new processor which reads labels specified in the Collector config file and attaches them to the data point labels of all metrics
Pros:
- Flexibility. labels can be added to any metrics for whatever reasons customers may deem fit
- Having the functionality here conforms with the definition of a processor
Cons:
- This will require more engineering effort compared to other proposed solutions that require changes in already existing components
Solution 2: Modify ‘resourceprocessor’ to allow changes to data point labels
The ‘resourceprocessor’ performs a similar transformation, however the key idea to note here is that within OTLP metrics, resource attributes are not the same as data point labels. ‘resourceprocessor’ can add resource attributes, however the desired functionality is the ability to add data point labels.
Here is an example metric from logging exporter that shows the difference:
Resource labels: -> service.name: STRING(otel-collector) -> host.hostname: STRING(0.0.0.0) -> port: STRING(8888) -> scheme: STRING(http) InstrumentationLibraryMetrics #0 Metric #0 Descriptor: -> Name: otelcol_process_cpu_seconds -> Description: Total CPU user and system time in seconds -> Unit: s -> DataType: DoubleGauge DoubleDataPoints #0 Data point labels: -> service_instance_id: xxxxx StartTime: 0 Timestamp: 1601479909978000000 Value: 0.000000
Pros:
- Similar to making a new ‘labelprocessor’, however less development is required to modify compared to writing a new processor entirely
Cons:
- This may be out of scope for what should be expected of a ‘resourceprocessor’ given that we want to make changes within metrics as opposed to in resource attributes
Solution 3: Add functionality to Prometheus Remote Write Exporter
What this solution entails is (1) using the ‘resourceprocessor’ to add our labels as resource attributes, and (2) ‘Prometheus remote write exporter’ taking these resource attributes and converting them to valid Prometheus labels.
The functionality for step (2) of this solution does not exist, so a possible solution is to add that.
Pros:
- No new processor is required, hence there is less engineering effort compared to a new processor
Cons:
- Although this satisfies our own use case, the scope is limited to Prometheus, whereas if we make a new processor it is more flexible and can be used to add labels to any metric
- It is not ideal to always be adding all resource attributes as labels, hence some coordination will be required for ‘Prometheus remote write exporter’ to know which resource attributes were the ones added from the configuration file
- This could mean a change to resourceprocessor, for example a flag associated with each attribute that specifies whether the attribute should be added as a metric label or not. This may once again not conform with the idea of what a ‘resource’ is.
Solution 4: Add batch label editing to metricstransformprocessor
metricstransformprocessor is a processor in the opentelemetry-collector-contrib repo that almost does what we want, however it can only add labels to metrics that are specified by name, whereas we wish to add labels to all metrics. Hence a possible solution is to add the ability to make transformations to all metrics in this processor.
Pros:
- A chunk of the required logic exists so it may not be as much engineering effort to make these changes compared to a new processor
Cons:
- May be out of the intended scope of this processor since it very clearly states it is not intended to be used for batch metrics changes
Additional Context
Prometheus Receiver
The Prometheus receiver has the functionality to add labels, however this does not suit our use case because in the scraping process Prometheus removes all labels starting with “__”. This means that the “__replica__” label cannot be set here.
Conclusion
The recommended solution for this issue is to make a new ‘labelprocessor’ that can be used to add labels to any metrics that are passing through the collector. Although our scope is currently limited to Prometheus and Cortex, this processor can be used for a more general purpose. We request approval from the Collector maintainers to proceed with our recommended solution, and we are open to discussion on other potential solutions.
cc - @bogdandrutu , @tigrannajaryan , @pjanotti , @alolita