Skip to content

prometheus exporter should use same normalization code as otel collector #6704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ywwg opened this issue Apr 28, 2025 · 25 comments
Open

prometheus exporter should use same normalization code as otel collector #6704

ywwg opened this issue Apr 28, 2025 · 25 comments
Labels
enhancement New feature or request pkg:exporter:prometheus Related to the Prometheus exporter package

Comments

@ywwg
Copy link

ywwg commented Apr 28, 2025

Problem Statement

Currently the prometheus export code has hand-rolled code to convert metric names from Otel-standard to Prometheus. This is out of sync with the exporters in otel-collector-contrib and the prometheus otel endpoint.

Proposed Solution

We should use the new https://github.com/prometheus/otlptranslator library to standardize the way that metric and label names are translated, including the different possible configurations

Alternatives

We could clone the current otel-collector-contrib code, but that code will soon be updated to use the new library and it seems unnecessary to go the extra step

Additional Context

There is a related issue here: open-telemetry/opentelemetry-collector-contrib#35459

@ywwg ywwg added the enhancement New feature or request label Apr 28, 2025
@pellared pellared added the pkg:exporter:prometheus Related to the Prometheus exporter package label Apr 29, 2025
@pellared
Copy link
Member

pellared commented Apr 29, 2025

@ywwg, I think this is a good idea.

I see that otlptranslator depends on pdata from Collector. Is it possible that otlptranslator would remove this dependency so that we have less dependencies and mitigate the possiiblity of having cyclic dependencies (some Collector modules depend on OTel Go modules)?

Moreover, we would ask for a stable release of github.com/prometheus/otlptranslator (in future) so that we could stabilize go.opentelemetry.io/otel/exporters/prometheus without worrying that people can run into build failures when bumping dependencies.

@ywwg
Copy link
Author

ywwg commented May 1, 2025

@ArthurSens

@aknuds1
Copy link

aknuds1 commented May 6, 2025

I see that otlptranslator depends on pdata from Collector. Is it possible that otlptranslator would remove this dependency so that we have less dependencies and mitigate the possiiblity of having cyclic dependencies (some Collector modules depend on OTel Go modules)?

@pellared Could we cross that bridge when it actually becomes a problem? For cyclic dependencies to arise from otlptranslator depending on go.opentelemetry.io/collector/pdata, the latter would have to depend on open-telemetry/opentelemetry-go, right? I'm worried about having to keep metric types in otlptranslator in sync with upstream.

@aknuds1
Copy link

aknuds1 commented May 6, 2025

@pellared Thinking about it some more, I don't think it makes sense to define OTel metric types in a Prometheus library (otlptranslator). The definitions are not Prometheus specific, and should therefore be in an open-telemetry owned library. If there are cyclic dependencies from otlptranslator depending on pdata, it's better solved by defining the metric types in another open-telemetry owned library.

@ywwg
Copy link
Author

ywwg commented May 6, 2025

Is it possible that otlptranslator would remove this dependency so that we have less dependencies and mitigate the possiiblity of having cyclic dependencies (some Collector modules depend on OTel Go modules)?

The intention is that the collector-contrib prometheusexporter and remotewriteexporter will depend on this otlptranslator package in order to unify the translation logic. Would that qualify as a cyclic dependency? I would prefer not to extract yet another library without a concrete need to do so. Would it be a problem to wait until that actually happens, and then fix it? Or is there some ossification that could take place that would make it harder to fix later.

@aknuds1 While "otlptranslator" lives in the Prometheus organization, my understanding was that we wanted to consider it a joint venture between otel and prometheus and is not A Prometheus Library or An Otel Library. We chose to put it in the prometheus org because most of the logic revolves around prometheus' standards for normalization. But its dependence on open telemetry means the code necessarily needs to be a collaboration between the two organizations because it is the critical interface point between the two systems. So I would prefer not to make decisions based on "who owns" the code, but rather the mutual needs of the two codebases.

@aknuds1
Copy link

aknuds1 commented May 6, 2025

Would it be a problem to wait until that actually happens, and then fix it?

That's what I'm arguing for. Let's not create a problem for ourselves unless it's actually necessary (i.e. we experience that otlptranslator's dependency on pdata causes a cyclical dependency when trying to build). I hope we're on the same page in this regard then?

But its dependence on open telemetry means the code necessarily needs to be a collaboration between the two organizations because it is the critical interface point between the two systems. So I would prefer not to make decisions based on "who owns" the code, but rather the mutual needs of the two codebases.

@ywwg My argument here isn't fundamentally about ownership, but that I would like to avoid replicating OTel metric type definitions in otlptranslator. Let's say we were to add metric type definitions to otlptranslator, those definitions would still live on in pdata (or another OTel library), right?

OTOH if we wanted to put the canonical OTel metric type definitions in a single library, that doesn't cause cyclic dependencies, otlptranslator doesn't make sense to me, considering that Prometheus is only one OTel backend. Does that make sense?

So long as there's no real technical obstacle to keeping metric type definitions in pdata, why go to the step of replicating them in otlptranslator and thereby increasing fragmentation?

@pellared
Copy link
Member

pellared commented May 12, 2025

Would it be a problem to wait until that actually happens, and then fix it?

That's what I'm arguing for. Let's not create a problem for ourselves unless it's actually necessary (i.e. we experience that otlptranslator's dependency on pdata causes a cyclical dependency when trying to build).

To clarify, we're not talking about a package-level cyclic dependency, but rather a product-level one.

For example, imagine we need to introduce a new metric data type (this is a type of change that I am mostly concerned about). We wouldn't be able to add it to go.opentelemetry.io/otel/exporters/prometheus until it's first added to pmetric and the Collector has released that change.

Meanwhile, any Collector component that depends on go.opentelemetry.io/otel/exporters/prometheus wouldn't be able to support the new type until that happens. So, even if there's no import cycle, the dependency chain between projects could still block progress.

However, notice there is even a bigger problem that you already have. How would you handle a translation for a new metric data type in otlptranslator in the first place? You would be blocked by a release of pkg.go.dev/go.opentelemetry.io/collector/pdata which would introduce the new type in pmetric, but the collector code would not be able to handle a translation for this new type as otlptranslator was not able to be updated yet.

For me the design is already broken and requires fixing 😉

@aknuds1
Copy link

aknuds1 commented May 12, 2025

Thanks very much for the clarification @pellared. I will try to digest it once I have some time. We have an off-site this week, so fairly occupied because of that.

@pellared
Copy link
Member

pellared commented May 12, 2025

But its dependence on open telemetry means the code necessarily needs to be a collaboration between the two organizations because it is the critical interface point between the two systems. So I would prefer not to make decisions based on "who owns" the code, but rather the mutual needs of the two codebases.

@ywwg, I totally agree with you. I guess we could also have it in https://github.com/open-telemetry organization (it does not matter for me personally). The problem right now is that it's design is tightly coupled to Collector (as mentioned #6704 (comment)) making it not really reusable. I guess it may be worth to do a spike/prototype to validate if it even is worth to have a reusable component as introducing new types and mapping everything may both less performant and harder to maintain than duplicating some of the translation code in two places. Or maybe otlptranslate should provide more low-level functions that Collector and SDK could reuse such as NormalizeLabel?

@aknuds1
Copy link

aknuds1 commented May 12, 2025

For me the design is already broken and requires fixing 😉

@pellared OK, I understand the nature of the problem now. Thanks for laying it out for those of us unfamiliar with the OTel ecosystem (I'm coming from the Prometheus side)!

I guess it may be worth to do a spike/prototype to validate if it even is worth to have a reusable component as introducing new types and mapping everything may both less performant and harder to maintain than duplicating some of the translation code in two places.

I think that ideally, it would be best to have the OTel metric type definitions in one place, and it doesn't make sense for that one place to be an OTel/Prometheus interop library, at least if interop for other OTel backends should want to use them. Could you explain how it would be harder to maintain a free-standing (i.e. independent of the Collector) OTel library containing the metric type definitions, rather than duplicating them between libraries (couldn't there be more than two places, if other backends than Prometheus needs to use them?)? Also, what could make this solution less performant? I can't really see what would lead to the latter.

To make sure we're on the same page; by spike/prototype, do you mean a spike that uses prometheus/otlptranslator in open-telemetry/opentelemetry-collector-contrib and open-telemetry/opentelemetry-go? Or do you mean a spike of an OTel library containing the metric type definitions? I'm generally for the idea of doing a spike :)

@pellared
Copy link
Member

Could you explain how it would be harder to maintain a free-standing (i.e. independent of the Collector) OTel library containing the metric type definitions, rather than duplicating them between libraries (couldn't there be more than two places, if other backends than Prometheus needs to use them?)?

Maintaining another repository, making releases, making sure that the API does not have any breaking changes is not free.

Also, what could make this solution less performant? I can't really see what would lead to the latter.

Here I can be wrong. I just thought that maybe we would need to translate the some types from our representation to the libraries representation. I would need to see the code and ideally benchmarks 😉

To make sure we're on the same page; by spike/prototype, do you mean a spike that uses prometheus/otlptranslator in open-telemetry/opentelemetry-collector-contrib and open-telemetry/opentelemetry-go? Or do you mean a spike of an OTel library containing the metric type definitions?

There can be multiple spikes so that we can even compare different designs.

@aknuds1
Copy link

aknuds1 commented May 13, 2025

@pellared What do you think of a spike that uses the prometheus/otlptranslator branch containing the OTel metric type definitions, in open-telemetry/opentelemetry-collector-contrib and open-telemetry/opentelemetry-go? I think it would be valuable as a PoC of the otlptranslator API design.

@pellared
Copy link
Member

@pellared What do you think of a spike that uses the prometheus/otlptranslator branch containing the OTel metric type definitions, in open-telemetry/opentelemetry-collector-contrib and open-telemetry/opentelemetry-go? I think it would be valuable as a PoC of the otlptranslator API design.

Yes. See: prometheus/otlptranslator#29 (comment)

@pellared
Copy link
Member

FYI @dashpole

@alexandreLamarre
Copy link
Contributor

Hey I know this isn't directly related to the issue at hand (but somewhat related), but the current experience is a little odd with exporter/prometheus w.r. to names.

If we do:

exporter, _:= expprom.New()
meterProvider := sdkmetric.NewMeterProvider(
	sdkmetric.WithReader(exporter),
)
counter, _ := meter.Int64Counter(
	"example.counter",
	otelmetricsdk.WithDescription("Example Counter"),
	otelmetricsdk.WithUnit("unit"),
)

with the promhttp.Handler(), it exposes:

# HELP "example.counter_total" Example Counter
# TYPE "example.counter_total" counter
{"example.counter_total", ...} 1

which is now vaild by default since the default name validation scheme from prometheus/common/model is now:

NameValidationScheme = UTF8Validation

feels like bad dev UX atm, especially since we want to expose metrics with the best practices:
https://prometheus.io/docs/practices/naming/

For now you can fix this by setting the NameValidationScheme to the legacy one, which will eventually be removed.

In the above meter constructor manually changing "example.counter" to "example_counter" results in the otelsdk renaming it to "example.counter".

It feels like conflicting practices being used


Wondering if a temporary patch would be useful to change the way the naming scheme is handled in exporters/prometheus to address this since:

  • setting the global NameValidationScheme inside a complex product like https://github.com/rancher/rancher, might cause some confusion/frustration if multiple teams are adding metrics (think also from the case of importing downstream go.mod dependencies that are also registering metrics using these global variables, and relying on the NameValidationScheme to utf8, but in rancher/rancher we change it to the legacy one)

@ywwg
Copy link
Author

ywwg commented May 27, 2025

Instead of changing the default "NameValidationScheme", we should change the default translation method. See: https://github.com/prometheus/docs/blob/main/content/docs/guides/opentelemetry.md#utf-8. Probably we want this to be UnderscoreEscapingWithSuffixes until otel depends on the new shared otlptranslator repo

@alexandreLamarre
Copy link
Contributor

A prometheus docs update went live and moved the link to : https://github.com/prometheus/docs/blob/main/docs/guides/opentelemetry.md#utf-8

@dashpole
Copy link
Contributor

@ywwg, at the prometheus wg, we had talked about keeping the existing options (negotiation for escaping, config for suffixes). Does your comment above mean you want to to switch to using the same "strategies" as promethues in the configuration?

@alexandreLamarre
Copy link
Contributor

There's also inconsistencies between exporter/prometheus and otel-contrib's pkg/translator/prometheus with regards to the UCUM unit suffixes map:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/65bb39e831d570362ac310f7392d89e6f1d49d96/pkg/translator/prometheus/normalize_unit.go#L108-L115

adds the plain text suffix anyways if it is not in the ucum map

if suffix, ok := unitSuffixes[m.Unit]; ok && !c.withoutUnits && !strings.HasSuffix(name, suffix) {
name += "_" + suffix
}

does not add the suffix if it is not in the ucum map

I see that https://github.com/prometheus/otlptranslator takes the former stance to handle units (probably intentionally).

I'm not sure what the timeline is for stabilizing https://github.com/prometheus/otlptranslator, but in the interim can we consider changing the behaviour of prometheus/exporter to match the incoming https://github.com/prometheus/otlptranslator ?

@ywwg
Copy link
Author

ywwg commented May 28, 2025

@ywwg, at the prometheus wg, we had talked about keeping the existing options (negotiation for escaping, config for suffixes). Does your comment above mean you want to to switch to using the same "strategies" as promethues in the configuration?

aha, I see -- I meant that negotiation is for determining what Prometheus can accept and configuration is for what the client wants to send. And "NameValidationScheme" indicates what the code can support. This is a lot of moving parts but I think it all holds together.

  • NameValidationScheme. Should always be "UTF8" in new code, that's why it's now deprecated. This value should not be used as configuration because it's a dumb global. Even in utf8 mode, code can still optionally check for legacy validity whenever it wants.
  • Content Negotiation: Prometheus announces what it can Accept -- can it accept UTF-8 or does it require escaping? This says nothing about whether there are suffixes, and it also does not insist that the client "must" send UTF-8. Prometheus can say that it Accepts UTF-8 and an endpoint can still totally validly send underscore+suffix names.. My main objection to putting things like suffix-addition in content negotiation is that it starts embedding Otel configuration inside the Prometheus protocol.
  • Configuration: This allows endpoints to determine what they send to prometheus. Maybe they send untranslated UTF-8, but maybe they always produce underscore+suffix names even when Prometheus announces that it can accept UTF-8.

So one way to migrate is: Set up endpoints to always produce underscore+suffix names. Prometheus says it can Accept underscores, or it can Accept UTF-8 -- doesn't matter, the endpoint will always produce escaped names. Later, the operator can update the endpoint configuration to produce untranslated names.

David, you talked about a situation where a customer has many endpoints and wants to update them all at the same time. One way to do this would be:

  1. Configure the endpoints to, when asked for UTF-8, produce untranslated metrics. When asked for escaping, produce underscore+suffix names.
  2. Set up prometheus to only Accept underscore escaping.
  3. At this point, all names will still be underscores+escaping.
  4. Prom configuration is later updated to Accept UTF-8.
  5. Now the endpoints see the support for UTF-8 in the Accept header and produce untranslated metrics.

The only hole I'm seeing in this more complex plan is the need separate configuration on the endpoints for UTF-8 vs escaped name formation.

I have some new prom docs to try to explain this:

https://github.com/prometheus/docs/blob/main/docs/instrumenting/content_negotiation.md
https://github.com/prometheus/docs/blob/main/docs/instrumenting/escaping_schemes.md
https://github.com/prometheus/docs/blob/main/docs/guides/utf8.md#scrape-content-negotiation-for-utf-8-escaping

@dashpole
Copy link
Contributor

Thanks, makes sense. I think having configuration is fine.

Do you think this configuration should be on the OpenTelemetry exporter, or should it be part of the prometheus client? E.g. on something like HandlerOpts, rather than an option in go.opentelemetry.io/otel/exporters/prometheus?

I would prefer that OTel exporters be "dumb" wrappers around prometheus clients as much as possible, so that defining a metric using OTel with a Prometheus exporter vs defining the metric using the prometheus client produce the same result.

dashpole added a commit that referenced this issue Jun 2, 2025
…ntrib (#6839)

Related to
#6704 (comment)

---------

Signed-off-by: Alexandre Lamarre <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
@ywwg
Copy link
Author

ywwg commented Jun 3, 2025

edit: rereading what you said

I have been assuming this should be configuration on the prometheus endpoint code, not on the otel exporter.

i.e., the config lives in the place where the conversion is being done from otel to prometheus, not in a pure-otel part of the code.

@ywwg
Copy link
Author

ywwg commented Jun 3, 2025

I think I'm not sure what you mean by "prometheus client" here -- I am used to talking about "Otel Exporter" to mean the thing that sends native otel to an otel endpoint (which could be prometheus) and the "Otel Prometheus Exporter" which is the thing that exposes a /metrics endpoint on the collector.

@ywwg
Copy link
Author

ywwg commented Jun 3, 2025

I think I agree that the configuration should go in the prometheus part of the code so that it applies everywhere

@dashpole
Copy link
Contributor

dashpole commented Jun 3, 2025

I think in an ideal world, controlling the character set and the addition of suffixes would be a feature built-into Prometheus clients (e.g. client_golang). OpenTelemetry would benefit from that as a user of the client, but it wouldn't be opentelemetry-specific.

But that would also presumably mean preserving the existing behavior of the Prometheus client as the default, which I believe is equivalent to NoTranslation. I'm not sure if we are comfortable with that being the default for OTel SDKs today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg:exporter:prometheus Related to the Prometheus exporter package
Projects
Status: No status
Development

No branches or pull requests

5 participants