Skip to content

[chore] Revert dc8e2dd #12917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

songy23
Copy link
Member

@songy23 songy23 commented Apr 24, 2025

Description

reverts #12856, breaks internal metrics

Testing

tested locally at localhost:8888/metrics, before the revert:
Screenshot 2025-04-23 at 9 38 43 PM

after:
Screenshot 2025-04-23 at 9 36 36 PM

@songy23 songy23 added the release:blocker The issue must be resolved before cutting the next release label Apr 24, 2025
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

Attention: Patch coverage is 77.27273% with 10 lines in your changes missing coverage. Please review.

Project coverage is 91.60%. Comparing base (d020c90) to head (7196ba0).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
service/internal/graph/connector.go 33.33% 1 Missing and 1 partial ⚠️
service/internal/graph/exporter.go 33.33% 1 Missing and 1 partial ⚠️
service/internal/graph/processor.go 33.33% 1 Missing and 1 partial ⚠️
service/internal/graph/receiver.go 33.33% 1 Missing and 1 partial ⚠️
service/telemetry/logger.go 93.75% 1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (77.27%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12917      +/-   ##
==========================================
- Coverage   91.65%   91.60%   -0.05%     
==========================================
  Files         499      499              
  Lines       27426    27437      +11     
==========================================
- Hits        25138    25135       -3     
- Misses       1809     1818       +9     
- Partials      479      484       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@djaglowski
Copy link
Member

djaglowski commented Apr 24, 2025

I don't know if reverting this is a good idea either as it was done to as a patch for #12870 (see comment)

@djaglowski
Copy link
Member

#12856, breaks internal metrics

Can we be more specific about what is broken?

The intended effect of open-telemetry/opentelemetry-collector-contrib#12856 was to add scope attributes to all telemetry describing components. We discussed lack of support for scope attributes as a potential concern but didn't notice any immediate issues. However, it seems that we may have overlooked one. Is it possible this is just a single consumer of the telemetry which needs to correctly handle scope attributes?

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Apr 24, 2025

The issue

I haven't investigated how the Prometheus exporter works in detail, but here is my guess on what the issue is:

It looks like until now, there was no way to identify which batch processor instance a metric like otelcol_processor_batch_metadata_cardinality came from. You get a "processor" attribute with the component ID, but nothing identifying which signal's pipeline it's in. This means that when running a config like:

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

The metrics from both batch instances were aggregated on the OTel side into a single metric stream, and output by Prometheus as a single metric stream.

open-telemetry/opentelemetry-collector-contrib#12856 automatically injects additional attributes which allow differentiating the metric streams from both instances. However, these are instrumentation scope attributes, which are not currently supported by the Prometheus exporter. I had assumed this simply meant that the distinction would be aggregated away at the exporter level, but it sounds like the Prometheus exporter does not implement any aggregation logic, and is very strict on its notion of "different metrics" matching that of the input. Because we no longer aggregate these 2 metric streams at the OTel level because of the differing scope attributes, and the Prometheus exporter completely ignores scope attributes and considers them to alias, it errors out.

In my opinion, this is clearly a bug in the Prometheus exporter. Having two metric streams on the OTel side that the exporter cannot tell apart because of lack of support of some identifying properties should not result in an unrecoverable error. And this could happen again if internal telemetry starts using other identifying properties the exporter doesn't support.

What to do

Because I expect many users to be relying on the Prometheus exporter, this needs to be fixed before the next release.

The obvious long-term course of action is to modify the Prometheus exporter to support instrumentation scope attributes, and perhaps to make it more lenient towards unexpectedly finely aggregated input.

If we estimate that we can't get that done before the next release:

@djaglowski
Copy link
Member

Related issue in contrib: #12923

@djaglowski
Copy link
Member

If this is something that can be solved in the prometheus exporter, I'd much prefer we resolve it there. I suggest we give the code owners a day to weigh in before attempting other solutions.

@songy23
Copy link
Member Author

songy23 commented Apr 24, 2025

Disclaimer I don't have much context on the related changes, I have this PR just because I was pinged in #12812 and followed the conversation there :)

Personally I would prefer to do a clean revert first - that should bring back the behavior in v0.124.0 IIUC - and then make any compensated changes in a follow-up PR. That is easier to review & easier to isolate (e.g. what if the follow-up PR introduces new issues). But ultimately will defer to @djaglowski & @jade-guiton-dd

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening the PR and marking it as a release blocker to prevent this from going out next week @songy23, I agree with @djaglowski let's see if we can address the underlying issue first, marking it as requesting change to prevent it from accidentally getting merged 👍🏻

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Apr 24, 2025

FYI a helm chart user ran into this: open-telemetry/opentelemetry-helm-charts#1642 (comment). I set with_resource_constant_labels.included on the prometheus exporter and that solved the problem I believe.

@djaglowski
Copy link
Member

Personally I would prefer to do a clean revert first

Normally I would agree but we already have a stack of problems that reverting won't cleanly resolve so I think pressing ahead may be safer.

@codeboten
Copy link
Contributor

codeboten commented Apr 24, 2025

Because we no longer aggregate these 2 metric streams at the OTel level because of the differing scope attributes, and the Prometheus exporter completely ignores scope attributes and considers them to alias, it errors out.

I'm not sure if the exporter ignores the scope attributes, based on the output I saw when I ran the collector, the scope namd and version are set as labels, but they are not different for the different pipelines causing the issue

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "otelcol_processor_batch_metadata_cardinality" { label:{name:"otel_scope_name" value:"go.opentelemetry.io/collector/processor/batchprocessor"} label:{name:"otel_scope_version" value:""} label:{name:"processor" value:"batch"} gauge:{value:1}} was collected before with the same name and label values
* collected metric "otelcol_processor_batch_metadata_cardinality" { label:{name:"otel_scope_name" value:"go.opentelemetry.io/collector/processor/batchprocessor"} label:{name:"otel_scope_version" value:""} label:{name:"processor" value:"batch"} gauge:{value:1}} was collected before with the same name and label values

Would adding the attribute for individual pipelines address the underlying problem that the prometheus client library doesn't know what to do w/ metrics it considers duplicates (https://github.com/prometheus/client_golang/blob/main/prometheus/registry.go#L943)

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Apr 24, 2025

@codeboten It handles the name and version of an instrumentation scope, but not its attribute set; those are different fields.

@codeboten
Copy link
Contributor

@jade-guiton-dd ah thanks, i misunderstood your comment

@codeboten
Copy link
Contributor

Related issue: open-telemetry/opentelemetry-go#5846

@codeboten
Copy link
Contributor

I opened an issue to track ensuring that this problem is caught with a core tests in #12918

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Apr 24, 2025

If we want feedback from Prometheus exporter codeowners, should we file an issue about this on opentelemetry-go?

@jmacd
Copy link
Contributor

jmacd commented Apr 24, 2025

@jmacd
Copy link
Contributor

jmacd commented Apr 24, 2025

I see three solutions, short/medium/long term.

  1. As described in Endorse use of runtime-defined instrumentation scope attributes opentelemetry-specification#4450 (comment), if we follow the existing approach to scope attributes, then we would extend each metric with a set of identifying scope-attribute values. This allows scopes to include many non-identifying attributes but duplicates all identifying attributes on both the otel_scope_info and the metrics.
  2. A unique identifier could be added, making it possible for a single join key to be used (e.g., a 64bit ID). This is more compact, doesn't require an entity definition to declare identifying attributes..
  3. We could try to add feature in the OTel-Go exporter to copy scope attribute metrics into every metric. This will restore the original data and enable a slow transition to one or the other solution described above.

@swapdonkey
Copy link

Related issue in contrib: open-telemetry/opentelemetry-collector#12923

Just wanted to clarify this issue is using the OTLP endpoint in Prometheus 3 its not using the Prometheus Exporter. So I don't think any changes to Prometheus Exporter will fix it

@ArthurSens
Copy link
Member

Related issue in contrib: #12923

Hi folks, I accidentally landed here while triaging issues in contrib. Collector internal metrics are exposed through the GO-SDK Prometheus exporter, and not the collector's component Prometheus exporter. So the issue in contrib is not related to the problem here.

I'm not sure how I can be helpful since I'm still getting used to OTel's spec, but please don't hesitate to ping me if you feel like I could be useful

@songy23
Copy link
Member Author

songy23 commented Apr 28, 2025

superseded by #12933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:blocker The issue must be resolved before cutting the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants