Skip to content

Same target from two different jobs missing after targetallocator upgrade 0.121.0+ #4044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gracewehner opened this issue May 28, 2025 · 5 comments · Fixed by #4066
Closed
Labels
area:target-allocator Issues for target-allocator bug Something isn't working

Comments

@gracewehner
Copy link
Contributor

Component(s)

target allocator

What happened?

Description

In a practical case, this happens when two different jobs point to the same target but each job has a different metrics_path. Now the target is only scraped for one of the jobs, whereas before it was scraped for both.

I was able to change of the unit tests to reproduce this and attach a debugger. I have narrowed it down to this PR: #3832 that changes the hash for a target from t.JobName + t.TargetURL + strconv.FormatUint(t.Labels.Hash(), 10) to t.Labels.Hash().

The targetgroup map coming from the Prometheus discovery manager does not contain job as a label when the targets are being processed by the Target Allocator and stored in the target list:

Image

Job is the key to a targetgroup object in the map, which does have the address as a label per target. This is also not added a label when processing the targets right before hashing them: https://github.com/swiatekm/opentelemetry-operator/blob/main/cmd/otel-allocator/internal/target/discovery.go#L195. The metrics path is also not a label at the time.

Steps to Reproduce

Use a scrape config pointing to the same target from two different jobs for Target Allocator version 0.121.0 and up:

config:
  scrape_configs:
  - job_name: prometheus
    static_configs:
    - targets: ["prom.domain:9001", "prom.domain:9002", "prom.domain:9003"]
  - job_name: prometheus2
    static_configs:
    - targets: ["prom.domain:9001"]

Expected Result

Target prom.domain:9001 will be allocated to a collector for both jobs prometheus or prometheus2.

Actual Result

Target prom.domain:9001 will be allocated to a collector for only one job of prometheus or prometheus2, but not both.

Kubernetes Version

1.31.8

Operator version

v0.121.0+

Collector version

v0.121.0+

Environment information

Environment

OS: "Ubuntu 20.04"

Log output

Additional context

No response

@gracewehner gracewehner added bug Something isn't working needs triage labels May 28, 2025
@swiatekm
Copy link
Contributor

Are you sure this also happens if the metric paths are different? The logic behind #3832 was that all the data that used to go into the hash calculation was present in the labels anyway. And that is definitely true for __address__ and __metrics_path__, but for the job name is only true when using Prometheus CRs. So there's definitely a bug here, and it should be resolved by adding the job name label for the purpose of hash calculation.

Thank you for the report and detailed reproduction!

@swiatekm swiatekm added area:target-allocator Issues for target-allocator and removed needs triage labels May 29, 2025
@gracewehner
Copy link
Contributor Author

Thanks @swiatekm for the quick reply. The above debugger screenshot is for the config:

config:
  scrape_configs:
  - job_name: prometheus
    metrics_path: /metrics1
    static_configs:
    - targets: ["prom.domain:9001", "prom.domain:9002", "prom.domain:9003"]
  - job_name: prometheus2
    metrics_path: /metrics2
    static_configs:
    - targets: ["prom.domain:9001"]

I was surprised to not see the __metrics_path__ labels there too, along with job.

I looked into it more now just to double-check and the labels in the targetgroup and target returned by the discovery manager are the bare minimum with address and some static labels. Then the scrape manager later takes these and combines these with more labels from the scrape config for job, metrics_path, etc: https://github.com/prometheus/prometheus/blob/main/scrape/target.go#L445

Discovery manager target groups: https://github.com/prometheus/prometheus/blob/main/discovery/targetgroup/targetgroup.go

I am happy to help with making a PR with a fix, but I'm not sure what strategy we want to take for the fix

@swiatekm
Copy link
Contributor

swiatekm commented May 30, 2025

That's a bit annoying. I don't want to revert the entirety of #3832 because it's conceptually sound and the performance improvement is significant. Unfortunately, Prometheus's labels.Labels don't let us pass in our own hasher. However, the implementation isn't particularly complicated, so I think the simplest fix is to copy it and add the job name and target url at the end, therefore restoring previous behaviour. Afterwards, we can rethink whether we want to mirror Prometheus' label initialization logic or do something bespoke. Does that make sense?

@gracewehner
Copy link
Contributor Author

Ok sounds good, thanks, that makes sense to me

@gracewehner
Copy link
Contributor Author

Thanks @swiatekm, made an initial PR for this, let me know what you think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:target-allocator Issues for target-allocator bug Something isn't working
Projects
None yet
2 participants