Same target from two different jobs missing after targetallocator upgrade 0.121.0+ #4044

gracewehner · 2025-05-28T17:55:32Z

Component(s)

target allocator

What happened?

Description

In a practical case, this happens when two different jobs point to the same target but each job has a different metrics_path. Now the target is only scraped for one of the jobs, whereas before it was scraped for both.

I was able to change of the unit tests to reproduce this and attach a debugger. I have narrowed it down to this PR: #3832 that changes the hash for a target from t.JobName + t.TargetURL + strconv.FormatUint(t.Labels.Hash(), 10) to t.Labels.Hash().

The targetgroup map coming from the Prometheus discovery manager does not contain job as a label when the targets are being processed by the Target Allocator and stored in the target list:

Job is the key to a targetgroup object in the map, which does have the address as a label per target. This is also not added a label when processing the targets right before hashing them: https://github.com/swiatekm/opentelemetry-operator/blob/main/cmd/otel-allocator/internal/target/discovery.go#L195. The metrics path is also not a label at the time.

Steps to Reproduce

Use a scrape config pointing to the same target from two different jobs for Target Allocator version 0.121.0 and up:

config:
  scrape_configs:
  - job_name: prometheus
    static_configs:
    - targets: ["prom.domain:9001", "prom.domain:9002", "prom.domain:9003"]
  - job_name: prometheus2
    static_configs:
    - targets: ["prom.domain:9001"]

Expected Result

Target prom.domain:9001 will be allocated to a collector for both jobs prometheus or prometheus2.

Actual Result

Target prom.domain:9001 will be allocated to a collector for only one job of prometheus or prometheus2, but not both.

Kubernetes Version

1.31.8

Operator version

v0.121.0+

Collector version

v0.121.0+

Environment information

Environment

OS: "Ubuntu 20.04"

Log output

Additional context

No response

The text was updated successfully, but these errors were encountered:

swiatekm · 2025-05-29T14:48:01Z

Are you sure this also happens if the metric paths are different? The logic behind #3832 was that all the data that used to go into the hash calculation was present in the labels anyway. And that is definitely true for __address__ and __metrics_path__, but for the job name is only true when using Prometheus CRs. So there's definitely a bug here, and it should be resolved by adding the job name label for the purpose of hash calculation.

Thank you for the report and detailed reproduction!

gracewehner · 2025-05-29T18:57:08Z

Thanks @swiatekm for the quick reply. The above debugger screenshot is for the config:

config:
  scrape_configs:
  - job_name: prometheus
    metrics_path: /metrics1
    static_configs:
    - targets: ["prom.domain:9001", "prom.domain:9002", "prom.domain:9003"]
  - job_name: prometheus2
    metrics_path: /metrics2
    static_configs:
    - targets: ["prom.domain:9001"]

I was surprised to not see the __metrics_path__ labels there too, along with job.

I looked into it more now just to double-check and the labels in the targetgroup and target returned by the discovery manager are the bare minimum with address and some static labels. Then the scrape manager later takes these and combines these with more labels from the scrape config for job, metrics_path, etc: https://github.com/prometheus/prometheus/blob/main/scrape/target.go#L445

Discovery manager target groups: https://github.com/prometheus/prometheus/blob/main/discovery/targetgroup/targetgroup.go

I am happy to help with making a PR with a fix, but I'm not sure what strategy we want to take for the fix

swiatekm · 2025-05-30T10:28:22Z

That's a bit annoying. I don't want to revert the entirety of #3832 because it's conceptually sound and the performance improvement is significant. Unfortunately, Prometheus's labels.Labels don't let us pass in our own hasher. However, the implementation isn't particularly complicated, so I think the simplest fix is to copy it and add the job name and target url at the end, therefore restoring previous behaviour. Afterwards, we can rethink whether we want to mirror Prometheus' label initialization logic or do something bespoke. Does that make sense?

gracewehner · 2025-06-02T21:53:49Z

Ok sounds good, thanks, that makes sense to me

gracewehner · 2025-06-03T00:06:06Z

Thanks @swiatekm, made an initial PR for this, let me know what you think

gracewehner added bug Something isn't working needs triage labels May 28, 2025

swiatekm added area:target-allocator Issues for target-allocator and removed needs triage labels May 29, 2025

gracewehner mentioned this issue Jun 3, 2025

fix: same target from two different jobs missing in targetallocator #4066

Merged

swiatekm closed this as completed in #4066 Jun 4, 2025

swiatekm mentioned this issue Jun 4, 2025

Initialize scrape target labels the same way Prometheus does #4074

Open

kwonjae-2 mentioned this issue Jun 4, 2025

Intermittent target loss during distribution by TargetAllocator #4072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Same target from two different jobs missing after targetallocator upgrade 0.121.0+ #4044

Same target from two different jobs missing after targetallocator upgrade 0.121.0+ #4044

gracewehner commented May 28, 2025

swiatekm commented May 29, 2025

Uh oh!

gracewehner commented May 29, 2025

Uh oh!

swiatekm commented May 30, 2025 •

edited

Loading

Uh oh!

gracewehner commented Jun 2, 2025

Uh oh!

gracewehner commented Jun 3, 2025

Uh oh!

Same target from two different jobs missing after targetallocator upgrade 0.121.0+ #4044

Same target from two different jobs missing after targetallocator upgrade 0.121.0+ #4044

Comments

gracewehner commented May 28, 2025

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

swiatekm commented May 29, 2025

Uh oh!

gracewehner commented May 29, 2025

Uh oh!

swiatekm commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gracewehner commented Jun 2, 2025

Uh oh!

gracewehner commented Jun 3, 2025

Uh oh!

swiatekm commented May 30, 2025 •

edited

Loading