-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
otel-collector running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered via kubernetes_sd_configs stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart).
The receiver seems to face a deadlock somewhere in updating the SD targets group.
Steps to reproduce
otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
To trigger the issue, it's enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:
{"level":"info","ts":1601986494.9710436,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}
{"level":"debug","ts":1601995775.1718767,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.37.173:1234/","err":"Get \"http://10.1.37.173:1234/\": dial tcp 10.1.37.173:1234: connect: connection refused"}
{"level":"warn","ts":1601995775.1720421,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995775171,"target_labels":"map[component:oap instance:10.1.37.173:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995776.6160927,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.7.143:1234/","err":"Get \"http://10.1.7.143:1234/\": dial tcp 10.1.7.143:1234: connect: connection refused"}
{"level":"warn","ts":1601995776.6162364,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995776615,"target_labels":"map[component:oap instance:10.1.7.143:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995798.0816824,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.49.45:1234/","err":"Get \"http://10.1.49.45:1234/\": context deadline exceeded"}
{"level":"debug","ts":1601995824.7997108,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
{"level":"debug","ts":1601995829.799763,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
(ad infinitum)
After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).
What did you expect to see?
Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.
What did you see instead?
Prometheus receiver scraping stops functioning completely.
What version did you use?
from /debug/servicez:
GitHash c8aac9e3
BuildType release
Goversion go1.14.7
OS linux
Architecture amd64
What config did you use?
Config: (e.g. the yaml config file)
https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
Environment
Goversion go1.14.7
OS linux
Architecture amd64
Kubernetes 1.17 on EKS
Additional context
The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest master.