Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable

**Describe the bug**
`otel-collector` running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered via `kubernetes_sd_configs` stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart).
The receiver seems to face a deadlock somewhere in updating the SD targets group.

**Steps to reproduce**
otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml

To trigger the issue, it's enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:
```
{"level":"info","ts":1601986494.9710436,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}


{"level":"debug","ts":1601995775.1718767,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.37.173:1234/","err":"Get \"http://10.1.37.173:1234/\": dial tcp 10.1.37.173:1234: connect: connection refused"}
{"level":"warn","ts":1601995775.1720421,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995775171,"target_labels":"map[component:oap instance:10.1.37.173:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995776.6160927,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.7.143:1234/","err":"Get \"http://10.1.7.143:1234/\": dial tcp 10.1.7.143:1234: connect: connection refused"}
{"level":"warn","ts":1601995776.6162364,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995776615,"target_labels":"map[component:oap instance:10.1.7.143:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995798.0816824,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.49.45:1234/","err":"Get \"http://10.1.49.45:1234/\": context deadline exceeded"}
{"level":"debug","ts":1601995824.7997108,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
{"level":"debug","ts":1601995829.799763,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}


(ad infinitum)
```
After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).


**What did you expect to see?**
Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.

**What did you see instead?**
Prometheus receiver scraping stops functioning completely.

**What version did you use?**
from `/debug/servicez`:

```
GitHash  c8aac9e3
BuildType release
Goversion  go1.14.7
OS  linux
Architecture amd64
```

**What config did you use?**
Config: (e.g. the yaml config file)
https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
**Environment**

Goversion  go1.14.7
OS  linux
Architecture amd64
Kubernetes 1.17 on EKS

**Additional context**
The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest `master`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions