[collector] collector pod evicted due to being unhealthy (health_check extension enabled) #3825

davgia · 2025-03-19T18:08:46Z

Component(s)

collector

What happened?

Description

The opentelemetry-collector pod seems restarted due to not being healthy (health_check extension enabled).

Steps to Reproduce

Deploy grafana/tempo chart at version 1.18.2
Deploy opentelemetry-operator helm chart at the specified version
Specify the following collector:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: "${local.namespace}"
spec:
  mode: deployment
  config:
    receivers:
      jaeger:
        protocols:
          grpc: {}
          thrift_http: {}
          thrift_compact: {}
      otlp:
        protocols:
          grpc: {}
          http: {}
      zipkin: {}
    extensions:
      health_check: {}
      pprof: {}
      zpages: {}
    processors:
      batch: {}
    exporters:
      otlp:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true
    service:
      extensions:
        - health_check
        - pprof
        - zpages
      pipelines:
        traces:
          receivers:
            - otlp
            - jaeger
            - zipkin
          processors:
            - batch
          exporters:
            - otlp

see that the collector pod is starting. From log I can see the following message: Everything is ready. Begin running and processing data.
port forward the pod directly and try to curl localhost:13133/. It returns 200 OK: {"status":"Server available","upSince":"2025-03-19T17:48:23.084107432Z","uptime":"14.231745165s"}
see that the collector pod is restarted: Received signal from OS {"signal": "terminated"}

Expected Result

The collector pod should become healthy because the probes (that seems to be correctly configured to make a HTTP GET request to :13133/) return 200 OK.

Actual Result

The collector pod goes into CrashLoopBackOff because it is terminated by the cluster too many times even if the probes seem to return 200 OK.

The conditions of the collector deployment are:

# kubectl describe deploy/otel-collector -n [REDACTED]
  Containers:
   otc-container:
    Image:       [REDACTED]/otel/opentelemetry-collector-contrib:0.120.0
    Ports:       13133/TCP, 8888/TCP, 4317/TCP, 4318/TCP
    Args:
      --config=/conf/collector.yaml
    Liveness:   http-get http://:13133/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:13133/ delay=0s timeout=1s period=10s #success=1 #failure=3
# [...]
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    False   ProgressDeadlineExceeded

Kubernetes Version

1.31.4

Operator version

0.83.0

Collector version

0.120.0

Environment information

Environment

OS: (e.g., "Ubuntu 24.04.1 LTS")

Log output

2025-03-19T17:52:12.860Z    info    [email protected]/service.go:193    Setting up own telemetry...
2025-03-19T17:52:12.860Z    warn    [email protected]/service.go:241    service::telemetry::metrics::address is being deprecated in favor of service::telemetry::metrics::readers
2025-03-19T17:52:12.862Z    info    [email protected]/service.go:258    Starting otelcol-contrib...    {"Version": "0.120.1", "NumCPU": 8}
2025-03-19T17:52:12.862Z    info    extensions/extensions.go:40    Starting extensions...
2025-03-19T17:52:12.862Z    info    extensions/extensions.go:44    Extension is starting...    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.862Z    info    [email protected]/zpagesextension.go:54    Registered zPages span processor on tracer provider    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.862Z    info    [email protected]/zpagesextension.go:64    Registered Host's zPages    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    [email protected]/zpagesextension.go:76    Starting zPages extension    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension", "config": {"Endpoint":"localhost:55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:61    Extension started.    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:44    Extension is starting...    {"otelcol.component.id": "pprof", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    [email protected]/pprofextension.go:61    Starting net/http/pprof server    {"otelcol.component.id": "pprof", "otelcol.component.kind": "Extension", "config": {"TCPAddr":{"Endpoint":"localhost:1777","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:61    Extension started.    {"otelcol.component.id": "pprof", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:44    Extension is starting...    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    [email protected]/healthcheckextension.go:32    Starting health_check extension    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:61    Extension started.    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.865Z    info    [email protected]/otlp.go:116    Starting GRPC server    {"otelcol.component.id": "otlp", "otelcol.component.kind": "Receiver", "endpoint": "0.0.0.0:4317"}
2025-03-19T17:52:12.865Z    info    [email protected]/otlp.go:173    Starting HTTP server    {"otelcol.component.id": "otlp", "otelcol.component.kind": "Receiver", "endpoint": "0.0.0.0:4318"}
2025-03-19T17:52:12.865Z    info    healthcheck/handler.go:132    Health Check state change    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension", "status": "ready"}
2025-03-19T17:52:12.865Z    info    [email protected]/service.go:281    Everything is ready. Begin running and processing data.
2025-03-19T17:52:42.527Z    info    [email protected]/collector.go:339    Received signal from OS    {"signal": "terminated"}
2025-03-19T17:52:42.527Z    info    [email protected]/service.go:323    Starting shutdown...
2025-03-19T17:52:42.527Z    info    healthcheck/handler.go:132    Health Check state change    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension", "status": "unavailable"}
2025-03-19T17:52:42.527Z    info    extensions/extensions.go:68    Stopping extensions...
2025-03-19T17:52:42.527Z    info    [email protected]/zpagesextension.go:105    Unregistered zPages span processor on tracer provider    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:42.527Z    info    [email protected]/service.go:337    Shutdown complete.

Additional context

No response

The text was updated successfully, but these errors were encountered:

shuheiktgw · 2025-03-24T07:53:37Z

This seems to be the same issue as #3688. A possible workaround can be found here: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/examples/deployment-only/rendered/configmap.yaml#L19-L21

davgia · 2025-03-24T08:08:18Z

This seems to be the same issue as #3688. A possible workaround can be found here: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/examples/deployment-only/rendered/configmap.yaml#L19-L21

you are right.. it's the same, I had already figure it out that could be the address binding though... I will wait for the fix. Thanks!

yuriolisa · 2025-04-07T10:59:37Z

@davgia, this issue was fixed through #3856. Please test it out, and if it's not behaving as expected, feel free to reopen it.

davgia · 2025-04-09T08:40:55Z

Hi @yuriolisa, I have updated the opentelemetry-operator helm chart to the latest version but I see the same behaviour (the chart version is 0.84.2).

This is my collector definition (it's managed from terraform hence the "$$" escape and "${}" interpolation):

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: "${local.namespace}"
spec:
  mode: deployment
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "environment"
                operator: In
                values:
                  - "${var.name}"
  env:
    - name: POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: $${env:POD_IP}:4317
          http:
            endpoint: $${env:POD_IP}:4318
    extensions:
      health_check: {}
    processors:
      batch: {}
    exporters:
      otlp:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true
    service:
      extensions:
        - health_check
      pipelines:
        traces:
          receivers:
            - otlp
          processors:
            - batch
          exporters:
            - otlp

davgia · 2025-04-22T10:52:17Z

Hi @yuriolisa , any update? I do not have the permission to reopen this issue

davgia added bug Something isn't working needs triage labels Mar 19, 2025

shuheiktgw mentioned this issue Mar 30, 2025

Set default endpoint for health check extension #3856

Merged

yuriolisa closed this as completed Apr 7, 2025

jackgopack4 mentioned this issue May 30, 2025

[processor/resourcedetection] Azure provider fails with timeout >30s open-telemetry/opentelemetry-collector-contrib#40372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[collector] collector pod evicted due to being unhealthy (health_check extension enabled) #3825

[collector] collector pod evicted due to being unhealthy (health_check extension enabled) #3825

davgia commented Mar 19, 2025 •

edited

Loading

shuheiktgw commented Mar 24, 2025

Uh oh!

davgia commented Mar 24, 2025 •

edited

Loading

Uh oh!

yuriolisa commented Apr 7, 2025

Uh oh!

davgia commented Apr 9, 2025 •

edited

Loading

Uh oh!

davgia commented Apr 22, 2025

Uh oh!

[collector] collector pod evicted due to being unhealthy (health_check extension enabled) #3825

[collector] collector pod evicted due to being unhealthy (health_check extension enabled) #3825

Comments

davgia commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

shuheiktgw commented Mar 24, 2025

Uh oh!

davgia commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuriolisa commented Apr 7, 2025

Uh oh!

davgia commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davgia commented Apr 22, 2025

Uh oh!

davgia commented Mar 19, 2025 •

edited

Loading

davgia commented Mar 24, 2025 •

edited

Loading

davgia commented Apr 9, 2025 •

edited

Loading