Skip to content

[collector] collector pod evicted due to being unhealthy (health_check extension enabled) #3825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davgia opened this issue Mar 19, 2025 · 5 comments
Labels
bug Something isn't working needs triage

Comments

@davgia
Copy link

davgia commented Mar 19, 2025

Component(s)

collector

What happened?

Description

The opentelemetry-collector pod seems restarted due to not being healthy (health_check extension enabled).

Steps to Reproduce

  1. Deploy grafana/tempo chart at version 1.18.2
  2. Deploy opentelemetry-operator helm chart at the specified version
  3. Specify the following collector:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: "${local.namespace}"
spec:
  mode: deployment
  config:
    receivers:
      jaeger:
        protocols:
          grpc: {}
          thrift_http: {}
          thrift_compact: {}
      otlp:
        protocols:
          grpc: {}
          http: {}
      zipkin: {}
    extensions:
      health_check: {}
      pprof: {}
      zpages: {}
    processors:
      batch: {}
    exporters:
      otlp:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true
    service:
      extensions:
        - health_check
        - pprof
        - zpages
      pipelines:
        traces:
          receivers:
            - otlp
            - jaeger
            - zipkin
          processors:
            - batch
          exporters:
            - otlp
  1. see that the collector pod is starting. From log I can see the following message: Everything is ready. Begin running and processing data.
  2. port forward the pod directly and try to curl localhost:13133/. It returns 200 OK: {"status":"Server available","upSince":"2025-03-19T17:48:23.084107432Z","uptime":"14.231745165s"}
  3. see that the collector pod is restarted: Received signal from OS {"signal": "terminated"}

Expected Result

The collector pod should become healthy because the probes (that seems to be correctly configured to make a HTTP GET request to :13133/) return 200 OK.

Actual Result

The collector pod goes into CrashLoopBackOff because it is terminated by the cluster too many times even if the probes seem to return 200 OK.

The conditions of the collector deployment are:

# kubectl describe deploy/otel-collector -n [REDACTED]
  Containers:
   otc-container:
    Image:       [REDACTED]/otel/opentelemetry-collector-contrib:0.120.0
    Ports:       13133/TCP, 8888/TCP, 4317/TCP, 4318/TCP
    Args:
      --config=/conf/collector.yaml
    Liveness:   http-get http://:13133/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:13133/ delay=0s timeout=1s period=10s #success=1 #failure=3
# [...]
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    False   ProgressDeadlineExceeded

Kubernetes Version

1.31.4

Operator version

0.83.0

Collector version

0.120.0

Environment information

Environment

OS: (e.g., "Ubuntu 24.04.1 LTS")

Log output

2025-03-19T17:52:12.860Z    info    [email protected]/service.go:193    Setting up own telemetry...
2025-03-19T17:52:12.860Z    warn    [email protected]/service.go:241    service::telemetry::metrics::address is being deprecated in favor of service::telemetry::metrics::readers
2025-03-19T17:52:12.862Z    info    [email protected]/service.go:258    Starting otelcol-contrib...    {"Version": "0.120.1", "NumCPU": 8}
2025-03-19T17:52:12.862Z    info    extensions/extensions.go:40    Starting extensions...
2025-03-19T17:52:12.862Z    info    extensions/extensions.go:44    Extension is starting...    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.862Z    info    [email protected]/zpagesextension.go:54    Registered zPages span processor on tracer provider    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.862Z    info    [email protected]/zpagesextension.go:64    Registered Host's zPages    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    [email protected]/zpagesextension.go:76    Starting zPages extension    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension", "config": {"Endpoint":"localhost:55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:61    Extension started.    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:44    Extension is starting...    {"otelcol.component.id": "pprof", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    [email protected]/pprofextension.go:61    Starting net/http/pprof server    {"otelcol.component.id": "pprof", "otelcol.component.kind": "Extension", "config": {"TCPAddr":{"Endpoint":"localhost:1777","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:61    Extension started.    {"otelcol.component.id": "pprof", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:44    Extension is starting...    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.863Z    info    [email protected]/healthcheckextension.go:32    Starting health_check extension    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2025-03-19T17:52:12.863Z    info    extensions/extensions.go:61    Extension started.    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:12.865Z    info    [email protected]/otlp.go:116    Starting GRPC server    {"otelcol.component.id": "otlp", "otelcol.component.kind": "Receiver", "endpoint": "0.0.0.0:4317"}
2025-03-19T17:52:12.865Z    info    [email protected]/otlp.go:173    Starting HTTP server    {"otelcol.component.id": "otlp", "otelcol.component.kind": "Receiver", "endpoint": "0.0.0.0:4318"}
2025-03-19T17:52:12.865Z    info    healthcheck/handler.go:132    Health Check state change    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension", "status": "ready"}
2025-03-19T17:52:12.865Z    info    [email protected]/service.go:281    Everything is ready. Begin running and processing data.
2025-03-19T17:52:42.527Z    info    [email protected]/collector.go:339    Received signal from OS    {"signal": "terminated"}
2025-03-19T17:52:42.527Z    info    [email protected]/service.go:323    Starting shutdown...
2025-03-19T17:52:42.527Z    info    healthcheck/handler.go:132    Health Check state change    {"otelcol.component.id": "health_check", "otelcol.component.kind": "Extension", "status": "unavailable"}
2025-03-19T17:52:42.527Z    info    extensions/extensions.go:68    Stopping extensions...
2025-03-19T17:52:42.527Z    info    [email protected]/zpagesextension.go:105    Unregistered zPages span processor on tracer provider    {"otelcol.component.id": "zpages", "otelcol.component.kind": "Extension"}
2025-03-19T17:52:42.527Z    info    [email protected]/service.go:337    Shutdown complete.

Additional context

No response

@davgia davgia added bug Something isn't working needs triage labels Mar 19, 2025
@shuheiktgw
Copy link
Contributor

@davgia
Copy link
Author

davgia commented Mar 24, 2025

This seems to be the same issue as #3688. A possible workaround can be found here: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/examples/deployment-only/rendered/configmap.yaml#L19-L21

you are right.. it's the same, I had already figure it out that could be the address binding though... I will wait for the fix. Thanks!

@yuriolisa
Copy link
Contributor

@davgia, this issue was fixed through #3856. Please test it out, and if it's not behaving as expected, feel free to reopen it.

@davgia
Copy link
Author

davgia commented Apr 9, 2025

Hi @yuriolisa, I have updated the opentelemetry-operator helm chart to the latest version but I see the same behaviour (the chart version is 0.84.2).

This is my collector definition (it's managed from terraform hence the "$$" escape and "${}" interpolation):

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: "${local.namespace}"
spec:
  mode: deployment
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "environment"
                operator: In
                values:
                  - "${var.name}"
  env:
    - name: POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: $${env:POD_IP}:4317
          http:
            endpoint: $${env:POD_IP}:4318
    extensions:
      health_check: {}
    processors:
      batch: {}
    exporters:
      otlp:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true
    service:
      extensions:
        - health_check
      pipelines:
        traces:
          receivers:
            - otlp
          processors:
            - batch
          exporters:
            - otlp

@davgia
Copy link
Author

davgia commented Apr 22, 2025

Hi @yuriolisa , any update? I do not have the permission to reopen this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

3 participants