Skip to content

[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
srolel opened this issue Mar 24, 2025 · 14 comments
Open

[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904

srolel opened this issue Mar 24, 2025 · 14 comments
Labels
bug Something isn't working connector/datadog priority:p1 High

Comments

@srolel
Copy link

srolel commented Mar 24, 2025

Component(s)

No response

What happened?

Description

Hi, we've noticed a slow-ish memory leak (over several days) and increased CPU usage when upgrading to v0.121.0 from v0.120.0 with no other changes.

Image

(the beginning if the ramp up in memory is the 0.121.0 deployment followed by a rollback)

Steps to Reproduce

Expected Result

Memory and CPU profiles are not changed.

Actual Result

Memory leak and increased CPU usage are observed

Collector version

v0.121.0

Environment information

Environment

OS: docker image on ECS. The collector runs as a gateway for other collectors.

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 8192
    timeout: 1s

  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(resource.attributes["_source"], "otlp_sensitive") where IsMatch(body, "\\[SENSITIVE\\]")
      - context: resource
        statements:
          # _sourceCategory
          - set(attributes["_sourceCategory"], Concat([attributes["deployment.environment.name"], attributes["service.name"], attributes["log.category"]], "_")) where attributes["_sourceCategory"] == nil

          # _account
          - set(attributes["_account"], attributes["cloud.account.id"]) where attributes["cloud.account.id"] != nil and attributes["_account"] == nil

          # _region
          - set(attributes["_region"], attributes["cloud.region"]) where attributes["cloud.region"] != nil and attributes["_region"] == nil

          # _from
          - set(attributes["_from"], Concat([attributes["_account"], attributes["_region"], attributes["deployment.environment.name"], attributes["service.namespace"], attributes["service.name"], attributes["log.category"]], "/")) where attributes["_from"] == nil

  resource:
    attributes:
      - action: delete
        pattern: ^(process|telemetry|thread)\..*
      - action: insert
        key: _sourceHost
        from_attribute: host.name
      - action: insert
        key: _sourceName
        from_attribute: log.file.path_resolved
      - action: insert
        key: service.namespace
        value: 'services'

  filter:
    error_mode: ignore
    traces:
      span:
        - 'IsMatch(attributes["db.statement"], "MGET")'

  cumulativetodelta:

exporters:
  otlphttp:
    endpoint: '${env:ENDPOINT}'
  datadog/exporter:
    api:
      key: ${env:DD_API_KEY}
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1000

connectors:
  datadog/connector:

extensions:
  health_check:
    endpoint: '0.0.0.0:13133'
  pprof:
  zpages:

service:
  telemetry:
    metrics:
      level: detailed
      readers:
        - periodic:
            interval: 60000
            exporter:
              otlp:
                protocol: grpc
                endpoint: collector:4317
  extensions:
    - health_check
    - pprof
    - zpages
  pipelines:
    metrics:
      receivers:
        - datadog/connector
        - otlp
      processors:
        - batch
      exporters:
        - datadog/exporter
    traces:
      receivers:
        - otlp
      processors:
        - filter
        - resource
        - transform
        - batch
      exporters:
        - datadog/exporter
        - datadog/connector
    logs:
      receivers:
        - otlp
      processors:
        - resource
        - transform
        - batch
      exporters:
        - otlphttp

Log output

Additional context

@srolel srolel added bug Something isn't working needs triage New item requiring triage labels Mar 24, 2025
@crobert-1
Copy link
Member

Hello @srolel, thanks for filing! Can you please share your collector config? You're welcome to remove sensitive data, but as much information as possible would be helpful.

(With so many components and no specific configuration on each, it would be hard to make progress debugging this.)

@srolel
Copy link
Author

srolel commented Mar 24, 2025

@crobert-1 thanks, got it. added a version of our config.

Copy link
Contributor

Pinging code owners for exporter/datadogexporter: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96 @jade-guiton-dd @IbraheemA. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

Copy link
Contributor

Pinging code owners for connector/datadogconnector: @mx-psi @dineshg13 @ankitpatel96 @jade-guiton-dd @IbraheemA. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

Copy link
Contributor

Pinging code owners for processor/transformprocessor: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley @edmocosta. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

@crobert-1
Copy link
Member

I'm adding more labels than likely necessary in an effort to try to engage code owners and determine which component is responsible here.

@songy23
Copy link
Member

songy23 commented Mar 26, 2025

Hey @srolel with so many components running here it is hard to tell which one is contributing to the problem. Could you get a profile when running v0.121.0 to see the memory breakdown?

@truthbk
Copy link

truthbk commented Mar 26, 2025

Since the pprof extension is enabled it might also be worth collecting a goroutine dump to check if it's a goroutine leak. Looks like the user is not overriding the pprof endpoint so that would look like this:

curl -X GET http://localhost:1777/debug/pprof/goroutine?debug=2 -o goroutine.dump

@srolel
Copy link
Author

srolel commented Mar 27, 2025

this is what I get from top in pprof:

Image

and here's a dump with curl -X GET http://localhost:1777/debug/pprof/goroutine?debug=2 -o goroutine.dump:

goroutine.dump.txt

@pguinard-public-com
Copy link

We run a similar setup and have gotten a working binary that we're build locally with this Dockerfile as we are only using this to debug reproducibility between the provided image and the expected image is what we're after.

FROM alpine:latest@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88c AS prep
RUN apk --update add ca-certificates

FROM golang:1.24.2-bookworm AS gobuilder
RUN go install go.opentelemetry.io/collector/cmd/[email protected]
RUN git clone https://github.com/open-telemetry/opentelemetry-collector-contrib.git /otelcol
RUN mkdir /otelcol/cmd/otelcontribcol/cmd
WORKDIR /otelcol/cmd/otelcontribcol
RUN git checkout v0.120.0
COPY builder-config.yaml /otelcol/cmd/otelcontribcol/builder-config.yaml
RUN /go/bin/builder --config builder-config.yaml

#FROM scratch
FROM alpine:latest

RUN apk add --no-cache gcompat

ARG USER_UID=10001
ARG USER_GID=10001
USER ${USER_UID}:${USER_GID}

COPY --from=prep /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
COPY --from=gobuilder /otelcol/cmd/otelcontribcol/cmd/otelcontribcol/otelcontribcol /otelcontribcol
EXPOSE 4317 55680 55679
ENTRYPOINT ["/otelcontribcol"]
CMD ["--config", "/etc/otel-contrib/config.yaml"]

# OUR INTERNAL CONFIG

ENV DECISION_WAIT=30
ENV QUEUE_SAMPLING_PERCENTAGE=1
ENV NUM_TRACES=50000
ENV TRACE_EXPORTERS=[datadog/exporter]
ENV METRICS_EXPORTERS=[datadog/exporter]
ENV METRICS_SCRAPE_INTERVAL=30
# Add prometheus to recieve internal metrics
ENV METRICS_RECEIVERS=[otlp]
# Allow localstack in docker run, HOME=/ should only needed for testing
ENV HOME=/

ADD config.yaml /etc/otel-contrib/config.yaml

Here's a builder-config.yaml that works with no memory leaks:


dist:
  module: github.com/open-telemetry/opentelemetry-collector-contrib/cmd/otelcontribcol
  name: otelcontribcol
  description: Local OpenTelemetry Collector Contrib binary, testing only.
  version: 0.124.1-dev
  output_path: ./cmd/otelcontribcol

extensions:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextension v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckv2extension v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/pprofextension v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage v0.120.0
exporters:
  - gomod: go.opentelemetry.io/collector/exporter/debugexporter v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter v0.120.0
processors:
  - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.121.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor v0.120.0
receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/simpleprometheusreceiver v0.120.0
connectors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/datadogconnector v0.120.1
providers:
  - gomod: go.opentelemetry.io/collector/confmap/provider/envprovider v1.30.0
  - gomod: go.opentelemetry.io/collector/confmap/provider/fileprovider v1.30.0

Replacing the following plugins causes the memory leak to happen, it's between v0.120.1 and v0.121.0:

-  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/datadogconnector v0.120.1
+  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/datadogconnector v0.121.0

@TylerHelmuth TylerHelmuth added priority:p1 High and removed exporter/datadog Datadog components processor/transform Transform processor needs triage New item requiring triage labels Apr 21, 2025
@zarbis
Copy link

zarbis commented May 20, 2025

I will chime in with additional info. I've also faced this issue while initially upgrading from 0.117.0 to 0.126.0 and then narrowed down regression with binary search to the same version 0.121.0 as the OP.

Some additional info:

  • I run logs, metrics and traces collectors as separate OTECOLs via otel-operator
  • This issue is present ONLY in metrics collector
  • I don't run datadog connector

Here is CPU profile:
Image

Memory:
Image

Goroutines:

Image

Here is my config:

config:
  connectors:
    forward/all: {}
  exporters:
    debug/detailed:
      sampling_initial: 5
      sampling_thereafter: 100
      verbosity: detailed
    debug/normal:
      verbosity: normal
    otlphttp:
      auth:
        authenticator: basicauth/client
      compression: gzip
      endpoint: <REDACTED>
  extensions:
    basicauth/client:
      client_auth:
        password: ${OTEL_GW_PASSWORD}
        username: ${OTEL_GW_USERNAME}
    health_check:
      endpoint: 0.0.0.0:13133
      path: /
    pprof:
      endpoint: 0.0.0.0:1777
  processors:
    attributes:
      actions:
      - action: upsert
        key: cluster
        value: <REDACTED>
      - action: upsert
        key: alerting_tier
        value: "1"
    batch:
      send_batch_max_size: 5000
      send_batch_size: 2000
    memory_limiter:
      check_interval: 1s
      limit_percentage: 80
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    prometheus:
      config:
        scrape_configs:
        - job_name: dummy
          static_configs:
          - targets:
            - 127.0.0.1:8888
  service:
    extensions:
    - basicauth/client
    - health_check
    - pprof
    pipelines:
      metrics/forward:
        receivers:
        - otlp
        - prometheus
        processors:
        - memory_limiter
        - attributes
        exporters:
        - forward/all
      metrics/otlp:
        receivers:
        - forward/all
        processors:
        - batch
        exporters:
        - otlphttp

@gracewehner
Copy link
Contributor

gracewehner commented May 22, 2025

We are also seeing very high CPU on some clusters with the prometheusreceiver and otlpexporter after upgrading to 0.121.0. It only repros on some clusters though and we are currently trying to narrow down what the difference is.

@mx-psi mx-psi changed the title Memory leak and increased CPU usage in v0.121.0 [connector/datadog] Memory leak and increased CPU usage in v0.121.0 May 23, 2025
@mx-psi
Copy link
Member

mx-psi commented May 23, 2025

@zarbis @gracewehner Let's keep this issue for Datadog connector related issues. If you are not using the Datadog connector on your Collector or your profiles point to somewhere else, please file a separate issue. Thanks!

@mx-psi
Copy link
Member

mx-psi commented May 23, 2025

@srolel I took a look at the info you provided. First I wanted to thank you for providing the details regarding your issue and also I wanted to apologize for taking so long to reply. To continue investigating this issue I would benefit from having:

If these contain any confidential information please file a ticket on https://www.datadoghq.com/support/, otherwise we can continue here. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/datadog priority:p1 High
Projects
None yet
Development

No branches or pull requests

9 participants