[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904

srolel · 2025-03-24T15:23:18Z

Component(s)

No response

What happened?

Description

Hi, we've noticed a slow-ish memory leak (over several days) and increased CPU usage when upgrading to v0.121.0 from v0.120.0 with no other changes.

(the beginning if the ramp up in memory is the 0.121.0 deployment followed by a rollback)

Steps to Reproduce

Expected Result

Memory and CPU profiles are not changed.

Actual Result

Memory leak and increased CPU usage are observed

Collector version

v0.121.0

Environment information

Environment

OS: docker image on ECS. The collector runs as a gateway for other collectors.

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 8192
    timeout: 1s

  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(resource.attributes["_source"], "otlp_sensitive") where IsMatch(body, "\\[SENSITIVE\\]")
      - context: resource
        statements:
          # _sourceCategory
          - set(attributes["_sourceCategory"], Concat([attributes["deployment.environment.name"], attributes["service.name"], attributes["log.category"]], "_")) where attributes["_sourceCategory"] == nil

          # _account
          - set(attributes["_account"], attributes["cloud.account.id"]) where attributes["cloud.account.id"] != nil and attributes["_account"] == nil

          # _region
          - set(attributes["_region"], attributes["cloud.region"]) where attributes["cloud.region"] != nil and attributes["_region"] == nil

          # _from
          - set(attributes["_from"], Concat([attributes["_account"], attributes["_region"], attributes["deployment.environment.name"], attributes["service.namespace"], attributes["service.name"], attributes["log.category"]], "/")) where attributes["_from"] == nil

  resource:
    attributes:
      - action: delete
        pattern: ^(process|telemetry|thread)\..*
      - action: insert
        key: _sourceHost
        from_attribute: host.name
      - action: insert
        key: _sourceName
        from_attribute: log.file.path_resolved
      - action: insert
        key: service.namespace
        value: 'services'

  filter:
    error_mode: ignore
    traces:
      span:
        - 'IsMatch(attributes["db.statement"], "MGET")'

  cumulativetodelta:

exporters:
  otlphttp:
    endpoint: '${env:ENDPOINT}'
  datadog/exporter:
    api:
      key: ${env:DD_API_KEY}
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1000

connectors:
  datadog/connector:

extensions:
  health_check:
    endpoint: '0.0.0.0:13133'
  pprof:
  zpages:

service:
  telemetry:
    metrics:
      level: detailed
      readers:
        - periodic:
            interval: 60000
            exporter:
              otlp:
                protocol: grpc
                endpoint: collector:4317
  extensions:
    - health_check
    - pprof
    - zpages
  pipelines:
    metrics:
      receivers:
        - datadog/connector
        - otlp
      processors:
        - batch
      exporters:
        - datadog/exporter
    traces:
      receivers:
        - otlp
      processors:
        - filter
        - resource
        - transform
        - batch
      exporters:
        - datadog/exporter
        - datadog/connector
    logs:
      receivers:
        - otlp
      processors:
        - resource
        - transform
        - batch
      exporters:
        - otlphttp

Log output

Additional context

crobert-1 · 2025-03-24T16:35:48Z

Hello @srolel, thanks for filing! Can you please share your collector config? You're welcome to remove sensitive data, but as much information as possible would be helpful.

(With so many components and no specific configuration on each, it would be hard to make progress debugging this.)

srolel · 2025-03-24T19:58:02Z

@crobert-1 thanks, got it. added a version of our config.

github-actions · 2025-03-25T15:34:38Z

Pinging code owners for exporter/datadogexporter: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96 @jade-guiton-dd @IbraheemA. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

github-actions · 2025-03-25T15:34:48Z

Pinging code owners for connector/datadogconnector: @mx-psi @dineshg13 @ankitpatel96 @jade-guiton-dd @IbraheemA. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

github-actions · 2025-03-25T15:34:52Z

Pinging code owners for processor/transformprocessor: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley @edmocosta. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

crobert-1 · 2025-03-25T15:35:15Z

I'm adding more labels than likely necessary in an effort to try to engage code owners and determine which component is responsible here.

songy23 · 2025-03-26T13:43:53Z

Hey @srolel with so many components running here it is hard to tell which one is contributing to the problem. Could you get a profile when running v0.121.0 to see the memory breakdown?

truthbk · 2025-03-26T15:46:21Z

Since the pprof extension is enabled it might also be worth collecting a goroutine dump to check if it's a goroutine leak. Looks like the user is not overriding the pprof endpoint so that would look like this:

curl -X GET http://localhost:1777/debug/pprof/goroutine?debug=2 -o goroutine.dump

srolel · 2025-03-27T06:52:22Z

this is what I get from top in pprof:

and here's a dump with curl -X GET http://localhost:1777/debug/pprof/goroutine?debug=2 -o goroutine.dump:

goroutine.dump.txt

pguinard-public-com · 2025-04-21T20:01:10Z

We run a similar setup and have gotten a working binary that we're build locally with this Dockerfile as we are only using this to debug reproducibility between the provided image and the expected image is what we're after.

FROM alpine:latest@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88c AS prep
RUN apk --update add ca-certificates

FROM golang:1.24.2-bookworm AS gobuilder
RUN go install go.opentelemetry.io/collector/cmd/[email protected]
RUN git clone https://github.com/open-telemetry/opentelemetry-collector-contrib.git /otelcol
RUN mkdir /otelcol/cmd/otelcontribcol/cmd
WORKDIR /otelcol/cmd/otelcontribcol
RUN git checkout v0.120.0
COPY builder-config.yaml /otelcol/cmd/otelcontribcol/builder-config.yaml
RUN /go/bin/builder --config builder-config.yaml

#FROM scratch
FROM alpine:latest

RUN apk add --no-cache gcompat

ARG USER_UID=10001
ARG USER_GID=10001
USER ${USER_UID}:${USER_GID}

COPY --from=prep /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
COPY --from=gobuilder /otelcol/cmd/otelcontribcol/cmd/otelcontribcol/otelcontribcol /otelcontribcol
EXPOSE 4317 55680 55679
ENTRYPOINT ["/otelcontribcol"]
CMD ["--config", "/etc/otel-contrib/config.yaml"]

# OUR INTERNAL CONFIG

ENV DECISION_WAIT=30
ENV QUEUE_SAMPLING_PERCENTAGE=1
ENV NUM_TRACES=50000
ENV TRACE_EXPORTERS=[datadog/exporter]
ENV METRICS_EXPORTERS=[datadog/exporter]
ENV METRICS_SCRAPE_INTERVAL=30
# Add prometheus to recieve internal metrics
ENV METRICS_RECEIVERS=[otlp]
# Allow localstack in docker run, HOME=/ should only needed for testing
ENV HOME=/

ADD config.yaml /etc/otel-contrib/config.yaml

Here's a builder-config.yaml that works with no memory leaks:


dist:
  module: github.com/open-telemetry/opentelemetry-collector-contrib/cmd/otelcontribcol
  name: otelcontribcol
  description: Local OpenTelemetry Collector Contrib binary, testing only.
  version: 0.124.1-dev
  output_path: ./cmd/otelcontribcol

extensions:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextension v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckv2extension v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/pprofextension v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage v0.120.0
exporters:
  - gomod: go.opentelemetry.io/collector/exporter/debugexporter v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter v0.120.0
processors:
  - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.121.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor v0.120.0
receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.120.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/simpleprometheusreceiver v0.120.0
connectors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/datadogconnector v0.120.1
providers:
  - gomod: go.opentelemetry.io/collector/confmap/provider/envprovider v1.30.0
  - gomod: go.opentelemetry.io/collector/confmap/provider/fileprovider v1.30.0

Replacing the following plugins causes the memory leak to happen, it's between v0.120.1 and v0.121.0:

-  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/datadogconnector v0.120.1
+  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/datadogconnector v0.121.0

zarbis · 2025-05-20T18:51:31Z

I will chime in with additional info. I've also faced this issue while initially upgrading from 0.117.0 to 0.126.0 and then narrowed down regression with binary search to the same version 0.121.0 as the OP.

Some additional info:

I run logs, metrics and traces collectors as separate OTECOLs via otel-operator
This issue is present ONLY in metrics collector
I don't run datadog connector

Here is CPU profile:

Memory:

Goroutines:

Here is my config:

config:
  connectors:
    forward/all: {}
  exporters:
    debug/detailed:
      sampling_initial: 5
      sampling_thereafter: 100
      verbosity: detailed
    debug/normal:
      verbosity: normal
    otlphttp:
      auth:
        authenticator: basicauth/client
      compression: gzip
      endpoint: <REDACTED>
  extensions:
    basicauth/client:
      client_auth:
        password: ${OTEL_GW_PASSWORD}
        username: ${OTEL_GW_USERNAME}
    health_check:
      endpoint: 0.0.0.0:13133
      path: /
    pprof:
      endpoint: 0.0.0.0:1777
  processors:
    attributes:
      actions:
      - action: upsert
        key: cluster
        value: <REDACTED>
      - action: upsert
        key: alerting_tier
        value: "1"
    batch:
      send_batch_max_size: 5000
      send_batch_size: 2000
    memory_limiter:
      check_interval: 1s
      limit_percentage: 80
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    prometheus:
      config:
        scrape_configs:
        - job_name: dummy
          static_configs:
          - targets:
            - 127.0.0.1:8888
  service:
    extensions:
    - basicauth/client
    - health_check
    - pprof
    pipelines:
      metrics/forward:
        receivers:
        - otlp
        - prometheus
        processors:
        - memory_limiter
        - attributes
        exporters:
        - forward/all
      metrics/otlp:
        receivers:
        - forward/all
        processors:
        - batch
        exporters:
        - otlphttp

gracewehner · 2025-05-22T19:35:11Z

We are also seeing very high CPU on some clusters with the prometheusreceiver and otlpexporter after upgrading to 0.121.0. It only repros on some clusters though and we are currently trying to narrow down what the difference is.

mx-psi · 2025-05-23T09:26:44Z

@zarbis @gracewehner Let's keep this issue for Datadog connector related issues. If you are not using the Datadog connector on your Collector or your profiles point to somewhere else, please file a separate issue. Thanks!

mx-psi · 2025-05-23T13:32:53Z

@srolel I took a look at the info you provided. First I wanted to thank you for providing the details regarding your issue and also I wanted to apologize for taking so long to reply. To continue investigating this issue I would benefit from having:

Logs of your Collector when running on v0.121.0 or above when the issue happens
The full pprof file (if you happen to keep the previous file used for [connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904 (comment) that would work)

If these contain any confidential information please file a ticket on https://www.datadoghq.com/support/, otherwise we can continue here. Thanks!

srolel added bug Something isn't working needs triage New item requiring triage labels Mar 24, 2025

github-actions bot mentioned this issue Mar 25, 2025

Weekly Report: 2025-03-18 - 2025-03-25 #38935

Closed

crobert-1 added exporter/datadog Datadog components processor/transform Transform processor connector/datadog labels Mar 25, 2025

This was referenced Apr 1, 2025

Weekly Report: 2025-03-25 - 2025-04-01 #39070

Closed

Weekly Report: 2025-04-01 - 2025-04-08 #39228

Closed

github-actions bot mentioned this issue Apr 15, 2025

Weekly Report: 2025-04-08 - 2025-04-15 #39396

Closed

TylerHelmuth added priority:p1 High and removed exporter/datadog Datadog components processor/transform Transform processor needs triage New item requiring triage labels Apr 21, 2025

mx-psi changed the title ~~Memory leak and increased CPU usage in v0.121.0~~ [connector/datadog] Memory leak and increased CPU usage in v0.121.0 May 23, 2025

zarbis mentioned this issue May 23, 2025

receiver/prometheus: Enable deactivation of Created Timestamp parsing based on user configuration #40245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904

[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904

srolel commented Mar 24, 2025 •

edited

Loading

crobert-1 commented Mar 24, 2025

Uh oh!

srolel commented Mar 24, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

crobert-1 commented Mar 25, 2025

Uh oh!

songy23 commented Mar 26, 2025

Uh oh!

truthbk commented Mar 26, 2025

Uh oh!

srolel commented Mar 27, 2025 •

edited

Loading

Uh oh!

pguinard-public-com commented Apr 21, 2025

Uh oh!

zarbis commented May 20, 2025

Uh oh!

gracewehner commented May 22, 2025 •

edited

Loading

Uh oh!

mx-psi commented May 23, 2025

Uh oh!

mx-psi commented May 23, 2025

Uh oh!

[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904

[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904

Comments

srolel commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

crobert-1 commented Mar 24, 2025

Uh oh!

srolel commented Mar 24, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

crobert-1 commented Mar 25, 2025

Uh oh!

songy23 commented Mar 26, 2025

Uh oh!

truthbk commented Mar 26, 2025

Uh oh!

srolel commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pguinard-public-com commented Apr 21, 2025

Uh oh!

zarbis commented May 20, 2025

Uh oh!

gracewehner commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mx-psi commented May 23, 2025

Uh oh!

mx-psi commented May 23, 2025

Uh oh!

srolel commented Mar 24, 2025 •

edited

Loading

srolel commented Mar 27, 2025 •

edited

Loading

gracewehner commented May 22, 2025 •

edited

Loading