-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[connector/datadog] Memory leak and increased CPU usage in v0.121.0 #38904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @srolel, thanks for filing! Can you please share your collector config? You're welcome to remove sensitive data, but as much information as possible would be helpful. (With so many components and no specific configuration on each, it would be hard to make progress debugging this.) |
@crobert-1 thanks, got it. added a version of our config. |
Pinging code owners for exporter/datadogexporter: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96 @jade-guiton-dd @IbraheemA. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
Pinging code owners for connector/datadogconnector: @mx-psi @dineshg13 @ankitpatel96 @jade-guiton-dd @IbraheemA. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
Pinging code owners for processor/transformprocessor: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley @edmocosta. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
I'm adding more labels than likely necessary in an effort to try to engage code owners and determine which component is responsible here. |
Hey @srolel with so many components running here it is hard to tell which one is contributing to the problem. Could you get a profile when running v0.121.0 to see the memory breakdown? |
Since the
|
We run a similar setup and have gotten a working binary that we're build locally with this
Here's a
Replacing the following plugins causes the memory leak to happen, it's between v0.120.1 and v0.121.0:
|
I will chime in with additional info. I've also faced this issue while initially upgrading from Some additional info:
Goroutines: Here is my config: config:
connectors:
forward/all: {}
exporters:
debug/detailed:
sampling_initial: 5
sampling_thereafter: 100
verbosity: detailed
debug/normal:
verbosity: normal
otlphttp:
auth:
authenticator: basicauth/client
compression: gzip
endpoint: <REDACTED>
extensions:
basicauth/client:
client_auth:
password: ${OTEL_GW_PASSWORD}
username: ${OTEL_GW_USERNAME}
health_check:
endpoint: 0.0.0.0:13133
path: /
pprof:
endpoint: 0.0.0.0:1777
processors:
attributes:
actions:
- action: upsert
key: cluster
value: <REDACTED>
- action: upsert
key: alerting_tier
value: "1"
batch:
send_batch_max_size: 5000
send_batch_size: 2000
memory_limiter:
check_interval: 1s
limit_percentage: 80
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: dummy
static_configs:
- targets:
- 127.0.0.1:8888
service:
extensions:
- basicauth/client
- health_check
- pprof
pipelines:
metrics/forward:
receivers:
- otlp
- prometheus
processors:
- memory_limiter
- attributes
exporters:
- forward/all
metrics/otlp:
receivers:
- forward/all
processors:
- batch
exporters:
- otlphttp |
We are also seeing very high CPU on some clusters with the prometheusreceiver and otlpexporter after upgrading to 0.121.0. It only repros on some clusters though and we are currently trying to narrow down what the difference is. |
@zarbis @gracewehner Let's keep this issue for Datadog connector related issues. If you are not using the Datadog connector on your Collector or your profiles point to somewhere else, please file a separate issue. Thanks! |
@srolel I took a look at the info you provided. First I wanted to thank you for providing the details regarding your issue and also I wanted to apologize for taking so long to reply. To continue investigating this issue I would benefit from having:
If these contain any confidential information please file a ticket on https://www.datadoghq.com/support/, otherwise we can continue here. Thanks! |
Uh oh!
There was an error while loading. Please reload this page.
Component(s)
No response
What happened?
Description
Hi, we've noticed a slow-ish memory leak (over several days) and increased CPU usage when upgrading to v0.121.0 from v0.120.0 with no other changes.
(the beginning if the ramp up in memory is the 0.121.0 deployment followed by a rollback)
Steps to Reproduce
Expected Result
Memory and CPU profiles are not changed.
Actual Result
Memory leak and increased CPU usage are observed
Collector version
v0.121.0
Environment information
Environment
OS: docker image on ECS. The collector runs as a gateway for other collectors.
OpenTelemetry Collector configuration
Log output
Additional context
The text was updated successfully, but these errors were encountered: