Skip to content

Throttle exporting from persistence queue to reduce memory consumption #11018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Nav-Kpbor opened this issue Aug 29, 2024 · 2 comments
Open

Comments

@Nav-Kpbor
Copy link

Is your feature request related to a problem? Please describe.
My team and I have encountered an issue where our collectors consume high memory usage when re-ingesting telemetry from a file storage queue after a disruption event with our backend. In these tests we have simulated an hour of connection failures to the backend to let our file storage queue grow. After an hour has passed, we restore the connection and see the spike in exported telemetry and memory usage.
image
image
Here is an example of the behavior we see from the persistence sending queue during the test period. Notice how the sending queue immediately drops to zero after reconnecting to the backend.
image
It seems like on reconnect with the backend, anything in the file storage queue gets consumed into a memory queue. We are hoping to control this memory spike so we can ensure memory won't pass a certain threshold when running on windows VMs

Describe the solution you'd like
Is there a feature we can add that will throttle how quickly the consumers pull from a file storage queue and send to the backend endpoint? Something that allows us to configure how many batches are pulled from the queue over a specified time frame?

Describe alternatives you've considered
We have tried utilizing the memory limiter and GOMEMLIMIT environment variable but neither have been successfully. My guess is the garbage collector won’t reclaim the memory since the telemetry is still being actively sent. We have also tried reducing the number of consumers and the size of batches but we are sill seeing the spiking.

Additional context
Collector version: v0.99 contrib
Tested on Windows 2016 server

Here is the config we used for testing in case there are any config changes we could make to improve memory usage with the current version of the collector.

extensions:
  health_check:
    endpoint: localhost:4313
  file_storage/backup:
    directory: {Directory of Collector on the machine}
    compaction:
      on_rebound: true
      directory: {Directory of Collector on the machine}
      rebound_needed_threshold_mib: 100
      rebound_trigger_threshold_mib: 10
      check_interval: 5s

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
  batch:
    send_batch_size: 8192
    send_batch_max_size: 8192
    timeout: 10s

exporters:   
  otlp:
    endpoint: "http://{IP of backend server}:4317"
    retry_on_failure:
      max_elapsed_time: 0
    sending_queue:
      queue_size: 1000
      storage: file_storage/backup
      num_consumers: 10
    tls:
      insecure: true
    

service:
  extensions: [health_check, file_storage/backup]
  telemetry:
    metrics:
      address: "localhost:4315"
      
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
@jmacd
Copy link
Contributor

jmacd commented Sep 23, 2024

The num_consumers setting in the persistent queue is capable of throttling the export path, consider lowering it to 1 and working back up if the recovery is too slow.

@mattsains
Copy link
Contributor

mattsains commented Apr 29, 2025

I think jmacd's message is a good workaround, but I think this feature request is valid, and relevant to the current improvements happening in exporterhelper. In addition, it shoe-horns nicely into this issue about creating rate/resource limiter extensions: #12603 It seems to me that either the persistent queue or the consumers in the exporter could be configured to interface with one of these limiters in the same way receivers would

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants