Skip to content

[exporterhelper persistent-queue] Not working as expected #12711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
julianocosta89 opened this issue Feb 11, 2025 · 11 comments
Open

[exporterhelper persistent-queue] Not working as expected #12711

julianocosta89 opened this issue Feb 11, 2025 · 11 comments
Labels
bug Something isn't working

Comments

@julianocosta89
Copy link
Member

julianocosta89 commented Feb 11, 2025

Component(s)

extension/storage/filestorage

What happened?

Description

I've been trying to use the storage extension and I'm having some trouble understanding it. The README is a bit confusing and the example is super basic.

When using the storage extension with sending_queue.storage.file_storage I can see that my traces, metrics and logs are being saved in the files exporter_otlp_<signal>, but that just happens when the data is exported.

Imagine the following scenario:

  • Collector receives the data.
  • The Batch processor is configured to wait for 30s to send the data.
  • The Collector dies for some reason before exporting the data.

In that case, the data would be lost, and even when the Collector is restarted it wouldn't be aware of the missing data.

Steps to Reproduce

I'll paste a simple app I'm using below, but this can be reproduced by using the Collector Config in here: https://github.com/julianocosta89/otel-lab/blob/main/q-and-a/otel-collector/otelcol-config.yaml

If you would like to test in the app, here it is:

git clone https://github.com/julianocosta89/otel-lab
cd otel-lab/q-and-a

docker compose up -d

After everything is up and running you can run the script ./requests.sh to send 100 requests to the weather service.

Once the script is finished, you can kill the Collector and when starting it again nothing happens.

docker kill q-and-a-otel-collector-1

docker compose up -d

Expected Result

I'd expect the received Traces, Metrics and Logs to be stored in the filesystem as soon as the Collector receives it, and if something happens before the data is exported, I'd expect the Collector to retry to send the data till it succeed.

I have this expectation due to this diagram here: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper#persistent-queue

Actual Result

If the Collector dies for some reason before exporting the data, the data is lost, and even when the Collector is restarted it isn't aware of the missing data.

Collector version

Contrib 0.119.0

Environment information

OS: MacOS Sequoia 15.3 (24D60)

But I'm running everything with Docker.

OpenTelemetry Collector configuration

extensions:
  file_storage: {}

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ${env:OTEL_COLLECTOR_HOST}:${env:OTEL_COLLECTOR_PORT_GRPC}
      http:
        endpoint: ${env:OTEL_COLLECTOR_HOST}:${env:OTEL_COLLECTOR_PORT_HTTP}
  postgresql:
    endpoint: ${env:POSTGRES_HOST}:${env:POSTGRES_PORT}
    username: ${env:POSTGRES_USER}
    password: ${env:POSTGRES_PWD}
    databases:
      - ${env:POSTGRES_DB}
    tls:
      insecure: true

processors:
  batch:
    timeout: 30s
  transform:
    error_mode: ignore
    metric_statements:
      - context: scope
        conditions:
          - name == "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/postgresqlreceiver"
        statements:
          - set(resource.attributes["service.name"], "${env:POSTGRES_DB}")

connectors:
  spanmetrics: {}

exporters:
  debug: {}
  otlp:
    endpoint: "aspire-dashboard:18889"
    tls:
      insecure: true
    sending_queue:
      storage: file_storage
  otlp/jaeger:
    endpoint: "jaeger:4317"
    tls:
      insecure: true
    sending_queue:
      storage: file_storage

service:
  extensions: [file_storage]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, otlp, otlp/jaeger, spanmetrics]
    metrics:
      receivers: [otlp, postgresql, spanmetrics]
      processors: [transform, batch]
      exporters: [debug, otlp]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, otlp]

Additional context

It would be great if we could have a more detailed documentation on how to use the components. In the main docs (opentelemetry.io) we only mention what are extensions, but there is not a single example there.
In the READMEs we have examples, but without data being actually sent anywhere, or even worse, with nop components.

@julianocosta89 julianocosta89 added the bug Something isn't working label Feb 11, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@swiatekm
Copy link
Contributor

@julianocosta89 I think the issue you want to look at is #4646. While your concern is valid, and this is an acknowledged problem in otel collector core, the storage extension itself cannot do anything about it.

The storage extension is simply a way for components to store data in some persistent storage medium. It is used by the persistent queue in exporterhelper, but also by receivers which need to persist their state, like filelog receiver or the sqlquery receiver. If this is not apparent from the README, I'd be happy to hear suggestions on how we can improve this.

When it comes to documenting how the storage extension and the persistent queue in exporterhelper fit together, I think the best place for that might be https://github.com/open-telemetry/opentelemetry.io/.

@swiatekm
Copy link
Contributor

/label -bug -needs-triage

@github-actions github-actions bot removed the bug Something isn't working label Feb 11, 2025
@julianocosta89
Copy link
Member Author

Thanks for the reply and pointing me to the issue @swiatekm!

I think the main issue is with this sentence:

When persistent queue is enabled, the batches are being buffered using the provided storage extension - filestorage is a popular and safe choice. If the collector instance is killed while having some items in the persistent queue, on restart the items will be picked and the exporting is continued.

Can this be achieved somehow?
If yes, I'd love to know how, and I'm happy to contribute documenting it.

What I'm struggling is that if I understand the issue correctly I'd need to define 2 pipelines, 1 that stores the data right away, and another one that has the batch processor.
I haven't tried, but what confuses me is that there is no filestorage receiver that would read from the file and check if all persisted data was sent.

Would you have any example of this?
I don't think we have an owner for the exporter helper, do we?

@swiatekm
Copy link
Contributor

swiatekm commented Feb 11, 2025

The root cause of the issue is actually the batch processor. It is asynchronous, but doesn't have persistent state and doesn't conduct backpressure effectively. In general, it's not a problem for the collector to not persist data it receives as long as it didn't acknowledge receipt to the upstream provider. The problem with the batch processor is that it acknowledges receipt, but stores the data in memory for some unknown amount of time, during which the collector may be killed.

The general solution to this problem otel core decided on is to move batching to exporterhelper. You can track that work in #8122. As of right now, exporters need to individually opt into the new configuration. This is not always documented in READMEs, and is currently experimental. See the otlp exporter for example.

exporterhelper is jointly owned by the otel core maintainers. @dmitryax is the person leading the effort to fix batching, and he should be able to give you more up-to-date answers on this effort.

@adriangb
Copy link

Shouldn't / can't the batching flush data when it receives a graceful termination signal? This obviously won't help with e.g. an OOM kill but it should handle relatively rate (in terms of operating time) things like being moved to another node pretty well by just sending out smaller batches. IIRC that's what SDKs do when they are shut down.

@julianocosta89
Copy link
Member Author

@adriangb that would help, but as you said, it wouldn't solve when the crash is unexpected.

@adriangb
Copy link

adriangb commented Mar 3, 2025

Agreed! But solving unexpected crashes is much more difficult, you need to never have data that is stored only in memory, which has tradeoffs. Graceful shutdowns should be much more common so there is great value in addressing them even if it doesn't help with unexpected shutdowns.

@dmitryax
Copy link
Member

dmitryax commented Mar 4, 2025

@adriangb the exporter batching is in a good shape now. I'd suggest trying it and giving us feedback if possible. Instead of using the batch proccesor, add this section to otlp exporter:

  batcher:
    enabled: true
    flush_timeout: 30s

@julianocosta89
Copy link
Member Author

hey @dmitryax sorry for taking too long to come back at this.
I've ran some tests and I think we may have an issue with the last 2 releases of the Collector.

I've tested with 0.120.0 and 0.121.0 and I got this:

otel-collector-1  | 2025-03-17T13:41:00.279Z    info    [email protected]/service.go:193 Setting up own telemetry...
otel-collector-1  | 2025-03-17T13:41:00.279Z    info    builders/builders.go:26 Development component. May change in the future.        {"otelcol.component.id": "debug", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    builders/builders.go:26 Unmaintained component. Actively looking for contributors. Component will become deprecated after 3 months of remaining unmaintained.      {"otelcol.component.id": "postgresql", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics"}
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    [email protected]/connector.go:110  Building spanmetrics connector  {"otelcol.component.id": "spanmetrics", "otelcol.component.kind": "Connector", "otelcol.signal": "traces", "otelcol.signal.output": "metrics"}
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    builders/builders.go:26 Development component. May change in the future.        {"otelcol.component.id": "debug", "otelcol.component.kind": "Exporter", "otelcol.signal": "traces"}
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    builders/builders.go:26 Development component. May change in the future.        {"otelcol.component.id": "debug", "otelcol.component.kind": "Exporter", "otelcol.signal": "logs"}
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    [email protected]/service.go:258 Starting otelcol-contrib...     {"Version": "0.121.0", "NumCPU": 10}
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    extensions/extensions.go:40     Starting extensions...
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    extensions/extensions.go:44     Extension is starting...        {"otelcol.component.id": "file_storage", "otelcol.component.kind": "Extension"}
otel-collector-1  | 2025-03-17T13:41:00.280Z    info    extensions/extensions.go:61     Extension started.      {"otelcol.component.id": "file_storage", "otelcol.component.kind": "Extension"}
otel-collector-1  | 2025-03-17T13:41:00.282Z    info    exporterqueue/persistent_queue.go:410   Fetching items left for dispatch by consumers   {"otelcol.component.id": "otlp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "numberOfItems": 3}
otel-collector-1  | 2025-03-17T13:41:00.282Z    info    exporterqueue/persistent_queue.go:452   Moved items for dispatching back to queue       {"otelcol.component.id": "otlp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "numberOfItems": 3}
otel-collector-1  | 2025-03-17T13:41:00.283Z    info    [email protected]/connector.go:204  Starting spanmetrics connector  {"otelcol.component.id": "spanmetrics", "otelcol.component.kind": "Connector", "otelcol.signal": "traces", "otelcol.signal.output": "metrics"}
otel-collector-1  | panic: runtime error: invalid memory address or nil pointer dereference
otel-collector-1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xb9118]
otel-collector-1  | 
otel-collector-1  | goroutine 85 [running]:
otel-collector-1  | time.(*Timer).Reset(0x40013fdac0?, 0x0?)
otel-collector-1  |     time/sleep.go:172 +0x18
otel-collector-1  | go.opentelemetry.io/collector/exporter/exporterhelper/internal/batcher.(*defaultBatcher).resetTimer(...)
otel-collector-1  |     go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/batcher/default_batcher.go:60
otel-collector-1  | go.opentelemetry.io/collector/exporter/exporterhelper/internal/batcher.(*defaultBatcher).Consume(0x4000bf4280, {0xaf10130, 0x11d90780}, {0xaee6410?, 0x40013fdac0?}, {0xae63e00, 0x40013fdae0})
otel-collector-1  |     go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/batcher/default_batcher.go:93 +0x7c8
otel-collector-1  | go.opentelemetry.io/collector/exporter/exporterqueue.(*asyncQueue[...]).Start.func1()
otel-collector-1  |     go.opentelemetry.io/collector/[email protected]/exporterqueue/async_queue.go:47 +0x78
otel-collector-1  | created by go.opentelemetry.io/collector/exporter/exporterqueue.(*asyncQueue[...]).Start in goroutine 1
otel-collector-1  |     go.opentelemetry.io/collector/[email protected]/exporterqueue/async_queue.go:39 +0x88
otel-collector-1 exited with code 2

With 0.121.0 I got 1 out of 10 working, with 0.119.0 it always worked and I never got an exception.

I have a sample that can be used to reproduce.

Clone this repo: https://github.com/julianocosta89/otel-lab

Navigate to q-and-a folder and run ./run_test.sh.

This is currently configured with otel/opentelemetry-collector-contrib:0.119.0 so everything will work.

Stop the containers (docker compose down) and change the otel/opentelemetry-collector-contrib to 0.121.0.
Run the script again.

When you start the Collector (after the script kills it), it should crash with the error message above.

@mx-psi mx-psi transferred this issue from open-telemetry/opentelemetry-collector-contrib Mar 24, 2025
@julianocosta89
Copy link
Member Author

@mx-psi do you think this one deserves the bug tag?
The error is reproducible and the behaviour is different between versions 0.119.0 and newer ones.

@mx-psi mx-psi added the bug Something isn't working label Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants