-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[exporterhelper persistent-queue] Not working as expected #12711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@julianocosta89 I think the issue you want to look at is #4646. While your concern is valid, and this is an acknowledged problem in otel collector core, the storage extension itself cannot do anything about it. The storage extension is simply a way for components to store data in some persistent storage medium. It is used by the persistent queue in exporterhelper, but also by receivers which need to persist their state, like filelog receiver or the sqlquery receiver. If this is not apparent from the README, I'd be happy to hear suggestions on how we can improve this. When it comes to documenting how the storage extension and the persistent queue in exporterhelper fit together, I think the best place for that might be https://github.com/open-telemetry/opentelemetry.io/. |
/label -bug -needs-triage |
Thanks for the reply and pointing me to the issue @swiatekm! I think the main issue is with this sentence:
Can this be achieved somehow? What I'm struggling is that if I understand the issue correctly I'd need to define 2 pipelines, 1 that stores the data right away, and another one that has the Would you have any example of this? |
The root cause of the issue is actually the batch processor. It is asynchronous, but doesn't have persistent state and doesn't conduct backpressure effectively. In general, it's not a problem for the collector to not persist data it receives as long as it didn't acknowledge receipt to the upstream provider. The problem with the batch processor is that it acknowledges receipt, but stores the data in memory for some unknown amount of time, during which the collector may be killed. The general solution to this problem otel core decided on is to move batching to exporterhelper. You can track that work in #8122. As of right now, exporters need to individually opt into the new configuration. This is not always documented in READMEs, and is currently experimental. See the otlp exporter for example. exporterhelper is jointly owned by the otel core maintainers. @dmitryax is the person leading the effort to fix batching, and he should be able to give you more up-to-date answers on this effort. |
Shouldn't / can't the batching flush data when it receives a graceful termination signal? This obviously won't help with e.g. an OOM kill but it should handle relatively rate (in terms of operating time) things like being moved to another node pretty well by just sending out smaller batches. IIRC that's what SDKs do when they are shut down. |
@adriangb that would help, but as you said, it wouldn't solve when the crash is unexpected. |
Agreed! But solving unexpected crashes is much more difficult, you need to never have data that is stored only in memory, which has tradeoffs. Graceful shutdowns should be much more common so there is great value in addressing them even if it doesn't help with unexpected shutdowns. |
@adriangb the exporter batching is in a good shape now. I'd suggest trying it and giving us feedback if possible. Instead of using the
|
hey @dmitryax sorry for taking too long to come back at this. I've tested with
With I have a sample that can be used to reproduce. Clone this repo: https://github.com/julianocosta89/otel-lab Navigate to This is currently configured with Stop the containers ( When you start the Collector (after the script kills it), it should crash with the error message above. |
@mx-psi do you think this one deserves the |
Uh oh!
There was an error while loading. Please reload this page.
Component(s)
extension/storage/filestorage
What happened?
Description
I've been trying to use the storage extension and I'm having some trouble understanding it. The README is a bit confusing and the example is super basic.
When using the storage extension with
sending_queue.storage.file_storage
I can see that my traces, metrics and logs are being saved in the filesexporter_otlp_<signal>
, but that just happens when the data is exported.Imagine the following scenario:
In that case, the data would be lost, and even when the Collector is restarted it wouldn't be aware of the missing data.
Steps to Reproduce
I'll paste a simple app I'm using below, but this can be reproduced by using the Collector Config in here: https://github.com/julianocosta89/otel-lab/blob/main/q-and-a/otel-collector/otelcol-config.yaml
If you would like to test in the app, here it is:
After everything is up and running you can run the script
./requests.sh
to send 100 requests to the weather service.Once the script is finished, you can kill the Collector and when starting it again nothing happens.
Expected Result
I'd expect the received Traces, Metrics and Logs to be stored in the filesystem as soon as the Collector receives it, and if something happens before the data is exported, I'd expect the Collector to retry to send the data till it succeed.
I have this expectation due to this diagram here: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper#persistent-queue
Actual Result
If the Collector dies for some reason before exporting the data, the data is lost, and even when the Collector is restarted it isn't aware of the missing data.
Collector version
Contrib 0.119.0
Environment information
OS: MacOS Sequoia 15.3 (24D60)
But I'm running everything with Docker.
OpenTelemetry Collector configuration
Additional context
It would be great if we could have a more detailed documentation on how to use the components. In the main docs (opentelemetry.io) we only mention what are extensions, but there is not a single example there.
In the READMEs we have examples, but without data being actually sent anywhere, or even worse, with
nop
components.The text was updated successfully, but these errors were encountered: