Skip to content

[otap-dataflow] parquet exporter gracefully handle errors #504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
albertlockett opened this issue May 30, 2025 · 1 comment
Open

[otap-dataflow] parquet exporter gracefully handle errors #504

albertlockett opened this issue May 30, 2025 · 1 comment
Labels
enhancement New feature or request parquet-exporter Parquet Exporter related tasks pipeline Rust Pipeline Related Tasks rust Pull requests that update Rust code

Comments

@albertlockett
Copy link
Member

There are a few types of errors we may encounter in the parquet exporter:

Non-retryable errors:

  • e.g. some message has malformed record batches

Retryable errors:

  • e.g. temporary network failures or service interruptions make it so we can't write parquet to remote object storage

Misconfigurations:

  • e.g. the bucket/storage location does not exist or can not be accessed

In each case, we should have a strategy for handling these.

For example, we may wish to emit a Nack control message which could suggest to the Retry processor whether it should retry the batch.

For the parquet exporter in-particular, careful attention will need to be payed to how we handle retried messages. For example, if we have written Logs to some file successfully, but LogAttrs for this batch fails for some retryable reason, we'll need to ensure that we:

  • don't flush the Logs file until LogAttrs is finally retried, written and flushed
  • don't rewrite the Logs message
    In this case, it might make the retry handling simpler if we don't even write the Logs message until we've sucessfully written LogsAttr. We can consider this more carefully when implementing this issue.

Prerequisites:

@albertlockett albertlockett added enhancement New feature or request pipeline Rust Pipeline Related Tasks rust Pull requests that update Rust code parquet-exporter Parquet Exporter related tasks labels May 30, 2025
@lquerel
Copy link
Contributor

lquerel commented May 30, 2025

Added more details on the Retry Processor in #509, and on the Failover Processor as well #510 (sub-issue of #368)

jmacd added a commit that referenced this issue Jun 2, 2025
Basic implementation of Parquet Exporter.

What's included in this PR:
- Base parquet exporter
- Generates IDs that are unique within some randomly generated partition
- Basic functionality to identify partition from schema metadata
- Writer module with:
  - Support for hive-style partitioning
  - Avoids flushing parent records before children
---

This is still very much Work In Progress, but I wanted to get the PR up
for early review before it got too long.

Some things still TODO before marking this "Ready":
- ~Handling various types of Control messages.~
- ~Need to review with @lquerel to better understand what is expected
from the exporter implementation~
- After discussion w/ Laurent on 05/29, decided to open TODOs for many
of these and link issues in comments
- ~Various unit tests are not finished / implemented~
- ~Another pass of documentation for adding clarity + correcting any
spelling/grammar errors~
- ~Fixing many clippy / formatting issues~
- ~Better resilience in computing the unique IDs. Currently this assumes
the IDs are not delta encoded. It might wish to error if the IDs are not
in plain encoding, and we might require an upstream processor to remove
delta encoding.~
(#512 related helper
that could simplify code in future)
---

Plan for followups after this PR:
- Immediate:
- Add support for other signals types. Currently only logs is supported.
Need to add support for metrics & traces. Currently there aren't any
tests for traces / metrics, and the code in ID generation & writer needs
to be adapted
  #503

  
- Near-term:
  - Better error handling (including retrying failed batches)
    #504
  - Add support for remote object stores (S3, GCS, Azure) 
  #501
- Add support for date-bucket partitioning (I had started implementing
that in this PR, and took it out to simplify review). It might be best
to just do this partitioning in a processor (and include benchmarks). I
have some additional thoughts on this ~I'll document in a separate
issue~
#496 (issue created)
- Add options to control the parquet writer to the base config (e.g.
controlling row group size)
  #502
  
- Future:
- we could consider adding a `parquet` feature to the crate to avoid
bringing in parquet & object_store dependencies unless this feature is
enabled
  
I'll document some of these TODOs for posterity in separate github
issues.
  
---
A few notes on the unique ID generation:

We wanted to have a scheme where we are able to generate unique IDs for
some given batch without any coordination between multiple nodes (nodes
in this case being exporters on other threads, or other processes).

The approach this PR takes is to keep have the exporter generate IDs
that are unique within some partition, and write a partitioned parquet
file using hive-style partitioning. Management of the partition is the
sole responsibility of this instance of the exporter. It does this by
randomly generating a partition ID, and then incrementing the IDs of
each subsequent batch by the maximum ID of the previous batch. When the
IDs overflow, it starts a new partition.

There are some trade-offs to this approach:
- replaying the data means that the data will get written twice (it's
not idempotent)
- the reader needs to be aware of this partitioning scheme

I'm sure there are other better ID Generation schemes we could explore
in the future, but seemed like a simple way to get something basic
working.

---

part of #399

---------

Co-authored-by: Joshua MacDonald <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parquet-exporter Parquet Exporter related tasks pipeline Rust Pipeline Related Tasks rust Pull requests that update Rust code
Projects
None yet
Development

No branches or pull requests

2 participants