Exporter batcher dynamic sharding of the partitions #12473

bogdandrutu · 2025-02-25T04:32:04Z

Component(s)

exporter/exporterhelper

Describe the issue you're reporting

Current Status

Currently the exporter batcher does not support "metadata" batching (which is supported by the batch processor), and it only has one active batch at any moment.

The current implementation that only supports one active batch has throughput issues, because the collector cannot linearly scale, since batching the request requires exclusive locks only one goroutine can "batch" at any point.

To solve the linearly scale problem, we need to get to be able to consume in parallel from the "queue" multiple requests while constructing multiple batches at the same time. This naturally happens when using metadata batching, but even that will suffer sometimes because of hot partitions that will not allow multiple updates at the same time.

In the current design where we always have a queue (even for the sync requests that wait for the response, we still have a logical queue there) all implementations are suffering from this problem, since batching is "single threaded"

Proposal

In this proposal, we will use the following terminologies:

A partition represents the logical separation of the data based on the input (e.g. a key in the context or an entry in the resource attributes). Even when no partitioning is configured (the current state), we can consider that we always have one partition.
A partition may be split into multiple shards and multiple batches may be produced for the same partition when the incoming load is larger than what a single goroutine can progress.

Implementation details

In order to implement the sharding capability of a partition, the queue batching needs to keep some statistics:

Number of "blocked in the queue" requests.
Number of requests processed per each partition.

There will be a "sharder" task that will support dynamic sharding using two actions:

Split a partition into multiple shards.
Merge shards from a partition into fewer shards.

The sharder task will be executed periodically (e.g. every minute) and based on the number of "blocked" requests and the traffic pattern from last N minutes per partition will trigger split and/or merge requests to different partitions.

The consumer goroutines that consume requests from the "queue" (as explained, we always have a queue, may or may not wait for the response):

Pick one request and determine the right partition for that request based on the configuration (using metadata from context, or any other supported mechanism).
If the determined partition has multiple shards, use a round-robin mechanism to select the shard in which the incoming request goes.

github-actions · 2025-02-25T04:32:16Z

Pinging code owners:

exporter/exporterhelper: @bogdandrutu @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

axw · 2025-02-25T07:05:22Z

@bogdandrutu lack of partitioning is holding us back from using the batcher, and I was anticipating the single-threaded issue, so I'd certainly welcome these capabilities.

Do you have in mind that the batcher itself does the partitioning (#10825), or should that responsibility be further up the pipeline before queuing? i.e. the alternative would be that each partition has its own (logical) queue for isolation, and behind that is a partition-specific instance of the batcher; each batcher would have its own sharder task.

bogdandrutu · 2025-02-25T12:47:56Z

@axw thanks for the comment, for this specific issue I was mostly focus on the dynamic sharding and will try to keep it as much as possible that way.

I do understand that for partitioning there are 2 use cases: 1. The downstream service is the same, so all partitions are part of the same failure domain; 2. The downstream service is different, so every partition is part of a different failure domains.

Will discuss this separately in the partitioning issue how to handle these 2 cases.

dmitryax · 2025-02-25T19:16:27Z

Thanks for putting the proposal, @bogdandrutu. It makes sense. I believe the performance bottleneck can be addressed separately using the dynamic sharding only, given that the partitioning can potentially be done in a different place (as mentioned in 2 case above).

adriangb · 2025-02-28T23:12:18Z

I'm very excited about this development. I would like to test using the OTEL Collector as a batching mechanisms for ingestion in a multi-tenant system. If it's able to do per-tenant batching and be disk persistent so data is not lost because of upstream failures it would be a perfect out of the box system for this use case that a lot of telemetry systems have 😄, which I think is why these features are being developed but since I wasn't 100% sure I wanted to share the use case.

Erog38 · 2025-04-04T17:02:11Z

I'm super excited about this proposal. Thank you!

#### Description This PR introduces two new components * `Partitioner` - an interface for fetching batch key. A partitioner type should implement the function `GetKey()` which returns the batching key. `Partitioner` should be provided to the `queue_bacher` along with `sizer` in `queue_batch::Settings`. * `multi_batcher`. It supports key-based batching by routing the requests to a corresponding `shard_batcher`. Each `shard_batcher` corresponds to a shard described in #12473.  #### Link to tracking issue #12795  #### Testing  #### Documentation  --------- Co-authored-by: Dmitry Anoshin <[email protected]>

bogdandrutu changed the title ~~Exporter batcher sharding and partitioning~~ Exporter batcher dynamic sharding and partitioning Feb 25, 2025

bogdandrutu changed the title ~~Exporter batcher dynamic sharding and partitioning~~ Exporter batcher dynamic sharding of the partitions Feb 25, 2025

This was referenced Mar 24, 2025

[exporter][batching] Replace timer with ticker #12720

Closed

[exporter][batching][chore] record batch creation time #12752

Closed

[exporter][batcher] Multi-batch support - Version 1 #12735

Closed

This was referenced Apr 1, 2025

[exporter][batcher] Multi-batch support - Version 2 #12760

Merged

Exporter batcher multi-tenant support - partitioning #12795

Open

dmitryax mentioned this issue Apr 17, 2025

[exporterhelper] Add an ability to batch by a key from pdata or context #10825

Open

dmitryax mentioned this issue Apr 25, 2025

Fix DefaultBatcher implementation to handle multiple producers. #12244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exporter batcher dynamic sharding of the partitions #12473

Exporter batcher dynamic sharding of the partitions #12473

bogdandrutu commented Feb 25, 2025 •

edited

Loading

github-actions bot commented Feb 25, 2025

Uh oh!

axw commented Feb 25, 2025

Uh oh!

bogdandrutu commented Feb 25, 2025 •

edited

Loading

Uh oh!

dmitryax commented Feb 25, 2025

Uh oh!

adriangb commented Feb 28, 2025

Uh oh!

Erog38 commented Apr 4, 2025

Uh oh!

Exporter batcher dynamic sharding of the partitions #12473

Exporter batcher dynamic sharding of the partitions #12473

Comments

bogdandrutu commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Component(s)

Describe the issue you're reporting

Current Status

Proposal

Implementation details

github-actions bot commented Feb 25, 2025

Uh oh!

axw commented Feb 25, 2025

Uh oh!

bogdandrutu commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmitryax commented Feb 25, 2025

Uh oh!

adriangb commented Feb 28, 2025

Uh oh!

Erog38 commented Apr 4, 2025

Uh oh!

bogdandrutu commented Feb 25, 2025 •

edited

Loading

bogdandrutu commented Feb 25, 2025 •

edited

Loading