Skip to content

Support for configuring Kafka consumer rebalance strategy and group instance ID #39513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bvsvas opened this issue Apr 21, 2025 · 5 comments
Open

Comments

@bvsvas
Copy link
Contributor

bvsvas commented Apr 21, 2025

Component(s)

receiver/kafka

Is your feature request related to a problem? Please describe.

Support for configuring Sticky Rebalancing Strategy (Consumer.Group.Rebalance.Strategy = sticky) in the Kafka receiver implementation via IBM Sarama client.

This includes:

  • Allow configuring Consumer Group Rebalance Strategy (range or roundrobin or sticky)
  • Optionally support Group.InstanceId to enable static consumer membership (as per KIP-345)

The kafkareceiver currently uses the IBM Sarama client with the default range rebalancing strategy for consumer group coordination. This often leads to uneven partition assignments and large-scale rebalances when pods are restarted or scaled, causing unnecessary cache reloading, CPU spikes, and latency due to metric metadata being recomputed or fetched again.

This is especially problematic in large-scale OpenTelemetry Collector deployments that rely on consistent partition ownership for optimized caching and reduced memory churn.

Describe the solution you'd like

Expose support for the sticky rebalancing strategy (stickyBalanceStrategy) in the kafkareceiver using the IBM Sarama client.

Specifically:
Add configuration option in kafkareceiver to allow Sarama client's setting Consumer.Group.Rebalance.Strategy

Allow optionally specifying Group.InstanceId to leverage static membership (e.g., group_instance_id: ${POD_NAME} for StatefulSets)

Default to current range strategy if no value is provided to maintain backward compatibility

Example config:

protocol_version: 3.2.1
group_id: otel-metrics-group
group_rebalance_strategy: sticky  # optional, new property, default value is range, possible values: range, roundrobin, sticky
group_instance_id: ${POD_NAME}  # optional, new property, enables static membership

Note: Supported Group.InstanceId for Kafka >2.3

This would allow consumers to maintain a more consistent partition-to-replica assignment across restarts and reduce the operational load during scaling events.

Describe alternatives you've considered

  • Custom patching the Sarama client config inside a forked Collector build (currently being used as a workaround)
  • Sticky logic at Kafka broker level — not viable; balancing is always determined by clients

Additional context

Sticky balancing in Sarama:
https://github.com/IBM/sarama/blob/main/balance_strategy.go
https://github.com/IBM/sarama/blob/main/consumer_group.go

KafkaReceiver implementation:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/internal/kafka/client.go
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/kafkareceiver

This enhancement will help large-scale OTel deployments (millions of unique time series) reduce rebalance impact and improve cache and CPU efficiency.

@bvsvas bvsvas added enhancement New feature or request needs triage New item requiring triage labels Apr 21, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@bvsvas bvsvas changed the title Support Sticky Rebalancing Strategy in Kafka Receiver via IBM Sarama Client Support for configuring Kafka consumer rebalance strategy and group instance ID Apr 21, 2025
@bvsvas
Copy link
Contributor Author

bvsvas commented Apr 21, 2025

/label internal/kafka

Copy link
Contributor

Pinging code owners for internal/kafka: @pavolloffay @MovieStoreGuy @axw. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

@bvsvas
Copy link
Contributor Author

bvsvas commented Apr 21, 2025

#39517

@crobert-1
Copy link
Member

A code owner has responded positively to the PR that resolves this issue, removing needs triage.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Apr 23, 2025
dmitryax pushed a commit that referenced this issue Apr 25, 2025
… Kafka consumer rebalance strategy and group instance ID (#39517)

As metioned in issue#
[39513](#39513)

… Kafka consumer rebalance strategy and group instance ID

This enhancement introduces two optional settings:
group_rebalance_strategy and group_instance_id.
These allow users to override the default Range-based rebalance strategy
and optionally provide a static instance ID (as per KIP-345) for
cooperative sticky balancing.
This is particularly useful when handling high-cardinality metric
workloads, as it reduces rebalance impact, improves cache reuse, and
boosts CPU efficiency.
  Both settings are optional to maintain full backward compatibility.

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes

<!--Describe what testing was performed and which tests were added.-->
#### Testing

<!--Describe the documentation added.-->
#### Documentation

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Co-authored-by: Srinivas Venkata Bevara <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
Co-authored-by: Vashistha Kumar Singh <[email protected]>
Co-authored-by: vs667919 <[email protected]>
vincentfree pushed a commit to ing-bank/opentelemetry-collector-contrib that referenced this issue May 6, 2025
… Kafka consumer rebalance strategy and group instance ID (open-telemetry#39517)

As metioned in issue#
[39513](open-telemetry#39513)

… Kafka consumer rebalance strategy and group instance ID

This enhancement introduces two optional settings:
group_rebalance_strategy and group_instance_id.
These allow users to override the default Range-based rebalance strategy
and optionally provide a static instance ID (as per KIP-345) for
cooperative sticky balancing.
This is particularly useful when handling high-cardinality metric
workloads, as it reduces rebalance impact, improves cache reuse, and
boosts CPU efficiency.
  Both settings are optional to maintain full backward compatibility.

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes

<!--Describe what testing was performed and which tests were added.-->
#### Testing

<!--Describe the documentation added.-->
#### Documentation

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Co-authored-by: Srinivas Venkata Bevara <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
Co-authored-by: Vashistha Kumar Singh <[email protected]>
Co-authored-by: vs667919 <[email protected]>
vincentfree pushed a commit to ing-bank/opentelemetry-collector-contrib that referenced this issue May 20, 2025
… Kafka consumer rebalance strategy and group instance ID (open-telemetry#39517)

As metioned in issue#
[39513](open-telemetry#39513)

… Kafka consumer rebalance strategy and group instance ID

This enhancement introduces two optional settings:
group_rebalance_strategy and group_instance_id.
These allow users to override the default Range-based rebalance strategy
and optionally provide a static instance ID (as per KIP-345) for
cooperative sticky balancing.
This is particularly useful when handling high-cardinality metric
workloads, as it reduces rebalance impact, improves cache reuse, and
boosts CPU efficiency.
  Both settings are optional to maintain full backward compatibility.

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes

<!--Describe what testing was performed and which tests were added.-->
#### Testing

<!--Describe the documentation added.-->
#### Documentation

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Co-authored-by: Srinivas Venkata Bevara <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
Co-authored-by: Vashistha Kumar Singh <[email protected]>
Co-authored-by: vs667919 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants