Skip to content

[exporter/kafka] Kafka exporter blocks collector pipeline with continuous retries when a configuration error occurs #38604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
an-mmx opened this issue Mar 13, 2025 · 4 comments
Labels
bug Something isn't working exporter/kafka

Comments

@an-mmx
Copy link
Contributor

an-mmx commented Mar 13, 2025

Component(s)

exporter/kafka

What happened?

Description

The Kafka exporter should not perform retries on Sarama configuration errors. There is no point in retries, as each retry will result in the same configuration error and ultimately block the collector's pipeline.

From the message producer's perspective, as of version [email protected], configuration errors occur in the following cases:

  • When message headers are specified, but the protocol version does not support them.
  • When the message size exceeds Producer.MaxMessageBytes.

Steps to Reproduce

Configure pipeline with Batch processor having batch_size: 25000 and timeout: 1m
Configure Kafka exporter havingretry_on_failure.enabled: true and producer.max_message_bytes: 100000
Have a constant telemetry flow

Expected Result

The telemetry batch should be dropped, and a corresponding error message should be logged when this issue occurs. The Kafka exporter should still be able to perform retries for other errors, such as network timeouts, etc.

Actual Result

The Kafka exporter performs numerous retries while adhering to the retry_on_failure options. The reason for retry is a configuration error: "Attempt to produce message larger than configured Producer.MaxMessageBytes"

Collector version

v0.121.0

Environment information

No response

OpenTelemetry Collector configuration

receivers:
  otlp:

processors:
  batch:
    send_batch_size: 10000
    timeout: 1m

exporters:
  kafka/metrics:
    brokers:
      - kafka.broker.local
    encoding: otlp_proto
    producer:
      max_message_bytes: 100000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 10m
      max_interval: 30s
    sending_queue:
      enabled: true
    timeout: 5s
    topic: test.topic

service:
  pipelines:
    metrics:
      exporters:
        - kafka/metrics
      processors:
        - batch
      receivers:
        - otlp

Log output

Additional context

No response

@an-mmx an-mmx added bug Something isn't working needs triage New item requiring triage labels Mar 13, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@zou-weidong
Copy link

Is there a function to calculate the size of kafka messages after batch? I configured max_message_bytes 10000000, but it always exceeds 10M and fails after batch. How should I configure it so as not to exceed the max value? Automatically calculate batch size so as not to exceed MaxMessageBytes.

2025-03-19T08:25:38.139Z error internal/queue_sender.go:84 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "logs", "name": "kafka/business", "error": "no more retries left: Failed to deliver 1 messages due to kafka: invalid configuration (Attempt to produce message larger than configured Producer.MaxMessageBytes: 12319988 > 10000000)", "dropped_items": 108}

@an-mmx
Copy link
Contributor Author

an-mmx commented Mar 19, 2025

AFAIK there is no such functionality in OTEL Collection #36982 . it's under development at the moment #37176 .
In our case, we use custom implementation of batchprocessor that allows us to limit batches size in bytes. We haven't shared it in upstream so far.

Nevertheless, this issues is not only about this case. As I mentioned in the cr, the Sarama client can fail to send a message with a Configuration Error. This kind of error is idempotent and there is no sense to perform retries for this kind of errors.

MovieStoreGuy added a commit that referenced this issue May 7, 2025
…prevent retries (#38608)

#### Description
This fix unifies message send error handling for all types of telemetry.
It is designed to identify whether the error was caused by a
ConfigurationError and then reclassify it as a permanent consumer error
to prevent further retries.

#### Link to tracking issue

#38604

#### Testing
Unit test coverage added

#### Documentation
No changes

Co-authored-by: Antoine Toulme <[email protected]>
Co-authored-by: Sean Marciniak <[email protected]>
@pjanotti pjanotti removed the needs triage New item requiring triage label May 13, 2025
@pjanotti
Copy link
Contributor

@an-mmx my understanding is that this one was fixed via #38608 - let me know if I misunderstood something.

dragonlord93 pushed a commit to dragonlord93/opentelemetry-collector-contrib that referenced this issue May 23, 2025
…prevent retries (open-telemetry#38608)

#### Description
This fix unifies message send error handling for all types of telemetry.
It is designed to identify whether the error was caused by a
ConfigurationError and then reclassify it as a permanent consumer error
to prevent further retries.

#### Link to tracking issue

open-telemetry#38604

#### Testing
Unit test coverage added

#### Documentation
No changes

Co-authored-by: Antoine Toulme <[email protected]>
Co-authored-by: Sean Marciniak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/kafka
Projects
None yet
Development

No branches or pull requests

3 participants