-
Notifications
You must be signed in to change notification settings - Fork 704
kgo-verifier: update franz-go to always retry EOF #28854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kgo-verifier: update franz-go to always retry EOF #28854
Conversation
Tests are flaking flake when kgo-repeater dies with a non-retriable error right after connection. This happens when franz-go gets EOF right after connecting but before receiving any response, as a heuristic it assumes this is a SASL misconfiguration as that is the broker behavior in that case. However, this can also occur because Redpanda is stopped/killed after the connection is made but before the initial requests can be responded to. This means the producer will fail if a producer dies/is killed/stops during a critical window between the connection and receiving the first response. This is reasonably likely in stress/chaos tests where producers are being started and stopped all the time. This is a relatively recent change (~6 months ago) in franz-go, which was brought in a few months ago by a franz-go upgrade. To fix this, we make use a proposed new option to franz-go, from this PR: twmb/franz-go#1198 This is not merged, so we pull in the SHA from this PR directly. When a franz-go release is made with this change, we can update to that. Details in CORE-14849. Fixes CORE-14898.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates the franz-go dependency to use a pre-release commit that adds a new AlwaysRetryEOF() option, which is then enabled in the kgo-verifier test tools to work around flaky test failures. The issue occurs when Redpanda is stopped/killed after a connection is made but before the initial request can be responded to, causing franz-go to incorrectly treat the EOF as a non-retriable SASL misconfiguration error.
Key changes:
- Updates franz-go dependency to commit SHA
v1.20.6-0.20251204171952-b7b6b8e44d30which includes the newAlwaysRetryEOF()option - Adds
kgo.AlwaysRetryEOF()option to both the worker and repeater worker configurations - Updates transitive dependencies (golang.org/x/sync, golang.org/x/crypto, golang.org/x/sys, golang.org/x/text)
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/go/kgo-verifier/pkg/worker/worker.go | Adds kgo.AlwaysRetryEOF() to the Kafka client options in MakeKgoOpts() |
| tests/go/kgo-verifier/pkg/worker/repeater/repeater_worker.go | Adds kgo.AlwaysRetryEOF() to the Kafka client options in the repeater worker's Init() method |
| tests/go/kgo-verifier/go.mod | Updates franz-go to pre-release commit and bumps transitive dependency versions |
| kgo.MaxBufferedRecords(int(wc.MaxBufferedRecords)), | ||
| kgo.RequiredAcks(kgo.AllISRAcks()), | ||
|
|
||
| kgo.AlwaysRetryEOF(), // workaround for CORE-14849 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many NewClient calls in these binaries, but luckily most defer here to MakeKgoOpts. I only searched for NewClient calls: hopefully that is sufficient.
Retry command for Build#77359please wait until all jobs are finished before running the slash command |
|
Looked at a few failures they were all noise, let's do a retry. |
|
/ci-repeat 1 |
Note: this effectively replaces #28816, using a commit from upstream franz-go which makes this behavior configurable.
Note also that because of how bazel works, this effectively makes all go modules in the repo use this updated version when they build with bazel, so this is a "tree wide" change even if it doesn't look like it.
Tests are flaking flake when kgo-repeater dies with a non-retriable error right after connection. This happens when franz-go gets EOF right after connecting but before receiving any response, as a heuristic it assumes this is a SASL misconfiguration as that is the broker behavior in that case. However, this can also occur because Redpanda is stopped/killed after the connection is made but before the initial requests can be responded to.
This means the producer will fail if a producer dies/is killed/stops during a critical window between the connection and receiving the first response. This is reasonably likely in stress/chaos tests where producers are being started and stopped all the time.
This is a relatively recent change (~6 months ago) in franz-go, which was brought in a few months ago by a franz-go upgrade.
To fix this, we make use a proposed new option to franz-go, from this PR:
twmb/franz-go#1198
This is not merged, so we pull in the SHA from this PR directly. When a franz-go release is made with this change, we can update to that.
Details in CORE-14849.
Fixes CORE-14898.
Backports Required
Release Notes