kgo-verifier: update franz-go to always retry EOF #28854

travisdowns · 2025-12-04T21:04:24Z

Note: this effectively replaces #28816, using a commit from upstream franz-go which makes this behavior configurable.

Note also that because of how bazel works, this effectively makes all go modules in the repo use this updated version when they build with bazel, so this is a "tree wide" change even if it doesn't look like it.

Tests are flaking flake when kgo-repeater dies with a non-retriable error right after connection. This happens when franz-go gets EOF right after connecting but before receiving any response, as a heuristic it assumes this is a SASL misconfiguration as that is the broker behavior in that case. However, this can also occur because Redpanda is stopped/killed after the connection is made but before the initial requests can be responded to.

This means the producer will fail if a producer dies/is killed/stops during a critical window between the connection and receiving the first response. This is reasonably likely in stress/chaos tests where producers are being started and stopped all the time.

This is a relatively recent change (~6 months ago) in franz-go, which was brought in a few months ago by a franz-go upgrade.

To fix this, we make use a proposed new option to franz-go, from this PR:

twmb/franz-go#1198

This is not merged, so we pull in the SHA from this PR directly. When a franz-go release is made with this change, we can update to that.

Details in CORE-14849.

Fixes CORE-14898.

Backports Required

Release Notes

none

Tests are flaking flake when kgo-repeater dies with a non-retriable error right after connection. This happens when franz-go gets EOF right after connecting but before receiving any response, as a heuristic it assumes this is a SASL misconfiguration as that is the broker behavior in that case. However, this can also occur because Redpanda is stopped/killed after the connection is made but before the initial requests can be responded to. This means the producer will fail if a producer dies/is killed/stops during a critical window between the connection and receiving the first response. This is reasonably likely in stress/chaos tests where producers are being started and stopped all the time. This is a relatively recent change (~6 months ago) in franz-go, which was brought in a few months ago by a franz-go upgrade. To fix this, we make use a proposed new option to franz-go, from this PR: twmb/franz-go#1198 This is not merged, so we pull in the SHA from this PR directly. When a franz-go release is made with this change, we can update to that. Details in CORE-14849. Fixes CORE-14898.

Copilot

Pull request overview

This PR updates the franz-go dependency to use a pre-release commit that adds a new AlwaysRetryEOF() option, which is then enabled in the kgo-verifier test tools to work around flaky test failures. The issue occurs when Redpanda is stopped/killed after a connection is made but before the initial request can be responded to, causing franz-go to incorrectly treat the EOF as a non-retriable SASL misconfiguration error.

Key changes:

Updates franz-go dependency to commit SHA v1.20.6-0.20251204171952-b7b6b8e44d30 which includes the new AlwaysRetryEOF() option
Adds kgo.AlwaysRetryEOF() option to both the worker and repeater worker configurations
Updates transitive dependencies (golang.org/x/sync, golang.org/x/crypto, golang.org/x/sys, golang.org/x/text)

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File	Description
tests/go/kgo-verifier/pkg/worker/worker.go	Adds `kgo.AlwaysRetryEOF()` to the Kafka client options in `MakeKgoOpts()`
tests/go/kgo-verifier/pkg/worker/repeater/repeater_worker.go	Adds `kgo.AlwaysRetryEOF()` to the Kafka client options in the repeater worker's `Init()` method
tests/go/kgo-verifier/go.mod	Updates franz-go to pre-release commit and bumps transitive dependency versions

travisdowns · 2025-12-04T21:07:42Z

tests/go/kgo-verifier/pkg/worker/worker.go

 		kgo.MaxBufferedRecords(int(wc.MaxBufferedRecords)),
 		kgo.RequiredAcks(kgo.AllISRAcks()),
+
+		kgo.AlwaysRetryEOF(), // workaround for CORE-14849


There are many NewClient calls in these binaries, but luckily most defer here to MakeKgoOpts. I only searched for NewClient calls: hopefully that is sufficient.

vbotbuildovich · 2025-12-04T22:57:49Z

Retry command for Build#77359

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"zstd"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"lz4"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"snappy"}
tests/rptest/transactions/tx_upgrade_test.py::TxUpgradeCompactionTest.upgrade_with_compaction_test
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"gzip"}
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":true}

vbotbuildovich · 2025-12-05T01:13:04Z

CI test results

test results on build#77359

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "gzip"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b391-4bec-b8d0-928fe68bb903	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "gzip"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0044-49d1-9592-c1136841c6b4	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "lz4"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b393-48dd-a01c-54c4bc35d7c2	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "lz4"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0046-4dd7-a808-5e4a98b5cdce	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "snappy"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b396-4f0c-ba73-4202492d5c43	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "snappy"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0047-481e-be6a-71bcbb2eef24	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "zstd"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b397-475a-bbfa-2e48d9bed65c	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "zstd"}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0049-43e6-a945-820b7ac155e2	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
ControllerUpgradeTest	test_updating_cluster_when_executing_operations	null	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b391-4bec-b8d0-928fe68bb903	FLAKY	16/21	upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerUpgradeTest&test_method=test_updating_cluster_when_executing_operations
DataMigrationsApiTest	test_creating_and_listing_migrations	null	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-004a-4fb0-9885-28babcbfa7a9	FLAKY	20/21	upstream reliability is '99.71910112359551'. current run reliability is '95.23809523809523'. drift is 4.48101 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": true}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b391-4bec-b8d0-928fe68bb903	FLAKY	12/21	upstream reliability is '89.82300884955751'. current run reliability is '57.14285714285714'. drift is 32.68015 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": true}	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0044-49d1-9592-c1136841c6b4	FLAKY	8/21	upstream reliability is '90.82568807339449'. current run reliability is '38.095238095238095'. drift is 52.73045 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
TxUpgradeCompactionTest	upgrade_with_compaction_test	null	integration	https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b396-4f0c-ba73-4202492d5c43	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxUpgradeCompactionTest&test_method=upgrade_with_compaction_test

travisdowns · 2025-12-05T01:49:39Z

Looked at a few failures they were all noise, let's do a retry.

travisdowns · 2025-12-05T01:49:42Z

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"zstd"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"lz4"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"snappy"}
tests/rptest/transactions/tx_upgrade_test.py::TxUpgradeCompactionTest.upgrade_with_compaction_test
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"gzip"}
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":true}

travisdowns requested review from StephanDollberg, Copilot, r-vasquez, rockwotj and twmb December 4, 2025 21:04

Copilot AI reviewed Dec 4, 2025

View reviewed changes

travisdowns mentioned this pull request Dec 4, 2025

Patch franz-go for EOF-before-first-response behavior #28816

Closed

8 tasks

travisdowns commented Dec 4, 2025

View reviewed changes

rockwotj approved these changes Dec 4, 2025

View reviewed changes

twmb approved these changes Dec 4, 2025

View reviewed changes

StephanDollberg approved these changes Dec 5, 2025

View reviewed changes

travisdowns merged commit 13a50ee into redpanda-data:dev Dec 5, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kgo-verifier: update franz-go to always retry EOF #28854

kgo-verifier: update franz-go to always retry EOF #28854

Uh oh!

travisdowns commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

travisdowns Dec 4, 2025

Uh oh!

vbotbuildovich commented Dec 4, 2025 •

edited

Loading

Uh oh!

vbotbuildovich commented Dec 5, 2025

Uh oh!

travisdowns commented Dec 5, 2025

Uh oh!

travisdowns commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kgo-verifier: update franz-go to always retry EOF #28854

kgo-verifier: update franz-go to always retry EOF #28854

Uh oh!

Conversation

travisdowns commented Dec 4, 2025

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

travisdowns Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#77359

Uh oh!

vbotbuildovich commented Dec 5, 2025

CI test results

Uh oh!

travisdowns commented Dec 5, 2025

Uh oh!

travisdowns commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vbotbuildovich commented Dec 4, 2025 •

edited

Loading