Skip to content

Conversation

@travisdowns
Copy link
Member

Note: this effectively replaces #28816, using a commit from upstream franz-go which makes this behavior configurable.

Note also that because of how bazel works, this effectively makes all go modules in the repo use this updated version when they build with bazel, so this is a "tree wide" change even if it doesn't look like it.

Tests are flaking flake when kgo-repeater dies with a non-retriable error right after connection. This happens when franz-go gets EOF right after connecting but before receiving any response, as a heuristic it assumes this is a SASL misconfiguration as that is the broker behavior in that case. However, this can also occur because Redpanda is stopped/killed after the connection is made but before the initial requests can be responded to.

This means the producer will fail if a producer dies/is killed/stops during a critical window between the connection and receiving the first response. This is reasonably likely in stress/chaos tests where producers are being started and stopped all the time.

This is a relatively recent change (~6 months ago) in franz-go, which was brought in a few months ago by a franz-go upgrade.

To fix this, we make use a proposed new option to franz-go, from this PR:

twmb/franz-go#1198

This is not merged, so we pull in the SHA from this PR directly. When a franz-go release is made with this change, we can update to that.

Details in CORE-14849.

Fixes CORE-14898.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Tests are flaking flake when kgo-repeater dies with a
non-retriable error right after connection. This happens when franz-go
gets EOF right after connecting but before receiving any response, as
a heuristic it assumes this is a SASL misconfiguration as that is the
broker behavior in that case. However, this can also occur because
Redpanda is stopped/killed after the connection is made but before
the initial requests can be responded to.

This means the producer will fail if a producer dies/is killed/stops
during a critical window between the connection and receiving the first
response. This is reasonably likely in stress/chaos tests where
producers are being started and stopped all the time.

This is a relatively recent change (~6 months ago) in franz-go,
which was brought in a few months ago by a franz-go upgrade.

To fix this, we make use a proposed new option to franz-go, from this
PR:

twmb/franz-go#1198

This is not merged, so we pull in the SHA from this PR directly. When
a franz-go release is made with this change, we can update to that.

Details in CORE-14849.

Fixes CORE-14898.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the franz-go dependency to use a pre-release commit that adds a new AlwaysRetryEOF() option, which is then enabled in the kgo-verifier test tools to work around flaky test failures. The issue occurs when Redpanda is stopped/killed after a connection is made but before the initial request can be responded to, causing franz-go to incorrectly treat the EOF as a non-retriable SASL misconfiguration error.

Key changes:

  • Updates franz-go dependency to commit SHA v1.20.6-0.20251204171952-b7b6b8e44d30 which includes the new AlwaysRetryEOF() option
  • Adds kgo.AlwaysRetryEOF() option to both the worker and repeater worker configurations
  • Updates transitive dependencies (golang.org/x/sync, golang.org/x/crypto, golang.org/x/sys, golang.org/x/text)

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File Description
tests/go/kgo-verifier/pkg/worker/worker.go Adds kgo.AlwaysRetryEOF() to the Kafka client options in MakeKgoOpts()
tests/go/kgo-verifier/pkg/worker/repeater/repeater_worker.go Adds kgo.AlwaysRetryEOF() to the Kafka client options in the repeater worker's Init() method
tests/go/kgo-verifier/go.mod Updates franz-go to pre-release commit and bumps transitive dependency versions

kgo.MaxBufferedRecords(int(wc.MaxBufferedRecords)),
kgo.RequiredAcks(kgo.AllISRAcks()),

kgo.AlwaysRetryEOF(), // workaround for CORE-14849
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many NewClient calls in these binaries, but luckily most defer here to MakeKgoOpts. I only searched for NewClient calls: hopefully that is sufficient.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 4, 2025

Retry command for Build#77359

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"zstd"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"lz4"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"snappy"}
tests/rptest/transactions/tx_upgrade_test.py::TxUpgradeCompactionTest.upgrade_with_compaction_test
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"gzip"}
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":true}

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#77359
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
JavaCompressionTest test_upgrade_java_compression {"compression_type": "gzip"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b391-4bec-b8d0-928fe68bb903 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "gzip"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0044-49d1-9592-c1136841c6b4 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "lz4"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b393-48dd-a01c-54c4bc35d7c2 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "lz4"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0046-4dd7-a808-5e4a98b5cdce FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "snappy"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b396-4f0c-ba73-4202492d5c43 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "snappy"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0047-481e-be6a-71bcbb2eef24 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "zstd"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b397-475a-bbfa-2e48d9bed65c FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "zstd"} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0049-43e6-a945-820b7ac155e2 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
ControllerUpgradeTest test_updating_cluster_when_executing_operations null integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b391-4bec-b8d0-928fe68bb903 FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerUpgradeTest&test_method=test_updating_cluster_when_executing_operations
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-004a-4fb0-9885-28babcbfa7a9 FLAKY 20/21 upstream reliability is '99.71910112359551'. current run reliability is '95.23809523809523'. drift is 4.48101 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b391-4bec-b8d0-928fe68bb903 FLAKY 12/21 upstream reliability is '89.82300884955751'. current run reliability is '57.14285714285714'. drift is 32.68015 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb47-0044-49d1-9592-c1136841c6b4 FLAKY 8/21 upstream reliability is '90.82568807339449'. current run reliability is '38.095238095238095'. drift is 52.73045 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
TxUpgradeCompactionTest upgrade_with_compaction_test null integration https://buildkite.com/redpanda/redpanda/builds/77359#019aeb40-b396-4f0c-ba73-4202492d5c43 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxUpgradeCompactionTest&test_method=upgrade_with_compaction_test

@travisdowns
Copy link
Member Author

Looked at a few failures they were all noise, let's do a retry.

@travisdowns
Copy link
Member Author

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"zstd"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"lz4"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"snappy"}
tests/rptest/transactions/tx_upgrade_test.py::TxUpgradeCompactionTest.upgrade_with_compaction_test
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"gzip"}
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":true}

@travisdowns travisdowns merged commit 13a50ee into redpanda-data:dev Dec 5, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants