lsm: support read-your-own-uncommitted-writes #28856

rockwotj · 2025-12-04T22:24:48Z

lsm: replace write_batch with memtable
lsm: add support for read-your-own-writes of uncommitted data

See: https://github.com/facebook/rocksdb/wiki/Write-Batch-With-Index#write-batch-with-index

Backports Required

Release Notes

none

A write batch is basically the same thing as a memtable. We want to allow iteration over a pending batch with the database contents. So in order to do this, just unify the write batch and memtable types.

Now that we have unified the memtable and write batch to a single structure, we can provide iteration and `get` semantics over the pending write batch to support use cases where you want to explicitly flush pending operations or provide some kind of transactional support with read-your-own-writes guarentees. For prior art, see the following in RocksDB: https://github.com/facebook/rocksdb/wiki/Write-Batch-With-Index#write-batch-with-index

Copilot

Pull request overview

This PR refactors the LSM tree implementation to support read-your-own-writes semantics for uncommitted data by replacing the internal write_batch class with direct use of memtable. This enables applications to read staged writes before committing them to the database.

Key changes:

Replaced lsm::internal::write_batch with lsm::db::memtable as the primary write staging mechanism
Added read capabilities (get() and create_iterator()) to write_batch that merge uncommitted writes with committed data
Modified memtable to support direct put()/remove() operations and a merge() operation for combining memtables

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/v/lsm/lsm.h	Added read operations to `write_batch` API and updated forward declarations
src/v/lsm/lsm.cc	Implemented `write_batch` read operations and refactored to use memtable instead of internal batch
src/v/lsm/db/memtable.h	Replaced `apply()` with `put()`, `remove()`, and `merge()` operations
src/v/lsm/db/memtable.cc	Implemented new memtable operations with proper sequence number validation
src/v/lsm/db/impl.h	Updated `apply()` and `create_iterator()` signatures to work with memtable
src/v/lsm/db/impl.cc	Implemented iterator creation with optional uncommitted memtable overlay
src/v/lsm/core/internal/batch.h	Removed file as functionality moved to memtable
src/v/lsm/db/tests/*.cc	Updated tests to use memtable directly and added read-your-own-writes test
src/v/lsm/*/BUILD	Updated build dependencies to remove batch and add memtable where needed

src/v/lsm/lsm.h

src/v/lsm/db/memtable.cc

src/v/lsm/lsm.h

src/v/lsm/db/impl.h

src/v/lsm/db/impl.cc

rockwotj · 2025-12-04T22:31:50Z

src/v/lsm/lsm.h

+    // The returned iterator must not be used after the write_batch is applied
+    // to the database.


We can remove this limitation if we want, but it will be more expensive, because it will require sharing/copying the data from the memtable to the second.

It doesn't feel like a big limitation to leave as is. I imagine this being useful to validate some DB conditions before applying to the DB, after which maybe most users wouldn't mind just initializing a new iterator?

vbotbuildovich · 2025-12-04T23:56:55Z

Retry command for Build#77365

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"lz4"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"gzip"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"snappy"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"zstd"}
tests/rptest/transactions/tx_upgrade_test.py::TxUpgradeCompactionTest.upgrade_with_compaction_test

vbotbuildovich · 2025-12-05T01:29:45Z

CI test results

test results on build#77365

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "gzip"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67a7-4dc2-87a2-be23cd4c361e	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "gzip"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23da-4b7f-b78e-ce2f348adb16	FLAKY	8/21	upstream reliability is '88.20224719101124'. current run reliability is '38.095238095238095'. drift is 50.10701 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "lz4"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67a9-4a25-9f25-4195d037b450	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "lz4"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23db-4291-99e5-31476d14f190	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "snappy"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67aa-450e-96f6-5842db431aa3	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "snappy"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23dd-4aa2-b46a-abbad47451d0	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "zstd"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67ac-409a-9221-bc4ddd5ae9cf	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "zstd"}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23de-446c-868a-6770121503ae	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
NodesDecommissioningTest	test_decommissioning_rebalancing_node	{"shutdown_decommissioned": true}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67ae-4aa6-9757-b9cecfeb33dc	FLAKY	16/21	upstream reliability is '92.44186046511628'. current run reliability is '76.19047619047619'. drift is 16.25138 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_rebalancing_node
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": true}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23da-4b7f-b78e-ce2f348adb16	FLAKY	12/21	upstream reliability is '95.30516431924883'. current run reliability is '57.14285714285714'. drift is 38.16231 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67ac-409a-9221-bc4ddd5ae9cf	FLAKY	18/21	upstream reliability is '88.71473354231975'. current run reliability is '85.71428571428571'. drift is 3.00045 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
TxUpgradeCompactionTest	upgrade_with_compaction_test	null	integration	https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67aa-450e-96f6-5842db431aa3	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxUpgradeCompactionTest&test_method=upgrade_with_compaction_test

andrwng

Slick! LGTM

andrwng · 2025-12-05T01:39:13Z

src/v/lsm/lsm.h

+    // The returned iterator must not be used after the write_batch is applied
+    // to the database.


It doesn't feel like a big limitation to leave as is. I imagine this being useful to validate some DB conditions before applying to the DB, after which maybe most users wouldn't mind just initializing a new iterator?

andrwng · 2025-12-05T01:46:13Z

Thinking through the train of thought from the original PR:

Just thinking about the db::impl but potentially replicated with Raft and using cloud storage, I'm thinking it makes sense to separate out the deterministic state and updates (moving _mem to _imm, swapping _imm for a SST, and setting new SSTs) from the background work, that maybe we'd drive on a leader with some different policies for when to flush.

This interface isn't quite what I was thinking, but I don't doubt it can still be useful. What I was getting at was exposing deterministic control over the memtable and the version_set, though maybe I need to shift away from that mental model to preserve the LSM public API

rockwotj · 2025-12-05T03:43:39Z

Thinking through the train of thought from the original PR:

Just thinking about the db::impl but potentially replicated with Raft and using cloud storage, I'm thinking it makes sense to separate out the deterministic state and updates (moving _mem to _imm, swapping _imm for a SST, and setting new SSTs) from the background work, that maybe we'd drive on a leader with some different policies for when to flush.

This interface isn't quite what I was thinking, but I don't doubt it can still be useful. What I was getting at was exposing deterministic control over the memtable and the version_set, though maybe I need to shift away from that mental model to preserve the LSM public API

We can disable automatic flushing of memtables (we could for example set write buffer size to 0 or max size) then accumulate the writes externally first and only apply then immediately flush. I guess if we set the write buffer size to max size_t then we can just explicitly flush and you don't need this interface.

lsm: replace write_batch with memtable

e368227

A write batch is basically the same thing as a memtable. We want to allow iteration over a pending batch with the database contents. So in order to do this, just unify the write batch and memtable types.

Copilot AI review requested due to automatic review settings December 4, 2025 22:24

github-actions bot added area/build area/redpanda labels Dec 4, 2025

rockwotj requested review from Lazin, andrwng, dotnwat and mmaslankaprv December 4, 2025 22:25

Copilot AI reviewed Dec 4, 2025

View reviewed changes

src/v/lsm/lsm.h Show resolved Hide resolved

src/v/lsm/db/memtable.cc Show resolved Hide resolved

src/v/lsm/db/memtable.cc Show resolved Hide resolved

src/v/lsm/lsm.h Show resolved Hide resolved

src/v/lsm/db/impl.h Show resolved Hide resolved

src/v/lsm/db/impl.cc Show resolved Hide resolved

rockwotj force-pushed the lsm_followup_memtable branch from c5ed900 to e79ff67 Compare December 4, 2025 22:26

rockwotj commented Dec 4, 2025

View reviewed changes

andrwng approved these changes Dec 5, 2025

View reviewed changes

rockwotj merged commit 7d0abd5 into redpanda-data:dev Dec 6, 2025
16 of 20 checks passed

rockwotj deleted the lsm_followup_memtable branch December 6, 2025 00:19

		// The returned iterator must not be used after the write_batch is applied
		// to the database.

lsm: support read-your-own-uncommitted-writes #28856

lsm: support read-your-own-uncommitted-writes #28856

Uh oh!

Conversation

rockwotj commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rockwotj Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

andrwng Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#77365

Uh oh!

vbotbuildovich commented Dec 5, 2025

CI test results

Uh oh!

andrwng left a comment

Choose a reason for hiding this comment

Uh oh!

andrwng Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

andrwng commented Dec 5, 2025

Uh oh!

rockwotj commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rockwotj commented Dec 4, 2025 •

edited

Loading

vbotbuildovich commented Dec 4, 2025 •

edited

Loading