Skip to content

Conversation

@rockwotj
Copy link
Contributor

@rockwotj rockwotj commented Dec 4, 2025

  • lsm: replace write_batch with memtable
  • lsm: add support for read-your-own-writes of uncommitted data

See: https://github.com/facebook/rocksdb/wiki/Write-Batch-With-Index#write-batch-with-index

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

A write batch is basically the same thing as a memtable. We want to
allow iteration over a pending batch with the database contents. So in
order to do this, just unify the write batch and memtable types.
Copilot AI review requested due to automatic review settings December 4, 2025 22:24
Now that we have unified the memtable and write batch to a single
structure, we can provide iteration and `get` semantics over the
pending write batch to support use cases where you want to explicitly
flush pending operations or provide some kind of transactional support
with read-your-own-writes guarentees.

For prior art, see the following in RocksDB:
https://github.com/facebook/rocksdb/wiki/Write-Batch-With-Index#write-batch-with-index
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the LSM tree implementation to support read-your-own-writes semantics for uncommitted data by replacing the internal write_batch class with direct use of memtable. This enables applications to read staged writes before committing them to the database.

Key changes:

  • Replaced lsm::internal::write_batch with lsm::db::memtable as the primary write staging mechanism
  • Added read capabilities (get() and create_iterator()) to write_batch that merge uncommitted writes with committed data
  • Modified memtable to support direct put()/remove() operations and a merge() operation for combining memtables

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/v/lsm/lsm.h Added read operations to write_batch API and updated forward declarations
src/v/lsm/lsm.cc Implemented write_batch read operations and refactored to use memtable instead of internal batch
src/v/lsm/db/memtable.h Replaced apply() with put(), remove(), and merge() operations
src/v/lsm/db/memtable.cc Implemented new memtable operations with proper sequence number validation
src/v/lsm/db/impl.h Updated apply() and create_iterator() signatures to work with memtable
src/v/lsm/db/impl.cc Implemented iterator creation with optional uncommitted memtable overlay
src/v/lsm/core/internal/batch.h Removed file as functionality moved to memtable
src/v/lsm/db/tests/*.cc Updated tests to use memtable directly and added read-your-own-writes test
src/v/lsm/*/BUILD Updated build dependencies to remove batch and add memtable where needed

@rockwotj rockwotj force-pushed the lsm_followup_memtable branch from c5ed900 to e79ff67 Compare December 4, 2025 22:26
Comment on lines +182 to +183
// The returned iterator must not be used after the write_batch is applied
// to the database.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this limitation if we want, but it will be more expensive, because it will require sharing/copying the data from the memtable to the second.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel like a big limitation to leave as is. I imagine this being useful to validate some DB conditions before applying to the DB, after which maybe most users wouldn't mind just initializing a new iterator?

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 4, 2025

Retry command for Build#77365

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"lz4"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"gzip"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"snappy"}
tests/rptest/tests/compatibility/java_compression_test.py::JavaCompressionTest.test_upgrade_java_compression@{"compression_type":"zstd"}
tests/rptest/transactions/tx_upgrade_test.py::TxUpgradeCompactionTest.upgrade_with_compaction_test

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#77365
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
JavaCompressionTest test_upgrade_java_compression {"compression_type": "gzip"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67a7-4dc2-87a2-be23cd4c361e FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "gzip"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23da-4b7f-b78e-ce2f348adb16 FLAKY 8/21 upstream reliability is '88.20224719101124'. current run reliability is '38.095238095238095'. drift is 50.10701 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "lz4"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67a9-4a25-9f25-4195d037b450 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "lz4"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23db-4291-99e5-31476d14f190 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "snappy"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67aa-450e-96f6-5842db431aa3 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "snappy"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23dd-4aa2-b46a-abbad47451d0 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "zstd"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67ac-409a-9221-bc4ddd5ae9cf FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "zstd"} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23de-446c-868a-6770121503ae FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
NodesDecommissioningTest test_decommissioning_rebalancing_node {"shutdown_decommissioned": true} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67ae-4aa6-9757-b9cecfeb33dc FLAKY 16/21 upstream reliability is '92.44186046511628'. current run reliability is '76.19047619047619'. drift is 16.25138 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_rebalancing_node
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb93-23da-4b7f-b78e-ce2f348adb16 FLAKY 12/21 upstream reliability is '95.30516431924883'. current run reliability is '57.14285714285714'. drift is 38.16231 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67ac-409a-9221-bc4ddd5ae9cf FLAKY 18/21 upstream reliability is '88.71473354231975'. current run reliability is '85.71428571428571'. drift is 3.00045 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
TxUpgradeCompactionTest upgrade_with_compaction_test null integration https://buildkite.com/redpanda/redpanda/builds/77365#019aeb8d-67aa-450e-96f6-5842db431aa3 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxUpgradeCompactionTest&test_method=upgrade_with_compaction_test

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slick! LGTM

Comment on lines +182 to +183
// The returned iterator must not be used after the write_batch is applied
// to the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel like a big limitation to leave as is. I imagine this being useful to validate some DB conditions before applying to the DB, after which maybe most users wouldn't mind just initializing a new iterator?

@andrwng
Copy link
Contributor

andrwng commented Dec 5, 2025

Thinking through the train of thought from the original PR:

Just thinking about the db::impl but potentially replicated with Raft and using cloud storage, I'm thinking it makes sense to separate out the deterministic state and updates (moving _mem to _imm, swapping _imm for a SST, and setting new SSTs) from the background work, that maybe we'd drive on a leader with some different policies for when to flush.

This interface isn't quite what I was thinking, but I don't doubt it can still be useful. What I was getting at was exposing deterministic control over the memtable and the version_set, though maybe I need to shift away from that mental model to preserve the LSM public API

@rockwotj
Copy link
Contributor Author

rockwotj commented Dec 5, 2025

Thinking through the train of thought from the original PR:

Just thinking about the db::impl but potentially replicated with Raft and using cloud storage, I'm thinking it makes sense to separate out the deterministic state and updates (moving _mem to _imm, swapping _imm for a SST, and setting new SSTs) from the background work, that maybe we'd drive on a leader with some different policies for when to flush.

This interface isn't quite what I was thinking, but I don't doubt it can still be useful. What I was getting at was exposing deterministic control over the memtable and the version_set, though maybe I need to shift away from that mental model to preserve the LSM public API

We can disable automatic flushing of memtables (we could for example set write buffer size to 0 or max size) then accumulate the writes externally first and only apply then immediately flush. I guess if we set the write buffer size to max size_t then we can just explicitly flush and you don't need this interface.

@rockwotj rockwotj merged commit 7d0abd5 into redpanda-data:dev Dec 6, 2025
16 of 20 checks passed
@rockwotj rockwotj deleted the lsm_followup_memtable branch December 6, 2025 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants