-
Notifications
You must be signed in to change notification settings - Fork 704
lsm: support read-your-own-uncommitted-writes #28856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A write batch is basically the same thing as a memtable. We want to allow iteration over a pending batch with the database contents. So in order to do this, just unify the write batch and memtable types.
Now that we have unified the memtable and write batch to a single structure, we can provide iteration and `get` semantics over the pending write batch to support use cases where you want to explicitly flush pending operations or provide some kind of transactional support with read-your-own-writes guarentees. For prior art, see the following in RocksDB: https://github.com/facebook/rocksdb/wiki/Write-Batch-With-Index#write-batch-with-index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the LSM tree implementation to support read-your-own-writes semantics for uncommitted data by replacing the internal write_batch class with direct use of memtable. This enables applications to read staged writes before committing them to the database.
Key changes:
- Replaced
lsm::internal::write_batchwithlsm::db::memtableas the primary write staging mechanism - Added read capabilities (
get()andcreate_iterator()) towrite_batchthat merge uncommitted writes with committed data - Modified
memtableto support directput()/remove()operations and amerge()operation for combining memtables
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/lsm/lsm.h | Added read operations to write_batch API and updated forward declarations |
| src/v/lsm/lsm.cc | Implemented write_batch read operations and refactored to use memtable instead of internal batch |
| src/v/lsm/db/memtable.h | Replaced apply() with put(), remove(), and merge() operations |
| src/v/lsm/db/memtable.cc | Implemented new memtable operations with proper sequence number validation |
| src/v/lsm/db/impl.h | Updated apply() and create_iterator() signatures to work with memtable |
| src/v/lsm/db/impl.cc | Implemented iterator creation with optional uncommitted memtable overlay |
| src/v/lsm/core/internal/batch.h | Removed file as functionality moved to memtable |
| src/v/lsm/db/tests/*.cc | Updated tests to use memtable directly and added read-your-own-writes test |
| src/v/lsm/*/BUILD | Updated build dependencies to remove batch and add memtable where needed |
c5ed900 to
e79ff67
Compare
| // The returned iterator must not be used after the write_batch is applied | ||
| // to the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this limitation if we want, but it will be more expensive, because it will require sharing/copying the data from the memtable to the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't feel like a big limitation to leave as is. I imagine this being useful to validate some DB conditions before applying to the DB, after which maybe most users wouldn't mind just initializing a new iterator?
Retry command for Build#77365please wait until all jobs are finished before running the slash command |
andrwng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slick! LGTM
| // The returned iterator must not be used after the write_batch is applied | ||
| // to the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't feel like a big limitation to leave as is. I imagine this being useful to validate some DB conditions before applying to the DB, after which maybe most users wouldn't mind just initializing a new iterator?
|
Thinking through the train of thought from the original PR:
This interface isn't quite what I was thinking, but I don't doubt it can still be useful. What I was getting at was exposing deterministic control over the memtable and the version_set, though maybe I need to shift away from that mental model to preserve the LSM public API |
We can disable automatic flushing of memtables (we could for example set write buffer size to 0 or max size) then accumulate the writes externally first and only apply then immediately flush. I guess if we set the write buffer size to max size_t then we can just explicitly flush and you don't need this interface. |
See: https://github.com/facebook/rocksdb/wiki/Write-Batch-With-Index#write-batch-with-index
Backports Required
Release Notes