Skip to content

Conversation

@v01dstar
Copy link
Member

@v01dstar v01dstar commented Aug 10, 2024

This PR together with #323 implements Titan's new GC solution: Punch hole.

Titan has implemented 2 types of GC methods:

  • GC based on stats (when discardable ratio of blob files reach certain threshold)
  • GC during RocksDB's compactions (a.k.a level merge)

The first approach introduces write amplification, since it needs to update references inside RocksDB to reflect blobs' new location, and due to this reason, the threshold has to be relatively high to minimize this impact.
The second approach has no extra write amplification, however, it may result in space waste. Imagine large portions of some blob files have been moved, but not yet meet the threshold to be entirely replaced.

Punch hole is a file system API, accepts a file descriptor, a start position and a length, removes the data from the file at the desired location. To application, this part of the file is set to 0. To file system, if some the underlying storage blocks are fully covered by the removed part, file system can reclaim those space.

This project try to utilize the punch hole API, in-place delete blobs during GC, to avoid updating references inside RocksDB

// The effective size of current file. This is different from `file_size_`, as
// `file_size_` is the original size of the file, and does not consider space
// reclaimed by punch hole GC.
// We can't use file system's `st_blocks` to get the logical size, because
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to get the size as effective_file_size after restart. The size doesn't have to be so precise. Indeed, it may have false positive for triggering punch hole GC, but it would be updated to the accurate number after the gc scan.
Then, we can get rid of updating manifest for effective_file_size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@hbisheng hbisheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a PR description to help future readers. The description could provide some background and highlight key aspects of implementation such as the introduction of the PunchHoleGCJob class and how the job is scheduled (when its snapshot becomes the oldest).

@v01dstar v01dstar force-pushed the punch-hole branch 2 times, most recently from 891fb24 to 1f0ecf1 Compare August 28, 2024 22:04
// record by adjusting iterate_offset_, otherwise (not a hole-punch record),
// we should break the loop and return the record, iterate_offset_ is
// already adjusted inside GetBlobRecord() in this case.
if (live || !status().ok()) return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to learn, in what cases will status() be not ok? Do we need to set valid_ to false in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, IO errors, In this case, we should not continue the loop.

As for whether setting valid_ to false, I don't think it is necessary, since valid() (the function not the variable) considers status_ already. So I didn't change the behavior (current code does not set valid_ either when encountering io errors).

// If not, we can just skip the next round of purging obsolete files.
db_->ReleaseSnapshot(snapshot);
{
MutexLock l(&mutex_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not wise to acuqire a mutex here, since this is a hot path, almost all read requrests go through this (Titan creates a ManagedSnapshot implicitly). I will refactor this with atomic operations, but in a separate PR

Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants