-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat(storage): calculate crc32c by default and pass checksum in trailing and per-chunk request #13205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @krishnamd-jkp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant improvement to data integrity within the storage client by enabling automatic CRC32C checksum calculation for all write operations by default. This ensures that data is verified during transmission, reducing the risk of corruption. Additionally, it provides flexibility for advanced users by allowing them to disable this automatic checksumming if their specific use case requires it, without preventing the use of manually provided checksums. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces default CRC32C checksum calculation for gRPC uploads, which is a great enhancement for data integrity. The implementation includes options to disable this behavior or provide user-defined checksums. The changes are well-structured and include new unit tests for the checksum logic. I have a few suggestions to improve code clarity and reduce duplication, which I've detailed in the comments.
Adhoc benchmarking results:Metrics for 10 files 64MB each with 10 workers running parallel with test execution of 5 minutesWithout warmup PR With 5 min warmup PR |
tritone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments... I would also suggest the following:
- Take a look at #12477; do we want to support user-provided trailing checksums as well?
- Can we do a CPU profile of the benchmark to understand the overhead of the automatic checksumming?
storage/writer.go
Outdated
| // point, the checksum will be ignored. | ||
| SendCRC32C bool | ||
|
|
||
| // DisableCRC32C disables the automatic CRC32C checksum calculation and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this only works for gRPC, and does not work for unfinalized writes to appendable objects. I would also rename it something like DisableAutoChecksum maybe?
Also, I think the docs for both this and SendCRC32C are a little unclear how they work together. I would want the user to understand:
- If they provide a checksum via SendCRC32C, we will not do checksum calculations and send their checksum on the first request to GCS.
- If they do not provide a checksum via SendCRC32C, do not set DisableCRC32C, and this is a finalized write via gRPC, we will automatically calculate the checksum and send it on the last message.
- If they set DisableCRC32C we will never calculate or send a checksum
Does that sound right? And does it match the actual behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think DisableAutoChecksum name works better.
The behavior is -
- If DisableAutoChecksum is set, checksum calculation in the writer is disabled i.e., both chunk-wise checksum will be disabled and full object checksum calculation is also disabled in the writer. However, if user configures their checksum, it will be sent on both first and last write
- If DisableAutoChecksum is not set, chunk-wise calculation is sent to GCS by the writer. On the final write, the grpc writer prioritizes user's checksum over auto calculated checksum. So on the last write, if user's checksum is provided, writer sends user's checksum to GCS. If user doesn't specify any checksum, auto calculated checksum will be sent to GCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, then maybe we should disambiguate between per-message checksums and whole-object checksums? If this is already implemented in other clients, do they offer separate options for each of these, or just one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if this clarifies the behavior?
|
@tritone -
This PR handles trailing checksums as well. Please check "getObjectChecksums" method. ` |
|
CPU profile benchmarking - The profile was captured over a duration of 300.11 seconds, with a total of 137.21 seconds of CPU time sampled. Top 5 CPU consuming functions -
|
Cool, I would say this is around what I would have expected. We probably should note in the godoc that SDK auto checksumming has some amount of increased CPU overhead. |
storage/writer.go
Outdated
| // checksum will be sent to GCS for validation by the gRPC writer on final write. | ||
| // | ||
| // Note: DisableAutoChecksum must be set to true BEFORE the first call to | ||
| // Writer.Write(). This flag Works only with gRPC writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is still not super clear. We should communicate that automatic checksumming only works with gRPC, not specifically this flag. The godoc for SendCRC32C should probably be updated as well.
Where did we land on allowing the user to control chunk vs whole object checksums independently only giving one flag? If we want to allow fine-grained control, we could make this field something like *AutoChecksumConfig with separate bools for per-message and per-object.
Also, remove random capitalized words in the godoc (Works, This).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think giving the user an fine-grained option to disable checksum per-chunk and whole object individually would confuse the users given these are default settings. Made some changes. I think it clarifies this a bit.
| case w.writesChan <- cmd: | ||
| // update fullObjectChecksum on every write and send it on finalWrite | ||
| if !w.disableAutoChecksum { | ||
| w.fullObjectChecksum = crc32.Update(w.fullObjectChecksum, crc32cTable, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to note that you are doing the work twice here for each set of bytes between here and L945. In theory it should be possible to calculate the per-message checksum once per buffer and then use those sums to update the full object checksum as well. It doesn't look like there is an easy interface to do this with in Go, but maybe worth considering if you are trying to save CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of no retries, yes, it does seem like the same calculation is being done twice. But in case of retries, the buffer in this line and the buffer in L945 will be out of sync. And I cannot use the checksum in L945 to update the global checksum because we could be using same bytes multiple times in case of retries. So I had to separate these two computations.
ccb7bf2 to
17771dd
Compare
PR created by the Librarian CLI to initialize a release. Merging this PR will auto trigger a release. Librarian Version: v0.7.0 Language Image: us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/librarian-go@sha256:718167d5c23ed389b41f617b3a00ac839bdd938a6bd2d48ae0c2f1fa51ab1c3d <details><summary>storage: 1.58.0</summary> ## [1.58.0](storage/v1.57.2...storage/v1.58.0) (2025-12-03) ### Features * add object contexts in Go GCS SDK (#13390) ([079c4d9](079c4d96)) * calculate crc32c by default and pass checksum in trailing and per-chunk request (#13205) ([2ab1c77](2ab1c778)) * add support for partial success in ListBuckets (#13320) ([d91e47f](d91e47f2)) ### Bug Fixes * omit empty filter in http list object request (#13434) ([377eb13](377eb13b)) </details> --------- Co-authored-by: Priti Chattopadhyay <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
calculate crc32c by default and pass checksum in trailing and per-chunk request