Skip to content

Conversation

@krishnamd-jkp
Copy link
Contributor

calculate crc32c by default and pass checksum in trailing and per-chunk request

@krishnamd-jkp krishnamd-jkp requested review from a team as code owners October 22, 2025 12:13
@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Oct 22, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @krishnamd-jkp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant improvement to data integrity within the storage client by enabling automatic CRC32C checksum calculation for all write operations by default. This ensures that data is verified during transmission, reducing the risk of corruption. Additionally, it provides flexibility for advanced users by allowing them to disable this automatic checksumming if their specific use case requires it, without preventing the use of manually provided checksums.

Highlights

  • Default CRC32C Calculation: The system now automatically calculates CRC32C checksums for data being written by default, enhancing data integrity.
  • Checksum Transmission: Checksums are now consistently passed in both trailing requests and per-chunk requests during data writes.
  • Disable CRC32C Option: A new DisableCRC32C option has been introduced, allowing users to opt-out of the automatic CRC32C calculation and validation if needed, while still supporting user-provided checksums.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces default CRC32C checksum calculation for gRPC uploads, which is a great enhancement for data integrity. The implementation includes options to disable this behavior or provide user-defined checksums. The changes are well-structured and include new unit tests for the checksum logic. I have a few suggestions to improve code clarity and reduce duplication, which I've detailed in the comments.

@krishnamd-jkp krishnamd-jkp added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 22, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 22, 2025
@krishnamd-jkp krishnamd-jkp added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 22, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 22, 2025
@krishnamd-jkp
Copy link
Contributor Author

Adhoc benchmarking results:

Metrics for 10 files 64MB each with 10 workers running parallel with test execution of 5 minutes

Without warmup
release 1.57.0
Total throughput extrapolated to 100MB: 0.22/GiB/s
Median throughput extrapolated to 100MB: 0.26 GiB/s
Median upload time: 3.7661s
P10 upload time: 3.1413s
P90 upload time: 6.1835s

PR
Total throughput extrapolated to 100MB: 0.27/GiB/s
Median throughput extrapolated to 100MB: 0.24 GiB/s
Median upload time: 3.633s
P10 upload time: 2.894s
P90 upload time: 6.295s

With 5 min warmup
release 1.57.0
Total throughput extrapolated to 100MB: 0.29 GiB/s
Median throughput extrapolated to 100MB: 0.31 GiB/s
Median upload time: 3.1737s
P10 upload time: 2.8136s
P90 upload time: 4.3056s

PR
Total throughput extrapolated to 100MB: 0.25 GiB/s
Median throughput extrapolated to 100MB: 0.30 GiB/s
Median upload time: 3.2925s
P10 upload time: 2.8041s
P90 upload time: 5.9242s

Copy link
Contributor

@tritone tritone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments... I would also suggest the following:

  • Take a look at #12477; do we want to support user-provided trailing checksums as well?
  • Can we do a CPU profile of the benchmark to understand the overhead of the automatic checksumming?

// point, the checksum will be ignored.
SendCRC32C bool

// DisableCRC32C disables the automatic CRC32C checksum calculation and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this only works for gRPC, and does not work for unfinalized writes to appendable objects. I would also rename it something like DisableAutoChecksum maybe?

Also, I think the docs for both this and SendCRC32C are a little unclear how they work together. I would want the user to understand:

  1. If they provide a checksum via SendCRC32C, we will not do checksum calculations and send their checksum on the first request to GCS.
  2. If they do not provide a checksum via SendCRC32C, do not set DisableCRC32C, and this is a finalized write via gRPC, we will automatically calculate the checksum and send it on the last message.
  3. If they set DisableCRC32C we will never calculate or send a checksum

Does that sound right? And does it match the actual behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think DisableAutoChecksum name works better.
The behavior is -

  1. If DisableAutoChecksum is set, checksum calculation in the writer is disabled i.e., both chunk-wise checksum will be disabled and full object checksum calculation is also disabled in the writer. However, if user configures their checksum, it will be sent on both first and last write
  2. If DisableAutoChecksum is not set, chunk-wise calculation is sent to GCS by the writer. On the final write, the grpc writer prioritizes user's checksum over auto calculated checksum. So on the last write, if user's checksum is provided, writer sends user's checksum to GCS. If user doesn't specify any checksum, auto calculated checksum will be sent to GCS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, then maybe we should disambiguate between per-message checksums and whole-object checksums? If this is already implemented in other clients, do they offer separate options for each of these, or just one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this clarifies the behavior?

@krishnamd-jkp
Copy link
Contributor Author

@tritone -

Couple comments... I would also suggest the following:

This PR handles trailing checksums as well. Please check "getObjectChecksums" method.
`

if !finishWrite {
		return nil
}

// send user's checksum on last write op if available
if sendCRC32C {
	return toProtoChecksums(sendCRC32C, attrs)
}

`

@krishnamd-jkp
Copy link
Contributor Author

CPU profile benchmarking -

The profile was captured over a duration of 300.11 seconds, with a total of 137.21 seconds of CPU time sampled.

Top 5 CPU consuming functions -

  • System Calls (internal/runtime/syscall.Syscall6): 66.58s (48.52%)
  • Encryption (crypto/.../gcm.gcmAesEnc): 15.00s (10.93%)
  • Memory Operations (runtime.memmove): 14.66s (10.68%) of the CPU time
  • Concurrency (runtime.futex): 10.08s (7.35%).
  • Checksumming (hash/crc32.castagnoliSSE42Triple): 8.38s (6.11%) of the CPU time.

@tritone
Copy link
Contributor

tritone commented Oct 24, 2025

CPU profile benchmarking -

The profile was captured over a duration of 300.11 seconds, with a total of 137.21 seconds of CPU time sampled.

Top 5 CPU consuming functions -

  • System Calls (internal/runtime/syscall.Syscall6): 66.58s (48.52%)
  • Encryption (crypto/.../gcm.gcmAesEnc): 15.00s (10.93%)
  • Memory Operations (runtime.memmove): 14.66s (10.68%) of the CPU time
  • Concurrency (runtime.futex): 10.08s (7.35%).
  • Checksumming (hash/crc32.castagnoliSSE42Triple): 8.38s (6.11%) of the CPU time.

Cool, I would say this is around what I would have expected. We probably should note in the godoc that SDK auto checksumming has some amount of increased CPU overhead.

@krishnamd-jkp krishnamd-jkp added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 25, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 25, 2025
// checksum will be sent to GCS for validation by the gRPC writer on final write.
//
// Note: DisableAutoChecksum must be set to true BEFORE the first call to
// Writer.Write(). This flag Works only with gRPC writer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is still not super clear. We should communicate that automatic checksumming only works with gRPC, not specifically this flag. The godoc for SendCRC32C should probably be updated as well.

Where did we land on allowing the user to control chunk vs whole object checksums independently only giving one flag? If we want to allow fine-grained control, we could make this field something like *AutoChecksumConfig with separate bools for per-message and per-object.

Also, remove random capitalized words in the godoc (Works, This).

Copy link
Contributor Author

@krishnamd-jkp krishnamd-jkp Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think giving the user an fine-grained option to disable checksum per-chunk and whole object individually would confuse the users given these are default settings. Made some changes. I think it clarifies this a bit.

case w.writesChan <- cmd:
// update fullObjectChecksum on every write and send it on finalWrite
if !w.disableAutoChecksum {
w.fullObjectChecksum = crc32.Update(w.fullObjectChecksum, crc32cTable, p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to note that you are doing the work twice here for each set of bytes between here and L945. In theory it should be possible to calculate the per-message checksum once per buffer and then use those sums to update the full object checksum as well. It doesn't look like there is an easy interface to do this with in Go, but maybe worth considering if you are trying to save CPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of no retries, yes, it does seem like the same calculation is being done twice. But in case of retries, the buffer in this line and the buffer in L945 will be out of sync. And I cannot use the checksum in L945 to update the global checksum because we could be using same bytes multiple times in case of retries. So I had to separate these two computations.

cpriti-os
cpriti-os previously approved these changes Nov 5, 2025
@cpriti-os cpriti-os requested a review from tritone November 10, 2025 04:22
@krishnamd-jkp krishnamd-jkp merged commit 2ab1c77 into googleapis:main Nov 18, 2025
9 of 10 checks passed
krishnamd-jkp added a commit that referenced this pull request Dec 5, 2025
PR created by the Librarian CLI to initialize a release. Merging this PR
will auto trigger a release.

Librarian Version: v0.7.0
Language Image:
us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/librarian-go@sha256:718167d5c23ed389b41f617b3a00ac839bdd938a6bd2d48ae0c2f1fa51ab1c3d
<details><summary>storage: 1.58.0</summary>

##
[1.58.0](storage/v1.57.2...storage/v1.58.0)
(2025-12-03)

### Features

* add object contexts in Go GCS SDK (#13390)
([079c4d9](079c4d96))

* calculate crc32c by default and pass checksum in trailing and
per-chunk request (#13205)
([2ab1c77](2ab1c778))

* add support for partial success in ListBuckets (#13320)
([d91e47f](d91e47f2))

### Bug Fixes

* omit empty filter in http list object request (#13434)
([377eb13](377eb13b))

</details>

---------

Co-authored-by: Priti Chattopadhyay <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: storage Issues related to the Cloud Storage API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants