-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat(storage): send trailing checksums for gRPC resumable uploads #12477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): send trailing checksums for gRPC resumable uploads #12477
Conversation
|
Thanks very much for the contribution. We're in the middle of a large refactor of the gRPC write flow in #12422 . I think we'll try and merge that first and then rebase this change on top if that's okay? |
| stream storagepb.Storage_BidiWriteObjectClient | ||
| flushOffset int64 | ||
| settings *settings | ||
| writer *gRPCWriter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd probably want to give a callback for storage.Writer to set the checksum rather than piping this through as a field.
Also, we should update the docs on storage.Writer to clarify the new contract for providing a checksum.
Thanks for the update and context! I'll keep an eye on that PR and rebase + address comments once that's merged. |
Trailing checksums are most scalable way to perform integrity checks of large uploads. Without them, users are left with two not great options: 1. Pre-calculate the checksum and include it in the initial request. This requires two passes over the data, or an impractical amount of in-memory buffering. 2. Send the object without an initial checksum, but calculate it client-side as data passes through the writer. Then, after the upload is complete, compare the client and server-generated checksums, and try to delete the object or create a new one. This is inherently brittle and may require users to maintain external state of "validated" objects. The GCS gRPC API [supports] setting the checksum on the last request in a resumable upload, but the Go client hadn't supported this so far. This PR modifies the gRPC writer to support trailing checksums. This PR modifies the behavior of `StartResumableWriteRequest` such that it ignores the checksum entirely, and we only ever set checksums in the last message. With the caveat that I haven't modified this code before, this seemed acceptable given the options in the docs: > May only be provided in the first or last request (either with > first_message, or finish_write set). The new integration test fails on `main` but succeeds on this branch. There were some other failures on this branch but they seemed related to me not configuring some test parameters (like KMS) or org policy restrictions. Trailing checksums would greatly simplify a number of our GCS use cases, so I'd love any feedback or changes to get this merged. [supports]: https://cloud.google.com/storage/docs/reference/rpc/storage-operations/google.storage.v2#writeobjectrequest
e115cdb to
4506b66
Compare
|
@schallert I noticed you put this back in draft; is there more that needs to be done before this is reviewable? |
@tritone thanks for checking in. I did an initial rebase after #12422, but haven't had a time to test this more thoroughly since. Once I do I'll mark it for review. |
|
Note that we are also working on automatic trailing checksums in the SDK; see #13205 |
|
Hi @schallert we are closing this PR as there is a PR which got merged #13205. This PR addresses some the comments made here and also adds default auto checksumming. Please let us know if you have any concerns |
Trailing checksums are most scalable way to perform integrity checks of
large uploads. Without them, users are left with two not great options:
Pre-calculate the checksum and include it in the initial request.
This requires two passes over the data, or an impractical amount of
in-memory buffering.
Send the object without an initial checksum, but calculate it
client-side as data passes through the writer. Then, after the upload is
complete, compare the client and server-generated checksums, and try to
delete the object or create a new one. This is inherently brittle and
may require users to maintain external state of "validated" objects.
The GCS gRPC API supports setting the checksum on the last request in
a resumable upload, but the Go client hadn't supported this so far. This
PR modifies the gRPC writer to support trailing checksums.
This PR modifies the behavior of
StartResumableWriteRequestsuch thatit ignores the checksum entirely, and we only ever set checksums in the
last message. With the caveat that I haven't modified this code before,
this seemed acceptable given the options in the docs:
The new integration test fails on
mainbut succeeds on this branch.There were some other failures on this branch but they seemed related to
me not configuring some test parameters (like KMS) or org policy
restrictions.
Trailing checksums would greatly simplify a number of our GCS use cases,
so I'd love any feedback or changes to get this merged.
Fixes #12474.