-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Dataflow Template for Firestore to Firestore Documents Copy #3098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Dataflow Template for Firestore to Firestore Documents Copy #3098
Conversation
…APSHOT, cleanup flag ordering
Summary of ChangesHello @pacoavila808, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a new Google Cloud Dataflow template that enables the batch copying of documents from one Firestore database to another. It provides a flexible and robust solution for data migration or replication, allowing users to specify source and destination environments, select specific collections, and maintain data consistency. The template is designed for efficient execution, utilizing Firestore's native querying and batch writing features. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new Dataflow batch template for copying documents between two Firestore databases. The implementation is well-structured, using partitioned reads for scalability and breaking down the logic into reusable transforms. The accompanying unit tests for the transforms are thorough.
I've identified a critical issue where the pipeline will fail at submission if run with default autoscaling settings (maxNumWorkers=0). I've also found some areas for improvement regarding documentation, error handling, and making the code more robust and readable. Please see the detailed comments for suggestions.
...-to-firestore/src/main/java/com/google/cloud/teleport/v2/templates/FirestoreToFirestore.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Templates for streaming data from DataStream to MongoDB. | ||
| * | ||
| * <p>This package contains a template for streaming data from DataStream to MongoDB using the | ||
| * MongoDB IO connector. The template allows direct writing of BSON documents. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package documentation appears to be copied from another template and is incorrect. It refers to 'DataStream to MongoDB' instead of describing the Firestore to Firestore template.
/**
* Templates for batch copying data from one Firestore database to another.
*
* <p>This package contains a template for copying documents between Firestore databases.
*/| groupName = "Destination", | ||
| order = 4, | ||
| description = "Destination Project ID", | ||
| helpText = "The destination project to write to. Defaults to the source project if not set", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help text is missing a period at the end. This causes a formatting issue in the generated README.md file where the example text runs directly into the preceding sentence.
| helpText = "The destination project to write to. Defaults to the source project if not set", | |
| helpText = "The destination project to write to. Defaults to the source project if not set.", |
| ); | ||
| PartitionQueryRequest request = PartitionQueryRequest.newBuilder() | ||
| .setParent(DocumentRootName.of(projectId, databaseId).toString()) | ||
| .setPartitionCount(partitionCount - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic partitionCount - 1 is confusing. The partitionCount in PartitionQueryRequest is documented as 'The desired number of partitions to return'. Subtracting 1 here is not intuitive and makes the code prone to errors if the calling code doesn't ensure partitionCount is greater than 1.
Please add a comment explaining why 1 is subtracted. If there's no strong reason, consider removing the subtraction and adjusting the calling code to pass the desired number of partitions directly. This would make the logic clearer.
| String originalName = doc.getName(); | ||
| String path = originalName.substring(originalName.indexOf("/documents/") + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This string manipulation is brittle. If originalName does not contain /documents/, indexOf will return -1, and substring will be called with 0. This would use the entire originalName as the path, leading to a malformed destination document name.
It's safer to validate that /documents/ exists in the name before proceeding.
| String originalName = doc.getName(); | |
| String path = originalName.substring(originalName.indexOf("/documents/") + 1); | |
| String originalName = doc.getName(); | |
| int documentsPathIndex = originalName.indexOf("/documents/"); | |
| if (documentsPathIndex < 0) { | |
| throw new IllegalArgumentException("Invalid document name format: " + originalName); | |
| } | |
| String path = originalName.substring(documentsPathIndex + 1); |
| // or mocking of FirestoreIO's builders and transforms. | ||
| // This test just checks if the pipeline can be constructed without errors. | ||
| @Test | ||
| public void testPipelineConstruction() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is quite basic and mainly serves as a placeholder. It constructs a partial pipeline but doesn't run it or assert any behavior. While the comment acknowledges that fully testing FirestoreIO is complex, the test could be improved to provide more value.
Consider using the TestPipeline rule and running it to verify that the pipeline can be fully constructed with the provided options without errors. You could also use PAssert to verify the intermediate PCollections before they are passed to FirestoreIO, which would provide more confidence in the pipeline's logic.
…ort/v2/templates/FirestoreToFirestore.java Add default for maxNumWorkers Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Implements a basic Dataflow template for copying documents from one Firestore database to another.