Skip to content

Conversation

@pacoavila808
Copy link

Implements a basic Dataflow template for copying documents from one Firestore database to another.

@gemini-code-assist
Copy link

Summary of Changes

Hello @pacoavila808, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a new Google Cloud Dataflow template that enables the batch copying of documents from one Firestore database to another. It provides a flexible and robust solution for data migration or replication, allowing users to specify source and destination environments, select specific collections, and maintain data consistency. The template is designed for efficient execution, utilizing Firestore's native querying and batch writing features.

Highlights

  • New Dataflow Template: Introduces a new Google Cloud Dataflow template specifically designed for copying documents between two Firestore databases.
  • Configurable Parameters: The template supports various parameters, including source and destination project IDs, database IDs, and optional collection IDs for selective replication. It also allows specifying a read time for consistent data snapshots.
  • Efficient Data Transfer: The pipeline leverages Firestore's partition query capabilities to efficiently read documents in parallel and prepares them for batch writing to the destination, ensuring data consistency at the end of the process.
  • Comprehensive Documentation: Includes a detailed README with instructions for building, staging, and running the template, along with guidance for Terraform integration.
  • Modular Design: The core logic is broken down into dedicated PTransform and DoFn classes for creating partition queries, extracting documents, and preparing write operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@pacoavila808
Copy link
Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Dataflow batch template for copying documents between two Firestore databases. The implementation is well-structured, using partitioned reads for scalability and breaking down the logic into reusable transforms. The accompanying unit tests for the transforms are thorough.

I've identified a critical issue where the pipeline will fail at submission if run with default autoscaling settings (maxNumWorkers=0). I've also found some areas for improvement regarding documentation, error handling, and making the code more robust and readable. Please see the detailed comments for suggestions.

Comment on lines 17 to 22
/**
* Templates for streaming data from DataStream to MongoDB.
*
* <p>This package contains a template for streaming data from DataStream to MongoDB using the
* MongoDB IO connector. The template allows direct writing of BSON documents.
*/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The package documentation appears to be copied from another template and is incorrect. It refers to 'DataStream to MongoDB' instead of describing the Firestore to Firestore template.

/**
 * Templates for batch copying data from one Firestore database to another.
 *
 * <p>This package contains a template for copying documents between Firestore databases.
 */

groupName = "Destination",
order = 4,
description = "Destination Project ID",
helpText = "The destination project to write to. Defaults to the source project if not set",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The help text is missing a period at the end. This causes a formatting issue in the generated README.md file where the example text runs directly into the preceding sentence.

Suggested change
helpText = "The destination project to write to. Defaults to the source project if not set",
helpText = "The destination project to write to. Defaults to the source project if not set.",

);
PartitionQueryRequest request = PartitionQueryRequest.newBuilder()
.setParent(DocumentRootName.of(projectId, databaseId).toString())
.setPartitionCount(partitionCount - 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic partitionCount - 1 is confusing. The partitionCount in PartitionQueryRequest is documented as 'The desired number of partitions to return'. Subtracting 1 here is not intuitive and makes the code prone to errors if the calling code doesn't ensure partitionCount is greater than 1.

Please add a comment explaining why 1 is subtracted. If there's no strong reason, consider removing the subtraction and adjusting the calling code to pass the desired number of partitions directly. This would make the logic clearer.

Comment on lines 37 to 38
String originalName = doc.getName();
String path = originalName.substring(originalName.indexOf("/documents/") + 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This string manipulation is brittle. If originalName does not contain /documents/, indexOf will return -1, and substring will be called with 0. This would use the entire originalName as the path, leading to a malformed destination document name.

It's safer to validate that /documents/ exists in the name before proceeding.

Suggested change
String originalName = doc.getName();
String path = originalName.substring(originalName.indexOf("/documents/") + 1);
String originalName = doc.getName();
int documentsPathIndex = originalName.indexOf("/documents/");
if (documentsPathIndex < 0) {
throw new IllegalArgumentException("Invalid document name format: " + originalName);
}
String path = originalName.substring(documentsPathIndex + 1);

// or mocking of FirestoreIO's builders and transforms.
// This test just checks if the pipeline can be constructed without errors.
@Test
public void testPipelineConstruction() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is quite basic and mainly serves as a placeholder. It constructs a partial pipeline but doesn't run it or assert any behavior. While the comment acknowledges that fully testing FirestoreIO is complex, the test could be improved to provide more value.

Consider using the TestPipeline rule and running it to verify that the pipeline can be fully constructed with the provided options without errors. You could also use PAssert to verify the intermediate PCollections before they are passed to FirestoreIO, which would provide more confidence in the pipeline's logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant