Skip to content

Conversation

@ashvina
Copy link
Contributor

@ashvina ashvina commented Feb 17, 2025

Fixes #652

The main goal of this pull request is to establish a common structure for storage files and introduces an abstraction for all types of storage files associated to a table. This should facilitate future extensions and ensure stability of APIs. The abstraction will serve as the base for various file types such as data files, delete files, etc.

Different table formats support various storage files, and as these evolve, more file types are expected to be added. The abstraction provided by the base class is helpful for maintaining the stability of core XTable APIs while supporting extensibility. For instance if a new storage file for statistics is introduced, XTable code for detecting file changes between commits need not change.

While the APIs need not change, handling of new types will be needed in the source and target implementations. For e.g., a target may not support deletion type files. By default the target format should throw an error if a deletion file is detected and not handled. This may not be true for in all cases.

Additionally, minor refactoring has been performed to rename certain elements for better clarity and consistency.

This PR does not add any new functionality of change in behavior. Tests have been updated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 18 out of 33 changed files in this pull request and generated no comments.

Files not reviewed (15)
  • xtable-api/src/test/java/org/apache/xtable/model/storage/TestDataFilesDiff.java: Evaluated as low risk
  • xtable-api/src/main/java/org/apache/xtable/model/stat/Range.java: Evaluated as low risk
  • xtable-core/src/main/java/org/apache/xtable/delta/DeltaPartitionExtractor.java: Evaluated as low risk
  • xtable-core/src/main/java/org/apache/xtable/hudi/HudiDataFileExtractor.java: Evaluated as low risk
  • xtable-api/src/main/java/org/apache/xtable/model/storage/FilesDiff.java: Evaluated as low risk
  • xtable-core/src/main/java/org/apache/xtable/delta/DeltaDataFileUpdatesExtractor.java: Evaluated as low risk
  • xtable-api/src/main/java/org/apache/xtable/spi/sync/ConversionTarget.java: Evaluated as low risk
  • xtable-api/src/main/java/org/apache/xtable/model/storage/PartitionFileGroup.java: Evaluated as low risk
  • xtable-api/src/main/java/org/apache/xtable/model/TableChange.java: Evaluated as low risk
  • xtable-api/src/test/java/org/apache/xtable/spi/extractor/TestExtractFromSource.java: Evaluated as low risk
  • xtable-core/src/main/java/org/apache/xtable/hudi/HudiConversionTarget.java: Evaluated as low risk
  • xtable-core/src/main/java/org/apache/xtable/delta/DeltaConversionTarget.java: Evaluated as low risk
  • xtable-core/src/main/java/org/apache/xtable/hudi/HudiFileStatsExtractor.java: Evaluated as low risk
  • xtable-api/src/test/java/org/apache/xtable/spi/sync/TestTableFormatSync.java: Evaluated as low risk
  • xtable-core/src/main/java/org/apache/xtable/hudi/BaseFileUpdatesExtractor.java: Evaluated as low risk

* @return the set of files that are added
*/
public static <P> FilesDiff<InternalDataFile, P> findNewAndRemovedFiles(
public static <P> FilesDiff<? extends InternalBaseFile, P> findNewAndRemovedFiles(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be less verbose to define some interface for InternalFile that we can use instead of this wildcard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verbosity of generics doesn't seem like a problem to me; verbosity is something unavoidable in Java. Also, I don't think that would be the good use of an interface and it wouldn't be consistent. That said, there were places where the base class was sufficient, so I used it there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't it be a good use of an interface? Currently we are always casting to a specific instance type instead of leveraging the shared implementation of the base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't it be a good use of an interface?

InternalFile is a data class. If the data class is simply a container for data and doesn't require any specific behavior or implementation, then using an interface is uncommon

Currently we are always casting to a specific instance

I wouldn't use call it always, but anyway.

Well currently none of the sources and target are handling any other kind of file type. There are pending PRs for that and new ones will be needed in the future.

instead of leveraging the shared implementation of the base class.

That will not happen even with interface. Each target and source will handle each type of file differently.

For now, I’m going to abandon this PR as I don’t see a clear path to resolving this issue. If anyone has suggestions or ideas on how to tackle this, I’m all ears!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is not a reason to close the PR. I was just trying to understand where you were coming from with your approach so I could try to see where things were going in general with how we can handle the different types of files in the targets.

@ashvina ashvina force-pushed the 652-generic-internal-file-representation branch from bd3f38f to 6a21d14 Compare February 19, 2025 02:05
@ashvina
Copy link
Contributor Author

ashvina commented Feb 19, 2025

@the-other-tim-brown what do you think of name InternalStorageFile. Is it better than InternalBaseFile?

@ashvina ashvina closed this Feb 19, 2025
@ashvina ashvina reopened this Mar 3, 2025
@ashvina ashvina force-pushed the 652-generic-internal-file-representation branch from 6a21d14 to 66a2b86 Compare March 3, 2025 21:42
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nitpick but otherwise LGTM!

@ashvina ashvina force-pushed the 652-generic-internal-file-representation branch from 66a2b86 to 866ba84 Compare March 5, 2025 22:15
@ashvina ashvina merged commit c76692b into main Mar 5, 2025
2 checks passed
@ashvina ashvina deleted the 652-generic-internal-file-representation branch March 5, 2025 22:34
@vinishjail97 vinishjail97 mentioned this pull request Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an generic internal representation for all physical files

2 participants