Skip to content

Conversation

@Akanksha-kedia
Copy link
Contributor

7cff2a4

Issue in PR #15473

What the PR Did (The Good Part)

The PR added a new deleteBatch() method to delete multiple files at once instead of one by one. This is much faster, especially for cloud storage like S3.

Example: Instead of deleting 100 files one at a time (100 API calls), you can delete them all together (1 API call).

The Problems We Found (What Was Missing)

Problem 1: Hadoop Filesystem Was Left Out

  • The PR added optimized deleteBatch() for S3 (Amazon's storage)
  • But Hadoop filesystem (HDFS) was forgotten - it still deletes files one by one
  • This means Hadoop users don't get the performance improvement

Analogy: It's like upgrading all cars to electric except the delivery trucks - they still run on old technology.

Problem 2: No Safety Check for Missing Files

  • The default deleteBatch() in BasePinotFS tries to delete files without checking if they exist first
  • If a file is already deleted or doesn't exist, it throws an error and stops
  • This can break the deletion process

@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2025

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.27%. Comparing base (48fcf00) to head (4e5004e).
⚠️ Report is 33 commits behind head on master.

Files with missing lines Patch % Lines
...a/org/apache/pinot/spi/filesystem/BasePinotFS.java 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17294      +/-   ##
============================================
+ Coverage     63.19%   63.27%   +0.07%     
- Complexity     1432     1474      +42     
============================================
  Files          3131     3135       +4     
  Lines        185838   186479     +641     
  Branches      28397    28496      +99     
============================================
+ Hits         117443   117995     +552     
- Misses        59333    59367      +34     
- Partials       9062     9117      +55     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.22% <0.00%> (+7.67%) ⬆️
java-21 63.25% <0.00%> (+0.08%) ⬆️
temurin 63.27% <0.00%> (+0.07%) ⬆️
unittests 63.27% <0.00%> (+0.07%) ⬆️
unittests1 55.68% <0.00%> (+0.07%) ⬆️
unittests2 33.91% <0.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances batch deletion capabilities in Pinot's filesystem implementations by addressing two key issues from PR #15473: adding optimized batch deletion support for Hadoop filesystem and implementing safety checks for non-existent files during deletion operations.

  • Adds an optimized deleteBatch() method to HadoopPinotFS that handles directories and files efficiently
  • Implements existence checks before deletion in both BasePinotFS and HadoopPinotFS to prevent errors when files don't exist
  • Fixes a typo in error message ("direactory" → "directory")

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pinot-spi/src/main/java/org/apache/pinot/spi/filesystem/BasePinotFS.java Adds existence check before deletion to prevent FileNotFoundException
pinot-plugins/pinot-file-system/pinot-hdfs/src/main/java/org/apache/pinot/plugin/filesystem/HadoopPinotFS.java Implements optimized batch deletion for Hadoop filesystem with recursive directory handling and corrects spelling error

Comment on lines +136 to +146
for (Path path : pathsToDelete) {
try {
if (!_hadoopFS.delete(path, true)) {
LOGGER.warn("Failed to delete path: {}", path);
result = false;
}
} catch (IOException e) {
LOGGER.warn("Error deleting path: {}", path, e);
result = false;
}
}
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch deletion implementation still deletes files one by one in a loop, negating the performance benefit of batch operations. Consider using Hadoop's bulk delete APIs if available, or at least parallelize the deletion operations using ExecutorService to improve performance for large batches.

Copilot uses AI. Check for mistakes.
LOGGER.warn("Directory {} is not empty and forceDelete is false, skipping", segmentUri);
result = false;
continue;
}
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When forceDelete is false and the directory is empty, the code skips collectFilesRecursively() but should still add the empty directory to pathsToDelete for deletion. Currently, empty directories are not deleted when forceDelete is false.

Suggested change
}
}
// Directory is empty, add it for deletion
pathsToDelete.add(path);
continue;

Copilot uses AI. Check for mistakes.
@Akanksha-kedia
Copy link
Contributor Author

#17292 issues being solv ed

@Akanksha-kedia
Copy link
Contributor Author

Akanksha-kedia commented Dec 4, 2025 via email

@Akanksha-kedia
Copy link
Contributor Author

@abhishekbafna @Jackie-Jiang please review. Thanks!!!

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. We need to define the contract for delete() and deleteBatch() and make all PinotFS implementation follow the same contract

boolean result = true;
for (URI segmentUri : segmentUris) {
// Check if file exists before attempting deletion to avoid FileNotFoundException
if (!exists(segmentUri)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the contract for deleteBatch() and delete() different.

We need to decide the behavior for the following scenarios:

  • For delete()
    • File exists
      • File is dir
        • Deleted
        • Not deleted
      • File is ordinary file
        • Deleted
        • Not deleted
    • File doesn't exist
  • For deleteBatch()
    • All files deleted
    • Some files deleted
    • None file deleted

@swaminathanmanish @KKcorps @abhishekbafna Can you please also comment here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going by the idempotent property, successful deletion and missing file, both should return true, otherwise false.

For the batch, if all the files are deleted or removed, return true otherwise false.

@Akanksha-kedia
Copy link
Contributor Author

@Jackie-Jiang did we conclude on it?

@Akanksha-kedia
Copy link
Contributor Author

@abhishekbafna @Jackie-Jiang @xiangfu0 please review and any comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants