-
Notifications
You must be signed in to change notification settings - Fork 460
Workaround for batch size search on xpu devices #4513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kprokofi
merged 4 commits into
open-edge-platform:release/2.5
from
kprokofi:kp/workaround_xpu
Aug 12, 2025
Merged
Workaround for batch size search on xpu devices #4513
kprokofi
merged 4 commits into
open-edge-platform:release/2.5
from
kprokofi:kp/workaround_xpu
Aug 12, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
See comment: #4508 (comment) |
sovrasov
approved these changes
Aug 12, 2025
73ea1b0
into
open-edge-platform:release/2.5
6 of 14 checks passed
kprokofi
added a commit
that referenced
this pull request
Aug 19, 2025
* Merge develop to release/2.5 (#4432) * Update demo requirements (#4421) Fix demo requirements * Cleanup Geti task templates for anomaly task (#4420) * Remove sub task templates for anomaly * Move anomaly classification templates one level up * Update model_template_id for PADIM and STFPM anomaly templates * Restore Engine (#4430) Restore engine.py Signed-off-by: Ashwin Vaidya <[email protected]> --------- Signed-off-by: Ashwin Vaidya <[email protected]> Co-authored-by: Vladislav Sovrasov <[email protected]> Co-authored-by: Rajesh Gangireddy <[email protected]> Co-authored-by: Ashwin Vaidya <[email protected]> * Support OVAnomaly in OVEngine (#4436) * fix anomaly model * almost refactored * refactor AnomalyOV * Add MaskRCNN v2 Rotated Detection task via Instance Segmentation (#4437) * ✨ Add Rotated MaskRCNN v2 model implementation and configuration files * fix: ensure newline at end of file in rotated_det.py * fix: reorder imports and improve error message in convert_masks_to_rotated_predictions * Update src/otx/backend/native/models/instance_segmentation/rotated_det.py Co-authored-by: Ashwin Vaidya <[email protected]> --------- Co-authored-by: Ashwin Vaidya <[email protected]> * Benchmark Refactor for 2.5 (#4435) * Refactor benchmark criteria in performance tests to remove redundant metrics and add GPU memory tracking * Refactor OVEngine logging and streamline benchmark task handling * Refactor dataset info entries to remove unnecessary extra_overrides in performance benchmark tests * Remove performance benchmark tests for anomaly detection, classification, instance segmentation, keypoint detection, semantic segmentation, and tiling instance segmentation. These tests included various model and dataset configurations along with benchmark criteria for performance evaluation. * Remove performance benchmark workflow configuration * Refactor benchmark.py to streamline engine initialization and remove unnecessary extra_kwargs handling * Refactor engine initialization in Benchmark class to return engine directly from configuration * Fix end time initialization in IterationTimer to ensure proper timing for each phase * Refactor test assertions in TestIterationTimer to simplify data_time logging checks for batch_idx * Fix model name in MODEL_TEST_CASES for keypoint detection benchmark * Fix kp detection metric name * Update documentation for 2.5 release (#4447) * update documentation * change to additional feature * added edits to the documentationЭ Ä : * delete product design * change README * small fix * Provide XPU workarounds (release/2.5) (#4464) * Provide workarounds for the XPU training (#4441) * provide XPU workarounds * add note section to the installation * Update __init__.py * 🐞 Fix 0 image scores in Anomaly OV model (#4469) Bugfix Signed-off-by: Ashwin Vaidya <[email protected]> * Fix regression on release 2.5 (#4468) Update adaptive early stopping configuration across multiple detection and segmentation recipes * Improve EarlyStoppingWithWarmup docs and set check_on_train_epoch_end to False as default (#4473) * Enhance EarlyStoppingWithWarmup functionality and add unit tests - Set default value for check_on_train_epoch_end to False in EarlyStoppingWithWarmup. * Fix formatter * Introduce Classification Factory and Simplify Model Imports (#4456) * add factory for classficaiton * add mising files * minor * fix imports * fix imports in tests 2 * fix ruff * fix unit test * update factory. Reply comments * add literal to other backbones * 🐞 Benchmark fixes for 2.5 (#4471) Bug fixes - Max epochs in train overrides the max_epochs value loaded from config when creating the engine - Other fixes for benchmarking script Signed-off-by: Ashwin Vaidya <[email protected]> * Merge 2.4.6 to 2.5.0 (Fix checkpoint loading update) (#4475) * merge changes * fix linter * fix readme * update modules mock * fix unit test * fix tox * create context manager * add snapshot for anomaly * add hlabel snapshot test * minor fix * fix changelog * fix linter * Update ConfigConverter for Geti2.12 (#4477) * add factory for classficaiton * add mising files * minor * fix imports * fix imports in tests 2 * fix ruff * fix unit test * fix paths * change converter * add configurable augmentation and input size * temporary fix * update ConfigConverter: * fix linter * update unit test for ConfigConverter * change integration tests * add missing file * fix unit test * delete templates * update changelog * update recipe * fix linter * return templates back * (release/2.5) Remove duplicate explain() method and consolidate XAI functionality into predict() (#4493) * Refactor XAI utilities and remove deprecated explain method * Fix XPU training and optimization from Geti2.5 (#4486) * apply fix to run xpu, change from_config * fix typing' * add example * fix xai test * fix linte * fix auto batch size for XPU * return max_epochs for atss * add kwargs override for OTXEngine.from_config() * use cache instead * return train kwargs back * minor fixes| * reply comments * Fix overriding train parameters (#4496) * apply param overrides * add additional kwargs to cache| | * fix unit test * add test for overriding epochs * add test for overriding epochs * Fix adaptive batch size to run on CPU (#4499) * add warning instead of raising error * fix unit test * Fix UFLow configuration (#4504) add callbacks for uflow * reimplement Gaussian noise * Fix confidence threshold cache invalidation and filtering logic (#4498) * Refactor confidence threshold handling in detection and instance segmentation models * adding stage parameter to model methods for validation and testing * Refactor metric computation in OTX models by removing stage parameter and consolidating test step logic * fix inst-seg _filter_outputs_by_threshold * Remove best_confidence_threshold_list from checkpoint during save and add unit tests for detection model confidence threshold logic. * Fix format * Enhance unit tests for detection threshold logic to ensure compatibility with Python 3.10 * Enhance unit tests for detection threshold logic to ensure compatibility with Python 3.10 * Fix tests * Fix format * fix tests * update unit test * Removing best_confidence_threshold_list and updating related unit tests for checkpoint functionality. * Refactor checkpoint saving in OTXModel to remove unnecessary line and update comments in OTXDetectionModel for clarity on best_confidence_threshold usage. * add RandomGaussianBlur aug * minor fix| * fix unit tests * reply comments * provide workaround for XPU batch search * return back parameters for MaskRCNN * fix unit test * Fix semantic segmentation annotation handling for ExtractedMask type (#4511) * Fix tiling when polygons are given * Fix gaussian noise augmentation and add random gaussian blur (#4508) * reimplement Gaussian noise * add RandomGaussianBlur aug * minor fix| * fix unit tests * reply comments * Filter invalid annotation by task (#4515) * Add task parameter to pre-filtering and enhance annotation validation logic * fix unit test * Workaround for batch size search on xpu devices (#4513) * provide workaround for XPU batch search * return back parameters for MaskRCNN * fix unit test * switch off adaprive_bs by default * fix linter * Fix cache args (#4522) * reimplement Gaussian noise * add RandomGaussianBlur aug * minor fix| * fix unit tests * reply comments * provide workaround for XPU batch search * return back parameters for MaskRCNN * fix unit test * fix train args * fix unit tests * add tiling arrow * fix deim recipe * fix test_xai * try self hosted * try pre-commit on Ubuntu * try to bypass unit tests * add installing build tools * remove sudo * fix integration tests * return workflow back * fix pre-commit --------- Signed-off-by: Ashwin Vaidya <[email protected]> Signed-off-by: Ashwin Vaidya <[email protected]> Co-authored-by: Vladislav Sovrasov <[email protected]> Co-authored-by: Rajesh Gangireddy <[email protected]> Co-authored-by: Ashwin Vaidya <[email protected]> Co-authored-by: Eugene Liu <[email protected]> Co-authored-by: Ashwin Vaidya <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Currently, XPU may fail even after a successful batch size search. XPU accumulates operations and uses a cache for primitives, causing memory consumption to grow during training. This PR is not a reliable solution, but rather an attempt to mitigate failures in Geti.
How to test
Checklist
License
Feel free to contact the maintainers if that's a concern.