Workaround for batch size search on xpu devices #4513

kprokofi · 2025-08-08T12:22:15Z

Summary

Currently, XPU may fail even after a successful batch size search. XPU accumulates operations and uses a cache for primitives, causing memory consumption to grow during training. This PR is not a reliable solution, but rather an attempt to mitigate failures in Geti.

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have ran e2e tests and there is no issues.
I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).
I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
I have linked related issues.

License

I submit my code changes under the same Apache License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

kprokofi · 2025-08-12T12:44:54Z

See comment: #4508 (comment)

* Merge develop to release/2.5 (#4432) * Update demo requirements (#4421) Fix demo requirements * Cleanup Geti task templates for anomaly task (#4420) * Remove sub task templates for anomaly * Move anomaly classification templates one level up * Update model_template_id for PADIM and STFPM anomaly templates * Restore Engine (#4430) Restore engine.py Signed-off-by: Ashwin Vaidya <[email protected]> --------- Signed-off-by: Ashwin Vaidya <[email protected]> Co-authored-by: Vladislav Sovrasov <[email protected]> Co-authored-by: Rajesh Gangireddy <[email protected]> Co-authored-by: Ashwin Vaidya <[email protected]> * Support OVAnomaly in OVEngine (#4436) * fix anomaly model * almost refactored * refactor AnomalyOV * Add MaskRCNN v2 Rotated Detection task via Instance Segmentation (#4437) * ✨ Add Rotated MaskRCNN v2 model implementation and configuration files * fix: ensure newline at end of file in rotated_det.py * fix: reorder imports and improve error message in convert_masks_to_rotated_predictions * Update src/otx/backend/native/models/instance_segmentation/rotated_det.py Co-authored-by: Ashwin Vaidya <[email protected]> --------- Co-authored-by: Ashwin Vaidya <[email protected]> * Benchmark Refactor for 2.5 (#4435) * Refactor benchmark criteria in performance tests to remove redundant metrics and add GPU memory tracking * Refactor OVEngine logging and streamline benchmark task handling * Refactor dataset info entries to remove unnecessary extra_overrides in performance benchmark tests * Remove performance benchmark tests for anomaly detection, classification, instance segmentation, keypoint detection, semantic segmentation, and tiling instance segmentation. These tests included various model and dataset configurations along with benchmark criteria for performance evaluation. * Remove performance benchmark workflow configuration * Refactor benchmark.py to streamline engine initialization and remove unnecessary extra_kwargs handling * Refactor engine initialization in Benchmark class to return engine directly from configuration * Fix end time initialization in IterationTimer to ensure proper timing for each phase * Refactor test assertions in TestIterationTimer to simplify data_time logging checks for batch_idx * Fix model name in MODEL_TEST_CASES for keypoint detection benchmark * Fix kp detection metric name * Update documentation for 2.5 release (#4447) * update documentation * change to additional feature * added edits to the documentationЭ Ä : * delete product design * change README * small fix * Provide XPU workarounds (release/2.5) (#4464) * Provide workarounds for the XPU training (#4441) * provide XPU workarounds * add note section to the installation * Update __init__.py * 🐞 Fix 0 image scores in Anomaly OV model (#4469) Bugfix Signed-off-by: Ashwin Vaidya <[email protected]> * Fix regression on release 2.5 (#4468) Update adaptive early stopping configuration across multiple detection and segmentation recipes * Improve EarlyStoppingWithWarmup docs and set check_on_train_epoch_end to False as default (#4473) * Enhance EarlyStoppingWithWarmup functionality and add unit tests - Set default value for check_on_train_epoch_end to False in EarlyStoppingWithWarmup. * Fix formatter * Introduce Classification Factory and Simplify Model Imports (#4456) * add factory for classficaiton * add mising files * minor * fix imports * fix imports in tests 2 * fix ruff * fix unit test * update factory. Reply comments * add literal to other backbones * 🐞 Benchmark fixes for 2.5 (#4471) Bug fixes - Max epochs in train overrides the max_epochs value loaded from config when creating the engine - Other fixes for benchmarking script Signed-off-by: Ashwin Vaidya <[email protected]> * Merge 2.4.6 to 2.5.0 (Fix checkpoint loading update) (#4475) * merge changes * fix linter * fix readme * update modules mock * fix unit test * fix tox * create context manager * add snapshot for anomaly * add hlabel snapshot test * minor fix * fix changelog * fix linter * Update ConfigConverter for Geti2.12 (#4477) * add factory for classficaiton * add mising files * minor * fix imports * fix imports in tests 2 * fix ruff * fix unit test * fix paths * change converter * add configurable augmentation and input size * temporary fix * update ConfigConverter: * fix linter * update unit test for ConfigConverter * change integration tests * add missing file * fix unit test * delete templates * update changelog * update recipe * fix linter * return templates back * (release/2.5) Remove duplicate explain() method and consolidate XAI functionality into predict() (#4493) * Refactor XAI utilities and remove deprecated explain method * Fix XPU training and optimization from Geti2.5 (#4486) * apply fix to run xpu, change from_config * fix typing' * add example * fix xai test * fix linte * fix auto batch size for XPU * return max_epochs for atss * add kwargs override for OTXEngine.from_config() * use cache instead * return train kwargs back * minor fixes| * reply comments * Fix overriding train parameters (#4496) * apply param overrides * add additional kwargs to cache| | * fix unit test * add test for overriding epochs * add test for overriding epochs * Fix adaptive batch size to run on CPU (#4499) * add warning instead of raising error * fix unit test * Fix UFLow configuration (#4504) add callbacks for uflow * reimplement Gaussian noise * Fix confidence threshold cache invalidation and filtering logic (#4498) * Refactor confidence threshold handling in detection and instance segmentation models * adding stage parameter to model methods for validation and testing * Refactor metric computation in OTX models by removing stage parameter and consolidating test step logic * fix inst-seg _filter_outputs_by_threshold * Remove best_confidence_threshold_list from checkpoint during save and add unit tests for detection model confidence threshold logic. * Fix format * Enhance unit tests for detection threshold logic to ensure compatibility with Python 3.10 * Enhance unit tests for detection threshold logic to ensure compatibility with Python 3.10 * Fix tests * Fix format * fix tests * update unit test * Removing best_confidence_threshold_list and updating related unit tests for checkpoint functionality. * Refactor checkpoint saving in OTXModel to remove unnecessary line and update comments in OTXDetectionModel for clarity on best_confidence_threshold usage. * add RandomGaussianBlur aug * minor fix| * fix unit tests * reply comments * provide workaround for XPU batch search * return back parameters for MaskRCNN * fix unit test * Fix semantic segmentation annotation handling for ExtractedMask type (#4511) * Fix tiling when polygons are given * Fix gaussian noise augmentation and add random gaussian blur (#4508) * reimplement Gaussian noise * add RandomGaussianBlur aug * minor fix| * fix unit tests * reply comments * Filter invalid annotation by task (#4515) * Add task parameter to pre-filtering and enhance annotation validation logic * fix unit test * Workaround for batch size search on xpu devices (#4513) * provide workaround for XPU batch search * return back parameters for MaskRCNN * fix unit test * switch off adaprive_bs by default * fix linter * Fix cache args (#4522) * reimplement Gaussian noise * add RandomGaussianBlur aug * minor fix| * fix unit tests * reply comments * provide workaround for XPU batch search * return back parameters for MaskRCNN * fix unit test * fix train args * fix unit tests * add tiling arrow * fix deim recipe * fix test_xai * try self hosted * try pre-commit on Ubuntu * try to bypass unit tests * add installing build tools * remove sudo * fix integration tests * return workflow back * fix pre-commit --------- Signed-off-by: Ashwin Vaidya <[email protected]> Signed-off-by: Ashwin Vaidya <[email protected]> Co-authored-by: Vladislav Sovrasov <[email protected]> Co-authored-by: Rajesh Gangireddy <[email protected]> Co-authored-by: Ashwin Vaidya <[email protected]> Co-authored-by: Eugene Liu <[email protected]> Co-authored-by: Ashwin Vaidya <[email protected]>

provide workaround for XPU batch search

84fcb03

kprokofi requested review from samet-akcay, eugene123tw, sovrasov, Daankrol, ashwinvaidya17 and rajeshgangireddy as code owners August 8, 2025 12:22

kprokofi changed the base branch from develop to release/2.5 August 8, 2025 12:22

github-actions bot added DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM BUILD DOC Improvements or additions to documentation labels Aug 8, 2025

return back parameters for MaskRCNN

c680ff1

github-actions bot added TEST Any changes in tests and removed DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM BUILD DOC Improvements or additions to documentation labels Aug 8, 2025

sovrasov added this to the 2.5.0 milestone Aug 8, 2025

kprokofi added 2 commits August 11, 2025 09:45

fix unit test

f3326e9

switch off adaprive_bs by default

5c1c2dc

sovrasov approved these changes Aug 12, 2025

View reviewed changes

kprokofi merged commit 73ea1b0 into open-edge-platform:release/2.5 Aug 12, 2025
6 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workaround for batch size search on xpu devices #4513

Workaround for batch size search on xpu devices #4513

Uh oh!

kprokofi commented Aug 8, 2025

Uh oh!

kprokofi commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Workaround for batch size search on xpu devices #4513

Workaround for batch size search on xpu devices #4513

Uh oh!

Conversation

kprokofi commented Aug 8, 2025

Summary

How to test

Checklist

License

Uh oh!

kprokofi commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!