Fix race condition in runAtenTest #9306
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The logic in runAtenTest has a race condition that can trigger if the "x.to(device)" is delayed, such as by acquiring a lock on the XLA device. In this case, the host-to-XLA device transfer may not start until after the test function has executed on the host CPU.
For test functions which mutate their inputs, such as test_diagonal_write_transposed_r3, this can result in the test
fn
being applied twice; once from the invocation offn(*tensors)
on CPU, and then a second time from the invocation offn(*xla_tensors)
. Empirically, this happens at a rate of approximately 0.03% (about 1 in 3000 executions).Adding a full clone on the CPU before calling the test function ensures that the CPU invocation and the XLA invocation will not interfere even if they execute in a different order, eliminating the race condition. This was verified by 1 million successful sequential executions of test_diagonal_write_transposed_r3.