Open
Description
We got the following error on the CI:
___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
File "/work/test/test_models.py", line 732, in test_classification_model
_check_input_backprop(model, x)
File "/work/test/test_models.py", line 226, in _check_input_backprop
out[0].sum().backward()
File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
This error only happened on vit_h_14
model in cuda device (the cpu is fine). Also I cannot reproduce the error on AWS cluster machine. Seems like this error is either machine or environment dependent and likely to be pytorch-core issue.
Note: I have tried running the test with CUDA_LAUNCH_BLOCKING=1
but the error trace seems pretty similar (see here):
___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
File "/work/test/test_models.py", line 725, in test_classification_model
_check_input_backprop(model, x)
File "/work/test/test_models.py", line 225, in _check_input_backprop
out[0].sum().backward()
File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.