"CUDA error: invalid argument" on model test for vit_h_14

We got the following error on the CI:
```
___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 732, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 226, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
This error only happened on `vit_h_14` model in cuda device (the cpu is fine). Also I cannot reproduce the error on AWS cluster machine. Seems like this error is either machine or environment dependent and likely to be pytorch-core issue.

Note: I have tried running the test with `CUDA_LAUNCH_BLOCKING=1` but the error trace seems pretty similar (see [here](https://github.com/pytorch/vision/actions/runs/4019456977/jobs/6906319059)):
```
___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 725, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 225, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

cc @pmeier @seemethere @atalman @osalpekar 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"CUDA error: invalid argument" on model test for vit_h_14 #7143

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"CUDA error: invalid argument" on model test for vit_h_14 #7143

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions