Skip to content

"CUDA error: invalid argument" on model test for vit_h_14 #7143

Open
@YosuaMichael

Description

@YosuaMichael

We got the following error on the CI:

___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 732, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 226, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This error only happened on vit_h_14 model in cuda device (the cpu is fine). Also I cannot reproduce the error on AWS cluster machine. Seems like this error is either machine or environment dependent and likely to be pytorch-core issue.

Note: I have tried running the test with CUDA_LAUNCH_BLOCKING=1 but the error trace seems pretty similar (see here):

___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 725, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 225, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

cc @pmeier @seemethere @atalman @osalpekar

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions