Skip to content

failed to create cublas handle: the resource allocation failed #4

@blinor

Description

@blinor

Hey there,
I am trying to run a simple tensorflow training in a dockercontainer with fractional-gpu. No matter which one I use i always get:
`>>> model.fit(x_train, y_train, epochs=50, batch_size=1000)
Epoch 1/50
2024-06-06 10:53:20.251154: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:185] failed to create cublas handle: the resource allocation failed
2024-06-06 10:53:20.251203: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:188] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine.
2024-06-06 10:53:20.251227: W external/local_xla/xla/stream_executor/stream.cc:1020] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node sequential/dense/MatMul defined at (most recent call last):
File "", line 1, in

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1807, in fit

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1401, in train_function

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1384, in step_function

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1373, in run_step

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1150, in train_step

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 590, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/sequential.py", line 398, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/functional.py", line 515, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/functional.py", line 672, in _run_internal_graph

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/dense.py", line 241, in call

Blas xGEMV launch failed : a.shape=[1,1000,784], b.shape=[1,784,1], m=1000, n=1, k=784
[[{{node sequential/dense/MatMul}}]] [Op:__inference_train_function_932] `
with the official tensorflow/tensorflow:latest-gpu image, everything works as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions