Skip to content

Update GPU Compute Capacity support to match tensorflow #200

Closed
@rnett

Description

@rnett

When trying to test stuff on GPU (on Linux) on 0.3.0-SNAPSHOT, it takes a while to initialize, before giving me:

2021-01-29 19:15:28.239332: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-01-29 19:15:28.239537: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Exception in thread "main" org.tensorflow.exceptions.TensorFlowException: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid
	at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:101)
	at org.tensorflow.EagerSession.allocate(EagerSession.java:357)
	at org.tensorflow.EagerSession.<init>(EagerSession.java:327)

This is with a 1070 (compute 6.1) that was successfully recognized earlier:

2021-01-29 19:15:28.229419: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.797GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s

After some diging, I found tensorflow/tensorflow#41990, tensorflow/tensorflow#41132 (comment), and tensorflow/tensorflow#41892 (comment).

The last two in particular imply that the issue is that our binaries aren't being built with support for compute capacity 6.1, and sure enough, we don't: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/build.sh#L25-L32

export TF_CUDA_COMPUTE_CAPABILITIES="3.5,7.0"

As per the 2nd and 3rd links, and https://www.tensorflow.org/install/gpu#hardware_requirements, the other tensorflow binaries (Python, C, C++, etc) support 3.5, 5.0, 6.0, 7.0, 7.5, 8.0 and higher than 8.0. Imo, we should do the same, ideally in a way we don't have to update when it changes (will simply not exporting it work? The defaults are specified in https://github.com/tensorflow/tensorflow/blob/master/.bazelrc#L600). This will likely increase build times though, which I think we already have issues with.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions