Description
When trying to test stuff on GPU (on Linux) on 0.3.0-SNAPSHOT, it takes a while to initialize, before giving me:
2021-01-29 19:15:28.239332: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-01-29 19:15:28.239537: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Exception in thread "main" org.tensorflow.exceptions.TensorFlowException: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid
at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:101)
at org.tensorflow.EagerSession.allocate(EagerSession.java:357)
at org.tensorflow.EagerSession.<init>(EagerSession.java:327)
This is with a 1070 (compute 6.1) that was successfully recognized earlier:
2021-01-29 19:15:28.229419: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.797GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
After some diging, I found tensorflow/tensorflow#41990, tensorflow/tensorflow#41132 (comment), and tensorflow/tensorflow#41892 (comment).
The last two in particular imply that the issue is that our binaries aren't being built with support for compute capacity 6.1, and sure enough, we don't: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/build.sh#L25-L32
export TF_CUDA_COMPUTE_CAPABILITIES="3.5,7.0"
As per the 2nd and 3rd links, and https://www.tensorflow.org/install/gpu#hardware_requirements, the other tensorflow binaries (Python, C, C++, etc) support 3.5, 5.0, 6.0, 7.0, 7.5, 8.0 and higher than 8.0
. Imo, we should do the same, ideally in a way we don't have to update when it changes (will simply not exporting it work? The defaults are specified in https://github.com/tensorflow/tensorflow/blob/master/.bazelrc#L600). This will likely increase build times though, which I think we already have issues with.