-
Notifications
You must be signed in to change notification settings - Fork 86
Validate setup for GPU allocation
Dr. Jan-Philip Gehrcke edited this page Jul 21, 2025
·
1 revision
This document assumes that you have followed the installation instructions[TODO], and that all relevant GPU Operator components are running, and in a Ready state.
Confirm all expected nodes run a *-k8s-dra-driver-kubelet-plugin-*
pod, and that the READY
column indicates readiness for all listed pods.
Create spec file:
cat <<EOF > dra-gpu-share-test.yaml
---
apiVersion: v1
kind: Namespace
metadata:
name: dra-gpu-share-test
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
namespace: dra-gpu-share-test
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
namespace: dra-gpu-share-test
name: pod
labels:
app: pod
spec:
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
- name: ctr1
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimTemplateName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
EOF
Apply spec:
kubectl apply -f dra-gpu-share-test.yaml
Obtain output of both containers in the pod:
kubectl logs pod -n dra-gpu-share-test --all-containers --prefix
The output is expected to show the same GPU UUID from both containers. Example:
[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)