Skip to content

Validate setup for GPU allocation

Dr. Jan-Philip Gehrcke edited this page Jul 21, 2025 · 1 revision

This document assumes that you have followed the installation instructions[TODO], and that all relevant GPU Operator components are running, and in a Ready state.

Validate that DRA driver is running

Confirm all expected nodes run a *-k8s-dra-driver-kubelet-plugin-* pod, and that the READY column indicates readiness for all listed pods.

Run validation workload

1) Two containers asking for the same GPU

Create spec file:

cat <<EOF > dra-gpu-share-test.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: dra-gpu-share-test
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  namespace: dra-gpu-share-test
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
  namespace: dra-gpu-share-test
  name: pod
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  - name: ctr1
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimTemplateName: single-gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
EOF

Apply spec:

kubectl apply -f dra-gpu-share-test.yaml

Obtain output of both containers in the pod:

kubectl logs pod -n dra-gpu-share-test --all-containers --prefix

The output is expected to show the same GPU UUID from both containers. Example:

[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
Clone this wiki locally