Validate setup for GPU allocation

This document assumes that you have followed the installation instructions[TODO], and that all relevant GPU Operator components are running, and in a Ready state.

Validate that DRA driver is running

Confirm all expected nodes run a *-k8s-dra-driver-kubelet-plugin-* pod, and that the READY column indicates readiness for all listed pods.

Run validation workload

1) Two containers asking for the same GPU

Create spec file:

cat <<EOF > dra-gpu-share-test.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: dra-gpu-share-test
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  namespace: dra-gpu-share-test
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
  namespace: dra-gpu-share-test
  name: pod
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  - name: ctr1
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimTemplateName: single-gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
EOF

Apply spec:

kubectl apply -f dra-gpu-share-test.yaml

Obtain output of both containers in the pod:

kubectl logs pod -n dra-gpu-share-test --all-containers --prefix

The output is expected to show the same GPU UUID from both containers. Example:

[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate setup for GPU allocation

Validate that DRA driver is running

Run validation workload

1) Two containers asking for the same GPU

Uh oh!

Uh oh!

Clone this wiki locally