Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 22, 2025

Kaito unconditionally deploys the NVIDIA device plugin DaemonSet, causing duplicate deployments when clusters already have it installed via GPU Operator or other means. This PR makes the deployment optional via a feature gate, allowing users to disable it when needed.

Changes

  • Added nvidiaDevicePlugin.enabled flag to values.yaml (default: true)
  • Wrapped device plugin DaemonSet template with conditional rendering
  • Updated chart README with configuration documentation

Rationale

Defaulting to enabled ensures Kaito works out of the box for users without GPU Operator. Users with existing device plugin installations can explicitly disable it to prevent conflicts.

Usage

# Default: device plugin deployed by Kaito
helm install kaito-workspace kaito/workspace

# For clusters with existing device plugin (e.g., GPU Operator)
helm install kaito-workspace kaito/workspace --set nvidiaDevicePlugin.enabled=false
Original prompt

This section details on the original issue you should resolve

<issue_title>kaito deploys nvidia device plugin when cluster already includes it</issue_title>
<issue_description>Describe the bug

kaito deploys nvidia device plugin when cluster already includes it

Image

they are also deployed on the cpu nodes

Steps To Reproduce

helm install kaito-workspace kaito/workspace --namespace kaito-workspace --create-namespace

Expected behavior

Logs

Environment

  • Kubernetes version (use kubectl version):
  • OS (e.g: cat /etc/os-release):
  • Install tools:
  • Others:

Additional context</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Fix deployment of NVIDIA device plugin on existing clusters Make NVIDIA device plugin deployment optional, default disabled Dec 22, 2025
Copilot AI requested a review from andyzhangx December 22, 2025 03:29
Copilot AI changed the title Make NVIDIA device plugin deployment optional, default disabled Make NVIDIA device plugin deployment optional via feature gate Dec 22, 2025
Copilot AI requested a review from andyzhangx December 22, 2025 03:35
@andyzhangx andyzhangx marked this pull request as ready for review December 22, 2025 03:38
@andyzhangx andyzhangx changed the title Make NVIDIA device plugin deployment optional via feature gate feat: Make NVIDIA device plugin deployment optional via feature gate Dec 22, 2025
@andyzhangx andyzhangx requested a review from Copilot December 22, 2025 03:38
@kaito-pr-agent
Copy link

Title

Make NVIDIA device plugin deployment optional via feature gate


Description

  • Added feature flag to control NVIDIA device plugin deployment

  • Updated Helm chart to conditionally render device plugin DaemonSet

  • Documented new configuration option in README

  • Set default plugin deployment to enabled for backward compatibility


Changes walkthrough 📝

Relevant files
Documentation
README.md
Document NVIDIA device plugin toggle feature                         

charts/kaito/workspace/README.md

  • Added documentation for new nvidiaDevicePlugin.enabled configuration
    parameter
  • Specified default value (true) and usage guidance
  • +1/-0     
    Configuration changes
    nvidia-device-plugin-ds.yaml
    Add conditional rendering for NVIDIA device plugin             

    charts/kaito/workspace/templates/nvidia-device-plugin-ds.yaml

  • Wrapped DaemonSet definition in conditional check for
    nvidiaDevicePlugin.enabled flag
  • Added template directives to enable/disable device plugin deployment
  • +2/-0     
    values.yaml
    Implement feature flag for NVIDIA device plugin                   

    charts/kaito/workspace/values.yaml

  • Added nvidiaDevicePlugin.enabled configuration option
  • Set default value to true for backward compatibility
  • +2/-0     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🎫 Ticket compliance analysis 🔶

    1700 - Partially compliant

    Compliant requirements:

    • Make NVIDIA device plugin deployment optional
    • Provide configuration option to disable plugin deployment

    Non-compliant requirements:

    • Prevent device plugin from being deployed on CPU nodes

    Requires further human verification:

    • Verify duplicate deployments are avoided when plugin is disabled
    • Test behavior on clusters with existing NVIDIA device plugin
    ⏱️ Estimated effort to review: 1 🔵⚪⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Node Selector Missing

    The DaemonSet lacks node selector to restrict deployment to GPU nodes only

    metadata:
      name: nvidia-device-plugin-daemonset
      namespace: {{ .Release.Namespace }}
      labels:
        {{- include "kaito.labels" . | nindent 4 }}
    spec:
      selector:
        matchLabels:
          name: nvidia-device-plugin-ds
      updateStrategy:
        type: RollingUpdate
      template:
        metadata:
          labels:
            name: nvidia-device-plugin-ds
        spec:
          nodeSelector:
            kubernetes.io/os: linux
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  {{- if eq .Values.cloudProviderName "azure" }}
                  - key: kubernetes.azure.com/cluster
                    operator: Exists
                  - key: type
                    operator: NotIn
                    values:
                    - virtual-kubelet
                  {{- else if eq .Values.cloudProviderName "aws" }}
                  - key: "k8s.io/cloud-provider-aws"
                    operator: Exists
                  {{- end }}
          tolerations:
            # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
            # This, along with the annotation above marks this pod as a critical add-on.
            - key: CriticalAddonsOnly
              operator: Exists
            - key: nvidia.com/gpu
              operator: Exists
              effect: NoSchedule
            - key: "sku"
              operator: "Equal"
              value: "gpu"
              effect: "NoSchedule"
          priorityClassName: "system-node-critical"
          containers:
            - image: mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.17.0
              name: nvidia-device-plugin-ctr
              env:
                - name: FAIL_ON_INIT_ERROR
                  value: "false"
                - name: PASS_DEVICE_SPECS
                  value: "true"
              securityContext:
                allowPrivilegeEscalation: false
                capabilities:
                  drop: ["ALL"]
              volumeMounts:
                - name: device-plugin
                  mountPath: /var/lib/kubelet/device-plugins
          volumes:
            - name: device-plugin
              hostPath:
                path: /var/lib/kubelet/device-plugins

    Copy link
    Contributor

    Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull request overview

    This PR addresses the issue where Kaito unconditionally deploys the NVIDIA device plugin DaemonSet, causing conflicts when clusters already have it installed via GPU Operator or other means. The solution introduces a feature gate to make the deployment optional while maintaining backward compatibility by defaulting to enabled.

    Key Changes:

    • Added nvidiaDevicePlugin.enabled configuration flag (defaulting to true for backward compatibility)
    • Wrapped the NVIDIA device plugin DaemonSet template with conditional rendering based on the flag
    • Updated the Helm chart README with clear documentation about when to disable the feature

    Reviewed changes

    Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

    File Description
    charts/kaito/workspace/values.yaml Adds the nvidiaDevicePlugin.enabled flag with default value of true to maintain backward compatibility
    charts/kaito/workspace/templates/nvidia-device-plugin-ds.yaml Wraps the entire DaemonSet resource with a conditional check on the new flag
    charts/kaito/workspace/README.md Documents the new configuration option with usage guidance for users with existing device plugin installations

    The implementation is well-structured and follows existing patterns in the codebase. The conditional rendering syntax matches other feature gates in the repository, the configuration structure is consistent with other top-level settings in values.yaml, and the documentation clearly explains when users should disable this feature. No issues were identified during the review.


    💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    @codecov
    Copy link

    codecov bot commented Dec 22, 2025

    Codecov Report

    ✅ All modified and coverable lines are covered by tests.

    @@           Coverage Diff           @@
    ##             main    #1707   +/-   ##
    =======================================
      Coverage   59.41%   59.41%           
    =======================================
      Files          92       92           
      Lines        8697     8697           
    =======================================
      Hits         5167     5167           
      Misses       3270     3270           
      Partials      260      260           
    Components Coverage Δ
    workspace 49.52% <ø> (ø)
    presets 87.10% <ø> (ø)
    main ∅ <ø> (∅)
    🚀 New features to boost your workflow:
    • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
    • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

    @andyzhangx andyzhangx merged commit 319419b into main Dec 22, 2025
    24 checks passed
    @andyzhangx andyzhangx deleted the copilot/fix-nvidia-plugin-deployment branch December 22, 2025 14:21
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    Status: Done

    Development

    Successfully merging this pull request may close these issues.

    kaito deploys nvidia device plugin when cluster already includes it

    3 participants