Skip to content

Conversation

@zhuangqh
Copy link
Collaborator

@zhuangqh zhuangqh commented Sep 25, 2025

Reason for Change:

  • enable nvme disk acceleration for normal workspace.
    if nvme disk is unavailable, use emptyDir instead.
  • cleanup deployment workload for existing workspace.

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

@kaito-pr-agent
Copy link

kaito-pr-agent bot commented Sep 25, 2025

Title

(Describe updated until commit 9f368a7)

Migrate to StatefulSet for all inference workloads with NVMe disk support


Description

  • Unified inference workloads to use StatefulSet exclusively

  • Added local NVMe disk support for model weights storage

  • Updated tests to reflect StatefulSet transition

  • Enhanced workspace controller logic for inference handling


Changes walkthrough 📝

Relevant files
Tests
test_utils.go
Add mock StatefulSet for testing                                                 

pkg/utils/test/test_utils.go

  • Added MockStatefulSetUpdated for testing StatefulSet updates
  • Created complete StatefulSet spec with containers, volumes, and status
  • +54/-0   
    workspace_controller_test.go
    Update controller tests for StatefulSet migration               

    pkg/workspace/controllers/workspace_controller_test.go

  • Updated tests to use StatefulSet instead of Deployment
  • Added storagev1.StorageClass mock for NVMe disk tests
  • Modified test cases to validate StatefulSet behavior
  • +19/-13 
    preset_inferences_test.go
    Refactor inference tests for StatefulSet and NVMe               

    pkg/workspace/inference/preset_inferences_test.go

  • Added storagev1.StorageClass dependency to tests
  • Removed workload type differentiation in tests
  • Updated test validation for StatefulSet properties
  • +21/-45 
    Enhancement
    workspace_controller.go
    Refactor inference handling to use StatefulSet exclusively

    pkg/workspace/controllers/workspace_controller.go

  • Replaced Deployment handling with StatefulSet-only approach
  • Simplified pod spec retrieval by removing type switching
  • Unified existing object handling for StatefulSets
  • +3/-18   
    preset_inferences.go
    Standardize inference generation on StatefulSet                   

    pkg/workspace/inference/preset_inferences.go

  • Unified inference generation to always return StatefulSet
  • Added NVMe disk volume check during StatefulSet creation
  • Removed Deployment generation path completely
  • +16/-26 
    Configuration changes
    Makefile
    Adjust test execution parameters                                                 

    Makefile

  • Modified test configuration to target A100Required label
  • Reduced default test node count to 1
  • +2/-2     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    StatefulSet Assumption

    The code now assumes workload is always StatefulSet, removing Deployment support. Verify this architectural change aligns with requirements and doesn't break existing functionality.

    // inference parameters.
    
    var workloadObj client.Object
    workloadObj, err = inference.GeneratePresetInference(ctx, wObj, revisionStr, model, c.Client)
    if err != nil {
    	return
    }
    
    existingObj := &appsv1.StatefulSet{}
    desiredPodSpec := workloadObj.(*appsv1.StatefulSet).Spec.Template.Spec
    
    if err = resources.GetResource(ctx, wObj.Name, wObj.Namespace, c.Client, existingObj); err == nil {
    	klog.InfoS("An inference workload already exists for workspace", "workspace", klog.KObj(wObj))
    	annotations := existingObj.GetAnnotations()
    	if annotations == nil {
    		annotations = make(map[string]string)
    	}
    
    	currentRevisionStr, ok := annotations[kaitov1beta1.WorkspaceRevisionAnnotation]
    	// If the current workload revision matches the one in Workspace, we do not need to update it.
    	if ok && currentRevisionStr == revisionStr {
    		err = resources.CheckResourceStatus(workloadObj, c.Client, inferenceParam.ReadinessTimeout)
    		return
    	}
    
    	spec := &existingObj.Spec.Template.Spec
    
    	// Selectively update the pod spec fields that are relevant to inference,
    	// and leave the rest unchanged in case user has customized them.
    	spec.Containers[0].Env = desiredPodSpec.Containers[0].Env
    NVMe Volume Handling

    The conditional logic for NVMe volume claims requires validation to ensure correct behavior when local NVMe storage is unavailable.

    	// Use StatefulSet for all use cases to ensure consistent pod identity and storage management
    	// For multi-node distributed inference with vLLM, we need StatefulSet to ensure pods are
    	// created with individual identities (their ordinal indexes) -
    	// https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-identity
    	if shouldUseDistributedInference(gctx, numNodes) {
    		podOpts = append(podOpts, SetDistributedInferenceProbe)
    	}
    
    	ssOpts := []generator.TypedManifestModifier[generator.WorkspaceGeneratorContext, appsv1.StatefulSet]{
    		manifests.GenerateStatefulSetManifest(revisionNum, numNodes),
    	}
    
    	if checkIfNVMeAvailable(ctx, &gpuConfig, kubeClient) {
    		ssOpts = append(ssOpts, manifests.AddStatefulSetVolumeClaimTemplates(GenerateModelWeightsCacheVolume(ctx, workspaceObj, model)))
    	} else {
    		podOpts = append(podOpts, SetDefaultModelWeightsVolume)
    	}
    
    	podSpec, err := generator.GenerateManifest(gctx, podOpts...)
    	if err != nil {
    		return nil, err
    	}
    	ssOpts = append(ssOpts, manifests.SetStatefulSetPodSpec(podSpec))
    
    	return generator.GenerateManifest(gctx, ssOpts...)
    }
    Ginkgo Configuration

    The default test label changed from exclusion to inclusion of A100 tests. Confirm this intentional change in test strategy.

    GINKGO_LABEL ?= A100Required
    GINKGO_NODES ?= 1

    @kaito-pr-agent
    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Safely handle type assertion for StatefulSet

    *The type assertion for workloadObj to appsv1.StatefulSet may panic if workloadObj
    is not of that type. Although the function GeneratePresetInference is expected to
    return a StatefulSet, we should handle the case where it does not to avoid a runtime
    panic. Check the type before asserting.

    pkg/workspace/controllers/workspace_controller.go [528-529]

    -existingObj := &appsv1.StatefulSet{}
    -desiredPodSpec := workloadObj.(*appsv1.StatefulSet).Spec.Template.Spec
    +statefulSet, ok := workloadObj.(*appsv1.StatefulSet)
    +if !ok {
    +    return fmt.Errorf("expected StatefulSet for inference workload, but got %T", workloadObj)
    +}
    +desiredPodSpec := statefulSet.Spec.Template.Spec
    Suggestion importance[1-10]: 7

    __

    Why: The type assertion should be safely handled to prevent runtime panics, though the context suggests StatefulSet is now exclusively used.

    Medium
    Add safe type assertion in test

    The type assertion may panic if createdObject is not a StatefulSet. Add a type check
    to safely handle the assertion and fail the test immediately with a clear error
    message if the type is unexpected.

    pkg/workspace/inference/preset_inferences_test.go [317]

    -statefulset := createdObject.(*appsv1.StatefulSet)
    +statefulset, ok := createdObject.(*appsv1.StatefulSet)
    +if !ok {
    +    t.Fatalf("expected StatefulSet but got %T", createdObject)
    +}
    Suggestion importance[1-10]: 6

    __

    Why: Adding a type check prevents test panics and improves error clarity, though the function should consistently return StatefulSet.

    Low
    General
    Document ephemeral storage behavior

    When NVMe is unavailable, SetDefaultModelWeightsVolume adds an emptyDir volume which
    is ephemeral. For StatefulSets, consider using a persistent volume instead since
    emptyDir doesn't persist across pod restarts. If ephemeral is acceptable, document
    this behavior.

    pkg/workspace/inference/preset_inferences.go [162-166]

     if checkIfNVMeAvailable(ctx, &gpuConfig, kubeClient) {
         ssOpts = append(ssOpts, manifests.AddStatefulSetVolumeClaimTemplates(GenerateModelWeightsCacheVolume(ctx, workspaceObj, model)))
     } else {
    +    // Using emptyDir for model weights (ephemeral storage)
         podOpts = append(podOpts, SetDefaultModelWeightsVolume)
     }
    Suggestion importance[1-10]: 5

    __

    Why: Adding a comment clarifies ephemeral storage behavior, but doesn't address functional concerns in StatefulSet usage.

    Low

    @bfoley13
    Copy link
    Collaborator

    /describe

    @zhuangqh zhuangqh changed the title feat: add local nvme disk support for all workspace feat: [BREAKING] use statefulset for all workspace Dec 18, 2025
    Signed-off-by: zhuangqh <[email protected]>
    @zhuangqh zhuangqh merged commit 3ab3f3d into kaito-project:main Dec 18, 2025
    18 of 20 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    Status: Done

    Development

    Successfully merging this pull request may close these issues.

    4 participants