feat: [BREAKING] use statefulset for all workspace #1523

zhuangqh · 2025-09-25T05:28:00Z

Reason for Change:

enable nvme disk acceleration for normal workspace.
if nvme disk is unavailable, use emptyDir instead.
cleanup deployment workload for existing workspace.

Requirements

added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

kaito-pr-agent · 2025-09-25T05:28:57Z

Title

(Describe updated until commit `9f368a7`)

Migrate to StatefulSet for all inference workloads with NVMe disk support

Description

Unified inference workloads to use StatefulSet exclusively
Added local NVMe disk support for model weights storage
Updated tests to reflect StatefulSet transition
Enhanced workspace controller logic for inference handling

Changes walkthrough 📝

Relevant files

Tests

test_utils.go `Add mock StatefulSet for testing` pkg/utils/test/test_utils.go Added MockStatefulSetUpdated for testing StatefulSet updates Created complete StatefulSet spec with containers, volumes, and status	+54/-0
workspace_controller_test.go `Update controller tests for StatefulSet migration` pkg/workspace/controllers/workspace_controller_test.go Updated tests to use StatefulSet instead of Deployment Added storagev1.StorageClass mock for NVMe disk tests Modified test cases to validate StatefulSet behavior	+19/-13
preset_inferences_test.go `Refactor inference tests for StatefulSet and NVMe` pkg/workspace/inference/preset_inferences_test.go Added storagev1.StorageClass dependency to tests Removed workload type differentiation in tests Updated test validation for StatefulSet properties	+21/-45

Enhancement

workspace_controller.go `Refactor inference handling to use StatefulSet exclusively` pkg/workspace/controllers/workspace_controller.go Replaced Deployment handling with StatefulSet-only approach Simplified pod spec retrieval by removing type switching Unified existing object handling for StatefulSets	+3/-18
preset_inferences.go `Standardize inference generation on StatefulSet` pkg/workspace/inference/preset_inferences.go Unified inference generation to always return StatefulSet Added NVMe disk volume check during StatefulSet creation Removed Deployment generation path completely	+16/-26

Configuration changes

Makefile `Adjust test execution parameters` Makefile Modified test configuration to target A100Required label Reduced default test node count to 1	+2/-2

Need help?
Type /help how to ... in the comments thread for any questions about PR-Agent usage.
Check out the documentation for more information.

kaito-pr-agent · 2025-09-25T05:30:28Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review StatefulSet Assumption The code now assumes workload is always StatefulSet, removing Deployment support. Verify this architectural change aligns with requirements and doesn't break existing functionality. // inference parameters. var workloadObj client.Object workloadObj, err = inference.GeneratePresetInference(ctx, wObj, revisionStr, model, c.Client) if err != nil { return } existingObj := &appsv1.StatefulSet{} desiredPodSpec := workloadObj.(appsv1.StatefulSet).Spec.Template.Spec if err = resources.GetResource(ctx, wObj.Name, wObj.Namespace, c.Client, existingObj); err == nil { klog.InfoS("An inference workload already exists for workspace", "workspace", klog.KObj(wObj)) annotations := existingObj.GetAnnotations() if annotations == nil { annotations = make(map[string]string) } currentRevisionStr, ok := annotations[kaitov1beta1.WorkspaceRevisionAnnotation] // If the current workload revision matches the one in Workspace, we do not need to update it. if ok && currentRevisionStr == revisionStr { err = resources.CheckResourceStatus(workloadObj, c.Client, inferenceParam.ReadinessTimeout) return } spec := &existingObj.Spec.Template.Spec // Selectively update the pod spec fields that are relevant to inference, // and leave the rest unchanged in case user has customized them. spec.Containers[0].Env = desiredPodSpec.Containers[0].Env NVMe Volume Handling* The conditional logic for NVMe volume claims requires validation to ensure correct behavior when local NVMe storage is unavailable. // Use StatefulSet for all use cases to ensure consistent pod identity and storage management // For multi-node distributed inference with vLLM, we need StatefulSet to ensure pods are // created with individual identities (their ordinal indexes) - // https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-identity if shouldUseDistributedInference(gctx, numNodes) { podOpts = append(podOpts, SetDistributedInferenceProbe) } ssOpts := []generator.TypedManifestModifier[generator.WorkspaceGeneratorContext, appsv1.StatefulSet]{ manifests.GenerateStatefulSetManifest(revisionNum, numNodes), } if checkIfNVMeAvailable(ctx, &gpuConfig, kubeClient) { ssOpts = append(ssOpts, manifests.AddStatefulSetVolumeClaimTemplates(GenerateModelWeightsCacheVolume(ctx, workspaceObj, model))) } else { podOpts = append(podOpts, SetDefaultModelWeightsVolume) } podSpec, err := generator.GenerateManifest(gctx, podOpts...) if err != nil { return nil, err } ssOpts = append(ssOpts, manifests.SetStatefulSetPodSpec(podSpec)) return generator.GenerateManifest(gctx, ssOpts...) } Ginkgo Configuration The default test label changed from exclusion to inclusion of A100 tests. Confirm this intentional change in test strategy. GINKGO_LABEL ?= A100Required GINKGO_NODES ?= 1

kaito-pr-agent · 2025-09-25T05:33:24Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Safely handle type assertion for StatefulSet The type assertion for `workloadObj` to `appsv1.StatefulSet` may panic if `workloadObj` is not of that type. Although the function `GeneratePresetInference` is expected to return a StatefulSet, we should handle the case where it does not to avoid a runtime panic. Check the type before asserting. pkg/workspace/controllers/workspace_controller.go [528-529] -existingObj := &appsv1.StatefulSet{} -desiredPodSpec := workloadObj.(appsv1.StatefulSet).Spec.Template.Spec +statefulSet, ok := workloadObj.(*appsv1.StatefulSet) +if !ok { + return fmt.Errorf("expected StatefulSet for inference workload, but got %T", workloadObj) +} +desiredPodSpec := statefulSet.Spec.Template.Spec Suggestion importance[1-10]: 7 __ Why: The type assertion should be safely handled to prevent runtime panics, though the context suggests StatefulSet is now exclusively used.	Medium
Possible issue	Add safe type assertion in test The type assertion may panic if `createdObject` is not a StatefulSet. Add a type check to safely handle the assertion and fail the test immediately with a clear error message if the type is unexpected. pkg/workspace/inference/preset_inferences_test.go [317] -statefulset := createdObject.(appsv1.StatefulSet) +statefulset, ok := createdObject.(appsv1.StatefulSet) +if !ok { + t.Fatalf("expected StatefulSet but got %T", createdObject) +} Suggestion importance[1-10]: 6 __ Why: Adding a type check prevents test panics and improves error clarity, though the function should consistently return StatefulSet.	Low
General	Document ephemeral storage behavior When NVMe is unavailable, `SetDefaultModelWeightsVolume` adds an emptyDir volume which is ephemeral. For StatefulSets, consider using a persistent volume instead since emptyDir doesn't persist across pod restarts. If ephemeral is acceptable, document this behavior. pkg/workspace/inference/preset_inferences.go [162-166] if checkIfNVMeAvailable(ctx, &gpuConfig, kubeClient) { ssOpts = append(ssOpts, manifests.AddStatefulSetVolumeClaimTemplates(GenerateModelWeightsCacheVolume(ctx, workspaceObj, model))) } else { + // Using emptyDir for model weights (ephemeral storage) podOpts = append(podOpts, SetDefaultModelWeightsVolume) } Suggestion importance[1-10]: 5 __ Why: Adding a comment clarifies ephemeral storage behavior, but doesn't address functional concerns in StatefulSet usage.	Low

pkg/workspace/controllers/workspace_controller.go

bfoley13 · 2025-09-29T23:59:28Z

/describe

Signed-off-by: zhuangqh <[email protected]>

pkg/workspace/inference/preset_inferences.go

pkg/workspace/controllers/workspace_controller.go

Signed-off-by: zhuangqh <[email protected]>

zhuangqh requested review from Fei-Guo and chewong as code owners September 25, 2025 05:28

github-project-automation bot added this to KAITO Roadmap Sep 25, 2025

zhuangqh had a problem deploying to e2e-test September 25, 2025 05:28 — with GitHub Actions Failure

zhuangqh temporarily deployed to unit-tests September 25, 2025 05:28 — with GitHub Actions Inactive

kaito-pr-agent bot added the Review effort 3/5 label Sep 25, 2025

andyzhangx reviewed Sep 25, 2025

View reviewed changes

pkg/workspace/controllers/workspace_controller.go Show resolved Hide resolved

zhuangqh force-pushed the sts branch from 9f368a7 to 0e7f87f Compare December 9, 2025 07:36

zhuangqh temporarily deployed to unit-tests December 9, 2025 07:36 — with GitHub Actions Inactive

zhuangqh had a problem deploying to e2e-test December 9, 2025 07:36 — with GitHub Actions Failure

zhuangqh force-pushed the sts branch from 0e7f87f to 042ca03 Compare December 9, 2025 11:18

zhuangqh temporarily deployed to unit-tests December 9, 2025 11:18 — with GitHub Actions Inactive

zhuangqh had a problem deploying to e2e-test December 9, 2025 11:18 — with GitHub Actions Failure

zhuangqh temporarily deployed to unit-tests December 9, 2025 11:18 — with GitHub Actions Inactive

zhuangqh had a problem deploying to e2e-test December 9, 2025 11:18 — with GitHub Actions Error

zhuangqh temporarily deployed to unit-tests December 10, 2025 06:00 — with GitHub Actions Inactive

zhuangqh temporarily deployed to e2e-test December 10, 2025 06:00 — with GitHub Actions Inactive

zhuangqh had a problem deploying to e2e-test December 10, 2025 06:00 — with GitHub Actions Error

zhuangqh added 3 commits December 18, 2025 12:18

feat: add local nvme disk support for all workspace

37f3592

Signed-off-by: zhuangqh <[email protected]>

fix

5cfec91

Signed-off-by: zhuangqh <[email protected]>

remove deployment

a528b64

Signed-off-by: zhuangqh <[email protected]>

zhuangqh force-pushed the sts branch from 2c59cc1 to a528b64 Compare December 18, 2025 01:18

zhuangqh temporarily deployed to unit-tests December 18, 2025 01:18 — with GitHub Actions Inactive

zhuangqh temporarily deployed to e2e-test December 18, 2025 01:18 — with GitHub Actions Inactive

zhuangqh had a problem deploying to e2e-test December 18, 2025 01:18 — with GitHub Actions Error

zhuangqh changed the title ~~feat: add local nvme disk support for all workspace~~ feat: [BREAKING] use statefulset for all workspace Dec 18, 2025

andyzhangx reviewed Dec 18, 2025

View reviewed changes

pkg/workspace/inference/preset_inferences.go Show resolved Hide resolved

Fei-Guo reviewed Dec 18, 2025

View reviewed changes

pkg/workspace/controllers/workspace_controller.go Show resolved Hide resolved

address comment

f484523

Signed-off-by: zhuangqh <[email protected]>

zhuangqh temporarily deployed to e2e-test December 18, 2025 03:02 — with GitHub Actions Inactive

zhuangqh temporarily deployed to unit-tests December 18, 2025 03:02 — with GitHub Actions Inactive

zhuangqh had a problem deploying to unit-tests December 18, 2025 03:02 — with GitHub Actions Failure

zhuangqh requested a deployment to e2e-test December 18, 2025 03:02 — with GitHub Actions Waiting

zhuangqh temporarily deployed to unit-tests December 18, 2025 03:40 — with GitHub Actions Inactive

andyzhangx approved these changes Dec 18, 2025

View reviewed changes

Fei-Guo approved these changes Dec 18, 2025

View reviewed changes

zhuangqh merged commit 3ab3f3d into kaito-project:main Dec 18, 2025
18 of 20 checks passed

github-project-automation bot moved this to Done in KAITO Roadmap Dec 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: [BREAKING] use statefulset for all workspace #1523

feat: [BREAKING] use statefulset for all workspace #1523

Uh oh!

zhuangqh commented Sep 25, 2025 •

edited

Loading

Uh oh!

kaito-pr-agent bot commented Sep 25, 2025 •

edited

Loading

Uh oh!

kaito-pr-agent bot commented Sep 25, 2025

Uh oh!

kaito-pr-agent bot commented Sep 25, 2025

Uh oh!

Uh oh!

bfoley13 commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: [BREAKING] use statefulset for all workspace #1523

feat: [BREAKING] use statefulset for all workspace #1523

Uh oh!

Conversation

zhuangqh commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaito-pr-agent bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title

(Describe updated until commit 9f368a7)

Description

Changes walkthrough 📝

Uh oh!

kaito-pr-agent bot commented Sep 25, 2025

PR Reviewer Guide 🔍

Uh oh!

kaito-pr-agent bot commented Sep 25, 2025

PR Code Suggestions ✨

Uh oh!

Uh oh!

bfoley13 commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhuangqh commented Sep 25, 2025 •

edited

Loading

kaito-pr-agent bot commented Sep 25, 2025 •

edited

Loading

(Describe updated until commit `9f368a7`)