Skip to content

Conversation

@mszadkow
Copy link
Contributor

@mszadkow mszadkow commented Oct 13, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

As stated here

enabling the feature risks dropping some conditions set on the Workload by a controller, or user.

Thus to lower the risk to minimum we introduce an option that preserves strict setting for SSA.
That keeps the functionality unchanged when used with WorkloadRequestUseMergePatch disabled and for enabled let scheduler be more strict on patch and let it retry on the next scheduling cycle.

Which issue(s) this PR fixes:

Relates to #7035

Special notes for your reviewer:

Does this PR introduce a user-facing change?

WorkloadRequestUseMergePatch: use "strict" mode for admission patches during scheduling which
sends the ResourceVersion of the workload being admitted for comparing by kube-apiserver. 
This fixes the race-condition issue that Workload conditions added concurrently by other controllers
could be removed during scheduling.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 13, 2025
@netlify
Copy link

netlify bot commented Oct 13, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit d64dff1
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68ee35ba2970260007cffd08

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 13, 2025
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, please also open another "test" PR which turns the feature gate as "true" so that we can make sure it works. We could then merge that PR in a follow up.

err := workload.PatchAdmissionStatus(ctx, s.client, origWorkload, s.clock, func() (*kueue.Workload, bool, error) {
return newWorkload, true, nil
}, workload.WithLoose())
}, workload.WithLooseOnApply())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be used here:

}, workload.WithLoose()); err != nil {

Do we have an integration or e2e test which covers that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on unit tests, but sure I can work on integration test as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we already have an integration test, you can check by enabling the feature gate, and say injecting a panic in that place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found one that fails from time to time because of the change:
Scheduler Fair Sharing Suite: [It] Scheduler when ClusterQueue head has inadmissible workload sticky workload deleted, next workload can admit

Copy link
Contributor

@mimowo mimowo Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one flakes not only on this PR: #7250

I was asking if we have an integration test already which exercises the other use of PatchAdmissionStatus in schedule

}, workload.WithLoose()); err != nil {

And in my first comment I also referenced the other place to update for consistency, as both are from scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, does it flake if you run the tests in a loop with the changes in this PR, plus enabling the feature gate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please summarize the results of the testing.

Copy link
Contributor Author

@mszadkow mszadkow Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A successful with flag enabled run of:
ginkgo --json-report ./ginkgo.report -focus "SchedulerWithWaitForPodsReady" -r --race --procs=4 --repeat=20
No failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@mimowo
Copy link
Contributor

mimowo commented Oct 13, 2025

Please also add a release note

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 13, 2025
@mszadkow mszadkow force-pushed the feat/7035-WorkloadRequestUseMergePatch-to-beta branch from 952183b to 3013cf0 Compare October 14, 2025 09:43
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Oct 14, 2025
@mszadkow mszadkow force-pushed the feat/7035-WorkloadRequestUseMergePatch-to-beta branch from 3013cf0 to 6a7a46c Compare October 14, 2025 11:28
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2025
@mszadkow mszadkow force-pushed the feat/7035-WorkloadRequestUseMergePatch-to-beta branch from 6a7a46c to 0543121 Compare October 14, 2025 11:31
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2025
@mszadkow mszadkow force-pushed the feat/7035-WorkloadRequestUseMergePatch-to-beta branch from 0543121 to d64dff1 Compare October 14, 2025 11:36
@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2025

Thanks 👍
/lgtm
/approve
/cherrypick release-0.14

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.14 in a new PR and assign it to you.

In response to this:

Thanks 👍
/lgtm
/approve
/cherrypick release-0.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, mszadkow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a15d5e4b1bb22cc8643db3e0d1042f3cca620f39

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2025

We need to adjust the release note to be user-oriented
/release-note-edit

WorkloadRequestUseMergePatch: use "strict" for admission patches during scheduling by comparing 
ResourceVersion of the workload. This prevents occasional dropping of conditions added by other
controllers.

@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2025

/release-note-edit

WorkloadRequestUseMergePatch: use "strict" mode for admission patches during scheduling which
sends the ResourceVersion of the workload being admitted for comparing by kube-apiserver. 
This prevents occasional dropping of conditions added by other controllers.

@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2025

/remove-kind feature
/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Oct 15, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2025

/release-note-edit

WorkloadRequestUseMergePatch: use "strict" mode for admission patches during scheduling which
sends the ResourceVersion of the workload being admitted for comparing by kube-apiserver. 
This fixes the race-condition issue that Workload conditions added concurrently by other controllers
could be removed during scheduling.

@k8s-ci-robot k8s-ci-robot merged commit a4a39cf into kubernetes-sigs:main Oct 15, 2025
23 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.15 milestone Oct 15, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2025

/cherrypick release-0.14

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: new pull request created: #7279

In response to this:

Thanks 👍
/lgtm
/approve
/cherrypick release-0.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: new pull request could not be created: failed to create pull request against kubernetes-sigs/kueue#release-0.14 from head k8s-infra-cherrypick-robot:cherry-pick-7246-to-release-0.14: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for k8s-infra-cherrypick-robot:cherry-pick-7246-to-release-0.14."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

In response to this:

/cherrypick release-0.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2025

@mszadkow please prepare the cherrypick manually

@mszadkow mszadkow deleted the feat/7035-WorkloadRequestUseMergePatch-to-beta branch October 15, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants