Skip to content

Conversation

@IrvingMg
Copy link
Contributor

@IrvingMg IrvingMg commented Nov 20, 2025

What type of PR is this?

Implements cache-based storage for AFS consumed resources, fixing race condition where consumed resources aren't visible to scheduler when entry penalties are deleted.

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #6710
Fixes #7688
Fixes #7693

Special notes for your reviewer:

Does this PR introduce a user-facing change?

AdmissionFairSharing: Fix the bug that occasionally a workload may get admitted from a busy LocalQueue,
bypassing the entry penalties.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Nov 20, 2025
@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 19b4c91
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69270155998f2300086e9ebd

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 20, 2025
IrvingMg

This comment was marked as duplicate.

@@ -0,0 +1,70 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added a new subpackage for Admission Fair Sharing under the queue package, with the intention of also moving the AfsEntryPenalties struct there.

I think this will help tidy up the code and avoid the circular dependency in CalcLocalQueueFSUsage between the queue and workload packages when using AFS structs. This can be a follow-up

@IrvingMg
Copy link
Contributor Author

I’ve opened draft PR for early review.

@PBundyra PTAL.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 20, 2025
@IrvingMg IrvingMg force-pushed the afs-cache-implementation branch from 58f4cc0 to 7061af4 Compare November 21, 2025 11:36
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 21, 2025
@IrvingMg IrvingMg force-pushed the afs-cache-implementation branch from 7061af4 to 7585112 Compare November 21, 2025 12:23
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 21, 2025
@IrvingMg IrvingMg marked this pull request as ready for review November 21, 2025 14:00
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 21, 2025
@IrvingMg
Copy link
Contributor Author

/test pull-kueue-test-e2e-multikueue-main
/test pull-kueue-test-e2e-main-1-33

Look like network issue:

shasum: standard input: no properly formatted SHA checksum lines found
Failed to install helm-unittest
For support, go to https://github.com/kubernetes/helm
Error: plugin install hook for "unittest" exited with error
make: *** [Makefile-deps.mk:110: helm] Error 1

See: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/7780/pull-kueue-test-e2e-multikueue-main/1992955152935424000

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2e5e2b2d75000123f886a0eb1f630e054f142054

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IrvingMg, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2025
@IrvingMg
Copy link
Contributor Author

  1. manual test on a live cluster for Kueue restart

Here are the logs with the steps I followed for the manual restart test on a Kind cluster:
afs-restart-results.log

@mimowo
Copy link
Contributor

mimowo commented Nov 26, 2025

Thank you, the raport lgtm

@PBundyra
Copy link
Contributor

Just some nits, other than that lgtm! Thanks @IrvingMg great job!

@PBundyra
Copy link
Contributor

Leaving unholding to @mimowo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also delete from afsConsumedResources here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo
Copy link
Contributor

mimowo commented Nov 26, 2025

Leaving unholding to @mimowo

sg, I think the remaining aspects of the PR:

  1. ensure the Delete is called when the LQ is deleted to make sure no memory leak.
  2. potentially simplify handling of the corner case discussed in [AFS] Implement cache for AFS consumed resources #7780 (comment) (to be checked)

@IrvingMg IrvingMg force-pushed the afs-cache-implementation branch from 1c312e2 to 19b4c91 Compare November 26, 2025 13:32
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mimowo November 26, 2025 13:32
@mimowo
Copy link
Contributor

mimowo commented Nov 26, 2025

Thank you 👍
/unhold
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 26, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0672030c67b03082b7cb1c179a7bf82ce3674ceb

@k8s-ci-robot k8s-ci-robot merged commit b2b82cc into kubernetes-sigs:main Nov 26, 2025
28 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.15 milestone Nov 26, 2025
@mimowo
Copy link
Contributor

mimowo commented Nov 26, 2025

@IrvingMg please check if we can cherrypick. Since AFS is alpha on 0.13 and 0.14 this is not "must", but if we can, then it would be great. It would be great to try to backport to 0.14 at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

5 participants