Expose contextualized fair sharing weights for cluster queues as metrics #7338

j-skiba · 2025-10-21T12:35:10Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This change updates the kueue_cluster_queue_weighted_share and kueue_cohort_weighted_share metrics to report precise fair sharing weights, rather than rounded values, and adds a cohort label for better context.

Which issue(s) this PR fixes:

Fixes #7244

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Observability: Adjust the `cluster_queue_weighted_share` and `cohort_weighted_share` metrics to report the precise value for the Weighted share, rather than the value rounded to an integer. Also, expand the `cluster_queue_weighted_share` metric with the "cohort" label.

…dd cohort label

netlify · 2025-10-21T12:35:18Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`a4916a2`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68ff38044a2b5a0008beb5d6

k8s-ci-robot · 2025-10-21T12:35:20Z

Hi @j-skiba. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2025-10-21T15:43:56Z

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

k8s-triage-robot · 2025-10-21T17:53:56Z

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

mbobrovskyi · 2025-10-22T05:51:13Z

/ok-to-test

mbobrovskyi · 2025-10-22T05:51:33Z

@j-skiba please rebase

mimowo · 2025-10-22T09:07:02Z

pkg/cache/scheduler/cache.go

+	if s.PreciseWeightedShare == math.Inf(1) {
+		return math.MaxInt64
+	}


This is tricky, we need a test for this scenario. I'm worried this will be cumbersome to visualize in any tool like graphana. For example consider visualing on one graph DRS from two CQs: one with weight=0, and one with weight=1. The one will weight=0 will have DRS=max in64 making the entire plot flattened for other CQs (IIUC). Maybe grafana could deal with that somehow, but then it needs to be investigated.

Instead of using MaxInt64 I would rather like to report MaxRange + 1. MaxRange curretly being 1000. wdyt @amy @gabesaba ?

I decided to add this just to cover this case and the comment here says that functional branches should never reach here.

But nevertheless, it might be worth handling this as you said.

Well, I think this comment is a bit tricky: https://github.com/j-skiba/kueue/blob/5cf835a6bf0dd22feee6b9bf650738f71a99cd3e/pkg/cache/scheduler/fair_sharing.go#L72 - it assumed the state of the world as before

Now, with this new feature this is a new functional branch, so I think this comment would no longer be accurate. So I would like to adjust that comment.

But nevertheless, it might be worth handling this as you said.

I think so, the only scenario which does not make me totally sure is when one is reducing the CQ quota, then the CQ might be temporarily running "overcommitted", and thus above 1000. You may check experimentally if this scenario is real. If this is the case then assuming 1001 might be tricky indeed.

I would be good to consider what is the range actually possible. Using Maxint64 for the metrics is weird.

entire plot flattened for other CQs (IIUC). Maybe grafana could deal with that somehow, but then it needs to be investigated.

Yeah... this sounds not great. If this is the case, can you look into if grafana can cap the Y axis for the viewing window?

What about setting the metric value to NaN if weight equals 0.0 and noting that in the metric's description? If the metric value can theoretically be anything from 0 to over 1000 in this case

the only scenario which does not make me totally sure is when one is reducing the CQ quota, then the CQ might be temporarily running "overcommitted", and thus above 1000.

Considering the metric value can theoretically be anything from 0 to over 1000 (especially in the "overcommitted" scenario you mentioned), perhaps the NaN approach would be fine. Grafana has a mapping feature that can handle special values: https://grafana.com/docs/grafana/latest/panels-visualizations/configure-value-mappings/#special. By default, metrics with NaN would be skipped by Grafana.

Although, I'm not sure if using NaN like this is a good pattern. Just throwing out an idea.

Using NaN feels better than MaxInt64 and 1001. I'm ok with that approach if no other voices or better ideas. I would also log NaN for consistency

Just to clarify my understanding and confirm the logic:

The return value from this WeightedShare() method is only used to set the status.fairSharing.WeightedShare field. This method handles Inf by returning math.MaxInt64, which is fine for the status API.

The Prometheus metric, on the other hand, gets its value from the raw PreciseWeightedShare(). This value can be Inf, which is what could cause the issue with flattened graphs.

Therefore, the change I pushed (in clusterqueue_controller.go) to convert Inf to NaN specifically for the metric seems fine. It solves the graphing problem without impacting the status field's logic.

amy · 2025-10-22T17:30:09Z

Can you also add scheduling cycle number? We need something to collate the values within a tournament. I'm not sure about metric cardinality for that though.

Perhaps the tournament correlation needs to be done via logs. But yeah, the main context that matters are the DRS values grouped within a tournament.

mimowo · 2025-10-23T06:13:04Z

Can you also add scheduling cycle number? We need something to collate the values within a tournament. I'm not sure about metric cardinality for that though.

Yes, I don't think we should be exposing the schedulingCycle counter. This is more of the technical detail.

Also, what would you exactly correlate the schedulingCycle with even if exposed - this is very transient state (state of the tournament) which is partially recorded in logs only. If we are going to correlate that precisely to the logs anyway, then we could just log the DRS at higher level of logging while the tournament is happening.

To make this easier I'm thinking we could have in the Kueue repo a small script like "fair sharing log analyzer" even, which would rely on logs at say V4+.

amy · 2025-10-23T15:56:29Z

Also, what would you exactly correlate the schedulingCycle with even if exposed - this is very transient state (state of the tournament) which is partially recorded in logs only. If we are going to correlate that precisely to the logs anyway, then we could just log the DRS at higher level of logging while the tournament is happening.

So the context to why we want higher DRS value precision instrumentation (regardless of metrics or logging) is so that operators could validate scheduling logic. When we originally found the rounding errors for fairshare tournaments, we looked at both the workload/CQ with the wrong DRS value and the competitors in the tournament.

A different question would be, why expose DRS with higher precision at all given its pretty transient and doesn't really make sense outside the context of a tournament.

To make this easier I'm thinking we could have in the Kueue repo a small script like "fair sharing log analyzer" even, which would rely on logs at say V4+.

Sounds like an interesting idea!

…haring-weights-for-CQs-as-metrics

j-skiba · 2025-10-24T07:24:21Z

@mimowo should I also change how cohort_weighted_share is reported? It's a similar case to cluster_queue_weighted_share in case of max_int value -

kueue/pkg/metrics/metrics.go

Line 493 in 178155c

    
           If the Cohort has a weight of zero and is borrowing, this will return 9223372036854775807,

mimowo · 2025-10-24T07:41:11Z

So the context to why we want higher DRS value precision instrumentation (regardless of metrics or logging) is so that operators could validate scheduling logic. When we originally found the rounding errors for fairshare tournaments, we looked at both the workload/CQ with the wrong DRS value and the competitors in the tournament.

Indeed the value of the metric and API for CQ is bumped in the cluster_queue controller, which is by design decoupled from scheduler's value which is in cache, see here.

However, metrics are only scraped by tools like prometheus in intervals, by default 15s. So this will also not give us a super precise tool for debugging.

A different question would be, why expose DRS with higher precision at all given its pretty transient and doesn't really make sense outside the context of a tournament.

That is a valid question to ask. As mentioned above, neither API nor metric will give us a super precise value as used by scheduler (at least I have no clue how to do it). We can only get us close as possible with the approximation, thus the proposal to increase the precision of the metric.

To make this easier I'm thinking we could have in the Kueue repo a small script like "fair sharing log analyzer" even, which would rely on logs at say V4+.
Sounds like an interesting idea!

Well, this is the only idea currently I have to expose the "precise point in time" values for DRS as used by the scheduler.

mimowo · 2025-10-24T07:41:44Z

cc @PBundyra @mwielgus who are also looking into debuggability of DRS

mimowo · 2025-10-24T07:47:55Z

@mimowo should I also change how cohort_weighted_share is reported? It's a similar case to cluster_queue_weighted_share in case of max_int value -

Oh yes, I think if we change for ClusterQueue, then we should change for Cohort in sync, so please update the PR.

However, be aware the discussion may continue as it was raised if this is needed #7338 (comment)

amy · 2025-10-24T23:06:37Z

However, metrics are only scraped by tools like prometheus in intervals, by default 15s. So this will also not give us a super precise tool for debugging.

However, be aware the discussion may continue as it was raised if this is needed #7338 (comment)

Ah alrighty. This metric without schedulingcycle could still be useful! We can retroactively correlate this with other metrics with time roughly. (Ex: at the most basic levels, when we expect a CQ to be bursting/using guarantees. Or when we have high CQ weights what the potential values could be. Then we use those clues to dig further in logs.)

mimowo · 2025-10-27T09:24:20Z

/release-note-edit

Adjust the `cluster_queue_weighted_share` and `cohort_weighted_share` metrics to report the precise value for the 
Weighted share, rather than the value rounded to an integer. Also, expand the `cluster_queue_weighted_share` metric
with the "cohort" label.

mimowo · 2025-10-27T09:25:44Z

Thanks 👍
/lgtm
/approve

k8s-ci-robot · 2025-10-27T09:25:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: j-skiba, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-10-27T09:25:52Z

LGTM label has been added.

Git tree hash: 709e86591d5654964567426a60051bb5ec7018be

…ics (kubernetes-sigs#7338) * use float instead of int in cluster_queue_weighted_share metric and add cohort label * don't use two fields for weighted share * adjust metric test util to the changes * make ExpectClusterQueueWeightedShareMetric accept float64 as value * adjust integration test * report NaN instead of max_int when weight is 0 * remove unused imports in e2e tests * use float instead of int in cohort_weighted_share metric * fix format and naming cleanup

mimowo · 2025-11-28T08:37:01Z

/release-note-edit

Observability: Adjust the `cluster_queue_weighted_share` and `cohort_weighted_share` metrics to report the precise value for the Weighted share, rather than the value rounded to an integer. Also, expand the `cluster_queue_weighted_share` metric with the "cohort" label.

j-skiba added 4 commits October 17, 2025 13:06

use float instead of int in cluster_queue_weighted_share metric and a…

cac59e0

…dd cohort label

don't use two fields for weighted share

56c2f08

adjust metric test util to the changes

8ff0b71

make ExpectClusterQueueWeightedShareMetric accept float64 as value

f69c2d1

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Oct 21, 2025

k8s-ci-robot requested review from PBundyra and kannon92 October 21, 2025 12:35

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 21, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 21, 2025

adjust integration test

d45824e

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 21, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 21, 2025

merge main

5cf835a

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 22, 2025

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2025

j-skiba marked this pull request as ready for review October 22, 2025 06:00

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2025

k8s-ci-robot requested a review from mbobrovskyi October 22, 2025 06:00

mimowo reviewed Oct 22, 2025

View reviewed changes

j-skiba added 3 commits October 24, 2025 05:48

report NaN instead of max_int when weight is 0

e3dd7d9

Merge branch 'kubernetes-sigs:main' into expose-contextualized-fair-s…

6ee03ca

…haring-weights-for-CQs-as-metrics

remove unused imports in e2e tests

0b337f1

j-skiba added 2 commits October 27, 2025 08:38

use float instead of int in cohort_weighted_share metric

17386eb

fix format and naming cleanup

a4916a2

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 27, 2025

k8s-ci-robot assigned mimowo Oct 27, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2025

k8s-ci-robot merged commit 4d88320 into kubernetes-sigs:main Oct 27, 2025
23 checks passed

k8s-ci-robot added this to the v0.15 milestone Oct 27, 2025

Expose contextualized fair sharing weights for cluster queues as metrics #7338

Expose contextualized fair sharing weights for cluster queues as metrics #7338

Uh oh!

Conversation

j-skiba commented Oct 21, 2025 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

k8s-ci-robot commented Oct 21, 2025

Uh oh!

k8s-triage-robot commented Oct 21, 2025

Uh oh!

k8s-triage-robot commented Oct 21, 2025

Uh oh!

mbobrovskyi commented Oct 22, 2025

Uh oh!

mbobrovskyi commented Oct 22, 2025

Uh oh!

mimowo Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

j-skiba Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

j-skiba Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

mimowo Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

amy Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

j-skiba Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

j-skiba Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

amy commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mimowo commented Oct 23, 2025

Uh oh!

amy commented Oct 23, 2025

Uh oh!

j-skiba commented Oct 24, 2025

Uh oh!

mimowo commented Oct 24, 2025

Uh oh!

mimowo commented Oct 24, 2025

Uh oh!

mimowo commented Oct 24, 2025

Uh oh!

amy commented Oct 24, 2025

Uh oh!

mimowo commented Oct 27, 2025

Uh oh!

mimowo commented Oct 27, 2025

Uh oh!

k8s-ci-robot commented Oct 27, 2025

Uh oh!

k8s-ci-robot commented Oct 27, 2025

Uh oh!

j-skiba commented Oct 21, 2025 •

edited by k8s-ci-robot

Loading

netlify bot commented Oct 21, 2025 •

edited

Loading

mimowo Oct 22, 2025 •

edited

Loading

j-skiba Oct 23, 2025 •

edited

Loading

amy commented Oct 22, 2025 •

edited

Loading