feat: split discovery into apiservice and crd discovery and monitoring #2854

alexandernorth · 2026-01-19T16:18:07Z

What this PR does / why we need it:
Enables discovery and metric collection for Custom Resources managed by aggregated API servers which do not have a local CRD. It does this by querying non-local apiservices for the resources they handle.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
By default, no change

Which issue(s) this PR fixes: (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged)
Fixes #2471

k8s-ci-robot · 2026-01-19T16:18:16Z

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-01-19T16:18:17Z

Welcome @alexandernorth!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-01-19T16:18:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alexandernorth
Once this PR has been reviewed and has the lgtm label, please assign catherinef-dev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bhope · 2026-01-20T09:04:40Z

internal/discovery/discovery.go

+		group := serviceSpec["group"].(string)
+		version := serviceSpec["version"].(string)
+
+		resourceList, err := discoveryClient.ServerResourcesForGroupVersion(fmt.Sprintf("%s/%s", group, version))


This call runs on every add/update event for an APIService and can make it quite chatty, especially since such updates can occur frequently. Do we want any dedup or backoff per group/version to avoid repeated discovery calls and potential API churn?

Good point - I will look into this

I have improved the logic so that the APIService must be Available before querying the API Server for resources, reducing calls which would return no data (and thus hopefully the amount of churn too).

I am under the impression we should always perform this query whenever there is an update as we need to track all available resources and a change to an APIService could result in new/removed Kinds, and this means that it becomes tricky with dedup/backoff as we might miss updates. As resources are queried by group+version which should be unique within the cluster, I don't see a case where we have multiple calls per update for the same group/version combinations - please correct me if I am missing something here though.

Thanks, that makes sense. The availability check sounds like a good improvement and should help reduce chattiness.
My original thought was mainly around repeated updates where the APIService object itself changes (like status churn) without a change in group/version, but I agree dedup/backoff gets tricky if we want to avoid missing updates.

Yes that's true, I also considered this but I didn't find a "nice" solution which would ensure we received all updates - although in my (short term) observations around this, unless the aggregation service is very unstable there are not that many updates triggered.

Makes sense, thanks

bhope

Since runInformer() is now used for both CRDs and APIServices, the CRDsAddEventsCounter and CRDsCacheCountGauge metrics will also count APIService events. That might be a bit confusing from a metrics perspective. Should we consider renaming them or splitting by source (CRD vs APIService)?

… be managed in the same way

alexandernorth · 2026-01-20T16:39:12Z

As APIServices are dynamic I realised I could not use exactly the same system as was present for CRDs, as if the APIService is no longer available then it cannot be queried for its Kinds (to remove them from the cache map). I refactored the cache map, which is now keyed by the source of the discovered resource - the benefit being that we can handle the case where an APIService becomes unavailable. It does mean that we no longer index by Group/Version, but this could be implemented if it is a requirement.

Regarding the generated metrics, I have consolidated and renamed them to apply for both CRDs and APIServices. I also removed the add metric as in the new implementation add and update are implemented the same way. I also wonder if the delete metric is necessary or if it makes sense to simply have that also increment the update counter. If I should not remove existing metrics I can add them back, and optionally add a consolidated counter with add/update/deletes.

I refactored the GVK->chan map, moving the stop channel to be part of the DiscoveredResource, and lifecycling it as part of the Update/Delete process.

The refactor also fixes a missing synchronisation where the cache map could be read outside of being locked (via the ResolveGVKToGVKPs function called in pkg/customresourcestate/config.go#L191)

bhope

Thanks for those refactors and detailed walkthrough.

On the metrics side, since these are already released (hence probably consumed), I’d prefer we keep the delete metric rather than folding it into update. Having delete counted separately is still useful to understand churn (resources dropping vs being refreshed). Besides, removing it would be a breaking change for existing dashboards.

alexandernorth · 2026-01-20T18:02:37Z

That makes sense - I have added back the add metric - the metrics have the same name as before, but now are counting the 'merged' CRD/APIServer events (the naming was not CRD specific). I updated the help text to be more generic, but I think this is fine as it should not change anything for existing consumers unless they now specify aggregation layer resources

bhope

Thanks for the update and incorporating the feedback. Overall, looks good to me.

split discovery into apiservice and crd discovery and monitoring

7345a5c

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 19, 2026

k8s-ci-robot requested review from CatherineF-dev and dgrisonnet January 19, 2026 16:18

github-project-automation bot added this to SIG Instrumentation Jan 19, 2026

github-project-automation bot moved this to Needs Triage in SIG Instrumentation Jan 19, 2026

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 19, 2026

alexandernorth changed the title ~~split discovery into apiservice and crd discovery and monitoring~~ feat: split discovery into apiservice and crd discovery and monitoring Jan 19, 2026

alexandernorth mentioned this pull request Jan 19, 2026

☂️ Replace gardener-metrics-exporter with CustomResourceState metrics via kube-state-metrics gardener/monitoring#28

Closed

bhope reviewed Jan 20, 2026

View reviewed changes

alexandernorth added 2 commits January 20, 2026 11:44

refactor gvkExtraction to interface

1bf641a

refactor CRDiscoverer so that Resources from CRDs and APIServices can…

5009bb5

… be managed in the same way

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 20, 2026

alexandernorth added 2 commits January 20, 2026 17:44

add licence header for lint check

149b22a

fix linter discovered issues

9586f0b

alexandernorth requested a review from bhope January 20, 2026 17:13

bhope reviewed Jan 20, 2026

View reviewed changes

add back 'add' metric

a65967f

bhope reviewed Jan 20, 2026

View reviewed changes

feat: split discovery into apiservice and crd discovery and monitoring #2854

Are you sure you want to change the base?

feat: split discovery into apiservice and crd discovery and monitoring #2854

Conversation

alexandernorth commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 19, 2026

Uh oh!

k8s-ci-robot commented Jan 19, 2026

Uh oh!

k8s-ci-robot commented Jan 19, 2026

Uh oh!

bhope Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

alexandernorth Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

alexandernorth Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

bhope Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

alexandernorth Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhope Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

bhope left a comment

Choose a reason for hiding this comment

Uh oh!

alexandernorth commented Jan 20, 2026

Uh oh!

bhope left a comment

Choose a reason for hiding this comment

Uh oh!

alexandernorth commented Jan 20, 2026

Uh oh!

bhope left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexandernorth commented Jan 19, 2026 •

edited

Loading

alexandernorth Jan 20, 2026 •

edited

Loading