Skip to content

Conversation

@alexandernorth
Copy link

@alexandernorth alexandernorth commented Jan 19, 2026

What this PR does / why we need it:
Enables discovery and metric collection for Custom Resources managed by aggregated API servers which do not have a local CRD. It does this by querying non-local apiservices for the resources they handle.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
By default, no change

Which issue(s) this PR fixes: (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged)
Fixes #2471

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 19, 2026
@k8s-ci-robot
Copy link
Contributor

Welcome @alexandernorth!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alexandernorth
Once this PR has been reviewed and has the lgtm label, please assign catherinef-dev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 19, 2026
@alexandernorth alexandernorth changed the title split discovery into apiservice and crd discovery and monitoring feat: split discovery into apiservice and crd discovery and monitoring Jan 19, 2026
group := serviceSpec["group"].(string)
version := serviceSpec["version"].(string)

resourceList, err := discoveryClient.ServerResourcesForGroupVersion(fmt.Sprintf("%s/%s", group, version))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call runs on every add/update event for an APIService and can make it quite chatty, especially since such updates can occur frequently. Do we want any dedup or backoff per group/version to avoid repeated discovery calls and potential API churn?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I will look into this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have improved the logic so that the APIService must be Available before querying the API Server for resources, reducing calls which would return no data (and thus hopefully the amount of churn too).

I am under the impression we should always perform this query whenever there is an update as we need to track all available resources and a change to an APIService could result in new/removed Kinds, and this means that it becomes tricky with dedup/backoff as we might miss updates. As resources are queried by group+version which should be unique within the cluster, I don't see a case where we have multiple calls per update for the same group/version combinations - please correct me if I am missing something here though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. The availability check sounds like a good improvement and should help reduce chattiness.
My original thought was mainly around repeated updates where the APIService object itself changes (like status churn) without a change in group/version, but I agree dedup/backoff gets tricky if we want to avoid missing updates.

Copy link
Author

@alexandernorth alexandernorth Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true, I also considered this but I didn't find a "nice" solution which would ensure we received all updates - although in my (short term) observations around this, unless the aggregation service is very unstable there are not that many updates triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks

Copy link
Contributor

@bhope bhope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since runInformer() is now used for both CRDs and APIServices, the CRDsAddEventsCounter and CRDsCacheCountGauge metrics will also count APIService events. That might be a bit confusing from a metrics perspective. Should we consider renaming them or splitting by source (CRD vs APIService)?

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 20, 2026
@alexandernorth
Copy link
Author

As APIServices are dynamic I realised I could not use exactly the same system as was present for CRDs, as if the APIService is no longer available then it cannot be queried for its Kinds (to remove them from the cache map). I refactored the cache map, which is now keyed by the source of the discovered resource - the benefit being that we can handle the case where an APIService becomes unavailable. It does mean that we no longer index by Group/Version, but this could be implemented if it is a requirement.

Regarding the generated metrics, I have consolidated and renamed them to apply for both CRDs and APIServices. I also removed the add metric as in the new implementation add and update are implemented the same way. I also wonder if the delete metric is necessary or if it makes sense to simply have that also increment the update counter. If I should not remove existing metrics I can add them back, and optionally add a consolidated counter with add/update/deletes.

I refactored the GVK->chan map, moving the stop channel to be part of the DiscoveredResource, and lifecycling it as part of the Update/Delete process.

The refactor also fixes a missing synchronisation where the cache map could be read outside of being locked (via the ResolveGVKToGVKPs function called in pkg/customresourcestate/config.go#L191)

@alexandernorth alexandernorth requested a review from bhope January 20, 2026 17:13
Copy link
Contributor

@bhope bhope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those refactors and detailed walkthrough.

On the metrics side, since these are already released (hence probably consumed), I’d prefer we keep the delete metric rather than folding it into update. Having delete counted separately is still useful to understand churn (resources dropping vs being refreshed). Besides, removing it would be a breaking change for existing dashboards.

@alexandernorth
Copy link
Author

That makes sense - I have added back the add metric - the metrics have the same name as before, but now are counting the 'merged' CRD/APIServer events (the naming was not CRD specific). I updated the help text to be more generic, but I think this is fine as it should not change anything for existing consumers unless they now specify aggregation layer resources

Copy link
Contributor

@bhope bhope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and incorporating the feedback. Overall, looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

CustomResourceState does not produce metrics for the aggregation layer if there is no CustomResourceDefinition defined

3 participants