-
Notifications
You must be signed in to change notification settings - Fork 2.2k
feat: split discovery into apiservice and crd discovery and monitoring #2854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: split discovery into apiservice and crd discovery and monitoring #2854
Conversation
|
This issue is currently awaiting triage. If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @alexandernorth! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alexandernorth The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
internal/discovery/discovery.go
Outdated
| group := serviceSpec["group"].(string) | ||
| version := serviceSpec["version"].(string) | ||
|
|
||
| resourceList, err := discoveryClient.ServerResourcesForGroupVersion(fmt.Sprintf("%s/%s", group, version)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call runs on every add/update event for an APIService and can make it quite chatty, especially since such updates can occur frequently. Do we want any dedup or backoff per group/version to avoid repeated discovery calls and potential API churn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I will look into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have improved the logic so that the APIService must be Available before querying the API Server for resources, reducing calls which would return no data (and thus hopefully the amount of churn too).
I am under the impression we should always perform this query whenever there is an update as we need to track all available resources and a change to an APIService could result in new/removed Kinds, and this means that it becomes tricky with dedup/backoff as we might miss updates. As resources are queried by group+version which should be unique within the cluster, I don't see a case where we have multiple calls per update for the same group/version combinations - please correct me if I am missing something here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense. The availability check sounds like a good improvement and should help reduce chattiness.
My original thought was mainly around repeated updates where the APIService object itself changes (like status churn) without a change in group/version, but I agree dedup/backoff gets tricky if we want to avoid missing updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true, I also considered this but I didn't find a "nice" solution which would ensure we received all updates - although in my (short term) observations around this, unless the aggregation service is very unstable there are not that many updates triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks
bhope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since runInformer() is now used for both CRDs and APIServices, the CRDsAddEventsCounter and CRDsCacheCountGauge metrics will also count APIService events. That might be a bit confusing from a metrics perspective. Should we consider renaming them or splitting by source (CRD vs APIService)?
… be managed in the same way
|
As APIServices are dynamic I realised I could not use exactly the same system as was present for CRDs, as if the APIService is no longer available then it cannot be queried for its Kinds (to remove them from the cache map). I refactored the cache map, which is now keyed by the source of the discovered resource - the benefit being that we can handle the case where an APIService becomes unavailable. It does mean that we no longer index by Group/Version, but this could be implemented if it is a requirement. Regarding the generated metrics, I have consolidated and renamed them to apply for both CRDs and APIServices. I also removed the I refactored the The refactor also fixes a missing synchronisation where the cache map could be read outside of being locked (via the |
bhope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those refactors and detailed walkthrough.
On the metrics side, since these are already released (hence probably consumed), I’d prefer we keep the delete metric rather than folding it into update. Having delete counted separately is still useful to understand churn (resources dropping vs being refreshed). Besides, removing it would be a breaking change for existing dashboards.
|
That makes sense - I have added back the |
bhope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update and incorporating the feedback. Overall, looks good to me.
What this PR does / why we need it:
Enables discovery and metric collection for Custom Resources managed by aggregated API servers which do not have a local CRD. It does this by querying non-local apiservices for the resources they handle.
How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
By default, no change
Which issue(s) this PR fixes: (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged)Fixes #2471