release-22.2: kvprober: special case node-is-decommissioned errors #106927
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Release justification:
kvproberis off by default and undocumented. It should only be used in CC prod as a result. Without this change, SRE is paged during decommission operations run in CC.Backport 1/1 commits from #104365.
/cc https://github.com/orgs/cockroachdb/teams/release
kvprober: special case node-is-decommissioned errors
kvprober runs on decommissioned node. In CC, this is generally fine, since
automation fully takes down nodes once they reach the decommissioned state. But
there is a brief period where a node is running and in the decommissioned
state, and we see kvprober errors in metrics during this period, as in below.
This sometimes leads to false positive kvprober pages in CC production.
‹rpc error: code = PermissionDenied desc = n1 was permanently removed from...
To be clear, the errors are not wrong per say. They just are expected to
happen, once a node is decommissioned.
This commit adds special handling for errors of the kind above, by doing a
substring match on the error string. To be exact, kvprober now logs such errors
at warning level and does not increment any error counters. This way, an
operation like decommissioning a node does not cause false positive kvprober
pages in CC production.
Fixes #104367
Release note: None, since kvprober is not used by customers. (It is not
documented.)
Co-authored-by: Josh Carp [email protected]