-
Notifications
You must be signed in to change notification settings - Fork 4k
kvprober: special case node-is-decommissioned errors #104365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1967675 to
4c0d5aa
Compare
4c0d5aa to
90cf080
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):
} // ManualStops stops the prober from sending more probes. It does not
Maybe it should block. I could do it with channels. Else, I worry the test I added will occasionally flake.
90cf080 to
2f49898
Compare
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 8 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @joshimhoff, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):
On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.
I worry the test I added will occasionally flake.
Because without blocking, a kvprobe may be in-flight when the node is decommissioned? That does seem racy.
pkg/server/decommission.go line 342 at r2 (raw file):
) error { // Once a node is fully decommissioned, neither kvclient nor kvprober work from
Decommission is configuring the nodes in nodeIDs as decommissioning. There is no guarantee that these are the same nodes that the decommission is running on. We wouldn't want to blindly stop kvProber on s if it is decommissioning s+1. We could manually stop kvprober only if we find ourself in nodeIDs, but is that sufficient?
I'm curious whether you considered an approach of detecting these errors and not considering them to be kvprober failures. Doing so would avoid the raciness between kvprober termination and decommissioning that you were alluding to above without the need for tighter coupling.
joshimhoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.
I worry the test I added will occasionally flake.
Because without blocking, a kvprobe may be in-flight when the node is decommissioned? That does seem racy.
Right. I will fix this, if we go with this approach. The rate of errors would prob be low enough to not page SRE, since we don't page on a single error or similar. But I don't want the tests to flake.
pkg/server/decommission.go line 342 at r2 (raw file):
We missed the nodeIDs argument. Thank you for catching this mistake.
We could manually stop kvprober only if we find ourself in
nodeIDs, but is that sufficient?
I think for CC purposes, it is sufficient, but I will need to check in with cloud platform team, as I am not confident.
I'm curious whether you considered an approach of detecting these errors and not considering them to be kvprober failures.
That's an interesting idea. Would you suggest detecting the error by inspecting the error string? Or is the error have more structure? PermissionDenied seems too broad. If you aren't sure, I can dig more into these Qs.
We are also considering disabling Prometheus itself on CRDB nodes once decommissioning starts. But that's a bit weird, since the node is part of the cluster until decommissioning completes. So we are somewhat unsure about the infra change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):
On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.
I didn't parse this fully on first read. I think you make a good point. And I like your below suggestion to filter the error out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.
I didn't parse this fully on first read. I think you make a good point. And I like your below suggestion to filter the error out
If needed, I think we could keep ManualStop & decommission loosely coupled in prod, with some special test logic to block on kvprober properly stopping to avoid flakes.
|
I'm going to tack on a commit that filters out the error. I want to try that approach out. |
480efa2 to
9e95508
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tacked on a commit trying out the other approach. I like it. Thanks for suggesting it.
PTAL!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
If needed, I think we could keep
ManualStop& decommission loosely coupled in prod, with some special test logic to block onkvproberproperly stopping to avoid falkes.
As per the other thread, I prefer the filter-out-error approach you suggested, which doesn't have this issue.
pkg/server/decommission.go line 342 at r2 (raw file):
I'm curious whether you considered an approach of detecting these errors and not considering them to be kvprober failures. Doing so would avoid the raciness between kvprober termination and decommissioning that you were alluding to above without the need for tighter coupling.
I've implemented this approach. I am doing a substring match on the error string. A little brittle, but there is an integration test that will fail if the error string & the substring check done in kvprober drift out of sync.
I like this approach more than the initial approach. More surgical, which is nice, since SRE wants to backport this change.
Let me know what you think. When we decide on an approach, I will squash the two commits into one.
1d0c69e to
f0c2f75
Compare
|
Interestingly, the test has failed once in CI, with a single error != "node is perm removed" happening post the node being decommissioned. I am digging into the error now. SRE alerts would not fire in case of a single error, so this is not necessarily a problem for the intended use case of this PR, but obviously we cannot merge a flaky test. |
|
Here is the error log: All warnings, except this one error, which was the first shadow write done since the node was decommissioned IIUC: No rush, but @nvanbenschoten, do you have any insight into this error? I don't think we'd want to filter it out like we are doing with "n* was permanently removed from the cluster at", but if we can explain why it happens here, and be confident that only one such error can happen (or some similar bound), we could loosen the test to allow the one error. |
f0c2f75 to
81ab2d0
Compare
|
Friendly poke, @nvanbenschoten! I'm hopeful that I can get this backported soon. Big cause of pager load right now. Regarding my question above, if you don't know off the top of your head, I can dig myself more. |
33c7a13 to
df2e4e2
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sorry. Thanks for helping! I know you are busy.
When we ran out of retries, we returned the first error we saw. If that's true, then kvprober shouldn't see this error more than once.
I have adjusted the tests to pass in case of a single error. I am running the test 100 times & so far have seen no test failures. I slightly prefer allowing one error to filtering out the use of closed network connection, since doing the latter reduces alerting coverage. I do not feel strongly about this.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 461 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should push this
err != nil && strings.Contains(err.Error(), nodeIsDecommissionedErrorSubstring)condition into ashouldIgnoreErrorfunction, so that we can name it (hopefully with a better name), give it some commentary, and add new errors if we decide that we need to.
Good idea. I have now introduced a function called errorIsExpectedDuringNormalOperation.
pkg/server/decommission.go line 342 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Would you suggest detecting the error by inspecting the error string? Or is the error have more structure?
PermissionDeniedseems too broad. If you aren't sure, I can dig more into these Qs.I wasn't familiar with this, but I do see some structure around these errors, along with a
IsDecommissionedStatusErrfunction. Did you look into whether this function works? Or do we wrap these errors and lose the context in some cases?
I have rebased to pick up #99191. IsDecommissionedStatusErr works well. Thanks for suggesting it. I guess I will need to tweak the code slightly when I backport this commit.
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r1, 6 of 6 files at r5, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jmcarp, @joshimhoff, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 368 at r5 (raw file):
return } if errorIsExpectedDuringNormalOperation(err) {
I'll defer to you, but I think nesting these checks inside the if err != nil { branches would make this logic slightly easier to read. Doing so keeps the error-handling paths contained to a single branch. As a bonus, you also could share the returns between the two error-handling paths. But I don't feel strongly.
pkg/kv/kvprober/kvprober_integration_test.go line 92 at r5 (raw file):
// Once a node is fully decommissioned, neither kvclient nor kvprober work from // the node. This does not indicate a service health issue; it is expected behavior. t.Run("decommission doesn't cause errors", func(t *testing.T) {
Nice test case. Thanks for writing it and spending the time to ensure it's not flaky.
df2e4e2 to
dd6af28
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TTFR! I appreciate the help getting this ready to be merged.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)
pkg/kv/kvprober/kvprober.go line 368 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'll defer to you, but I think nesting these checks inside the
if err != nil {branches would make this logic slightly easier to read. Doing so keeps the error-handling paths contained to a single branch. As a bonus, you also could share thereturns between the two error-handling paths. But I don't feel strongly.
Good idea. Done.
pkg/kv/kvprober/kvprober_integration_test.go line 92 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Nice test case. Thanks for writing it and spending the time to ensure it's not flaky.
Thanks! I like writing tests with the CRDB integration test framework.
5fbf49c to
0a216b6
Compare
|
Can you take another look, @nvanbenschoten? |
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jmcarp, @smg260, and @srosenberg)
|
I hit another error, without even looking aggressively for it. Here is the error: I'm no longer sure about the filter out error approach. Below is my current thinking. It might be worth having a 1:1 instead of going back & forth on this PR. I will schedule, but feel free to reach out to me, if you don't have time for the 1:1, and we can keep talking async.
So, now our goal is to merge the kvprober fix that we think best solves the problem. Two options:
Like I said above, I am not sure about solution 1 anymore. The problem is that the set of errors that can be returned from the kvclient of a decommissioned node is not well understood (at least by me), and the errors lack structure. Yes, the majority of the errors are caught by the [1] As a result of this, I think we should consider going back to solution 2. There were two objections to it raised by Nathan:
I talked to the cloud platform team, and even in CC, we cannot assume that the decommission RPC for node A will always be sent to node A. So I think we would need to do the
Agree. We should cancel a context from |
6b3dee6 to
e9e2655
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed over VC, I have adjusted this PR to only filter out errors via the IsDecommissionedStatusErr function. I have replaced the integration test with a unit test, to avoid test flakes.
Note that I left a small change in kvprober_integration_test.go in place. This change ensures that a single kvprober is running per CRDB instance. That makes for a simpler test, and it was required to get the earlier integration test working.
Note also that before merge, I will run the integration test as a one-off, to ensure the PR is correct.
PTAL!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)
kvprober runs on decommissioned node. In CC, this is generally fine, since automation fully takes down nodes once they reach the decommissioned state. But there is a brief period where a node is running and in the decommissioned state, and we see kvprober errors in metrics during this period, as in below. This sometimes leads to false positive kvprober pages in CC production. ‹rpc error: code = PermissionDenied desc = n1 was permanently removed from... To be clear, the errors are not wrong per say. They just are expected to happen, once a node is decommissioned. This commit adds special handling for errors of the kind above, by doing a substring match on the error string. To be exact, kvprober now logs such errors at warning level and does not increment any error counters. This way, an operation like decommissioning a node does not cause false positive kvprober pages in CC production. Fixes cockroachdb#104367 Release note: None, since kvprober is not used by customers. (It is not documented.) Co-authored-by: Josh Carp <[email protected]>
e9e2655 to
2611c74
Compare
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for iterating on this. The end result feels like a good solution, at least for the time being.
Reviewed 5 of 5 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jmcarp, @smg260, and @srosenberg)
|
TFTR! |
|
bors r+ |
|
Build succeeded: |
|
I tested with the (slightly flaky) integration test post merge. It passed as expected. Will send backports in a second. |
kvprober: special case node-is-decommissioned errors
kvprober runs on decommissioned node. In CC, this is generally fine, since
automation fully takes down nodes once they reach the decommissioned state. But
there is a brief period where a node is running and in the decommissioned
state, and we see kvprober errors in metrics during this period, as in below.
This sometimes leads to false positive kvprober pages in CC production.
‹rpc error: code = PermissionDenied desc = n1 was permanently removed from...
To be clear, the errors are not wrong per say. They just are expected to
happen, once a node is decommissioned.
This commit adds special handling for errors of the kind above, by doing a
substring match on the error string. To be exact, kvprober now logs such errors
at warning level and does not increment any error counters. This way, an
operation like decommissioning a node does not cause false positive kvprober
pages in CC production.
Fixes #104367
Release note: None, since kvprober is not used by customers. (It is not
documented.)
Co-authored-by: Josh Carp [email protected]