Skip to content

Conversation

@joshimhoff
Copy link
Collaborator

@joshimhoff joshimhoff commented Jun 5, 2023

kvprober: special case node-is-decommissioned errors

kvprober runs on decommissioned node. In CC, this is generally fine, since
automation fully takes down nodes once they reach the decommissioned state. But
there is a brief period where a node is running and in the decommissioned
state, and we see kvprober errors in metrics during this period, as in below.
This sometimes leads to false positive kvprober pages in CC production.

‹rpc error: code = PermissionDenied desc = n1 was permanently removed from...

To be clear, the errors are not wrong per say. They just are expected to
happen, once a node is decommissioned.

This commit adds special handling for errors of the kind above, by doing a
substring match on the error string. To be exact, kvprober now logs such errors
at warning level and does not increment any error counters. This way, an
operation like decommissioning a node does not cause false positive kvprober
pages in CC production.

Fixes #104367

Release note: None, since kvprober is not used by customers. (It is not
documented.)

Co-authored-by: Josh Carp [email protected]

@joshimhoff joshimhoff requested a review from jmcarp June 5, 2023 20:55
@joshimhoff joshimhoff requested review from a team as code owners June 5, 2023 20:55
@joshimhoff joshimhoff requested review from smg260 and srosenberg and removed request for a team June 5, 2023 20:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@joshimhoff joshimhoff force-pushed the kvprober_decommission branch 2 times, most recently from 1967675 to 4c0d5aa Compare June 5, 2023 21:02
@joshimhoff joshimhoff force-pushed the kvprober_decommission branch from 4c0d5aa to 90cf080 Compare June 5, 2023 21:09
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):

}

// ManualStops stops the prober from sending more probes. It does not

Maybe it should block. I could do it with channels. Else, I worry the test I added will occasionally flake.

@joshimhoff joshimhoff force-pushed the kvprober_decommission branch from 90cf080 to 2f49898 Compare June 5, 2023 21:17
Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 8 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @joshimhoff, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):
On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.

I worry the test I added will occasionally flake.

Because without blocking, a kvprobe may be in-flight when the node is decommissioned? That does seem racy.


pkg/server/decommission.go line 342 at r2 (raw file):

) error {

	// Once a node is fully decommissioned, neither kvclient nor kvprober work from

Decommission is configuring the nodes in nodeIDs as decommissioning. There is no guarantee that these are the same nodes that the decommission is running on. We wouldn't want to blindly stop kvProber on s if it is decommissioning s+1. We could manually stop kvprober only if we find ourself in nodeIDs, but is that sufficient?

I'm curious whether you considered an approach of detecting these errors and not considering them to be kvprober failures. Doing so would avoid the raciness between kvprober termination and decommissioning that you were alluding to above without the need for tighter coupling.

@joshimhoff joshimhoff requested a review from nvb June 5, 2023 22:04
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.

I worry the test I added will occasionally flake.

Because without blocking, a kvprobe may be in-flight when the node is decommissioned? That does seem racy.

Right. I will fix this, if we go with this approach. The rate of errors would prob be low enough to not page SRE, since we don't page on a single error or similar. But I don't want the tests to flake.


pkg/server/decommission.go line 342 at r2 (raw file):
We missed the nodeIDs argument. Thank you for catching this mistake.

We could manually stop kvprober only if we find ourself in nodeIDs, but is that sufficient?

I think for CC purposes, it is sufficient, but I will need to check in with cloud platform team, as I am not confident.

I'm curious whether you considered an approach of detecting these errors and not considering them to be kvprober failures.

That's an interesting idea. Would you suggest detecting the error by inspecting the error string? Or is the error have more structure? PermissionDenied seems too broad. If you aren't sure, I can dig more into these Qs.

We are also considering disabling Prometheus itself on CRDB nodes once decommissioning starts. But that's a bit weird, since the node is part of the cluster until decommissioning completes. So we are somewhat unsure about the infra change.

Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):

On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.

I didn't parse this fully on first read. I think you make a good point. And I like your below suggestion to filter the error out.

Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.

I didn't parse this fully on first read. I think you make a good point. And I like your below suggestion to filter the error out

If needed, I think we could keep ManualStop & decommission loosely coupled in prod, with some special test logic to block on kvprober properly stopping to avoid flakes.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jun 7, 2023

I'm going to tack on a commit that filters out the error. I want to try that approach out.

@joshimhoff joshimhoff changed the title kvprober: stop kvprober at the end of decommissioning kvprober: avoid false positive pages in case of decommissioning Jun 8, 2023
@joshimhoff joshimhoff force-pushed the kvprober_decommission branch from 480efa2 to 9e95508 Compare June 8, 2023 14:45
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tacked on a commit trying out the other approach. I like it. Thanks for suggesting it.

PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 290 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

If needed, I think we could keep ManualStop & decommission loosely coupled in prod, with some special test logic to block on kvprober properly stopping to avoid falkes.

As per the other thread, I prefer the filter-out-error approach you suggested, which doesn't have this issue.


pkg/server/decommission.go line 342 at r2 (raw file):

I'm curious whether you considered an approach of detecting these errors and not considering them to be kvprober failures. Doing so would avoid the raciness between kvprober termination and decommissioning that you were alluding to above without the need for tighter coupling.

I've implemented this approach. I am doing a substring match on the error string. A little brittle, but there is an integration test that will fail if the error string & the substring check done in kvprober drift out of sync.

I like this approach more than the initial approach. More surgical, which is nice, since SRE wants to backport this change.

Let me know what you think. When we decide on an approach, I will squash the two commits into one.

@joshimhoff joshimhoff force-pushed the kvprober_decommission branch 3 times, most recently from 1d0c69e to f0c2f75 Compare June 8, 2023 15:03
@joshimhoff
Copy link
Collaborator Author

Interestingly, the test has failed once in CI, with a single error != "node is perm removed" happening post the node being decommissioned. I am digging into the error now. SRE alerts would not fire in case of a single error, so this is not necessarily a problem for the intended use case of this PR, but obviously we cannot merge a flaky test.

https://teamcity.cockroachdb.com/viewLog.html?buildId=10457643&buildTypeId=Cockroach_BazelEssentialCi

@joshimhoff
Copy link
Collaborator Author

Here is the error log:

W230608 15:43:35.249884 7556 2@kv/kvprober/kvprober.go:398 ⋮ [T1,n1,kvprober] 1059  kv.Get(/Local/Range/Table/24/‹RangeProbe›), r=26 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.268075 7556 2@kv/kvprober/kvprober.go:398 ⋮ [T1,n1,kvprober] 1084  kv.Get(/Local/Range/Table/35/‹RangeProbe›), r=37 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
E230608 15:43:35.291051 7557 2@kv/kvprober/kvprober.go:465 ⋮ [T1,n1,kvprober] 1114  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with: result is ambiguous: error=ba: ‹Put [/Local/Range/Table/11/RangeProbe,/Min), Delete [/Local/Range/Table/11/RangeProbe,/Min), EndTxn(parallel commit) [/Local/Range/Table/11/RangeProbe], [txn: a979d323], [can-forward-ts]› RPC error: grpc: ‹error reading from server: read tcp 127.0.0.1:52417->127.0.0.1:52407: use of closed network connection› [code 14/Unavailable] [propagate] [quarantined=true]
W230608 15:43:35.393723 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1216  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.410173 7557 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1230  kv.Txn(Put(/Local/Range/Table/24/‹RangeProbe›); Del(-)), r=26 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.497431 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1257  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.535974 7557 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1272  kv.Txn(Put(/Local/Range/Table/23/‹RangeProbe›); Del(-)), r=25 failed with expected error during decommission: failed to connect to n5 at ‹127.0.0.1:52422›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.604841 7557 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1279  kv.Txn(Put(/Local/Range/Table/22/‹RangeProbe›); Del(-)), r=24 failed with expected error during decommission: failed to connect to n5 at ‹127.0.0.1:52422›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.616410 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1290  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.719304 7557 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1302  kv.Txn(Put(/Local/Range/Table/28/‹RangeProbe›); Del(-)), r=30 failed with expected error during decommission: failed to connect to n5 at ‹127.0.0.1:52422›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.723915 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1305  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.837714 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1321  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.842290 7557 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1324  kv.Txn(Put(/Local/Range/Table/29/‹RangeProbe›); Del(-)), r=31 failed with expected error during decommission: failed to connect to n5 at ‹127.0.0.1:52422›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:35.957133 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1341  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:36.073087 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1382  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:36.187135 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1457  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]
W230608 15:43:36.287867 7558 2@kv/kvprober/kvprober.go:458 ⋮ [T1,n1,kvprober] 1514  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with expected error during decommission: failed to connect to n4 at ‹127.0.0.1:52407›: grpc: ‹n1 was permanently removed from the cluster at 2023-06-08 15:43:39.145154 +0000 UTC; it is not allowed to rejoin the cluster› [code 7/PermissionDenied]

All warnings, except this one error, which was the first shadow write done since the node was decommissioned IIUC:

E230608 15:43:35.291051 7557 2@kv/kvprober/kvprober.go:465 ⋮ [T1,n1,kvprober] 1114  kv.Txn(Put(/Local/Range/Table/11/‹RangeProbe›); Del(-)), r=13 failed with: result is ambiguous: error=ba: ‹Put [/Local/Range/Table/11/RangeProbe,/Min), Delete [/Local/Range/Table/11/RangeProbe,/Min), EndTxn(parallel commit) [/Local/Range/Table/11/RangeProbe], [txn: a979d323], [can-forward-ts]› RPC error: grpc: ‹error reading from server: read tcp 127.0.0.1:52417->127.0.0.1:52407: use of closed network connection› [code 14/Unavailable] [propagate] [quarantined=true]

No rush, but @nvanbenschoten, do you have any insight into this error? I don't think we'd want to filter it out like we are doing with "n* was permanently removed from the cluster at", but if we can explain why it happens here, and be confident that only one such error can happen (or some similar bound), we could loosen the test to allow the one error.

@joshimhoff joshimhoff force-pushed the kvprober_decommission branch from f0c2f75 to 81ab2d0 Compare June 8, 2023 20:10
@joshimhoff
Copy link
Collaborator Author

Friendly poke, @nvanbenschoten! I'm hopeful that I can get this backported soon. Big cause of pager load right now. Regarding my question above, if you don't know off the top of your head, I can dig myself more.

@joshimhoff joshimhoff changed the title kvprober: avoid false positive pages in case of decommissioning kvprober: special case node-is-decommissioned errors Jun 23, 2023
@joshimhoff joshimhoff force-pushed the kvprober_decommission branch 2 times, most recently from 33c7a13 to df2e4e2 Compare June 23, 2023 20:40
@joshimhoff joshimhoff requested a review from nvb June 23, 2023 20:51
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry. Thanks for helping! I know you are busy.

When we ran out of retries, we returned the first error we saw. If that's true, then kvprober shouldn't see this error more than once.

I have adjusted the tests to pass in case of a single error. I am running the test 100 times & so far have seen no test failures. I slightly prefer allowing one error to filtering out the use of closed network connection, since doing the latter reduces alerting coverage. I do not feel strongly about this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 461 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should push this err != nil && strings.Contains(err.Error(), nodeIsDecommissionedErrorSubstring) condition into a shouldIgnoreError function, so that we can name it (hopefully with a better name), give it some commentary, and add new errors if we decide that we need to.

Good idea. I have now introduced a function called errorIsExpectedDuringNormalOperation.


pkg/server/decommission.go line 342 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would you suggest detecting the error by inspecting the error string? Or is the error have more structure? PermissionDenied seems too broad. If you aren't sure, I can dig more into these Qs.

I wasn't familiar with this, but I do see some structure around these errors, along with a IsDecommissionedStatusErr function. Did you look into whether this function works? Or do we wrap these errors and lose the context in some cases?

I have rebased to pick up #99191. IsDecommissionedStatusErr works well. Thanks for suggesting it. I guess I will need to tweak the code slightly when I backport this commit.

Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 8 files at r1, 6 of 6 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jmcarp, @joshimhoff, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 368 at r5 (raw file):

		return
	}
	if errorIsExpectedDuringNormalOperation(err) {

I'll defer to you, but I think nesting these checks inside the if err != nil { branches would make this logic slightly easier to read. Doing so keeps the error-handling paths contained to a single branch. As a bonus, you also could share the returns between the two error-handling paths. But I don't feel strongly.


pkg/kv/kvprober/kvprober_integration_test.go line 92 at r5 (raw file):

	// Once a node is fully decommissioned, neither kvclient nor kvprober work from
	// the node. This does not indicate a service health issue; it is expected behavior.
	t.Run("decommission doesn't cause errors", func(t *testing.T) {

Nice test case. Thanks for writing it and spending the time to ensure it's not flaky.

@joshimhoff joshimhoff force-pushed the kvprober_decommission branch from df2e4e2 to dd6af28 Compare June 26, 2023 13:19
@joshimhoff joshimhoff requested a review from nvb June 26, 2023 13:19
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTFR! I appreciate the help getting this ready to be merged.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)


pkg/kv/kvprober/kvprober.go line 368 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'll defer to you, but I think nesting these checks inside the if err != nil { branches would make this logic slightly easier to read. Doing so keeps the error-handling paths contained to a single branch. As a bonus, you also could share the returns between the two error-handling paths. But I don't feel strongly.

Good idea. Done.


pkg/kv/kvprober/kvprober_integration_test.go line 92 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Nice test case. Thanks for writing it and spending the time to ensure it's not flaky.

Thanks! I like writing tests with the CRDB integration test framework.

@joshimhoff joshimhoff force-pushed the kvprober_decommission branch 2 times, most recently from 5fbf49c to 0a216b6 Compare June 26, 2023 13:57
@joshimhoff
Copy link
Collaborator Author

use of closed network connection happened twice on a test run. I've updated the kvprober to filter out that error too. I've done this with a substring match, as reading IsClosedConnection makes me believe that is necessary. I've not reused IsClosedConnection for reasons described in a code comment.

Can you take another look, @nvanbenschoten?

Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jmcarp, @smg260, and @srosenberg)

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jun 29, 2023

I hit another error, without even looking aggressively for it. Here is the error:

E230626 14:08:10.682742 8021 2@kv/kvprober/kvprober.go:485 ⋮ [T1,n1,kvprober] 1208  kv.Txn(Put(/Local/Range/System/NodeLiveness/‹RangeProbe›); Del(-)), r=2 failed with: result is ambiguous: error=ba: ‹Put [/Local/Range/System/NodeLiveness/RangeProbe,/Min), Delete [/Local/Range/System/NodeLiveness/RangeProbe,/Min), EndTxn(parallel commit) [/Local/Range/System/NodeLiveness/RangeProbe], [txn: 528ac691], [can-forward-ts]› RPC error: grpc: ‹error reading from server: EOF› [code 14/Unavailable] [propagate] [quarantined=true]

I'm no longer sure about the filter out error approach. Below is my current thinking. It might be worth having a 1:1 instead of going back & forth on this PR. I will schedule, but feel free to reach out to me, if you don't have time for the 1:1, and we can keep talking async.

  • The fix to this problem belongs in CRDB. Anyone else who uses kvprober down the road will need this fix.
  • At same time, need to stop SRE from getting paged ~now.
  • So I have come up with an alerting rule change to keep SRE from getting paged that I can merge ~now. The idea is to count the number of kvprober instances reporting errors. If >= 2, page right away. If 1, require ~30m of errors to page. I've looked at historical data, and this will keep SRE from getting paged, so long as the CC downscale automation doesn't time out before scaling the instances down (this rarely happens).
  • As a result of merging a stop-gap fix on the SRE side, we can take our time with the CRDB fix, and we don't even necessarily need to backport it (tho a backport is generally desirable, due to caveats with my stop-gap fix).

So, now our goal is to merge the kvprober fix that we think best solves the problem. Two options:

  1. Filter out errors that are returned from the kvclient of a decommissioned node, like we have here.
  2. Manually shutdown kvprober right before a node enters the decommissioned state, like I had initially.

Like I said above, I am not sure about solution 1 anymore. The problem is that the set of errors that can be returned from the kvclient of a decommissioned node is not well understood (at least by me), and the errors lack structure. Yes, the majority of the errors are caught by the IsDecommissionedStatusErr function, but there are other errors. I could use IsClosedConnection to filter out EOF too, but as can be seen below [1], the scope of this function is very wide. I both worry kvprober will filter out errors that we don't want it to filter out, since they indicate a true positive availability event (e.g. "node unavailable") AND that the test may flake due to some error not being filtered out by IsClosedConnection. It's not a good sign IMHO that I am worried about both of these things. In general, I only favor solution 1 if we can do all filtering via a precise check on the error type, and if the error type is specific to the decommissioned state (as with IsDecommissionedStatusErr).

[1]

// IsClosedConnection returns true if err's Cause is an error produced by gRPC
// on closed connections.
func IsClosedConnection(err error) bool {
	if errors.Is(err, ErrConnectionInterrupted) {
		return true
	}
	err = errors.Cause(err)
	if s, ok := status.FromError(err); ok {
		if s.Code() == codes.Canceled ||
			s.Code() == codes.Unavailable {
			return true
		}
	}
	if errors.Is(err, context.Canceled) ||
		strings.Contains(err.Error(), "is closing") ||
		strings.Contains(err.Error(), "tls: use of closed connection") ||
		strings.Contains(err.Error(), "use of closed network connection") ||
		strings.Contains(err.Error(), io.ErrClosedPipe.Error()) ||
		strings.Contains(err.Error(), io.EOF.Error()) ||
		strings.Contains(err.Error(), "node unavailable") {
		return true
	}
	return netutil.IsClosedConnection(err)
}

As a result of this, I think we should consider going back to solution 2. There were two objections to it raised by Nathan:

Decommission is configuring the nodes in nodeIDs as decommissioning. There is no guarantee that these are the same nodes that the decommission is running on. We wouldn't want to blindly stop kvProber on s if it is decommissioning s+1. We could manually stop kvprober only if we find ourself in nodeIDs, but is that sufficient?

I talked to the cloud platform team, and even in CC, we cannot assume that the decommission RPC for node A will always be sent to node A. So I think we would need to do the ManualStop from somewhere else in the CRDB code. I'm not sure if that is practical. I will dig a bit.

On the other hand, if it blocks, we risk kvprober unavailability stalling decommissioning. It feels safer to keep this loosely coupled.

Agree. We should cancel a context from ManualStop only, so as to keep a loose coupling. We can allow one error from the tests to avoid flakes.

@joshimhoff joshimhoff force-pushed the kvprober_decommission branch 2 times, most recently from 6b3dee6 to e9e2655 Compare July 11, 2023 14:20
@joshimhoff joshimhoff requested a review from nvb July 11, 2023 14:24
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed over VC, I have adjusted this PR to only filter out errors via the IsDecommissionedStatusErr function. I have replaced the integration test with a unit test, to avoid test flakes.

Note that I left a small change in kvprober_integration_test.go in place. This change ensures that a single kvprober is running per CRDB instance. That makes for a simpler test, and it was required to get the earlier integration test working.

Note also that before merge, I will run the integration test as a one-off, to ensure the PR is correct.

PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jmcarp, @nvanbenschoten, @smg260, and @srosenberg)

kvprober runs on decommissioned node. In CC, this is generally fine, since
automation fully takes down nodes once they reach the decommissioned state. But
there is a brief period where a node is running and in the decommissioned
state, and we see kvprober errors in metrics during this period, as in below.
This sometimes leads to false positive kvprober pages in CC production.

‹rpc error: code = PermissionDenied desc = n1 was permanently removed from...

To be clear, the errors are not wrong per say. They just are expected to
happen, once a node is decommissioned.

This commit adds special handling for errors of the kind above, by doing a
substring match on the error string. To be exact, kvprober now logs such errors
at warning level and does not increment any error counters. This way, an
operation like decommissioning a node does not cause false positive kvprober
pages in CC production.

Fixes cockroachdb#104367

Release note: None, since kvprober is not used by customers. (It is not
documented.)

Co-authored-by: Josh Carp <[email protected]>
@joshimhoff joshimhoff force-pushed the kvprober_decommission branch from e9e2655 to 2611c74 Compare July 11, 2023 14:29
Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks for iterating on this. The end result feels like a good solution, at least for the time being.

Reviewed 5 of 5 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jmcarp, @smg260, and @srosenberg)

@joshimhoff
Copy link
Collaborator Author

TFTR!

@joshimhoff
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 13, 2023

Build succeeded:

@craig craig bot merged commit d6665c0 into cockroachdb:master Jul 13, 2023
@joshimhoff
Copy link
Collaborator Author

I tested with the (slightly flaky) integration test post merge. It passed as expected. Will send backports in a second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvprober reports errors on a decommissioned node

3 participants