rpc: track failed connections #96566

koorosh · 2023-02-05T20:23:44Z

This is an attempt to track information about broken connections to
have visibility of both healthy and broken connections.
Timestamp at which connection was broken is tracked and periodically
it attempts to restore connection.

In addition to existing changes, following should be resolved:

should we indefinitely retry to restore connection or give up
after several attempts?
If connection is failed because of for instance wrong address or Node ID,
should we somehow detect such cases or not?

Jira epic: https://cockroachlabs.atlassian.net/browse/CRDB-21710

cockroach-teamcity · 2023-02-05T20:23:55Z

This change is

tbg

Flushing some comments.

Before we get wrapped up in the code changes too much, let's make sure this suggested behavior is reasonable:

keep state for all nodes
dial unreachable nodes periodically
but not if they're decommissioned (source of truth is tombstone storage)
remove tracked state after nobody has tried to dial a given node for 2h

tbg · 2023-02-08T08:20:19Z

pkg/rpc/context.go

Data race (also as we add locking make sure not to hold the lock across the dial).

tbg · 2023-02-08T08:22:18Z

pkg/rpc/context.go

This is racy - there might still be a attempt to dial this node shortly afterwards, and then we're reinserting the entry but never evicting it again.

I think a better approach is to mark the peer as decommissioned (i.e. add a new field to c) so that we don't attempt to re-dial it in the connection maintenance loop you added above.

Since nodes can restart, ideally we'd be consulting the nodeTombstoneStorage here, behind a suitable abstraction.

Done. decommissioned flag is used instead of removing records from map.
Now, when connection fails, we derive information whether node is decommission based on error type that in turn based on nodeTombstoneStorage logic.

tbg · 2023-02-08T08:38:07Z

pkg/rpc/context.go

need to check error

Also, and maybe this is making perfect the enemy of good, sort of sad to have a single goroutine that's always there and ticking but can then get stuck on an individual node.

We could try to introduce, on peer, an async probing-based circuit breaker (util/circuit) and make its probe loop responsible for reconnecting. This would be beneficial in its own right, since it would (later, and not related to your work directly) allow us to phase out the circuit breakers we have at the nodedialer level.

@tbg , done. Now peer is instantiated with its own circuit breaker, but now I need to provide Context, Stopper, and enclosed function to dial node within TryInsert function and I feel it is a bit overloaded func and maybe has to be refactored in some way.

tbg · 2023-02-08T08:39:09Z

pkg/rpc/context.go

Let's also track the last "external" attempt at dialing the node. When nobody has tried to reach a node for, say, 2h, we remove it from the map, and this is our eviction policy.

blathers-crl · 2023-02-20T12:10:10Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

pkg/rpc/context.go

tbg · 2023-02-24T15:52:02Z

pkg/rpc/context.go

+						if !inserted {
+							// someone else inserted connection and probably
+							// started dialing.
+							continue


Why does it matter who inserted? We are just trying to make sure there's a Connection, and we want to see it in a healthy state (call Connect). Why the continue here?

tbg · 2023-02-24T15:52:32Z

pkg/rpc/context.go

+					defer done()
+					for {
+						// periodically run probe until connection is healed.
+						// TODO (koorosh): would it be enough to pause for `heartbeatInterval` duration?


https://cockroachlabs.slack.com/archives/C04G3B34PEZ/p1677253494714269

tbg · 2023-02-24T15:55:51Z

pkg/rpc/context.go

+	// TODO (koorosh): breaker is stored per `peer` and shouldn't be reinitialized every time `grpcDialNodeInternal`
+	// is called.
+	if pc.breaker == nil {
+		// TODO (koorosh): Probably this logic should be extracted from here...


+1 maybe to rpc/breaker.go because you'll ultimately have to disable the existing breaker functionality I think! That functionality mostly powers (*nodedialer.Dialer).getBreaker but now components that sit on top of rpc.Context don't need their own breaker any more, they'll just blindly dial and the rpc.Context will fail-fast as appropriate.

pkg/util/grpcutil/grpc_util.go

I've noticed in the past that this is needed sometimes, including in the next commit. Epic: none Release note: None

Can't scan from `/Min`. Epic: none Release note: None

We'll need this to port existing tests that rely on manually manipulating breaker state. Epic: none Release note: None

Epic: None Release note: None

What it said used to be true, but it caused all kinds of problems and so we've stopped sharing long ago. Epic: None Release note: None

Clarified the semantics a bit and made sure it didn't dig into the lower layers more than it needed to. Left some comments around oddities of the approach that are good to know when working in the area. Epic: none Release note: None

This will be used in a follow-up commit to more precisely communicate that a connection attempt failed due to a decommissioned peer or source. It will be more important to get this right as we switch to a stateful connection pool. We don't want to attempt to reconnect to decommissioned nodes forever, and we also don't want to misrepresent them as "unavailable" for the purpose of network observability. So far we're relying on the `PermissionDenied` error code but this is already overloaded. It's better to provide an explicit gRPC status payload that leaves no room for interpretation. Epic: none Release note: None

It was printing byte '?' instead of "?" previously. Epic: None Release note: None

It will be convenient to refer to it instead of typing out the anonymous interface multiple times. I'm not sure why I didn't introduce this much earlier, must have been a long bout of misguided purism. Epic: none Release note: None

A type peer will be introduced soon. Epic: None Release note: None

Epic: none Release note: None

The previous default, zero, effectively implied a tight loop of heartbeats. This can't be good for race/stress builds, so add at least a little breath. Epic: None Release note: None

…ryDial} Epic: none Release note: None

Test the functionality via the new circuit breakers, i.e. this test will continue working if we remove the old breakers (which it no longer tests). Epic: none Release note: None

It was specific to the nodedialer-level circuit breakers, which will be removed. Epic: none Release note: None

This was testing the deprecated breaker. Epic: none Release note: None

Epic: none Release note: None

…ut one call to old breaker

tbg · 2023-05-15T15:55:56Z

Closing for #99191, which I'm actively working on.

This comment was marked as resolved.

Sign in to view

blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Feb 5, 2023

koorosh force-pushed the rpc-track-failed-connections branch from 09734ed to ecebd25 Compare February 5, 2023 20:38

This comment was marked as resolved.

Sign in to view

koorosh added the do-not-merge bors won't merge a PR with this label. label Feb 6, 2023

koorosh requested a review from tbg February 6, 2023 13:15

koorosh force-pushed the rpc-track-failed-connections branch 3 times, most recently from ea2ad56 to 22d5578 Compare February 8, 2023 15:21

tbg reviewed Feb 8, 2023

View reviewed changes

koorosh force-pushed the rpc-track-failed-connections branch 2 times, most recently from 2b6c895 to ccbd19d Compare February 15, 2023 16:41

Santamaura mentioned this pull request Feb 15, 2023

server: validate removing node connections via nodeTombstoneStorage #97188

Closed

koorosh force-pushed the rpc-track-failed-connections branch from 221d212 to 8d22ba5 Compare February 16, 2023 16:43

koorosh force-pushed the rpc-track-failed-connections branch 6 times, most recently from 8cd1f38 to 43b447b Compare February 24, 2023 12:23

tbg reviewed Feb 24, 2023

View reviewed changes

pkg/rpc/context.go Outdated Show resolved Hide resolved

tbg reviewed Feb 24, 2023

View reviewed changes

pkg/util/grpcutil/grpc_util.go Outdated Show resolved Hide resolved

tbg mentioned this pull request Feb 27, 2023

kv: convert uni-directional network partitions to bi-directional #94778

Merged

koorosh force-pushed the rpc-track-failed-connections branch from 43b447b to 9aa0162 Compare February 27, 2023 14:02

j82w requested a review from a team May 3, 2023 15:54

tbg force-pushed the rpc-track-failed-connections branch from 33f44f7 to 0c5c0dc Compare May 9, 2023 07:28

tbg added 2 commits May 11, 2023 14:16

stop: add OnQuiesce

49ebdc1

I've noticed in the past that this is needed sometimes, including in the next commit. Epic: none Release note: None

server: fix a (harmless) test buglet

0d13335

Can't scan from `/Min`. Epic: none Release note: None

tbg force-pushed the rpc-track-failed-connections branch from 0c5c0dc to 1074a4a Compare May 11, 2023 21:06

tbg added 22 commits May 12, 2023 17:34

breaker: allow testing override of Signal

c06425e

We'll need this to port existing tests that rely on manually manipulating breaker state. Epic: none Release note: None

rpc: small comment tweak

f17ed4c

Epic: None Release note: None

rpc: fix incorrect comment

9068a18

What it said used to be true, but it caused all kinds of problems and so we've stopped sharing long ago. Epic: None Release note: None

rpc: touch up DialBack functionality

4b0f048

Clarified the semantics a bit and made sure it didn't dig into the lower layers more than it needed to. Left some comments around oddities of the approach that are good to know when working in the area. Epic: none Release note: None

rpc: fix display for remoteNode 0

713e222

It was printing byte '?' instead of "?" previously. Epic: None Release note: None

rpc: rename a package

817edd2

A type peer will be introduced soon. Epic: None Release note: None

server: add comment pointing out TestGraphite slowness

0d71bdb

Epic: none Release note: None

rpc: minor Goland lint appeasement

2fd1d74

Epic: none Release note: None

rpc: avoid shadowing backoff package

9660a4c

Epic: none Release note: None

rpc: make some unused args anonymous

25c4913

Epic: none Release note: None

base: set explicit small RPCHeartbeatInterval

ac5847d

The previous default, zero, effectively implied a tight loop of heartbeats. This can't be good for race/stress builds, so add at least a little breath. Epic: None Release note: None

wip

258dd81

nodedialer: remove test for deprecated breaker from TestConnHealth{,T…

486af0a

…ryDial} Epic: none Release note: None

nodedialer: modernize TestDialNoBreaker

481ea65

Test the functionality via the new circuit breakers, i.e. this test will continue working if we remove the old breakers (which it no longer tests). Epic: none Release note: None

nodedialer: remove TestConcurrentCancellationAndTimeout

f874f1e

It was specific to the nodedialer-level circuit breakers, which will be removed. Epic: none Release note: None

nodedialer: remove breaker check from TestNodedialerPositive

7393f35

This was testing the deprecated breaker. Epic: none Release note: None

nodedialer: remove Test{Disconnects,ResolverErrors}Trip

7746e06

Epic: none Release note: None

importer: use new rpc breaker in TestImportRetriesBreakerOpenFailure

0aded05

Epic: none Release note: None

nodedialer: remove use of old breaker in TestConnHealthInternal

285789e

Epic: none Release note: None

rafttransport use new breakers and rework stuff, still need to take o…

2353b04

…ut one call to old breaker

tbg force-pushed the rpc-track-failed-connections branch from 1074a4a to 2353b04 Compare May 12, 2023 15:46

tbg closed this May 15, 2023

rpc: track failed connections #96566

rpc: track failed connections #96566

Uh oh!

Conversation

koorosh commented Feb 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

cockroach-teamcity commented Feb 5, 2023

Uh oh!

This comment was marked as resolved.

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented Feb 20, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tbg commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

koorosh commented Feb 5, 2023 •

edited

Loading