-
Notifications
You must be signed in to change notification settings - Fork 4k
rpc: track failed connections #96566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
09734ed to
ecebd25
Compare
This comment was marked as resolved.
This comment was marked as resolved.
ea2ad56 to
22d5578
Compare
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing some comments.
Before we get wrapped up in the code changes too much, let's make sure this suggested behavior is reasonable:
- keep state for all nodes
- dial unreachable nodes periodically
- but not if they're decommissioned (source of truth is tombstone storage)
- remove tracked state after nobody has tried to dial a given node for 2h
pkg/rpc/context.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data race (also as we add locking make sure not to hold the lock across the dial).
pkg/rpc/context.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is racy - there might still be a attempt to dial this node shortly afterwards, and then we're reinserting the entry but never evicting it again.
I think a better approach is to mark the peer as decommissioned (i.e. add a new field to c) so that we don't attempt to re-dial it in the connection maintenance loop you added above.
Since nodes can restart, ideally we'd be consulting the nodeTombstoneStorage here, behind a suitable abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. decommissioned flag is used instead of removing records from map.
Now, when connection fails, we derive information whether node is decommission based on error type that in turn based on nodeTombstoneStorage logic.
pkg/rpc/context.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, and maybe this is making perfect the enemy of good, sort of sad to have a single goroutine that's always there and ticking but can then get stuck on an individual node.
We could try to introduce, on peer, an async probing-based circuit breaker (util/circuit) and make its probe loop responsible for reconnecting. This would be beneficial in its own right, since it would (later, and not related to your work directly) allow us to phase out the circuit breakers we have at the nodedialer level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg , done. Now peer is instantiated with its own circuit breaker, but now I need to provide Context, Stopper, and enclosed function to dial node within TryInsert function and I feel it is a bit overloaded func and maybe has to be refactored in some way.
pkg/rpc/context.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also track the last "external" attempt at dialing the node. When nobody has tried to reach a node for, say, 2h, we remove it from the map, and this is our eviction policy.
2b6c895 to
ccbd19d
Compare
221d212 to
8d22ba5
Compare
|
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
8cd1f38 to
43b447b
Compare
pkg/rpc/context.go
Outdated
| if !inserted { | ||
| // someone else inserted connection and probably | ||
| // started dialing. | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it matter who inserted? We are just trying to make sure there's a Connection, and we want to see it in a healthy state (call Connect). Why the continue here?
pkg/rpc/context.go
Outdated
| defer done() | ||
| for { | ||
| // periodically run probe until connection is healed. | ||
| // TODO (koorosh): would it be enough to pause for `heartbeatInterval` duration? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/rpc/context.go
Outdated
| // TODO (koorosh): breaker is stored per `peer` and shouldn't be reinitialized every time `grpcDialNodeInternal` | ||
| // is called. | ||
| if pc.breaker == nil { | ||
| // TODO (koorosh): Probably this logic should be extracted from here... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 maybe to rpc/breaker.go because you'll ultimately have to disable the existing breaker functionality I think! That functionality mostly powers (*nodedialer.Dialer).getBreaker but now components that sit on top of rpc.Context don't need their own breaker any more, they'll just blindly dial and the rpc.Context will fail-fast as appropriate.
43b447b to
9aa0162
Compare
33f44f7 to
0c5c0dc
Compare
I've noticed in the past that this is needed sometimes, including in the next commit. Epic: none Release note: None
Can't scan from `/Min`. Epic: none Release note: None
0c5c0dc to
1074a4a
Compare
We'll need this to port existing tests that rely on manually manipulating breaker state. Epic: none Release note: None
Epic: None Release note: None
What it said used to be true, but it caused all kinds of problems and so we've stopped sharing long ago. Epic: None Release note: None
Clarified the semantics a bit and made sure it didn't dig into the lower layers more than it needed to. Left some comments around oddities of the approach that are good to know when working in the area. Epic: none Release note: None
This will be used in a follow-up commit to more precisely communicate that a connection attempt failed due to a decommissioned peer or source. It will be more important to get this right as we switch to a stateful connection pool. We don't want to attempt to reconnect to decommissioned nodes forever, and we also don't want to misrepresent them as "unavailable" for the purpose of network observability. So far we're relying on the `PermissionDenied` error code but this is already overloaded. It's better to provide an explicit gRPC status payload that leaves no room for interpretation. Epic: none Release note: None
It was printing byte '?' instead of "?" previously. Epic: None Release note: None
It will be convenient to refer to it instead of typing out the anonymous interface multiple times. I'm not sure why I didn't introduce this much earlier, must have been a long bout of misguided purism. Epic: none Release note: None
A type peer will be introduced soon. Epic: None Release note: None
Epic: none Release note: None
Epic: none Release note: None
Epic: none Release note: None
Epic: none Release note: None
The previous default, zero, effectively implied a tight loop of heartbeats. This can't be good for race/stress builds, so add at least a little breath. Epic: None Release note: None
…ryDial} Epic: none Release note: None
Test the functionality via the new circuit breakers, i.e. this test will continue working if we remove the old breakers (which it no longer tests). Epic: none Release note: None
It was specific to the nodedialer-level circuit breakers, which will be removed. Epic: none Release note: None
This was testing the deprecated breaker. Epic: none Release note: None
Epic: none Release note: None
Epic: none Release note: None
Epic: none Release note: None
…ut one call to old breaker
1074a4a to
2353b04
Compare
|
Closing for #99191, which I'm actively working on. |
This is an attempt to track information about broken connections to
have visibility of both healthy and broken connections.
Timestamp at which connection was broken is tracked and periodically
it attempts to restore connection.
In addition to existing changes, following should be resolved:
after several attempts?
should we somehow detect such cases or not?
Jira epic: https://cockroachlabs.atlassian.net/browse/CRDB-21710