Skip to content

Conversation

@koorosh
Copy link
Contributor

@koorosh koorosh commented Feb 5, 2023

This is an attempt to track information about broken connections to
have visibility of both healthy and broken connections.
Timestamp at which connection was broken is tracked and periodically
it attempts to restore connection.

In addition to existing changes, following should be resolved:

  • should we indefinitely retry to restore connection or give up
    after several attempts?
  • If connection is failed because of for instance wrong address or Node ID,
    should we somehow detect such cases or not?

Jira epic: https://cockroachlabs.atlassian.net/browse/CRDB-21710

@blathers-crl

This comment was marked as resolved.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Feb 5, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@koorosh koorosh force-pushed the rpc-track-failed-connections branch from 09734ed to ecebd25 Compare February 5, 2023 20:38
@blathers-crl

This comment was marked as resolved.

@koorosh koorosh added the do-not-merge bors won't merge a PR with this label. label Feb 6, 2023
@koorosh koorosh requested a review from tbg February 6, 2023 13:15
@koorosh koorosh force-pushed the rpc-track-failed-connections branch 3 times, most recently from ea2ad56 to 22d5578 Compare February 8, 2023 15:21
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some comments.

Before we get wrapped up in the code changes too much, let's make sure this suggested behavior is reasonable:

  • keep state for all nodes
  • dial unreachable nodes periodically
  • but not if they're decommissioned (source of truth is tombstone storage)
  • remove tracked state after nobody has tried to dial a given node for 2h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data race (also as we add locking make sure not to hold the lock across the dial).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is racy - there might still be a attempt to dial this node shortly afterwards, and then we're reinserting the entry but never evicting it again.

I think a better approach is to mark the peer as decommissioned (i.e. add a new field to c) so that we don't attempt to re-dial it in the connection maintenance loop you added above.

Since nodes can restart, ideally we'd be consulting the nodeTombstoneStorage here, behind a suitable abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. decommissioned flag is used instead of removing records from map.
Now, when connection fails, we derive information whether node is decommission based on error type that in turn based on nodeTombstoneStorage logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, and maybe this is making perfect the enemy of good, sort of sad to have a single goroutine that's always there and ticking but can then get stuck on an individual node.

We could try to introduce, on peer, an async probing-based circuit breaker (util/circuit) and make its probe loop responsible for reconnecting. This would be beneficial in its own right, since it would (later, and not related to your work directly) allow us to phase out the circuit breakers we have at the nodedialer level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg , done. Now peer is instantiated with its own circuit breaker, but now I need to provide Context, Stopper, and enclosed function to dial node within TryInsert function and I feel it is a bit overloaded func and maybe has to be refactored in some way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also track the last "external" attempt at dialing the node. When nobody has tried to reach a node for, say, 2h, we remove it from the map, and this is our eviction policy.

@koorosh koorosh force-pushed the rpc-track-failed-connections branch 2 times, most recently from 2b6c895 to ccbd19d Compare February 15, 2023 16:41
@koorosh koorosh force-pushed the rpc-track-failed-connections branch from 221d212 to 8d22ba5 Compare February 16, 2023 16:43
@blathers-crl
Copy link

blathers-crl bot commented Feb 20, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@koorosh koorosh force-pushed the rpc-track-failed-connections branch 6 times, most recently from 8cd1f38 to 43b447b Compare February 24, 2023 12:23
if !inserted {
// someone else inserted connection and probably
// started dialing.
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it matter who inserted? We are just trying to make sure there's a Connection, and we want to see it in a healthy state (call Connect). Why the continue here?

defer done()
for {
// periodically run probe until connection is healed.
// TODO (koorosh): would it be enough to pause for `heartbeatInterval` duration?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO (koorosh): breaker is stored per `peer` and shouldn't be reinitialized every time `grpcDialNodeInternal`
// is called.
if pc.breaker == nil {
// TODO (koorosh): Probably this logic should be extracted from here...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 maybe to rpc/breaker.go because you'll ultimately have to disable the existing breaker functionality I think! That functionality mostly powers (*nodedialer.Dialer).getBreaker but now components that sit on top of rpc.Context don't need their own breaker any more, they'll just blindly dial and the rpc.Context will fail-fast as appropriate.

@j82w j82w requested a review from a team May 3, 2023 15:54
@tbg tbg force-pushed the rpc-track-failed-connections branch from 33f44f7 to 0c5c0dc Compare May 9, 2023 07:28
tbg added 2 commits May 11, 2023 14:16
I've noticed in the past that this is needed sometimes, including
in the next commit.

Epic: none
Release note: None
Can't scan from `/Min`.

Epic: none
Release note: None
@tbg tbg force-pushed the rpc-track-failed-connections branch from 0c5c0dc to 1074a4a Compare May 11, 2023 21:06
tbg added 22 commits May 12, 2023 17:34
We'll need this to port existing tests that rely on manually manipulating
breaker state.

Epic: none
Release note: None
Epic: None
Release note: None
What it said used to be true, but it caused all kinds of problems and
so we've stopped sharing long ago.

Epic: None
Release note: None
Clarified the semantics a bit and made sure it didn't dig into the lower layers
more than it needed to. Left some comments around oddities of the approach that
are good to know when working in the area.

Epic: none
Release note: None
This will be used in a follow-up commit to more precisely communicate that a
connection attempt failed due to a decommissioned peer or source.

It will be more important to get this right as we switch to a stateful
connection pool. We don't want to attempt to reconnect to decommissioned nodes
forever, and we also don't want to misrepresent them as "unavailable" for the
purpose of network observability.

So far we're relying on the `PermissionDenied` error code but this is already
overloaded. It's better to provide an explicit gRPC status payload that leaves
no room for interpretation.

Epic: none
Release note: None
It was printing byte '?' instead of "?" previously.

Epic: None
Release note: None
It will be convenient to refer to it instead of typing out the anonymous
interface multiple times. I'm not sure why I didn't introduce this much
earlier, must have been a long bout of misguided purism.

Epic: none
Release note: None
A type peer will be introduced soon.

Epic: None
Release note: None
Epic: none
Release note: None
Epic: none
Release note: None
Epic: none
Release note: None
The previous default, zero, effectively implied a tight loop of heartbeats. This can't be good for race/stress builds, so add at least a little breath.

Epic: None
Release note: None
Test the functionality via the new circuit breakers, i.e. this test will
continue working if we remove the old breakers (which it no longer tests).

Epic: none
Release note: None
It was specific to the nodedialer-level circuit breakers, which will be
removed.

Epic: none
Release note: None
This was testing the deprecated breaker.

Epic: none
Release note: None
@tbg tbg force-pushed the rpc-track-failed-connections branch from 1074a4a to 2353b04 Compare May 12, 2023 15:46
@tbg
Copy link
Member

tbg commented May 15, 2023

Closing for #99191, which I'm actively working on.

@tbg tbg closed this May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-community Originated from the community X-blathers-silence Avoid blathers notifications

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants