Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented May 24, 2023

Now that we have #99191, we can simplify VerifyDialback. Storing connection attempts is no longer necessary since the circuit breakers keep track of the health of the remote.

Epic: none
Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the dialback-stateful-breaker-simplify branch from 63eb993 to 659f5ab Compare June 12, 2023 12:01
@blathers-crl
Copy link

blathers-crl bot commented Jun 12, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg changed the title rpc: trigger dial in VerifyDialback fast path check [DO NOT REVIEW HERE] rpc: simplify VerifyDialback via circuit breakers Jun 12, 2023
@tbg tbg marked this pull request as ready for review June 12, 2023 12:22
@tbg tbg requested a review from a team as a code owner June 12, 2023 12:22
@tbg
Copy link
Member Author

tbg commented Jun 12, 2023

I think the code change is good to go, but the test has become flaky because it was relying on certain connections being inactive. I'm taking a look.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the simplification.

I suppose this is in principle slightly less reliable: previously, if the reverse connection was unhealthy for NON_BLOCKING we would initiate a connection attempt and not fail until we had its result. Now, we may instead see a faulty state from the circuit breaker's last attempt and fail fast until the circuit breaker initiates its next attempt.

I think in practice this isn't likely to be a problem, because the circuit breaker also uses the heartbeat interval for its retry interval. So if we're recovering after a partition, the BLOCKING reverse connection succeeds, the peer waits for a heartbeat interval, sends a NON_BLOCKING ping, and that should always have given the circuit breaker enough time to initiate the new connection attempt, at which point the reverse connection will see ErrNotHeartbeated and return success until the circuit breaker attempt finalizes.

If this doesn't hold, i.e. if a NON_BLOCKING ping arrives before the circuit breaker retries, then connections will keep erroring out on their second ping until the retry begins.

// connection again, etc, for an infinite loop of blocking connections.
// A throwaway connection keeps it simple.
ctx := rpcCtx.makeDialCtx(target, request.OriginNodeID, SystemClass)
ctx := rpcCtx.wrapCtx(ctx, target, request.OriginNodeID, SystemClass)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth a backport?

Copy link
Member Author

@tbg tbg Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will do. Tracking in our sprint.

@tbg tbg force-pushed the dialback-stateful-breaker-simplify branch from 659f5ab to 854547c Compare June 13, 2023 11:24
@tbg tbg requested a review from a team as a code owner June 13, 2023 11:24
@tbg tbg force-pushed the dialback-stateful-breaker-simplify branch from 854547c to 5655d29 Compare June 13, 2023 12:14
@tbg
Copy link
Member Author

tbg commented Jun 13, 2023

bors single on
bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Jun 13, 2023

Build failed:

@tbg
Copy link
Member Author

tbg commented Jun 13, 2023

Ah, the joy of wrangling bazel.

tbg added 7 commits June 19, 2023 15:06
This will also trigger the fast-path connection if it does not exist.
This allows further refactors in subsequent commits.

Epic: none
Release note: None
With upcoming changes that trigger instantiating the system class connections
proactively, this test becomes flaky in a way that is hard to fix - we'll get
extra pings, but pings don't reflect the connection class, so we can't tell
which pings are on behalf of a connection we just created vs. an earlier one.

Since the dialback functionality is about to become much simpler, this test is
preemptively made simpler as well. We can then add a more targeted unit test to
balance this.

Epic: none
Release note: None
This "just" relies on the RPC circuit breakers for dialbacks.  The (small)
gotcha is that when they are disabled, dialbacks are disabled as well. This is
well worth the reduction in code, in my opinion, and we don't expect either
setting to be disabled as both are escape hatches.

Epic: none
Release note: None
Epic: none
Release note: None
The old mechanism struggled with this, the new one does not.

Epic: none
Release note: None
Epic: none
Release note: None
We didn't want to derive from `MasterCtx` (which has no cancellation) in this
case.

Epic: none
Release note: None
@tbg tbg force-pushed the dialback-stateful-breaker-simplify branch from 5655d29 to 1292998 Compare June 19, 2023 13:27
tbg added 2 commits June 20, 2023 12:22
Extract the rpcContext into an interface, and create mocks for it. Then add
TestVerifyDialback which unit tests the method directly using the mocks.

Epic: none
Release note: None
Release note: None
Epic: None
@tbg tbg force-pushed the dialback-stateful-breaker-simplify branch from 1292998 to 6bc5b45 Compare June 20, 2023 10:22
@tbg
Copy link
Member Author

tbg commented Jun 20, 2023

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Jun 20, 2023

Build succeeded:

@craig craig bot merged commit 41494d1 into cockroachdb:master Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants