-
Notifications
You must be signed in to change notification settings - Fork 4k
rpc: simplify VerifyDialback via circuit breakers #103829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: simplify VerifyDialback via circuit breakers #103829
Conversation
63eb993 to
659f5ab
Compare
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
I think the code change is good to go, but the test has become flaky because it was relying on certain connections being inactive. I'm taking a look. |
erikgrinaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the simplification.
I suppose this is in principle slightly less reliable: previously, if the reverse connection was unhealthy for NON_BLOCKING we would initiate a connection attempt and not fail until we had its result. Now, we may instead see a faulty state from the circuit breaker's last attempt and fail fast until the circuit breaker initiates its next attempt.
I think in practice this isn't likely to be a problem, because the circuit breaker also uses the heartbeat interval for its retry interval. So if we're recovering after a partition, the BLOCKING reverse connection succeeds, the peer waits for a heartbeat interval, sends a NON_BLOCKING ping, and that should always have given the circuit breaker enough time to initiate the new connection attempt, at which point the reverse connection will see ErrNotHeartbeated and return success until the circuit breaker attempt finalizes.
If this doesn't hold, i.e. if a NON_BLOCKING ping arrives before the circuit breaker retries, then connections will keep erroring out on their second ping until the retry begins.
| // connection again, etc, for an infinite loop of blocking connections. | ||
| // A throwaway connection keeps it simple. | ||
| ctx := rpcCtx.makeDialCtx(target, request.OriginNodeID, SystemClass) | ||
| ctx := rpcCtx.wrapCtx(ctx, target, request.OriginNodeID, SystemClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth a backport?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, will do. Tracking in our sprint.
659f5ab to
854547c
Compare
854547c to
5655d29
Compare
|
bors single on |
|
Build failed: |
|
Ah, the joy of wrangling bazel. |
This will also trigger the fast-path connection if it does not exist. This allows further refactors in subsequent commits. Epic: none Release note: None
With upcoming changes that trigger instantiating the system class connections proactively, this test becomes flaky in a way that is hard to fix - we'll get extra pings, but pings don't reflect the connection class, so we can't tell which pings are on behalf of a connection we just created vs. an earlier one. Since the dialback functionality is about to become much simpler, this test is preemptively made simpler as well. We can then add a more targeted unit test to balance this. Epic: none Release note: None
This "just" relies on the RPC circuit breakers for dialbacks. The (small) gotcha is that when they are disabled, dialbacks are disabled as well. This is well worth the reduction in code, in my opinion, and we don't expect either setting to be disabled as both are escape hatches. Epic: none Release note: None
Epic: none Release note: None
The old mechanism struggled with this, the new one does not. Epic: none Release note: None
Epic: none Release note: None
We didn't want to derive from `MasterCtx` (which has no cancellation) in this case. Epic: none Release note: None
5655d29 to
1292998
Compare
Extract the rpcContext into an interface, and create mocks for it. Then add TestVerifyDialback which unit tests the method directly using the mocks. Epic: none Release note: None
Release note: None Epic: None
1292998 to
6bc5b45
Compare
|
bors r=erikgrinaker |
|
Build succeeded: |
Now that we have #99191, we can simplify
VerifyDialback. Storing connection attempts is no longer necessary since the circuit breakers keep track of the health of the remote.Epic: none
Release note: none