-
Notifications
You must be signed in to change notification settings - Fork 4k
go.mod: bump raft to pick up https://github.com/etcd-io/raft/pull/52 #103826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
pav-kv
approved these changes
May 24, 2023
Collaborator
pav-kv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go mod tidy?
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
May 24, 2023
shouldReplicaQuiesce checks that all followers are fully caught up, but it's still possible to have a follower in StateProbe (because we call `rawNode.ReportUnreachable` when an outgoing message gets dropped). Persistent StateProbe is problematic: we check for it before lease transfers, so despite the follower being fully caught up, we'd refuse the transfer. Unfortunately, this commit itself is not enough: even if the range is not quiesced, it needs to replicate a new entry to rectify the situation (i.e. switch follower back to StateReplicate). This is because at the time of writing, receipt of a heartbeat response from the follower is not enough to move it back to StateReplicate. This was fixed upstream, in cockroachdb#103826. However, this is still not enough! If the range quiesces successfully, and *then* `ReportUnreachable` is called, we still end up in the same state. TODO file issue about this. I ran into the above issue on cockroachdb#99191, which adds persistent circuit breakers, when stressing `TestStoreMetrics`. That test happens to restart n2 when it's fully caught up and due to the persistence of the circuit breakers when it comes up the leader will move it into StateProbe (since we can end up dropping the first heartbeat sent to it as it comes up, since the breaker hasn't untripped yet). But, I believe that this bug is real even without this breaker re-work, just harder to trigger. Epic: none Release note (bug fix): fixed a problem that could lead to erroneously refused lease transfers (error message: "refusing to transfer lease to [...] because target may need a Raft snapshot: replica in StateProbe"
This picks up etcd-io/raft#52, which addresses a bug discovered in cockroachdb#99191 that could keep a follower in StateProbe even though it was fully caught up. Since we bumped raft just a few days ago, this pretty much only picks up that PR and nothing else: etcd-io/raft@4967cff...eb88ac5 Epic: none Release note: None
Member
Author
|
bors r=pavelkalinnikov |
Contributor
|
Build succeeded: |
craig bot
pushed a commit
that referenced
this pull request
May 24, 2023
103738: copy: fix nil pointer in COPY telemetry logging r=rafiss a=rafiss fixes #102494 Release note (bug fix): Fixed a panic that could occur while a COPY statement is logged for telemetry purposes. 103827: kvserver: don't quiesce with follower in StateProbe r=erikgrinaker a=tbg shouldReplicaQuiesce checks that all followers are fully caught up, but it's still possible to have a follower in StateProbe (because we call `rawNode.ReportUnreachable` when an outgoing message gets dropped). Persistent StateProbe is problematic: we check for it before lease transfers, so despite the follower being fully caught up, we'd refuse the transfer. Unfortunately, this commit itself is not enough: even if the range is not quiesced, it needs to replicate a new entry to rectify the situation (i.e. switch follower back to StateReplicate). This is because at the time of writing, receipt of a heartbeat response from the follower is not enough to move it back to StateReplicate. This was fixed upstream, in #103826. However, this is still not enough! If the range quiesces successfully, and *then* `ReportUnreachable` is called, we still end up in the same state; this is now tracked in #103828. I ran into the above issue on #99191, which adds persistent circuit breakers, when stressing `TestStoreMetrics`. That test happens to restart n2 when it's fully caught up and due to the persistence of the circuit breakers when it comes up the leader will move it into StateProbe (since we can end up dropping the first heartbeat sent to it as it comes up, since the breaker hasn't untripped yet). But, I believe that this bug is real even without this breaker re-work, just harder to trigger. Epic: none Release note (bug fix): fixed a problem that could lead to erroneously refused lease transfers (error message: "refusing to transfer lease to [...] because target may need a Raft snapshot: replica in StateProbe" Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this pull request
May 24, 2023
shouldReplicaQuiesce checks that all followers are fully caught up, but it's still possible to have a follower in StateProbe (because we call `rawNode.ReportUnreachable` when an outgoing message gets dropped). Persistent StateProbe is problematic: we check for it before lease transfers, so despite the follower being fully caught up, we'd refuse the transfer. Unfortunately, this commit itself is not enough: even if the range is not quiesced, it needs to replicate a new entry to rectify the situation (i.e. switch follower back to StateReplicate). This is because at the time of writing, receipt of a heartbeat response from the follower is not enough to move it back to StateReplicate. This was fixed upstream, in #103826. However, this is still not enough! If the range quiesces successfully, and *then* `ReportUnreachable` is called, we still end up in the same state. TODO file issue about this. I ran into the above issue on #99191, which adds persistent circuit breakers, when stressing `TestStoreMetrics`. That test happens to restart n2 when it's fully caught up and due to the persistence of the circuit breakers when it comes up the leader will move it into StateProbe (since we can end up dropping the first heartbeat sent to it as it comes up, since the breaker hasn't untripped yet). But, I believe that this bug is real even without this breaker re-work, just harder to trigger. Epic: none Release note (bug fix): fixed a problem that could lead to erroneously refused lease transfers (error message: "refusing to transfer lease to [...] because target may need a Raft snapshot: replica in StateProbe"
tbg
added a commit
that referenced
this pull request
May 25, 2023
shouldReplicaQuiesce checks that all followers are fully caught up, but it's still possible to have a follower in StateProbe (because we call `rawNode.ReportUnreachable` when an outgoing message gets dropped). Persistent StateProbe is problematic: we check for it before lease transfers, so despite the follower being fully caught up, we'd refuse the transfer. Unfortunately, this commit itself is not enough: even if the range is not quiesced, it needs to replicate a new entry to rectify the situation (i.e. switch follower back to StateReplicate). This is because at the time of writing, receipt of a heartbeat response from the follower is not enough to move it back to StateReplicate. This was fixed upstream, in #103826. However, this is still not enough! If the range quiesces successfully, and *then* `ReportUnreachable` is called, we still end up in the same state. TODO file issue about this. I ran into the above issue on #99191, which adds persistent circuit breakers, when stressing `TestStoreMetrics`. That test happens to restart n2 when it's fully caught up and due to the persistence of the circuit breakers when it comes up the leader will move it into StateProbe (since we can end up dropping the first heartbeat sent to it as it comes up, since the breaker hasn't untripped yet). But, I believe that this bug is real even without this breaker re-work, just harder to trigger. Epic: none Release note (bug fix): fixed a problem that could lead to erroneously refused lease transfers (error message: "refusing to transfer lease to [...] because target may need a Raft snapshot: replica in StateProbe"
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This picks up etcd-io/raft#52, which addresses a bug
discovered in #99191 that could
keep a follower in StateProbe even though it was fully caught up.
Since we bumped raft just a few days ago, this pretty much only picks up
that PR and nothing else:
etcd-io/raft@4967cff...eb88ac5
Epic: none
Release note: None