-
Notifications
You must be signed in to change notification settings - Fork 216
Move from StatePause->StateReplicate on heartbeat response when possible #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4aad73b to
5460636
Compare
raft.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fine to allow the StateSnapshot here? Maybe test that too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this check for StateProbe explicitly, and added this comment:
// If the follower is fully caught up but also in StateProbe (as can happen
// if ReportUnreachable was called), we also want to send an append (it will
// be empty) to allow the follower to transition back to StateReplicate once
// it responds.
//
// Note that StateSnapshot typically satisfies pr.Match < lastIndex, but
// `pr.Paused()` is always true for StateSnapshot, so sendAppend is a
// no-op.
2c96d75 to
7d1ecfd
Compare
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL. There is a new commit touching up TestLeaderAppResp (in preparation for changing the outcome of one test case in the main commit).
The main commit has been changed to only change behavior for StateProbe (not StateSnapshot) with commentary in both cases on why StateSnapshot is not affected.
raft.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this check for StateProbe explicitly, and added this comment:
// If the follower is fully caught up but also in StateProbe (as can happen
// if ReportUnreachable was called), we also want to send an append (it will
// be empty) to allow the follower to transition back to StateReplicate once
// it responds.
//
// Note that StateSnapshot typically satisfies pr.Match < lastIndex, but
// `pr.Paused()` is always true for StateSnapshot, so sendAppend is a
// no-op.
ahrtr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
183b700 to
c4e2300
Compare
This makes `go test -rewrite ./...` stable. Signed-off-by: Tobias Grieger <[email protected]>
It will be touched in this PR. Signed-off-by: Tobias Grieger <[email protected]>
After a call to `ReportUnreachable`, a fully caught up follower would end up in StateReplicate and not leave it despite responding to heartbeats. This is a bug which is going to be fixed in a follow-up commit. Signed-off-by: Tobias Grieger <[email protected]>
See individual commits. Essentially, when a fully caught-up follower was reported unreachable, it'd transition to `StateProbe` but then wouldn't recover from that via heartbeats (once they resumed). This caused some issues in CRDB because we rely on the reported status to reason about the safety of leadership changes, etc. This PR makes it such that StateProbe resolves on its own: when the leader hears back from the follower via a heartbeat, it sends an empty MsgApp, and as response to this moves the follower back into StateProbe. Signed-off-by: Tobias Grieger <[email protected]>
c4e2300 to
49bac48
Compare
|
TFTRs! |
This picks up etcd-io/raft#52, which addresses a bug discovered in cockroachdb#99191 that could keep a follower in StateProbe even though it was fully caught up. Since we bumped raft just a few days ago, this pretty much only picks up that PR and nothing else: etcd-io/raft@4967cff...eb88ac5 Epic: none Release note: None
103826: go.mod: bump raft to pick up etcd-io/raft#52 r=pavelkalinnikov a=tbg This picks up etcd-io/raft#52, which addresses a bug discovered in #99191 that could keep a follower in StateProbe even though it was fully caught up. Since we bumped raft just a few days ago, this pretty much only picks up that PR and nothing else: etcd-io/raft@4967cff...eb88ac5 Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]>
See individual commits. Essentially, when a fully caught-up follower was
reported unreachable, it'd transition to
StateProbebut then wouldn'trecover from that via heartbeats (once they resumed).
This caused some issues in CRDB because we rely on the reported status
to reason about the safety of leadership changes, etc.
This PR makes it such that StateProbe resolves on its own: when the
leader hears back from the follower via a heartbeat, it sends an
empty MsgApp, and as response to this moves the follower back into
StateProbe.