Skip to content

Conversation

@erikgrinaker
Copy link

@erikgrinaker erikgrinaker commented Jul 7, 2023

Backports etcd-io/raft#52 on behalf of @tbg.

Move from StatePause->StateReplicate on heartbeat response when possible
See individual commits. Essentially, when a fully caught-up follower was
reported unreachable, it'd transition to StateProbe but then wouldn't
recover from that via heartbeats (once they resumed).

This caused some issues in CRDB because we rely on the reported status
to reason about the safety of leadership changes, etc.

This PR makes it such that StateProbe resolves on its own: when the
leader hears back from the follower via a heartbeat, it sends an
empty MsgApp, and as response to this moves the follower back into
StateProbe.

Touches cockroachdb/cockroach#104304.
Touches cockroachdb/cockroach#101624.

Signed-off-by: Tobias Grieger [email protected]

erikgrinaker and others added 4 commits July 7, 2023 09:39
Result of `go mod tidy`.
It will be touched in this PR.

Signed-off-by: Tobias Grieger <[email protected]>
After a call to `ReportUnreachable`, a fully caught up follower would end up in
StateReplicate and not leave it despite responding to heartbeats. This is a bug
which is going to be fixed in a follow-up commit.

Signed-off-by: Tobias Grieger <[email protected]>
See individual commits. Essentially, when a fully caught-up follower was
reported unreachable, it'd transition to `StateProbe` but then wouldn't
recover from that via heartbeats (once they resumed).

This caused some issues in CRDB because we rely on the reported status
to reason about the safety of leadership changes, etc.

This PR makes it such that StateProbe resolves on its own: when the
leader hears back from the follower via a heartbeat, it sends an
empty MsgApp, and as response to this moves the follower back into
StateProbe.

Signed-off-by: Tobias Grieger <[email protected]>
@erikgrinaker erikgrinaker requested a review from tbg July 7, 2023 09:53
@erikgrinaker erikgrinaker self-assigned this Jul 7, 2023
@erikgrinaker erikgrinaker changed the title Move from StatePause->StateReplicate on heartbeat response when possible crdb-release-22.2: Move from StatePause->StateReplicate on heartbeat response when possible Jul 7, 2023
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any trouble merging this in? Have not reviewed in depth.

@erikgrinaker
Copy link
Author

Just annoying since I couldn't get git am to apply the patch properly from the raft repo. Basically had to apply it manually. But otherwise good, tests are passing.

Wonder if there could be any unexpected interactions due to other patches that we've skipped, but this seems reasonable on its own.

@erikgrinaker erikgrinaker merged commit f61509e into crdb-release-22.2 Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants