Skip to content

Conversation

@tbg
Copy link
Contributor

@tbg tbg commented May 17, 2023

See individual commits. Essentially, when a fully caught-up follower was
reported unreachable, it'd transition to StateProbe but then wouldn't
recover from that via heartbeats (once they resumed).

This caused some issues in CRDB because we rely on the reported status
to reason about the safety of leadership changes, etc.

This PR makes it such that StateProbe resolves on its own: when the
leader hears back from the follower via a heartbeat, it sends an
empty MsgApp, and as response to this moves the follower back into
StateProbe.

@tbg tbg requested a review from pav-kv May 17, 2023 15:48
@tbg tbg force-pushed the probe-state-heartbeat branch from 4aad73b to 5460636 Compare May 17, 2023 15:50
raft.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine to allow the StateSnapshot here? Maybe test that too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this check for StateProbe explicitly, and added this comment:

	// If the follower is fully caught up but also in StateProbe (as can happen
	// if ReportUnreachable was called), we also want to send an append (it will
	// be empty) to allow the follower to transition back to StateReplicate once
	// it responds.
	//
	// Note that StateSnapshot typically satisfies pr.Match < lastIndex, but
	// `pr.Paused()` is always true for StateSnapshot, so sendAppend is a
	// no-op.

@tbg tbg force-pushed the probe-state-heartbeat branch 2 times, most recently from 2c96d75 to 7d1ecfd Compare May 19, 2023 22:06
Copy link
Contributor Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL. There is a new commit touching up TestLeaderAppResp (in preparation for changing the outcome of one test case in the main commit).

The main commit has been changed to only change behavior for StateProbe (not StateSnapshot) with commentary in both cases on why StateSnapshot is not affected.

raft.go Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this check for StateProbe explicitly, and added this comment:

	// If the follower is fully caught up but also in StateProbe (as can happen
	// if ReportUnreachable was called), we also want to send an append (it will
	// be empty) to allow the follower to transition back to StateReplicate once
	// it responds.
	//
	// Note that StateSnapshot typically satisfies pr.Match < lastIndex, but
	// `pr.Paused()` is always true for StateSnapshot, so sendAppend is a
	// no-op.

@tbg tbg marked this pull request as ready for review May 19, 2023 22:09
@tbg tbg requested a review from pav-kv May 22, 2023 10:03
@ahrtr ahrtr self-requested a review May 24, 2023 07:26
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tbg tbg force-pushed the probe-state-heartbeat branch from 183b700 to c4e2300 Compare May 24, 2023 08:48
tbg added 4 commits May 24, 2023 10:48
This makes `go test -rewrite ./...` stable.

Signed-off-by: Tobias Grieger <[email protected]>
It will be touched in this PR.

Signed-off-by: Tobias Grieger <[email protected]>
After a call to `ReportUnreachable`, a fully caught up follower would end up in
StateReplicate and not leave it despite responding to heartbeats. This is a bug
which is going to be fixed in a follow-up commit.

Signed-off-by: Tobias Grieger <[email protected]>
See individual commits. Essentially, when a fully caught-up follower was
reported unreachable, it'd transition to `StateProbe` but then wouldn't
recover from that via heartbeats (once they resumed).

This caused some issues in CRDB because we rely on the reported status
to reason about the safety of leadership changes, etc.

This PR makes it such that StateProbe resolves on its own: when the
leader hears back from the follower via a heartbeat, it sends an
empty MsgApp, and as response to this moves the follower back into
StateProbe.

Signed-off-by: Tobias Grieger <[email protected]>
@tbg tbg force-pushed the probe-state-heartbeat branch from c4e2300 to 49bac48 Compare May 24, 2023 08:48
@tbg
Copy link
Contributor Author

tbg commented May 24, 2023

TFTRs!

@tbg tbg merged commit eb88ac5 into etcd-io:main May 24, 2023
@tbg tbg deleted the probe-state-heartbeat branch May 24, 2023 08:51
tbg added a commit to tbg/cockroach that referenced this pull request May 24, 2023
This picks up etcd-io/raft#52, which addresses a bug
discovered in cockroachdb#99191 that could
keep a follower in StateProbe even though it was fully caught up.

Since we bumped raft just a few days ago, this pretty much only picks up
that PR and nothing else:

etcd-io/raft@4967cff...eb88ac5

Epic: none
Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request May 24, 2023
103826: go.mod: bump raft to pick up etcd-io/raft#52 r=pavelkalinnikov a=tbg

This picks up etcd-io/raft#52, which addresses a bug
discovered in #99191 that could
keep a follower in StateProbe even though it was fully caught up.

Since we bumped raft just a few days ago, this pretty much only picks up
that PR and nothing else:

etcd-io/raft@4967cff...eb88ac5

Epic: none
Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants