Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented May 24, 2023

This picks up etcd-io/raft#52, which addresses a bug
discovered in #99191 that could
keep a follower in StateProbe even though it was fully caught up.

Since we bumped raft just a few days ago, this pretty much only picks up
that PR and nothing else:

etcd-io/raft@4967cff...eb88ac5

Epic: none
Release note: None

@tbg tbg requested a review from a team as a code owner May 24, 2023 10:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from pav-kv May 24, 2023 10:14
Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go mod tidy?

tbg added a commit to tbg/cockroach that referenced this pull request May 24, 2023
shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
`rawNode.ReportUnreachable` when an outgoing message gets dropped).

Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.

Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in cockroachdb#103826.

However, this is still not enough! If the range quiesces successfully, and
*then* `ReportUnreachable` is called, we still end up in the same state.

TODO file issue about this.

I ran into the above issue on
cockroachdb#99191, which adds persistent
circuit breakers, when stressing `TestStoreMetrics`. That test happens to
restart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).

But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.

Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"
This picks up etcd-io/raft#52, which addresses a bug
discovered in cockroachdb#99191 that could
keep a follower in StateProbe even though it was fully caught up.

Since we bumped raft just a few days ago, this pretty much only picks up
that PR and nothing else:

etcd-io/raft@4967cff...eb88ac5

Epic: none
Release note: None
@tbg tbg force-pushed the bump-raft-stateprobe branch from 3c50753 to 199e12f Compare May 24, 2023 10:35
@tbg
Copy link
Member Author

tbg commented May 24, 2023

bors r=pavelkalinnikov
TFTR!

@craig
Copy link
Contributor

craig bot commented May 24, 2023

Build succeeded:

@craig craig bot merged commit bc69fec into cockroachdb:master May 24, 2023
craig bot pushed a commit that referenced this pull request May 24, 2023
103738: copy: fix nil pointer in COPY telemetry logging r=rafiss a=rafiss

fixes #102494

Release note (bug fix): Fixed a panic that could occur while a COPY statement is logged for telemetry purposes.

103827: kvserver: don't quiesce with follower in StateProbe r=erikgrinaker a=tbg

shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
`rawNode.ReportUnreachable` when an outgoing message gets dropped).

Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.

Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in #103826.

However, this is still not enough! If the range quiesces successfully, and
*then* `ReportUnreachable` is called, we still end up in the same state; this is now tracked in #103828.

I ran into the above issue on
#99191, which adds persistent
circuit breakers, when stressing `TestStoreMetrics`. That test happens to
restart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).

But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.

Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"


Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request May 24, 2023
shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
`rawNode.ReportUnreachable` when an outgoing message gets dropped).

Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.

Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in #103826.

However, this is still not enough! If the range quiesces successfully, and
*then* `ReportUnreachable` is called, we still end up in the same state.

TODO file issue about this.

I ran into the above issue on
#99191, which adds persistent
circuit breakers, when stressing `TestStoreMetrics`. That test happens to
restart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).

But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.

Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"
tbg added a commit that referenced this pull request May 25, 2023
shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
`rawNode.ReportUnreachable` when an outgoing message gets dropped).

Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.

Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in #103826.

However, this is still not enough! If the range quiesces successfully, and
*then* `ReportUnreachable` is called, we still end up in the same state.

TODO file issue about this.

I ran into the above issue on
#99191, which adds persistent
circuit breakers, when stressing `TestStoreMetrics`. That test happens to
restart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).

But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.

Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants