Skip to content

Conversation

@sciascid
Copy link
Contributor

The Raft implementation was tightly coupled to the server's internal client and send queue for the RPC communication. This makes it difficult to test scenarios like network partitions in deterministic manner.
The primary benefit of this change is improved testability. A new mockTransport is introduced for testing, which allows for simulating network partitions and for injecting behavior after a message is sent.

Signed-off-by: Daniele Sciascia [email protected]

@sciascid sciascid requested a review from a team as a code owner December 19, 2025 17:03
@sciascid sciascid force-pushed the raft-transport-refactoring branch from 7776f56 to 2d9baba Compare December 19, 2025 17:09
@sciascid sciascid marked this pull request as draft December 19, 2025 17:32
@sciascid sciascid force-pushed the raft-transport-refactoring branch 3 times, most recently from de169b6 to 7ce35fc Compare December 22, 2025 09:56
The Raft implementation was tightly coupled to the server's
internal client and send queue for the RPC communication.
This makes it difficult to test scenarios like network
partitions in deterministic manner.
The primary benefit of this change is improved testability. A
new mockTransport is introduced for testing, which allows for
simulating network partitions and for injecting behavior after
a message is sent.

Signed-off-by: Daniele Sciascia <[email protected]>
@sciascid sciascid force-pushed the raft-transport-refactoring branch from 7ce35fc to 1f309fd Compare December 23, 2025 16:29
…ions

This commit fixes the following bugs:

- Inconsistent Cluster Size: When a leader was partitioned from the
  cluster, immediately after proposing a EntryAddPeer. The remaining
  nodes could end up with different view of the cluster size and
  quorum. So followers could have cluster size and would not match
  the number of peers in the peer set. A subsequent leader election,
  electing one of the followers, could break the quorum system.

- Incorrect Leader Election: It was possible for a new leader to be
  elected without a proper quorum. This could happen if a partition
  occurred after a new peer was proposed but before that change was
  committed. A follower could add the uncommitted peer to its peer
  set but would not update its cluster size and quorum, leading to
  an invalid election.

Both issues are solved by making sure that when a peer is added or
removed from the membership, the cluster size and quorum are adjusted
accordingly, at the same time. Followers would first add peers when
receiving the EntryAddPeer, and then adjusting the cluster size
only after commit. This patch changes this behavior such that the
cluster size and quorum are recomputed upon receiving the EntryAddPeer
/ EntryRemovePeer proposals. This is inline with the membership
protocol proposed in Ongaro's dissertation, section 4.1.
This patch also removes the concept of a "known" peer from the Raft
layer. A node would add a peer to its peer set when first receiving
the corresponding appendEntry, and on commit it would be marked as
"known". This distinction no longer applies.

Signed-off-by: Daniele Sciascia <[email protected]>
@sciascid sciascid force-pushed the raft-transport-refactoring branch from 1f309fd to 5650240 Compare December 23, 2025 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants