Push and query the ClusterInfo for votes. #2562

aeyakovenko · 2019-01-26T15:52:33Z

Problem

Need a way for the cluster to receive and send all the votes in the network regardless of leader failures.

Summary of Changes

Push votes into the crds, which will propagate them around the network. Leaders should call ClusterInfo::get_votes periodically (like on every tick) and process the vector like a batch of transactions. The bank will naturally drop any old, invalid or duplicate votes. The subsequent working leader in the round robin will pick up any pending votes as soon as possible.

Fixes #2545

Added book chapter on design. This PR implements the sub 1k node rollout plan described in the book.

sagar-solana · 2019-01-26T15:54:37Z

Oh this is cool. Why aren't we sticking with sending votes back to the leader directly? (aka - what leader failures, I might be out of the loop here)

aeyakovenko · 2019-01-26T15:56:01Z

@sagar-solana The next leader might not be aware they are the next leader because their PoH is slow. Someone might have tripped over the power cord, etc... We want the subsequent working leader in the round robin to pick up any pending votes as soon as possible.

codecov · 2019-01-26T16:10:45Z

Codecov Report

Merging #2562 into master will decrease coverage by 9.1%.
The diff coverage is 84%.

@@           Coverage Diff            @@
##           master   #2562     +/-   ##
========================================
- Coverage    77.5%   68.4%   -9.2%     
========================================
  Files         111     110      -1     
  Lines       18646   20743   +2097     
========================================
- Hits        14466   14192    -274     
- Misses       4180    6551   +2371

sagar-solana · 2019-01-26T17:24:36Z

@aeyakovenko, why was height removed from the vote?
We'll switch to this mechanism in a separate PR ?

aeyakovenko · 2019-01-26T19:09:28Z

@sagar-solana the last commit does the integration. want to take it over? those huge replay stage tests should be junked. they are to big to be useful.

aeyakovenko · 2019-01-26T20:16:05Z

@aeyakovenko, why was height removed from the vote?

It wasn't used, wallclock is used to check the last vote. The height was there before we had finalized the voting/fork/block generation. With a wallclock the network can achieve the same result since a validator would always use a later time for later heights.

We'll switch to this mechanism in a separate PR ?

Feel free to take this over :). Those replay tests are just to big. I think any brains of the stage could be tested in isolation without IO, and the rest of the code should be service boilerplate that we can hopefully factor out into something that is tested in isolation as well.

garious · 2019-01-26T22:06:32Z

This looks like a good approach. I especially like all the code it deletes. It'd be great to figure out the performance implications before any next steps. Also curious if there's any correctness concerns, like losing previous votes because the leader was down or didn't poll fast enough.

aeyakovenko · 2019-01-26T23:47:38Z

@garious, the votes propagate at a fanout of 6 every 100ms. We can tweak prioritizing the leaders, or increasing the fanout for push depending on the network size.

But the simplest thing would be to do both, send the vote to the next TPU and gossip. Gossip has a really high guarantee of eventual delivery and the tpu would be a best effort. It shouldn’t be a big deal to skip a few votes, but we can also cache and clear deleted votes in the crds.

sagar-solana · 2019-01-27T04:06:37Z

@aeyakovenko, yeah I can take over integrating this.

garious · 2019-01-28T15:54:22Z

@aeyakovenko, stepping back, after max lockout, the validator needs to get every vote on the ledger to receive a mining reward. Each missing vote is a missed mining reward. The current approach would have two parts:

Validator retries sending the vote to the leader until it observes it in its own ReplayStage.
Not implemented: Push the vote to gossip to help guard against censorship.

This approach looks to be:

Validator pushes to gossip and relies on the eventual consistency guarantees of gossip and CRDTs to know its latest vote will eventually be observed by some leader.
If the validator pushes two votes (one at an early height, one at a later height) and only the second vote is observed, the only way to get the first vote on the ledger would be to customize Crds to intercept the "old" information that it would otherwise discard.

Seems like an unnecessary abuse of a CRDT to intercept those packets before choosing to discard "old" information. Instead, something more similar to the current path seems more natural. In terms of elegance, the only mistake of the current approach was not to leverage the local TPU forwarder (duplicate code).

aeyakovenko · 2019-01-28T16:33:41Z

@garoius

It’s easy to add a queue type to the crds that tracks the last N CrdsValues.

How does a validator know if the tpu received the message? Or the next tpu is actively waiting for messages? The validator to leader state transition is flaky.

The votes are per block. The propagation time to 20k nodes with a fanout of 6 is 6 hops, so worst case 600ms.

It’s designed for this usecase. Because the next leader can be malicious there is no way for the validator to ensure the message arrives at some node that will encode it to a block. So gossip is the only practical option for eventual delivery.

Validator retries sending the vote to the leader until it observes it in its own ReplayStage.

How long is the timeout? What do validators do with messages they received before they transitioned? I think you are going to end up designing another gossip network :)

garious · 2019-01-28T16:58:32Z

It’s easy to add a queue type to the crds that tracks the last N CrdsValues.

This is my big concern. To hold the trajectory of this PR, that's exactly what you'll have to do, and that means that Crds isn't solving the problem of being a CRDT. It's a CRDT for everything else, but now quintupling in RAM size (for queue depth of just 5) to support this one usage.

How does a validator know if the tpu received the message? Or the next tpu is actively waiting for messages?

The validator would observe the Vote transaction in its ReplayStage.

The validator to leader state transition is flaky.

How so?

So gossip is the only practical option for eventual delivery.

Clients, for example, already have a practical option. The use retries. That mechanism, in combination with leader rotation, ensures they'll eventually get their transaction on the ledger (at the time of the first honest leader).

Overall, this architectural change feels distracting, and we can make the codebase simpler with no major architectural changes by simply sending votes to a thread that retries sending votes to the current leader until the vote shows up in the ReplayStage.

At the very least, I think we need to acknowledge this is a significant architectural change and it should go through an architectural review to debate the implications. You could cite this PR as evidence that it might make the codebase simpler, but if we add the proposed queues to Crds, that's not so obvious.

aeyakovenko · 2019-01-28T17:57:45Z

@sagar-solana @garious

With this approach there duplicate work between gossip and the ledger broadcast. Once a vote is broadcast, it can be added to the crds which will shortcut the push message protocol. Push only forwards new data, so adding broadcast votes from the ledger into the data store will terminate the propagation of the message.

For that to work, the CrdsValue for Transaction needs to be a thin wrapper without its own signature. We basically just need to add a "wallclock" to the vote userdata and use the Transaction signature for the first key. @sagar-solana does that make sense?

aeyakovenko · 2019-01-28T18:35:27Z

This is my big concern. To hold the trajectory of this PR, that's exactly what you'll have to do, and that means that Crds isn't solving the problem of being a CRDT. It's a CRDT for everything else, but now quintupling in RAM size (for queue depth of just 5) to support this one usage.

Just for the votes. Not all the messages.

The validator would observe the Vote transaction in its ReplayStage.

So it would withhold its next vote until it observes the current one and continue to retransmit? How often should it retransmit?

The validator to leader state transition is flaky.
How so?

It's triggered by PoH, and has network propagation delay. The next leader could be behind while it's receiving validation messages. Should it cache them? how big should the cache be?

Clients, for example, already have a practical option. The use retries. That mechanism, in combination with leader rotation, ensures they'll eventually get their transaction on the ledger (at the time of the first honest leader).

This will actually be slower and less reliable because with the two gossip protocols there is active repair of state happening between all the nodes. So while this client waits for N periods between observing the ledger and voting the network has a chance to repair the state with the nexts leader from every node in the network.

I'll add a book section covering these details.

aeyakovenko · 2019-01-28T19:38:21Z

@sagar-solana @garious checkout the book. This PR implements the Sub 1k network proposal described at the end of the book chapter.

aeyakovenko · 2019-01-29T22:32:27Z

@garious any feedback? i would like to not bitrot this

garious · 2019-01-29T22:36:42Z

Yeah, lots. Can you start by 80-char aligning the book content so that all my review comments don't point to the same line?

sagar-solana · 2019-01-29T22:38:46Z

@aeyakovenko, sorry I haven't been able to pick this up. Been really hectic here since I booked my flight. I'll be travelling for the next 26 or so hours. I'll get back to this on Thursday :)

garious · 2019-01-30T00:25:06Z

This work should fit in very nicely with #2594 and #2539. VoteSignerProxy should reduce to almost nothing.

garious · 2019-01-30T03:15:23Z

@aeyakovenko, can you move the book proposal into a separate PR? Feel free to reference this one from the new one if you think it'll help communicate what needs to be implemented. I'd like to gain consensus on the design and merge it without having to think about the implementation. Say, for example, we need to revert it; we wouldn't want to revert the design as well.

aeyakovenko · 2019-01-30T16:55:22Z

design is in #2601

garious · 2019-01-31T16:22:57Z

@aeyakovenko, can you rebase?

sagar-solana · 2019-01-31T19:24:33Z

@aeyakovenko @garious, I'll rebase this and take it over.

aeyakovenko · 2019-01-31T22:00:11Z

#2622 took this over

- Implement SIMD-170;

- Add feature gate, issue solana-labs#2562; - Implement SIMD-170; --------- Co-authored-by: Justin Starry <[email protected]>

…ana-labs#3799) (solana-labs#3931) * Fix reserve minimal compute units for builtins (solana-labs#3799) - Add feature gate, issue solana-labs#2562; - Implement SIMD-170; --------- Co-authored-by: Justin Starry <[email protected]> (cherry picked from commit 3e9af14) * Fix conflicts manually * avoid getting program_ids that are not used after all --------- Co-authored-by: Tao Zhu <[email protected]> Co-authored-by: Tao Zhu <[email protected]>

Push and query the ClusterInfo for votes.

e6d9ae2

aeyakovenko requested review from garious and sagar-solana January 26, 2019 15:52

aeyakovenko mentioned this pull request Jan 26, 2019

Sending vote transactions is unnecessarily complex #2545

Closed

Expand API to allow to filter for new votes only.

aa1e786

aeyakovenko force-pushed the push_vote branch from 2d4a39e to d39f413 Compare January 26, 2019 19:11

aeyakovenko force-pushed the push_vote branch from d39f413 to 91336a1 Compare January 26, 2019 20:22

Integrate cluster info voting into the TPU and ReplayStage

f725604

aeyakovenko force-pushed the push_vote branch from 91336a1 to f725604 Compare January 26, 2019 20:36

aeyakovenko added 3 commits January 28, 2019 11:28

book on reliable vote transmission

a74d173

linked

419f905

nits

c47aeff

aeyakovenko mentioned this pull request Jan 29, 2019

leader looks at the crds for any votes that are valid for the above checkpoint that haven't been registered yet #2589

Closed

80 char limit

d3520f0

removed book

db480ad

sagar-solana mentioned this pull request Jan 31, 2019

Push and query the ClusterInfo for votes. #2622

Merged

aeyakovenko closed this Jan 31, 2019

tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 22, 2024

add feature gate, issue solana-labs#2562

8a5bf86

tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 26, 2024

add feature gate, issue solana-labs#2562

e63564e

tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 26, 2024

add feature gate, issue solana-labs#2562

8a4b653

tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 27, 2024

add feature gate, issue solana-labs#2562

65954e6

tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 30, 2024

add feature gate, issue solana-labs#2562

3a9188d

tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 3, 2024

add feature gate, issue solana-labs#2562

55ccf65

tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 4, 2024

- Add feature gate, issue solana-labs#2562;

c2f7271

- Implement SIMD-170;

tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 4, 2024

- Add feature gate, issue solana-labs#2562;

7251345

- Implement SIMD-170;

tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 4, 2024

- Add feature gate, issue solana-labs#2562;

844a2d3

- Implement SIMD-170;

steviez pushed a commit to steviez/solana that referenced this pull request Dec 5, 2024

Fix reserve minimal compute units for builtins (solana-labs#3799)

3e9af14

- Add feature gate, issue solana-labs#2562; - Implement SIMD-170; --------- Co-authored-by: Justin Starry <[email protected]>

Push and query the ClusterInfo for votes. #2562

Push and query the ClusterInfo for votes. #2562

Uh oh!

Conversation

aeyakovenko commented Jan 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

sagar-solana commented Jan 26, 2019

Uh oh!

aeyakovenko commented Jan 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sagar-solana commented Jan 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeyakovenko commented Jan 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeyakovenko commented Jan 26, 2019

Uh oh!

garious commented Jan 26, 2019

Uh oh!

aeyakovenko commented Jan 26, 2019

Uh oh!

sagar-solana commented Jan 27, 2019

Uh oh!

garious commented Jan 28, 2019

Uh oh!

aeyakovenko commented Jan 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garious commented Jan 28, 2019

Uh oh!

aeyakovenko commented Jan 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeyakovenko commented Jan 28, 2019

Uh oh!

aeyakovenko commented Jan 28, 2019

Uh oh!

aeyakovenko commented Jan 29, 2019

Uh oh!

garious commented Jan 29, 2019

Uh oh!

sagar-solana commented Jan 29, 2019

Uh oh!

garious commented Jan 30, 2019

Uh oh!

garious commented Jan 30, 2019

Uh oh!

aeyakovenko commented Jan 30, 2019

Uh oh!

garious commented Jan 31, 2019

Uh oh!

sagar-solana commented Jan 31, 2019

Uh oh!

aeyakovenko commented Jan 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aeyakovenko commented Jan 26, 2019 •

edited

Loading

aeyakovenko commented Jan 26, 2019 •

edited

Loading

codecov bot commented Jan 26, 2019 •

edited

Loading

sagar-solana commented Jan 26, 2019 •

edited

Loading

aeyakovenko commented Jan 26, 2019 •

edited

Loading

aeyakovenko commented Jan 28, 2019 •

edited

Loading

aeyakovenko commented Jan 28, 2019 •

edited

Loading