Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Conversation

@aeyakovenko
Copy link
Member

@aeyakovenko aeyakovenko commented Jan 26, 2019

Problem

Need a way for the cluster to receive and send all the votes in the network regardless of leader failures.

Summary of Changes

Push votes into the crds, which will propagate them around the network. Leaders should call ClusterInfo::get_votes periodically (like on every tick) and process the vector like a batch of transactions. The bank will naturally drop any old, invalid or duplicate votes. The subsequent working leader in the round robin will pick up any pending votes as soon as possible.

Fixes #2545

Added book chapter on design. This PR implements the sub 1k node rollout plan described in the book.

@sagar-solana
Copy link
Contributor

Oh this is cool. Why aren't we sticking with sending votes back to the leader directly? (aka - what leader failures, I might be out of the loop here)

@aeyakovenko
Copy link
Member Author

aeyakovenko commented Jan 26, 2019

@sagar-solana The next leader might not be aware they are the next leader because their PoH is slow. Someone might have tripped over the power cord, etc... We want the subsequent working leader in the round robin to pick up any pending votes as soon as possible.

@codecov
Copy link

codecov bot commented Jan 26, 2019

Codecov Report

Merging #2562 into master will decrease coverage by 9.1%.
The diff coverage is 84%.

@@           Coverage Diff            @@
##           master   #2562     +/-   ##
========================================
- Coverage    77.5%   68.4%   -9.2%     
========================================
  Files         111     110      -1     
  Lines       18646   20743   +2097     
========================================
- Hits        14466   14192    -274     
- Misses       4180    6551   +2371

@sagar-solana
Copy link
Contributor

sagar-solana commented Jan 26, 2019

@aeyakovenko, why was height removed from the vote?
We'll switch to this mechanism in a separate PR ?

@aeyakovenko
Copy link
Member Author

aeyakovenko commented Jan 26, 2019

@sagar-solana the last commit does the integration. want to take it over? those huge replay stage tests should be junked. they are to big to be useful.

@aeyakovenko
Copy link
Member Author

@aeyakovenko, why was height removed from the vote?

It wasn't used, wallclock is used to check the last vote. The height was there before we had finalized the voting/fork/block generation. With a wallclock the network can achieve the same result since a validator would always use a later time for later heights.

We'll switch to this mechanism in a separate PR ?

Feel free to take this over :). Those replay tests are just to big. I think any brains of the stage could be tested in isolation without IO, and the rest of the code should be service boilerplate that we can hopefully factor out into something that is tested in isolation as well.

@garious
Copy link
Contributor

garious commented Jan 26, 2019

This looks like a good approach. I especially like all the code it deletes. It'd be great to figure out the performance implications before any next steps. Also curious if there's any correctness concerns, like losing previous votes because the leader was down or didn't poll fast enough.

@aeyakovenko
Copy link
Member Author

@garious, the votes propagate at a fanout of 6 every 100ms. We can tweak prioritizing the leaders, or increasing the fanout for push depending on the network size.

But the simplest thing would be to do both, send the vote to the next TPU and gossip. Gossip has a really high guarantee of eventual delivery and the tpu would be a best effort. It shouldn’t be a big deal to skip a few votes, but we can also cache and clear deleted votes in the crds.

@sagar-solana
Copy link
Contributor

@aeyakovenko, yeah I can take over integrating this.

@garious
Copy link
Contributor

garious commented Jan 28, 2019

@aeyakovenko, stepping back, after max lockout, the validator needs to get every vote on the ledger to receive a mining reward. Each missing vote is a missed mining reward. The current approach would have two parts:

  1. Validator retries sending the vote to the leader until it observes it in its own ReplayStage.
  2. Not implemented: Push the vote to gossip to help guard against censorship.

This approach looks to be:

  1. Validator pushes to gossip and relies on the eventual consistency guarantees of gossip and CRDTs to know its latest vote will eventually be observed by some leader.
  2. If the validator pushes two votes (one at an early height, one at a later height) and only the second vote is observed, the only way to get the first vote on the ledger would be to customize Crds to intercept the "old" information that it would otherwise discard.

Seems like an unnecessary abuse of a CRDT to intercept those packets before choosing to discard "old" information. Instead, something more similar to the current path seems more natural. In terms of elegance, the only mistake of the current approach was not to leverage the local TPU forwarder (duplicate code).

@aeyakovenko
Copy link
Member Author

aeyakovenko commented Jan 28, 2019

@garoius

It’s easy to add a queue type to the crds that tracks the last N CrdsValues.

How does a validator know if the tpu received the message? Or the next tpu is actively waiting for messages? The validator to leader state transition is flaky.

The votes are per block. The propagation time to 20k nodes with a fanout of 6 is 6 hops, so worst case 600ms.

It’s designed for this usecase. Because the next leader can be malicious there is no way for the validator to ensure the message arrives at some node that will encode it to a block. So gossip is the only practical option for eventual delivery.

  1. Validator retries sending the vote to the leader until it observes it in its own ReplayStage.

How long is the timeout? What do validators do with messages they received before they transitioned? I think you are going to end up designing another gossip network :)

@garious
Copy link
Contributor

garious commented Jan 28, 2019

It’s easy to add a queue type to the crds that tracks the last N CrdsValues.

This is my big concern. To hold the trajectory of this PR, that's exactly what you'll have to do, and that means that Crds isn't solving the problem of being a CRDT. It's a CRDT for everything else, but now quintupling in RAM size (for queue depth of just 5) to support this one usage.

How does a validator know if the tpu received the message? Or the next tpu is actively waiting for messages?

The validator would observe the Vote transaction in its ReplayStage.

The validator to leader state transition is flaky.

How so?

So gossip is the only practical option for eventual delivery.

Clients, for example, already have a practical option. The use retries. That mechanism, in combination with leader rotation, ensures they'll eventually get their transaction on the ledger (at the time of the first honest leader).

Overall, this architectural change feels distracting, and we can make the codebase simpler with no major architectural changes by simply sending votes to a thread that retries sending votes to the current leader until the vote shows up in the ReplayStage.

At the very least, I think we need to acknowledge this is a significant architectural change and it should go through an architectural review to debate the implications. You could cite this PR as evidence that it might make the codebase simpler, but if we add the proposed queues to Crds, that's not so obvious.

@aeyakovenko
Copy link
Member Author

aeyakovenko commented Jan 28, 2019

@sagar-solana @garious

With this approach there duplicate work between gossip and the ledger broadcast. Once a vote is broadcast, it can be added to the crds which will shortcut the push message protocol. Push only forwards new data, so adding broadcast votes from the ledger into the data store will terminate the propagation of the message.

For that to work, the CrdsValue for Transaction needs to be a thin wrapper without its own signature. We basically just need to add a "wallclock" to the vote userdata and use the Transaction signature for the first key. @sagar-solana does that make sense?

@aeyakovenko
Copy link
Member Author

This is my big concern. To hold the trajectory of this PR, that's exactly what you'll have to do, and that means that Crds isn't solving the problem of being a CRDT. It's a CRDT for everything else, but now quintupling in RAM size (for queue depth of just 5) to support this one usage.

Just for the votes. Not all the messages.

The validator would observe the Vote transaction in its ReplayStage.

So it would withhold its next vote until it observes the current one and continue to retransmit? How often should it retransmit?

The validator to leader state transition is flaky.
How so?

It's triggered by PoH, and has network propagation delay. The next leader could be behind while it's receiving validation messages. Should it cache them? how big should the cache be?

Clients, for example, already have a practical option. The use retries. That mechanism, in combination with leader rotation, ensures they'll eventually get their transaction on the ledger (at the time of the first honest leader).

This will actually be slower and less reliable because with the two gossip protocols there is active repair of state happening between all the nodes. So while this client waits for N periods between observing the ledger and voting the network has a chance to repair the state with the nexts leader from every node in the network.

I'll add a book section covering these details.

@aeyakovenko
Copy link
Member Author

@sagar-solana @garious checkout the book. This PR implements the Sub 1k network proposal described at the end of the book chapter.

@aeyakovenko
Copy link
Member Author

@garious any feedback? i would like to not bitrot this

@garious
Copy link
Contributor

garious commented Jan 29, 2019

Yeah, lots. Can you start by 80-char aligning the book content so that all my review comments don't point to the same line?

@sagar-solana
Copy link
Contributor

@aeyakovenko, sorry I haven't been able to pick this up. Been really hectic here since I booked my flight. I'll be travelling for the next 26 or so hours. I'll get back to this on Thursday :)

@garious
Copy link
Contributor

garious commented Jan 30, 2019

This work should fit in very nicely with #2594 and #2539. VoteSignerProxy should reduce to almost nothing.

@garious
Copy link
Contributor

garious commented Jan 30, 2019

@aeyakovenko, can you move the book proposal into a separate PR? Feel free to reference this one from the new one if you think it'll help communicate what needs to be implemented. I'd like to gain consensus on the design and merge it without having to think about the implementation. Say, for example, we need to revert it; we wouldn't want to revert the design as well.

@aeyakovenko
Copy link
Member Author

design is in #2601

@garious
Copy link
Contributor

garious commented Jan 31, 2019

@aeyakovenko, can you rebase?

@sagar-solana
Copy link
Contributor

@aeyakovenko @garious, I'll rebase this and take it over.

@aeyakovenko
Copy link
Member Author

#2622 took this over

tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 22, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 26, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 26, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 27, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Nov 30, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 3, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 4, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 4, 2024
tao-stones added a commit to tao-stones/solana that referenced this pull request Dec 4, 2024
steviez pushed a commit to steviez/solana that referenced this pull request Dec 5, 2024
- Add feature gate, issue solana-labs#2562;
- Implement SIMD-170;

---------

Co-authored-by: Justin Starry <[email protected]>
t-nelson pushed a commit to t-nelson/solana that referenced this pull request Jan 11, 2025
…ana-labs#3799) (solana-labs#3931)

* Fix reserve minimal compute units for builtins  (solana-labs#3799)

- Add feature gate, issue solana-labs#2562;
- Implement SIMD-170;

---------

Co-authored-by: Justin Starry <[email protected]>
(cherry picked from commit 3e9af14)

* Fix conflicts manually

* avoid getting program_ids that are not used after all

---------

Co-authored-by: Tao Zhu <[email protected]>
Co-authored-by: Tao Zhu <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants