Skip to content

Conversation

@garyschulte
Copy link
Contributor

@garyschulte garyschulte commented Jan 9, 2025

PR description

PR adds precompile caching behavior for an MVP set of precompiles that are costly enough to benefit from caching. Provision is added to disable caching via command line arg (for gas costing reasons), but it is enabled by default in besu, and disabled by default in evmtool and benchmark subcommand.

Changes:

  • add a static member and setter in AbstractPrecompiledContract used to control whether we want to cache results
  • add precompile-specific LRU caches with rational size limits in each MVP precompile
  • add a cli arg for precompile caching, defaulted to true

MVP precompiles include:

  • altbn128/bn254 precompiles for add, mul and pairing
  • ecrecover precompile
  • blake2 precompile
  • kzg point precompile
  • bls precompiles

Feedback welcome on the design choices:

  • one cache per precompile contract (since each will have different input and output size characteristics)
  • cache is <hashCode, input_and_result_tuple> in order to verify input is truly identical rather than just matching by hashCode (it is trivial to construct requests that have different inputs, but similar Bytes hashCode)

Parallel transaction execution should benefit from precompile caching when state conflicts are detected. Attached are preliminary results from the nethermind gas-benchmarks suite which indicate performance does not seem to take a hit for cache checking and misses, and the caching itself is effective for repetitive/identical inputs

updated:
ecmul_new.pdf
ecrec_new.pdf
blake2f_new.pdf

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests

@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from 49dd4dc to e9155f3 Compare January 10, 2025 00:27
@garyschulte garyschulte changed the title Precompile caching part1 Precompile caching MVP Jan 10, 2025
@garyschulte garyschulte changed the title Precompile caching MVP Precompile Caching MVP Jan 10, 2025
@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from 0d95b1d to 67f912f Compare January 10, 2025 00:34
Copy link
Contributor

@ahamlat ahamlat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sens to have a cache per precompile as we discussed. Also you need to change the key to use a hashing function that has no collisions, as the hashcode method that returns int doesn't can have collisions

@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from d519ece to 73b29e1 Compare January 23, 2025 00:30
@github-actions
Copy link

This pr is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the Stale label Feb 22, 2025
@github-actions
Copy link

github-actions bot commented Mar 9, 2025

This pr was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this Mar 9, 2025
@garyschulte garyschulte reopened this Mar 18, 2025
@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from 5de5681 to 93bb963 Compare March 18, 2025 14:29
@github-actions github-actions bot removed the Stale label Mar 19, 2025
Copy link
Contributor

@ahamlat ahamlat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you review the way the hashcode is calculate for some precompiles (see below) ? It is sometimes calculated twice where we can store it and use it to cache the result. Also, I wonder why you didn't add a cache for KZGPointEvalPrecompiledContract.

@ahamlat
Copy link
Contributor

ahamlat commented Mar 19, 2025

Other than the small requested changes above, the PR is great, I think we can decouple it from parallel transaction execution as it can help even when parallel transaction execution is not enabled.
In terms of performance, in addition to what @garyschulte shared, profiling two nodes running this PR against two nodes running version 25.2.0 showed that the cache works fine, especially for EcRecover, and it reduces precompile execution time, reducing the number of samples for a period of 300 second sampling from ~18 to 1 sample in both cases.
A sample is collected each 11 ms. The profiling is done on the same blocks.

Without this PR
image

With this PR
image

@ahamlat
Copy link
Contributor

ahamlat commented Mar 19, 2025

An interesting idea from this implementation, is that you created a cache where the eviction mechanism is based on hashcode collisions. It is like a hashmap where we keep only one node behind each bucket (hashcode index), the new value will always replace the existing one.

@garyschulte
Copy link
Contributor Author

Could you review the way the hashcode is calculate for some precompiles (see below) ? It is sometimes calculated twice where we can store it and use it to cache the result. Also, I wonder why you didn't add a cache for KZGPointEvalPrecompiledContract.

Will do, and will add similar caches for the pectra BLS precompiles.

@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch 2 times, most recently from 08fd69d to 352c9f3 Compare March 20, 2025 21:14
@garyschulte
Copy link
Contributor Author

OK, I found the issues with false positives. It appears we are subsequently mutating the input bytes. So the input value in our precompile result tuple was getting mutated after we cached it, and that was the source of the false positives.

What I have done for all precompiles is to copy the input in the precompile result tuple IF we have caching enabled. This seems to be the sweet spot for removing false positives. We should have little to no impact on precompile performance or overhead if caching is disabled.

Since these changes, I have not seen any false positives:
image

If you would, please re-review at your leisure @ahamlat 🙏

Copy link
Contributor

@ahamlat ahamlat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just few comments.
I will approve once I checked the impact of the copy of the input.

Copy link
Contributor

@ahamlat ahamlat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the benchmarks in the description with the results from the last implementation ?
The CPU profiling shows an improvement on the nodes running this PR :
Control node 1 / 11 samples for MessageCallProcessor.executePrecompile
image

Control node 2 / 11 samples for MessageCallProcessor.executePrecompile
image

Node 1 running this PR / 4 samples for MessageCallProcessor.executePrecompile
image

Node 2 running this PR / 4 samples for MessageCallProcessor.executePrecompile
image

@garyschulte
Copy link
Contributor Author

Updated the pdf docs. Input copying doesn't seem to be a big problem in this case. The ecmul point-at-infinity optimization performs better without the cache - interestingly even when we run the optimization check before the cache check. But otherwise the current implementation looks to be better all around.

@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from 352c9f3 to fb18fd6 Compare March 21, 2025 20:21
@garyschulte
Copy link
Contributor Author

garyschulte commented Mar 21, 2025

Added bls precompile caching also. The gas-benchmarks suite doesn't have bls precompile tests, but evmtool gives some pretty dramatic results:

➜  besu git:(feature/precompile-caching-part1) ✗ build/install/besu/bin/evmtool benchmark --native --use-precompile-cache bls12      
besu/v25.3-develop-f4df214/osx-aarch_64/corretto-java-22
Benchmarks for Bls12
Bls12 G1 Add    375 avg gas @    1.1 µs /   332.3 MGps
Bls12 G1 MSM   501,672 total gas @   13.2 µs /38,046.9 MGps
Bls12 MapFpToG1  5,500 avg gas @    0.5 µs /10,890.2 MGps
Bls12 G2 Add    600 avg gas @    0.9 µs /   697.4 MGps
Bls12 G2 MSM   991,080 total gas @   16.6 µs /59,835.1 MGps
Bls12 MapFp2G1 23,800 avg gas @    0.1 µs /184,270.0 MGps
Bls12 Pairing 2,209,700 total gas @   20.7 µs /106,567.6 MGps

versus

➜  besu git:(feature/precompile-caching-part1) ✗ build/install/besu/bin/evmtool benchmark --native --use-precompile-cache=false bls12
besu/v25.3-develop-f4df214/osx-aarch_64/corretto-java-22
Benchmarks for Bls12
Bls12 G1 Add    375 avg gas @    5.6 µs /    66.9 MGps
Bls12 G1 MSM   501,672 total gas @4,155.8 µs /   120.7 MGps
Bls12 MapFpToG1  5,500 avg gas @   47.9 µs /   114.7 MGps
Bls12 G2 Add    600 avg gas @    6.3 µs /    95.1 MGps
Bls12 G2 MSM   991,080 total gas @6,888.4 µs /   143.9 MGps
Bls12 MapFp2G1 23,800 avg gas @  226.9 µs /   104.9 MGps
Bls12 Pairing 2,209,700 total gas @22,388.5 µs /    98.7 MGps

@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from 995504b to edeeace Compare March 24, 2025 17:36
@garyschulte
Copy link
Contributor Author

Also I don't see metrics for BLS12_G1ADD, BLS12_G1MULTIEXP, BLS12_G2ADD, BLS12_G2MULTIEXP, BLS12_MAP_FIELD_TO_CURVE, BLS12_PAIRING. I guess, it is because these precompiles are not called, but it is worth double checking.

BLS doesn't go live until pectra, thus no stats yet...

@garyschulte
Copy link
Contributor Author

Sharing the metrics on the configured precompiles on Ethereum mainnet, after ~40 minutes of executions. We're using counters, so this is the hit ratio for the 40 minutes of execution. From the metrics, we can reconsider at least enabling the cache on KZGPointEval, as the hit ratio is 0. It was suggestion from my side to enable caching on KZGPointEval, but the metrics show that it is not a good candidate for caching.

Over 4 days, I see a pretty low hit ratio:

nuc 8: 198 / 2665
nuc 14: 146 / 2594

how low of a ratio makes it not worth it ?

@ahamlat
Copy link
Contributor

ahamlat commented Mar 25, 2025

Sharing the metrics on the configured precompiles on Ethereum mainnet, after ~40 minutes of executions. We're using counters, so this is the hit ratio for the 40 minutes of execution. From the metrics, we can reconsider at least enabling the cache on KZGPointEval, as the hit ratio is 0. It was suggestion from my side to enable caching on KZGPointEval, but the metrics show that it is not a good candidate for caching.

Over 4 days, I see a pretty low hit ratio:

nuc 8: 198 / 2665 nuc 14: 146 / 2594

how low of a ratio makes it not worth it ?

It depends on :

  • The cost of checking if the entry is on the cache and adding it to the cache (i.e overhead of caching)
    vs
  • The cost of the precompile execution itself.

So in the case of the metrics you shared, the cache has an overhead on 2665 calls and could avoid the execution of only 198 precompile calls.
The overhead here is calculating the hashcode on the input byte array, and checking if the integer key (hashcode result) is in the cache. There is also the overhead of adding the new execution result to the cache. So in this case, we're improving 7% of the calls and generating an overhead on 93% of the calls.
This kind of cache can still be interesting if the execution of the precompile is slow, let me get more metrics on KZGPointEval.

@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from 1c0ae22 to 4e742ea Compare March 25, 2025 14:52
@ahamlat
Copy link
Contributor

ahamlat commented Mar 25, 2025

It depends on :

  • The cost of checking if the entry is on the cache and adding it to the cache (i.e overhead of caching)
    vs
  • The cost of the precompile execution itself.

So in the case of the metrics you shared, the cache has an overhead on 2665 calls and could avoid the execution of only 198 precompile calls. The overhead here is calculating the hashcode on the input byte array, and checking if the integer key (hashcode result) is in the cache. There is also the overhead of adding the new execution result to the cache. So in this case, we're improving 7% of the calls and generating an overhead on 93% of the calls. This kind of cache can still be interesting if the execution of the precompile is slow, let me get more metrics on KZGPointEval.

So the execution time of KZGPointEval call is around 500 us on my laptop, which pretty slow. Will suggest a PR tomorrow to add it to the existing benchmarks. I think even if it is only 7 % hit ratio, we can keep it.
As a reference, EcRecover takes around 50 us, around 10x faster.

Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
…ck before cache check in ecmul

Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
Signed-off-by: garyschulte <[email protected]>
@garyschulte garyschulte force-pushed the feature/precompile-caching-part1 branch from 4e742ea to aad1762 Compare March 31, 2025 20:25
@garyschulte garyschulte enabled auto-merge (squash) March 31, 2025 20:27
@garyschulte garyschulte merged commit 2440f6a into hyperledger:main Mar 31, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants