Skip to content

Conversation

silvanshade
Copy link
Contributor

@silvanshade silvanshade commented Jul 22, 2025

This migrates the VAES implementation to the stabilized intrinsics for the upcoming 1.89 release.

CC @tarcieri

Related:

Benchmarks:

The numbers are basically on par with #482

VAES512

$ `RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly test`

running 15 tests
test aes128_decrypt_block  ... bench:         923.58 ns/iter (+/- 31.60) = 17750 MB/s
test aes128_decrypt_blocks ... bench:         227.12 ns/iter (+/- 2.14) = 72176 MB/s
test aes128_encrypt_block  ... bench:         925.93 ns/iter (+/- 23.25) = 17712 MB/s
test aes128_encrypt_blocks ... bench:         226.78 ns/iter (+/- 0.93) = 72495 MB/s
test aes128_new            ... bench:           8.75 ns/iter (+/- 0.06)
test aes192_decrypt_block  ... bench:       1,137.97 ns/iter (+/- 7.48) = 14409 MB/s
test aes192_decrypt_blocks ... bench:         274.65 ns/iter (+/- 3.55) = 59795 MB/s
test aes192_encrypt_block  ... bench:       1,137.29 ns/iter (+/- 9.93) = 14409 MB/s
test aes192_encrypt_blocks ... bench:         274.43 ns/iter (+/- 3.48) = 59795 MB/s
test aes192_new            ... bench:          10.84 ns/iter (+/- 0.04)
test aes256_decrypt_block  ... bench:       1,422.63 ns/iter (+/- 12.13) = 11521 MB/s
test aes256_decrypt_blocks ... bench:         318.35 ns/iter (+/- 4.88) = 51522 MB/s
test aes256_encrypt_block  ... bench:       1,423.10 ns/iter (+/- 16.96) = 11513 MB/s
test aes256_encrypt_blocks ... bench:         318.48 ns/iter (+/- 4.64) = 51522 MB/s
test aes256_new            ... bench:          11.88 ns/iter (+/- 0.10)

VAES256

$ RUSTFLAGS="-Ctarget-cpu=native --cfg aes_avx512_disable" cargo +nightly bench

running 15 tests
test aes128_decrypt_block  ... bench:         916.61 ns/iter (+/- 13.60) = 17886 MB/s
test aes128_decrypt_blocks ... bench:         446.66 ns/iter (+/- 2.20) = 36735 MB/s
test aes128_encrypt_block  ... bench:         918.15 ns/iter (+/- 8.60) = 17847 MB/s
test aes128_encrypt_blocks ... bench:         446.91 ns/iter (+/- 1.37) = 36735 MB/s
test aes128_new            ... bench:           8.70 ns/iter (+/- 0.03)
test aes192_decrypt_block  ... bench:       1,137.01 ns/iter (+/- 10.35) = 14409 MB/s
test aes192_decrypt_blocks ... bench:         533.74 ns/iter (+/- 0.81) = 30739 MB/s
test aes192_encrypt_block  ... bench:       1,136.23 ns/iter (+/- 7.93) = 14422 MB/s
test aes192_encrypt_blocks ... bench:         536.83 ns/iter (+/- 1.60) = 30567 MB/s
test aes192_new            ... bench:          10.77 ns/iter (+/- 0.06)
test aes256_decrypt_block  ... bench:       1,421.61 ns/iter (+/- 27.79) = 11529 MB/s
test aes256_decrypt_blocks ... bench:         625.80 ns/iter (+/- 0.82) = 26214 MB/s
test aes256_encrypt_block  ... bench:       1,421.75 ns/iter (+/- 20.02) = 11529 MB/s
test aes256_encrypt_blocks ... bench:         623.91 ns/iter (+/- 1.41) = 26298 MB/s
test aes256_new            ... bench:          11.87 ns/iter (+/- 0.06)

AES-NI

$ RUSTFLAGS="-Ctarget-cpu=native --cfg aes_avx512_disable --cfg aes_avx256_disable" cargo +nightly bench

running 15 tests
test aes128_decrypt_block  ... bench:         920.21 ns/iter (+/- 34.94) = 17808 MB/s
test aes128_decrypt_blocks ... bench:         906.61 ns/iter (+/- 7.41) = 18083 MB/s
test aes128_encrypt_block  ... bench:         919.36 ns/iter (+/- 14.43) = 17828 MB/s
test aes128_encrypt_blocks ... bench:         907.71 ns/iter (+/- 5.45) = 18063 MB/s
test aes128_new            ... bench:           8.69 ns/iter (+/- 0.09)
test aes192_decrypt_block  ... bench:       1,136.54 ns/iter (+/- 8.26) = 14422 MB/s
test aes192_decrypt_blocks ... bench:       1,100.96 ns/iter (+/- 3.33) = 14894 MB/s
test aes192_encrypt_block  ... bench:       1,135.87 ns/iter (+/- 7.97) = 14435 MB/s
test aes192_encrypt_blocks ... bench:       1,098.96 ns/iter (+/- 2.14) = 14921 MB/s
test aes192_new            ... bench:          10.75 ns/iter (+/- 0.13)
test aes256_decrypt_block  ... bench:       1,422.29 ns/iter (+/- 20.06) = 11521 MB/s
test aes256_decrypt_blocks ... bench:       1,286.62 ns/iter (+/- 21.24) = 12740 MB/s
test aes256_encrypt_block  ... bench:       1,426.20 ns/iter (+/- 35.99) = 11489 MB/s
test aes256_encrypt_blocks ... bench:       1,279.78 ns/iter (+/- 15.21) = 12810 MB/s
test aes256_new            ... bench:          11.87 ns/iter (+/- 0.06)

@silvanshade silvanshade force-pushed the vaes-intrinsics branch 2 times, most recently from 49a52de to a49e9e0 Compare July 22, 2025 04:28
@silvanshade
Copy link
Contributor Author

I had to change the cfg flag feature gating to be opt-in rather than opt-out in order for the CI tests to pass without bumping the MSRV.

@newpavlov
Copy link
Member

newpavlov commented Jul 22, 2025

Maybe for cleaner history it's worth to revert #482, rebase this PR to new master, and merge it after v0.9.0 release? It would also allow us to publish a smaller v0.9.0.

@silvanshade
Copy link
Contributor Author

Maybe for cleaner history it's worth to revert #482, rebase this PR to new master, and merge it after v0.9.0 release?

One reason to consider keeping the original in history is because it serves as an explanation for why par block sizes of 30 for VAES256 and 64 for VAES512 were chosen, i.e., due to the register configuration. This detail is absent from the intrinsics implementation.

@newpavlov
Copy link
Member

This detail is absent from the intrinsics implementation.

This better be described in a code comment, than to be left for git history.

@silvanshade
Copy link
Contributor Author

This detail is absent from the intrinsics implementation.

This better be described in a code comment, than to be left for git history.

True. I can add a comment about that.

@silvanshade silvanshade force-pushed the vaes-intrinsics branch 3 times, most recently from d9e12be to 230c5f8 Compare July 27, 2025 17:50
@silvanshade
Copy link
Contributor Author

I added these comment about block sizes:

        #[cfg(all(target_arch = "x86_64", any(aes_avx256, aes_avx512)))]
        impl<'a> ParBlocksSizeUser for $name_backend::Vaes256<'a> {
            // Block size of 30 is chosen based on AVX2's 16 YMM registers.
            //
            // * 1 register holds 2 keys per round (loads interleaved with rounds)
            // * 15 registers hold 2 data blocks
            //
            // This gives (16 <total> - 1 <round key>) * 2 <data> = 30 <data>.
            type ParBlocksSize = U30;
        }

        #[cfg(all(target_arch = "x86_64", aes_avx512))]
        impl<'a> ParBlocksSizeUser for $name_backend::Vaes512<'a> {
            // Block size of 64 is chosen based on AVX512's 32 ZMM registers.
            //
            // * 11, 13, 15 registers for keys, correspond to AES-128, AES-192, AES-256
            // * 11, 13, 15 registers hold 4 keys each (no interleaved loading like VAES256)
            // * 16 registers hold 4 data blocks
            // * 1-4 registers remain unused (could use them but probably not worth it)
            //
            // This gives (32 <total> - 15 <AES-256 round keys> - 1 <unused>) * 4 <data> = 64 <data>.
            type ParBlocksSize = U64;
        }

@tarcieri
Copy link
Member

tarcieri commented Aug 2, 2025

@newpavlov what's the path forward here?

  1. Open a separate PR to revert Implement VAES AVX and AVX512 backends for aes #482, merge that, and rebase on top?
  2. Merge as is? (I'm personally fine with this)

@tarcieri tarcieri mentioned this pull request Aug 3, 2025
@newpavlov
Copy link
Member

I would prefer to do the following:

  1. Revert the old VAES PR.
  2. Release aes v0.9.0 with MSRV 1.85 and without VAES support.
  3. Merge this PR and release it as part of aes v0.9.1.

aes is relatively widely used, so I think it's worth to have a release with relaxed MSRV.

@tarcieri
Copy link
Member

tarcieri commented Aug 3, 2025

Again, we can use cfg gating both to avoid the MSRV bump and because we still don’t have real-world info on the performance impact

@newpavlov
Copy link
Member

I would prefer to have a simpler v0.9.0 release. Since v0.9.1 will be probably released shortly after v0.9.0, I think it's fine to postpone VAES support a bit.

@tarcieri
Copy link
Member

tarcieri commented Aug 3, 2025

What is the drawback of an off-by-default feature which requires cfg to enable? It can be labeled experimental if you so desire.

@newpavlov
Copy link
Member

5k LoC is not a small amount. It also would require additional work to make it experimental. It will be much easier to revert the old PR and merge this PR with a proper intrinsics-based implementation and autodetection enabled by default. If we are to encounter any issues with the VAES backend we also will be able to quickly yank v0.9.1.

@tarcieri
Copy link
Member

tarcieri commented Aug 3, 2025

I'm talking about shipping this PR, cfg gated, after reverting #482, which was 5klocs. This one is significantly smaller.

tarcieri added a commit that referenced this pull request Aug 3, 2025
This reverts commit ad83428.

This implementation uses assembly, but the relevant intrinsics will be
stable in Rust 1.89, and we have a PR open to use them: #491

For a cleaner history, this reverts the assembly implementation so the
intrinsics-based implementation can be cleanly applied to an ASM-free
codebase, rather than as a replacement for the ASM.
@tarcieri
Copy link
Member

tarcieri commented Aug 3, 2025

I opened a PR to revert #482: #496

tarcieri added a commit that referenced this pull request Aug 3, 2025
This reverts commit ad83428.

This implementation uses assembly, but the relevant intrinsics will be
stable in Rust 1.89, and we have a PR open to use them: #491

For a cleaner history, this reverts the assembly implementation so the
intrinsics-based implementation can be cleanly applied to an ASM-free
codebase, rather than as a replacement for the ASM.
@tarcieri
Copy link
Member

tarcieri commented Aug 3, 2025

@silvanshade now that #496 has been merged, can you rebase/merge?

@newpavlov newpavlov changed the title Migrate to intrinsics for VAES aes: add VAES support Aug 6, 2025
@tarcieri
Copy link
Member

tarcieri commented Aug 6, 2025

@silvanshade it looks like the CI config got lost in the rebase

(also as of tomorrow you'll be able to use Rust 1.89)

@silvanshade silvanshade force-pushed the vaes-intrinsics branch 2 times, most recently from 43acd15 to 2743559 Compare August 7, 2025 13:57
@silvanshade silvanshade requested a review from tarcieri August 7, 2025 13:58
Copy link
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving with the following notes:

  • MSRV is untouched, since the --cfg aes_avx256 or --cfg aes_avx512 options are required to enable the feature, as we requested
  • This should likewise have no impact on anyone who does not explicitly enable those
  • autodetect.rs is improved/simplified despite a new backend, though the complexity has moved to x86.rs. I still consider that a general win.
  • In a future release, we can potentially bump MSRV and enable this by default, but having a cfg at first allows an initial set of users to experiment with it and help us gather data if that's a good idea or not. I think at least for now this is a niche feature for users with Intel Xeon servers?

#[allow(unused)] // TODO: remove once cfg flags are removed
pub(crate) features: Features,
pub(crate) keys: &'a Simd128RoundKeys<$rounds>,
pub(crate) simd_256_keys: OnceCell<Simd256RoundKeys<$rounds>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to cache broadcasted keys. It makes the struct bigger and it especially will be a problem when the backends will be enabled by default. I think we should just broadcast the keys during encryption/decryption and let the compiler to optimize it out when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changes to this behavior should be accompanied by benchmarks.


let backend = &$name_backend::Vaes256::from(backend);
if backend.features.has_vaes256() {
while rem >= backend.par_blocks() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think VAES256 availability should be guaranteed if VAES512 available, no? If it's not guaranteed for some reason, we can be conservative and check VAES256 support in has_vaes512.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap. VAES256 intrinsics are gated on the vaes target feature, while VAES512 intrinsics are gated on vaes + avx512f.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one probably does make sense to change. I don't technically know if the ISA spec guarantees that VAES256 should be available if VAES512 is available (one would assume so), but especially if the compiler and LLVM believes that to be the case, we might as well follow that convention.

backend.encrypt_par_blocks(blocks);
rem -= backend.par_blocks();
iptr = unsafe { iptr.add(backend.par_blocks()) };
optr = unsafe { optr.add(backend.par_blocks()) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to write this code using safe code. See the InOutBuf::into_chunks method.

@silvanshade
Copy link
Contributor Author

Regarding the reviews from @newpavlov, I don't have time to work on this again and I'm not sure when I will.

But even if I did, I also don't think it's reasonable to add a bunch of new requested changes at this point, especially given the long history of this code. I won't object if you want to make those changes yourself but I don't plan to do it.

@tarcieri
Copy link
Member

@silvanshade can you address this one in particular? #491 (comment)

@newpavlov
Copy link
Member

I can push fixes later into this branch (I also have a bunch of other nitpicky changes which I would like to add), but it probably will happen after v0.9.0 release.

@tarcieri
Copy link
Member

@newpavlov is there a particular reason you want to block merging this on "nitpicky changes"? Those sound like they could all be done in a completely incremental fashion at any point in time, since they aren't externally-facing.

@newpavlov
Copy link
Member

I did not mean that the merge is blocked on them, but that I plan to work on them together with fixing the comments above. At the very least before merging I would like to see removal of the OnceCell uses and reduction of unsafe.

I don't see why we should rush addition of VAES support into v0.9.0. I think it's fine to add it in v0.9.1 together with other new backends.

@tarcieri
Copy link
Member

I agree it would be good to get rid of OnceCell, however I don't think merging before doing that is "rushing" the PR. That seems like a minor "nice to have" improvement.

@tarcieri
Copy link
Member

I'd also say there appears to be a time-memory tradeoff here. We've had another request to support a lower memory profile at the cost of performance (#191) and I'm wondering if we should perhaps expose a cfg knob that allows the application to specify whether it prefers performance or lower memory use.

That said, my expectation would be that performance is the top priority for anyone who goes out of their way to enable --cfg aes_avx512. So a tuning knob for performance vs memory use becomes a bigger priority if we were to ship this on-by-default.

@newpavlov
Copy link
Member

newpavlov commented Aug 11, 2025

The broadcast should be extremely efficient cycle-wise, cache-friendly, and trivial for the compiler to move outside of the loop. OnceCell not only increases size of the struct, but also likely to act as an optimization barrier because of its side-effect-full nature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants