aes: add VAES support #491

silvanshade · 2025-07-22T03:27:56Z

This migrates the VAES implementation to the stabilized intrinsics for the upcoming 1.89 release.

aes: use stabilized AVX-512 intrinsics #489

Benchmarks:

The numbers are basically on par with #482

VAES512

$ `RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly test`

running 15 tests
test aes128_decrypt_block  ... bench:         923.58 ns/iter (+/- 31.60) = 17750 MB/s
test aes128_decrypt_blocks ... bench:         227.12 ns/iter (+/- 2.14) = 72176 MB/s
test aes128_encrypt_block  ... bench:         925.93 ns/iter (+/- 23.25) = 17712 MB/s
test aes128_encrypt_blocks ... bench:         226.78 ns/iter (+/- 0.93) = 72495 MB/s
test aes128_new            ... bench:           8.75 ns/iter (+/- 0.06)
test aes192_decrypt_block  ... bench:       1,137.97 ns/iter (+/- 7.48) = 14409 MB/s
test aes192_decrypt_blocks ... bench:         274.65 ns/iter (+/- 3.55) = 59795 MB/s
test aes192_encrypt_block  ... bench:       1,137.29 ns/iter (+/- 9.93) = 14409 MB/s
test aes192_encrypt_blocks ... bench:         274.43 ns/iter (+/- 3.48) = 59795 MB/s
test aes192_new            ... bench:          10.84 ns/iter (+/- 0.04)
test aes256_decrypt_block  ... bench:       1,422.63 ns/iter (+/- 12.13) = 11521 MB/s
test aes256_decrypt_blocks ... bench:         318.35 ns/iter (+/- 4.88) = 51522 MB/s
test aes256_encrypt_block  ... bench:       1,423.10 ns/iter (+/- 16.96) = 11513 MB/s
test aes256_encrypt_blocks ... bench:         318.48 ns/iter (+/- 4.64) = 51522 MB/s
test aes256_new            ... bench:          11.88 ns/iter (+/- 0.10)

VAES256

$ RUSTFLAGS="-Ctarget-cpu=native --cfg aes_avx512_disable" cargo +nightly bench

running 15 tests
test aes128_decrypt_block  ... bench:         916.61 ns/iter (+/- 13.60) = 17886 MB/s
test aes128_decrypt_blocks ... bench:         446.66 ns/iter (+/- 2.20) = 36735 MB/s
test aes128_encrypt_block  ... bench:         918.15 ns/iter (+/- 8.60) = 17847 MB/s
test aes128_encrypt_blocks ... bench:         446.91 ns/iter (+/- 1.37) = 36735 MB/s
test aes128_new            ... bench:           8.70 ns/iter (+/- 0.03)
test aes192_decrypt_block  ... bench:       1,137.01 ns/iter (+/- 10.35) = 14409 MB/s
test aes192_decrypt_blocks ... bench:         533.74 ns/iter (+/- 0.81) = 30739 MB/s
test aes192_encrypt_block  ... bench:       1,136.23 ns/iter (+/- 7.93) = 14422 MB/s
test aes192_encrypt_blocks ... bench:         536.83 ns/iter (+/- 1.60) = 30567 MB/s
test aes192_new            ... bench:          10.77 ns/iter (+/- 0.06)
test aes256_decrypt_block  ... bench:       1,421.61 ns/iter (+/- 27.79) = 11529 MB/s
test aes256_decrypt_blocks ... bench:         625.80 ns/iter (+/- 0.82) = 26214 MB/s
test aes256_encrypt_block  ... bench:       1,421.75 ns/iter (+/- 20.02) = 11529 MB/s
test aes256_encrypt_blocks ... bench:         623.91 ns/iter (+/- 1.41) = 26298 MB/s
test aes256_new            ... bench:          11.87 ns/iter (+/- 0.06)

AES-NI

$ RUSTFLAGS="-Ctarget-cpu=native --cfg aes_avx512_disable --cfg aes_avx256_disable" cargo +nightly bench

running 15 tests
test aes128_decrypt_block  ... bench:         920.21 ns/iter (+/- 34.94) = 17808 MB/s
test aes128_decrypt_blocks ... bench:         906.61 ns/iter (+/- 7.41) = 18083 MB/s
test aes128_encrypt_block  ... bench:         919.36 ns/iter (+/- 14.43) = 17828 MB/s
test aes128_encrypt_blocks ... bench:         907.71 ns/iter (+/- 5.45) = 18063 MB/s
test aes128_new            ... bench:           8.69 ns/iter (+/- 0.09)
test aes192_decrypt_block  ... bench:       1,136.54 ns/iter (+/- 8.26) = 14422 MB/s
test aes192_decrypt_blocks ... bench:       1,100.96 ns/iter (+/- 3.33) = 14894 MB/s
test aes192_encrypt_block  ... bench:       1,135.87 ns/iter (+/- 7.97) = 14435 MB/s
test aes192_encrypt_blocks ... bench:       1,098.96 ns/iter (+/- 2.14) = 14921 MB/s
test aes192_new            ... bench:          10.75 ns/iter (+/- 0.13)
test aes256_decrypt_block  ... bench:       1,422.29 ns/iter (+/- 20.06) = 11521 MB/s
test aes256_decrypt_blocks ... bench:       1,286.62 ns/iter (+/- 21.24) = 12740 MB/s
test aes256_encrypt_block  ... bench:       1,426.20 ns/iter (+/- 35.99) = 11489 MB/s
test aes256_encrypt_blocks ... bench:       1,279.78 ns/iter (+/- 15.21) = 12810 MB/s
test aes256_new            ... bench:          11.87 ns/iter (+/- 0.06)

aes/Cargo.toml

silvanshade · 2025-07-22T04:32:12Z

I had to change the cfg flag feature gating to be opt-in rather than opt-out in order for the CI tests to pass without bumping the MSRV.

newpavlov · 2025-07-22T11:36:57Z

Maybe for cleaner history it's worth to revert #482, rebase this PR to new master, and merge it after v0.9.0 release? It would also allow us to publish a smaller v0.9.0.

silvanshade · 2025-07-22T15:28:46Z

Maybe for cleaner history it's worth to revert #482, rebase this PR to new master, and merge it after v0.9.0 release?

One reason to consider keeping the original in history is because it serves as an explanation for why par block sizes of 30 for VAES256 and 64 for VAES512 were chosen, i.e., due to the register configuration. This detail is absent from the intrinsics implementation.

newpavlov · 2025-07-23T10:00:11Z

This detail is absent from the intrinsics implementation.

This better be described in a code comment, than to be left for git history.

silvanshade · 2025-07-23T13:55:42Z

This detail is absent from the intrinsics implementation.

This better be described in a code comment, than to be left for git history.

True. I can add a comment about that.

silvanshade · 2025-07-27T17:51:07Z

I added these comment about block sizes:

        #[cfg(all(target_arch = "x86_64", any(aes_avx256, aes_avx512)))]
        impl<'a> ParBlocksSizeUser for $name_backend::Vaes256<'a> {
            // Block size of 30 is chosen based on AVX2's 16 YMM registers.
            //
            // * 1 register holds 2 keys per round (loads interleaved with rounds)
            // * 15 registers hold 2 data blocks
            //
            // This gives (16 <total> - 1 <round key>) * 2 <data> = 30 <data>.
            type ParBlocksSize = U30;
        }

        #[cfg(all(target_arch = "x86_64", aes_avx512))]
        impl<'a> ParBlocksSizeUser for $name_backend::Vaes512<'a> {
            // Block size of 64 is chosen based on AVX512's 32 ZMM registers.
            //
            // * 11, 13, 15 registers for keys, correspond to AES-128, AES-192, AES-256
            // * 11, 13, 15 registers hold 4 keys each (no interleaved loading like VAES256)
            // * 16 registers hold 4 data blocks
            // * 1-4 registers remain unused (could use them but probably not worth it)
            //
            // This gives (32 <total> - 15 <AES-256 round keys> - 1 <unused>) * 4 <data> = 64 <data>.
            type ParBlocksSize = U64;
        }

tarcieri · 2025-08-02T14:13:53Z

@newpavlov what's the path forward here?

Open a separate PR to revert Implement VAES AVX and AVX512 backends for aes #482, merge that, and rebase on top?
Merge as is? (I'm personally fine with this)

newpavlov · 2025-08-03T18:22:55Z

I would prefer to do the following:

Revert the old VAES PR.
Release aes v0.9.0 with MSRV 1.85 and without VAES support.
Merge this PR and release it as part of aes v0.9.1.

aes is relatively widely used, so I think it's worth to have a release with relaxed MSRV.

tarcieri · 2025-08-03T19:41:37Z

Again, we can use cfg gating both to avoid the MSRV bump and because we still don’t have real-world info on the performance impact

newpavlov · 2025-08-03T21:27:33Z

I would prefer to have a simpler v0.9.0 release. Since v0.9.1 will be probably released shortly after v0.9.0, I think it's fine to postpone VAES support a bit.

tarcieri · 2025-08-03T21:32:29Z

What is the drawback of an off-by-default feature which requires cfg to enable? It can be labeled experimental if you so desire.

newpavlov · 2025-08-03T21:40:45Z

5k LoC is not a small amount. It also would require additional work to make it experimental. It will be much easier to revert the old PR and merge this PR with a proper intrinsics-based implementation and autodetection enabled by default. If we are to encounter any issues with the VAES backend we also will be able to quickly yank v0.9.1.

tarcieri · 2025-08-03T21:47:26Z

I'm talking about shipping this PR, cfg gated, after reverting #482, which was 5klocs. This one is significantly smaller.

This reverts commit ad83428. This implementation uses assembly, but the relevant intrinsics will be stable in Rust 1.89, and we have a PR open to use them: #491 For a cleaner history, this reverts the assembly implementation so the intrinsics-based implementation can be cleanly applied to an ASM-free codebase, rather than as a replacement for the ASM.

tarcieri · 2025-08-03T21:53:40Z

I opened a PR to revert #482: #496

This reverts commit ad83428. This implementation uses assembly, but the relevant intrinsics will be stable in Rust 1.89, and we have a PR open to use them: #491 For a cleaner history, this reverts the assembly implementation so the intrinsics-based implementation can be cleanly applied to an ASM-free codebase, rather than as a replacement for the ASM.

tarcieri · 2025-08-03T23:55:38Z

@silvanshade now that #496 has been merged, can you rebase/merge?

tarcieri · 2025-08-06T19:39:54Z

@silvanshade it looks like the CI config got lost in the rebase

(also as of tomorrow you'll be able to use Rust 1.89)

.github/workflows/aes.yml

tarcieri

Reapproving with the following notes:

MSRV is untouched, since the --cfg aes_avx256 or --cfg aes_avx512 options are required to enable the feature, as we requested
This should likewise have no impact on anyone who does not explicitly enable those
autodetect.rs is improved/simplified despite a new backend, though the complexity has moved to x86.rs. I still consider that a general win.
In a future release, we can potentially bump MSRV and enable this by default, but having a cfg at first allows an initial set of users to experiment with it and help us gather data if that's a good idea or not. I think at least for now this is a niche feature for users with Intel Xeon servers?

newpavlov · 2025-08-11T13:43:16Z

aes/src/x86.rs

+                #[allow(unused)] // TODO: remove once cfg flags are removed
+                pub(crate) features: Features,
+                pub(crate) keys: &'a Simd128RoundKeys<$rounds>,
+                pub(crate) simd_256_keys: OnceCell<Simd256RoundKeys<$rounds>>,


I don't think we need to cache broadcasted keys. It makes the struct bigger and it especially will be a problem when the backends will be enabled by default. I think we should just broadcast the keys during encryption/decryption and let the compiler to optimize it out when possible.

I think changes to this behavior should be accompanied by benchmarks.

newpavlov · 2025-08-11T13:50:32Z

aes/src/x86.rs

+
+                let backend = &$name_backend::Vaes256::from(backend);
+                if backend.features.has_vaes256() {
+                    while rem >= backend.par_blocks() {


I think VAES256 availability should be guaranteed if VAES512 available, no? If it's not guaranteed for some reason, we can be conservative and check VAES256 support in has_vaes512.

Yeap. VAES256 intrinsics are gated on the vaes target feature, while VAES512 intrinsics are gated on vaes + avx512f.

This one probably does make sense to change. I don't technically know if the ISA spec guarantees that VAES256 should be available if VAES512 is available (one would assume so), but especially if the compiler and LLVM believes that to be the case, we might as well follow that convention.

newpavlov · 2025-08-11T13:51:51Z

aes/src/x86.rs

+                        backend.encrypt_par_blocks(blocks);
+                        rem -= backend.par_blocks();
+                        iptr = unsafe { iptr.add(backend.par_blocks()) };
+                        optr = unsafe { optr.add(backend.par_blocks()) };


I would prefer to write this code using safe code. See the InOutBuf::into_chunks method.

silvanshade · 2025-08-11T14:25:34Z

Regarding the reviews from @newpavlov, I don't have time to work on this again and I'm not sure when I will.

But even if I did, I also don't think it's reasonable to add a bunch of new requested changes at this point, especially given the long history of this code. I won't object if you want to make those changes yourself but I don't plan to do it.

tarcieri · 2025-08-11T14:28:40Z

@silvanshade can you address this one in particular? #491 (comment)

newpavlov · 2025-08-11T14:38:06Z

I can push fixes later into this branch (I also have a bunch of other nitpicky changes which I would like to add), but it probably will happen after v0.9.0 release.

tarcieri · 2025-08-11T14:40:36Z

@newpavlov is there a particular reason you want to block merging this on "nitpicky changes"? Those sound like they could all be done in a completely incremental fashion at any point in time, since they aren't externally-facing.

newpavlov · 2025-08-11T14:47:11Z

I did not mean that the merge is blocked on them, but that I plan to work on them together with fixing the comments above. At the very least before merging I would like to see removal of the OnceCell uses and reduction of unsafe.

I don't see why we should rush addition of VAES support into v0.9.0. I think it's fine to add it in v0.9.1 together with other new backends.

tarcieri · 2025-08-11T14:49:33Z

I agree it would be good to get rid of OnceCell, however I don't think merging before doing that is "rushing" the PR. That seems like a minor "nice to have" improvement.

tarcieri · 2025-08-11T18:59:28Z

I'd also say there appears to be a time-memory tradeoff here. We've had another request to support a lower memory profile at the cost of performance (#191) and I'm wondering if we should perhaps expose a cfg knob that allows the application to specify whether it prefers performance or lower memory use.

That said, my expectation would be that performance is the top priority for anyone who goes out of their way to enable --cfg aes_avx512. So a tuning knob for performance vs memory use becomes a bigger priority if we were to ship this on-by-default.

newpavlov · 2025-08-11T19:09:36Z

The broadcast should be extremely efficient cycle-wise, cache-friendly, and trivial for the compiler to move outside of the loop. OnceCell not only increases size of the struct, but also likely to act as an optimization barrier because of its side-effect-full nature.

tarcieri reviewed Jul 22, 2025

View reviewed changes

aes/Cargo.toml Outdated Show resolved Hide resolved

silvanshade force-pushed the vaes-intrinsics branch 2 times, most recently from 49a52de to a49e9e0 Compare July 22, 2025 04:28

silvanshade force-pushed the vaes-intrinsics branch from a49e9e0 to 1ba6590 Compare July 22, 2025 15:21

silvanshade force-pushed the vaes-intrinsics branch 3 times, most recently from d9e12be to 230c5f8 Compare July 27, 2025 17:50

tarcieri approved these changes Jul 28, 2025

View reviewed changes

tarcieri mentioned this pull request Aug 3, 2025

VAES support #372

Closed

tarcieri mentioned this pull request Aug 3, 2025

Revert #482: "aes: implement VAES AVX and AVX512 backends" #496

Merged

silvanshade force-pushed the vaes-intrinsics branch from 230c5f8 to 22dd9e3 Compare August 4, 2025 01:16

newpavlov changed the title ~~Migrate to intrinsics for VAES~~ aes: add VAES support Aug 6, 2025

silvanshade force-pushed the vaes-intrinsics branch 2 times, most recently from 43acd15 to 2743559 Compare August 7, 2025 13:57

silvanshade requested a review from tarcieri August 7, 2025 13:58

silvanshade force-pushed the vaes-intrinsics branch from 2743559 to e14fdff Compare August 7, 2025 14:08

tarcieri reviewed Aug 7, 2025

View reviewed changes

.github/workflows/aes.yml Outdated Show resolved Hide resolved

silvanshade force-pushed the vaes-intrinsics branch from e14fdff to d33373e Compare August 7, 2025 14:19

silvanshade requested a review from tarcieri August 7, 2025 14:20

Migrate to intrinsics for VAES

f655326

silvanshade force-pushed the vaes-intrinsics branch from d33373e to f655326 Compare August 7, 2025 14:30

nazar-pc mentioned this pull request Aug 8, 2025

Update benchmarks with VAES support bloom42/chacha12-blake3#4

Open

tarcieri approved these changes Aug 11, 2025

View reviewed changes

newpavlov reviewed Aug 11, 2025

View reviewed changes

robbie01 mentioned this pull request Aug 11, 2025

ocb3: use parallel-capable APIs RustCrypto/AEADs#706

Open

aes: add VAES support #491

Are you sure you want to change the base?

aes: add VAES support #491

Uh oh!

Conversation

silvanshade commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAES512

VAES256

AES-NI

Uh oh!

Uh oh!

silvanshade commented Jul 22, 2025

Uh oh!

newpavlov commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silvanshade commented Jul 22, 2025

Uh oh!

newpavlov commented Jul 23, 2025

Uh oh!

silvanshade commented Jul 23, 2025

Uh oh!

silvanshade commented Jul 27, 2025

Uh oh!

tarcieri commented Aug 2, 2025

Uh oh!

newpavlov commented Aug 3, 2025

Uh oh!

tarcieri commented Aug 3, 2025

Uh oh!

newpavlov commented Aug 3, 2025

Uh oh!

tarcieri commented Aug 3, 2025

Uh oh!

newpavlov commented Aug 3, 2025

Uh oh!

tarcieri commented Aug 3, 2025

Uh oh!

tarcieri commented Aug 3, 2025

Uh oh!

tarcieri commented Aug 3, 2025

Uh oh!

tarcieri commented Aug 6, 2025

Uh oh!

Uh oh!

tarcieri left a comment

Choose a reason for hiding this comment

Uh oh!

newpavlov Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

silvanshade Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

newpavlov Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

newpavlov Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

silvanshade Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

newpavlov Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

silvanshade commented Aug 11, 2025

Uh oh!

tarcieri commented Aug 11, 2025

Uh oh!

newpavlov commented Aug 11, 2025

Uh oh!

tarcieri commented Aug 11, 2025

Uh oh!

newpavlov commented Aug 11, 2025

Uh oh!

tarcieri commented Aug 11, 2025

Uh oh!

tarcieri commented Aug 11, 2025

silvanshade commented Jul 22, 2025 •

edited

Loading

newpavlov commented Jul 22, 2025 •

edited

Loading

newpavlov commented Aug 11, 2025 •

edited

Loading