LLVM's "optimization" of shuffles penalizes x64 codegen #196
Description
So we've discussed this previously at some point on the WASM call, but I keep hitting this issue and I feel like something must be done.
Briefly, the issue is as follows:
- WASM SIMD shuffle with constant indices is a general byte shuffle
- v8 codegen for the shuffle tries to pattern match a set of common SSE shuffles, and failing that, emits really terrible code sequence
- LLVM often optimizes shuffles that do form a recognizable pattern. This optimization is harmful and breaks pattern matching in v8 - and, presumably, other engines.
Here's the recent example I hit. Given this code that tries to extract 16-bit component from 4 64-bit halves of two vectors:
const v128_t zmask = wasm_i32x4_splat(0x7fff);
v128_t z4 = wasmx_shuffle_v32x4(n4_0, n4_1, 1, 3, 1, 3);
v128_t zf = wasm_v128_and(z4, zmask);
With wasmx_shuffle_v32x4
defined as follows to simulate SSE2 shufps
:
#define wasmx_shuffle_v32x4(v, w, i, j, k, l) wasm_v8x16_shuffle(v, w, 4 * i, 4 * i + 1, 4 * i + 2, 4 * i + 3, 4 * j, 4 * j + 1, 4 * j + 2, 4 * j + 3, 16 + 4 * k, 16 + 4 * k + 1, 16 + 4 * k + 2, 16 + 4 * k + 3, 16 + 4 * l, 16 + 4 * l + 1, 16 + 4 * l + 2, 16 + 4 * l + 3)
LLVM notices that the results of the shuffle have two bytes in each 32-bit lane unused and "optimizes" the shuffle to:
v8x16.shuffle 0x00000504 0x00000d0c 0x00001514 0x00001d1c
V8 doesn't recognize this as a valid shuffle, and generates this:
000000E5ED349504 204 4989e1 REX.W movq r9,rsp
000000E5ED349507 207 4883e4f0 REX.W andq rsp,0xf0
000000E5ED34950B 20b c4417810f8 vmovups xmm15,xmm8
000000E5ED349510 210 49ba8080000080800000 REX.W movq r10,0000808000008080
000000E5ED34951A 21a 4152 push r10
000000E5ED34951C 21c 49ba040500000c0d0000 REX.W movq r10,00000D0C00000504
000000E5ED349526 226 4152 push r10
000000E5ED349528 228 c46201003c24 vpshufb xmm15,xmm15,[rsp]
000000E5ED34952E 22e 450f10e1 movups xmm12,xmm9
000000E5ED349532 232 49ba040580800c0d8080 REX.W movq r10,80800D0C80800504
000000E5ED34953C 23c 4152 push r10
000000E5ED34953E 23e 49ba8080808080808080 REX.W movq r10,8080808080808080
000000E5ED349548 248 4152 push r10
000000E5ED34954A 24a c46219002424 vpshufb xmm12,xmm12,[rsp]
000000E5ED349550 250 c44119ebe7 vpor xmm12,xmm12,xmm15
000000E5ED349555 255 498be1 REX.W movq rsp,r9
This code sequence is catastrophically bad. To put it in perspective, this code without this problem runs at 3.2 GB/s, and with this problem it runs at 1.1 GB/s - this is though the loop without this code sequence is actually pretty large, ~70 SSE/AVX instructions plus some amount of loop/branch scaffolding that v8 emits - this shuffle is merely a small piece of a larger transform, and it alone wreaks havoc on the performance.
Now, clearly the instruction sequence doesn't have to be this bad. However, note that this is actually a shuffle with two distinct arguments - so even if v8 could pre-compute the shuffle masks, this would still result in something like
vpshufb reg, reg, [rip + off1]
vpshufb reg1, reg1, [rip + off1]
vpor reg, reg, reg1
So LLVM actively pessimizes the code that the programmer writes. I've hit this issue so many times and every time it takes time to figure out how to apply a fragile workaround - in this case I had to mark zmask
as volatile, spending an extra stack memory load just to make sure LLVM doesn't do anything stupid.
I know we discussed that adding target-specific logic to LLVM transforms seems problematic, since that's a lot of code to maintain. Have we considered not optimizing shuffles at all? I've yet to see evidence that LLVM combining shuffles without being aware of the target platform can produce beneficial results.