LLVM's "optimization" of shuffles penalizes x64 codegen

So we've discussed this previously at some point on the WASM call, but I keep hitting this issue and I feel like something must be done.

Briefly, the issue is as follows:

- WASM SIMD shuffle with constant indices is a general byte shuffle
- v8 codegen for the shuffle tries to pattern match a set of common SSE shuffles, and failing that, emits *really* terrible code sequence
- LLVM often optimizes shuffles that *do* form a recognizable pattern. This optimization is harmful and breaks pattern matching in v8 - and, presumably, other engines.

Here's the recent example I hit. Given this code that tries to extract 16-bit component from 4 64-bit halves of two vectors:

```
const v128_t zmask = wasm_i32x4_splat(0x7fff);

v128_t z4 = wasmx_shuffle_v32x4(n4_0, n4_1, 1, 3, 1, 3);
v128_t zf = wasm_v128_and(z4, zmask);
```

With `wasmx_shuffle_v32x4` defined as follows to simulate SSE2 `shufps`:

```
#define wasmx_shuffle_v32x4(v, w, i, j, k, l) wasm_v8x16_shuffle(v, w, 4 * i, 4 * i + 1, 4 * i + 2, 4 * i + 3, 4 * j, 4 * j + 1, 4 * j + 2, 4 * j + 3, 16 + 4 * k, 16 + 4 * k + 1, 16 + 4 * k + 2, 16 + 4 * k + 3, 16 + 4 * l, 16 + 4 * l + 1, 16 + 4 * l + 2, 16 + 4 * l + 3)
```

LLVM notices that the results of the shuffle have two bytes in each 32-bit lane unused and "optimizes" the shuffle to:

```
v8x16.shuffle 0x00000504 0x00000d0c 0x00001514 0x00001d1c
```

V8 doesn't recognize this as a valid shuffle, and generates this:

```
000000E5ED349504   204  4989e1         REX.W movq r9,rsp
000000E5ED349507   207  4883e4f0       REX.W andq rsp,0xf0
000000E5ED34950B   20b  c4417810f8     vmovups xmm15,xmm8
000000E5ED349510   210  49ba8080000080800000 REX.W movq r10,0000808000008080
000000E5ED34951A   21a  4152           push r10
000000E5ED34951C   21c  49ba040500000c0d0000 REX.W movq r10,00000D0C00000504
000000E5ED349526   226  4152           push r10
000000E5ED349528   228  c46201003c24   vpshufb xmm15,xmm15,[rsp]
000000E5ED34952E   22e  450f10e1       movups xmm12,xmm9
000000E5ED349532   232  49ba040580800c0d8080 REX.W movq r10,80800D0C80800504
000000E5ED34953C   23c  4152           push r10
000000E5ED34953E   23e  49ba8080808080808080 REX.W movq r10,8080808080808080
000000E5ED349548   248  4152           push r10
000000E5ED34954A   24a  c46219002424   vpshufb xmm12,xmm12,[rsp]
000000E5ED349550   250  c44119ebe7     vpor xmm12,xmm12,xmm15
000000E5ED349555   255  498be1         REX.W movq rsp,r9
```

This code sequence is catastrophically bad. To put it in perspective, this code *without* this problem runs at 3.2 GB/s, and with this problem it runs at 1.1 GB/s - this is though the loop *without* this code sequence is actually pretty large, ~70 SSE/AVX instructions plus some amount of loop/branch scaffolding that v8 emits - this shuffle is merely a small piece of a larger transform, and it alone wreaks havoc on the performance.

Now, clearly the instruction sequence doesn't have to be this bad. However, note that this is actually a shuffle with *two* distinct arguments - so even if v8 could pre-compute the shuffle masks, this would still result in something like

```
vpshufb reg, reg, [rip + off1]
vpshufb reg1, reg1, [rip + off1]
vpor reg, reg, reg1
```

So LLVM actively pessimizes the code that the programmer writes. I've hit this issue so many times and every time it takes time to figure out how to apply a fragile workaround - in this case I had to mark `zmask` as volatile, spending an extra stack memory load just to make sure LLVM doesn't do anything stupid.

I know we discussed that adding target-specific logic to LLVM transforms seems problematic, since that's a lot of code to maintain. Have we considered *not* optimizing shuffles at all? I've yet to see evidence that LLVM combining shuffles without being aware of the target platform can produce beneficial results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLVM's "optimization" of shuffles penalizes x64 codegen #196

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLVM's "optimization" of shuffles penalizes x64 codegen #196

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions