Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

LLVM's "optimization" of shuffles penalizes x64 codegen #196

Closed
@zeux

Description

@zeux

So we've discussed this previously at some point on the WASM call, but I keep hitting this issue and I feel like something must be done.

Briefly, the issue is as follows:

  • WASM SIMD shuffle with constant indices is a general byte shuffle
  • v8 codegen for the shuffle tries to pattern match a set of common SSE shuffles, and failing that, emits really terrible code sequence
  • LLVM often optimizes shuffles that do form a recognizable pattern. This optimization is harmful and breaks pattern matching in v8 - and, presumably, other engines.

Here's the recent example I hit. Given this code that tries to extract 16-bit component from 4 64-bit halves of two vectors:

const v128_t zmask = wasm_i32x4_splat(0x7fff);

v128_t z4 = wasmx_shuffle_v32x4(n4_0, n4_1, 1, 3, 1, 3);
v128_t zf = wasm_v128_and(z4, zmask);

With wasmx_shuffle_v32x4 defined as follows to simulate SSE2 shufps:

#define wasmx_shuffle_v32x4(v, w, i, j, k, l) wasm_v8x16_shuffle(v, w, 4 * i, 4 * i + 1, 4 * i + 2, 4 * i + 3, 4 * j, 4 * j + 1, 4 * j + 2, 4 * j + 3, 16 + 4 * k, 16 + 4 * k + 1, 16 + 4 * k + 2, 16 + 4 * k + 3, 16 + 4 * l, 16 + 4 * l + 1, 16 + 4 * l + 2, 16 + 4 * l + 3)

LLVM notices that the results of the shuffle have two bytes in each 32-bit lane unused and "optimizes" the shuffle to:

v8x16.shuffle 0x00000504 0x00000d0c 0x00001514 0x00001d1c

V8 doesn't recognize this as a valid shuffle, and generates this:

000000E5ED349504   204  4989e1         REX.W movq r9,rsp
000000E5ED349507   207  4883e4f0       REX.W andq rsp,0xf0
000000E5ED34950B   20b  c4417810f8     vmovups xmm15,xmm8
000000E5ED349510   210  49ba8080000080800000 REX.W movq r10,0000808000008080
000000E5ED34951A   21a  4152           push r10
000000E5ED34951C   21c  49ba040500000c0d0000 REX.W movq r10,00000D0C00000504
000000E5ED349526   226  4152           push r10
000000E5ED349528   228  c46201003c24   vpshufb xmm15,xmm15,[rsp]
000000E5ED34952E   22e  450f10e1       movups xmm12,xmm9
000000E5ED349532   232  49ba040580800c0d8080 REX.W movq r10,80800D0C80800504
000000E5ED34953C   23c  4152           push r10
000000E5ED34953E   23e  49ba8080808080808080 REX.W movq r10,8080808080808080
000000E5ED349548   248  4152           push r10
000000E5ED34954A   24a  c46219002424   vpshufb xmm12,xmm12,[rsp]
000000E5ED349550   250  c44119ebe7     vpor xmm12,xmm12,xmm15
000000E5ED349555   255  498be1         REX.W movq rsp,r9

This code sequence is catastrophically bad. To put it in perspective, this code without this problem runs at 3.2 GB/s, and with this problem it runs at 1.1 GB/s - this is though the loop without this code sequence is actually pretty large, ~70 SSE/AVX instructions plus some amount of loop/branch scaffolding that v8 emits - this shuffle is merely a small piece of a larger transform, and it alone wreaks havoc on the performance.

Now, clearly the instruction sequence doesn't have to be this bad. However, note that this is actually a shuffle with two distinct arguments - so even if v8 could pre-compute the shuffle masks, this would still result in something like

vpshufb reg, reg, [rip + off1]
vpshufb reg1, reg1, [rip + off1]
vpor reg, reg, reg1

So LLVM actively pessimizes the code that the programmer writes. I've hit this issue so many times and every time it takes time to figure out how to apply a fragile workaround - in this case I had to mark zmask as volatile, spending an extra stack memory load just to make sure LLVM doesn't do anything stupid.

I know we discussed that adding target-specific logic to LLVM transforms seems problematic, since that's a lot of code to maintain. Have we considered not optimizing shuffles at all? I've yet to see evidence that LLVM combining shuffles without being aware of the target platform can produce beneficial results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions