Canonical sign-replication operation

Sign-replication is an often-used operation that replicates the sign bit of a SIMD lane into all bits of the lane. There are two reasons why we need to pay attention to sign-replication. First, it can be encoded in WebAssembly SIMD in several ways:
- `i8x16.shr_s(v, -1)`/`i16x8.shr_s(v, -1)`/`i32x4.shr_s(v, -1)`/`i64x2.shr_s(v, -1)`
- `i8x16.shr_s(v, 7)`/`i16x8.shr_s(v, 15)`/`i32x4.shr_s(v, 31)`/`i64x2.shr_s(v, 63)`
- `i8x16.neg(i8x16.shr_u(v, -1))`/`i16x8.neg(i16x8.shr_u(v, -1))`/`i32x4.neg(i32x4.shr_s(v, -1))`/`i64x2.neg(i64x2.shr_s(v, -1))`
- `i8x16.neg(i8x16.shr_u(v, 7))`/`i16x8.neg(i16x8.shr_u(v, 15))`/`i32x4.neg(i32x4.shr_s(v, 31))`/`i64x2.neg(i64x2.shr_s(v, 63))`
- `i8x16.lt_s(v, v128.const(0))`/`i16x8.lt_s(v, v128.const(0))`/`i32x4.lt_s(v, v128.const(0))`/`i64x2.lt_s(v, v128.const(0))`
Secondly, sign-replication can be lowered in many ways depending on the data type and the target instruction set, as noted by @jan-wassenberg in #124.

My suggestion is:
- To standardize `i8x16.shr_s(v, -1)`/`i16x8.shr_s(v, -1)`/`i32x4.shr_s(v, -1)`/`i64x2.shr_s(v, -1)` and `i8x16.shr_s(v, 7)`/`i16x8.shr_s(v, 15)`/`i32x4.shr_s(v, 31)`/`i64x2.shr_s(v, 63)` as the canonical sign-replication instructions, and recommend that WebAssembly engines lower them differently that other arithmetic shift instructions.
- To provide an informative recommendation on optimal lowering depending on the instruction set (see below).

Mapping to Common Instruction Sets
===========================

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F and AVX512VL instruction sets
--------------------------------------------------

- **i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)**
  - `y = i64x2.shr_s(x, 63)` is lowered to `VPSRAQ xmm_y, xmm_x, 63`

x86/x86-64 processors with AVX instruction set
--------------------------------------------------

- **i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)**
  - `y = i8x16.shr_s(x, 7)` (`y` is **NOT** `x`) is lowered to:
    - `VPXOR xmm_y, xmm_y, xmm_y`
    - `VPCMPGTB xmm_y, xmm_y, xmm_x`
  - `x = i8x16.shr_s(x, 7)` is lowered to:
    - `VPXOR xmm_tmp, xmm_tmp, xmm_tmp`
    - `VPCMPGTB xmm_x, xmm_tmp, xmm_x`
- **i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)**
  - `y = i16x8.shr_s(x, 15)` is lowered to `VPSRAW xmm_y, xmm_x, 15`
- **i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)**
  - `y = i32x4.shr_s(x, 31)` is lowered to `VPSRAD xmm_y, xmm_x, 31`
- **i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)**
  - `y = i64x2.shr_s(x, 63)` is lowered to:
    - `VPSRAD xmm_y, xmm_x, 31`
    - `VPSHUFD xmm_y, xmm_y, 0xF5`

x86/x86-64 processors with SSE2 instruction set
--------------------------------------------------

- **i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)**
  - `y = i8x16.shr_s(x, 7)` (`y` is **NOT** `x`) is lowered to:
    - `PXOR xmm_y, xmm_y`
    - `PCMPGTB xmm_y, xmm_x`
  - `x = i8x16.shr_s(x, 7)` is lowered to:
    - `MOVDQA xmm_tmp, xmm_x`
    - `PXOR xmm_x, xmm_x`
    - `PCMPGTB xmm_x, xmm_tmp`
- **i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)**
  - `y = i16x8.shr_s(x, 15)` is lowered to:
    - `PXOR xmm_y, xmm_y`
    - `PCMPGTW xmm_y, xmm_x`
  - `x = i16x8.shr_s(x, 15)` is lowered to `PSRAW xmm_x, 15`
- **i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)**
  - `y = i32x4.shr_s(x, 31)` is lowered to:
    - `PXOR xmm_y, xmm_y`
    - `PCMPGTD xmm_y, xmm_x`
  - `x = i32x4.shr_s(x, 31)` is lowered to `PSRAD xmm_x, 31`
- **i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)**
  - `y = i64x2.shr_s(x, 63)` is lowered to:
    - `PSHUFD xmm_y, xmm_x, 0xF5`
    - `PSRAD xmm_y, 31`

ARM64 processors
--------------------------------------------------

- **i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)**
  - `y = i8x16.shr_s(x, 7)` is lowered to `CMLT Vy.16B, Vx.16B, #0`
- **i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)**
  - `y = i16x8.shr_s(x, 15)` is lowered to `CMLT Vy.8H, Vx.8H, #0`
- **i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)**
  - `y = i32x4.shr_s(x, 31)` is lowered to `CMLT Vy.4S, Vx.4S, #0`
- **i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)**
  - `y = i64x2.shr_s(x, 63)` is lowered to `CMLT Vy.2D, Vx.2D, #0`

ARMv7 processors with NEON extension
--------------------------------------------------

- **i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)**
  - `y = i8x16.shr_s(x, 7)` is lowered to `VCLT.S8 Qy, Qx, #0`
- **i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)**
  - `y = i16x8.shr_s(x, 15)` is lowered to `VCLT.S16 Qy, Qx, #0`
- **i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)**
  - `y = i32x4.shr_s(x, 31)` is lowered to `VCLT.S32 Qy, Qx, #0`
- **i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)**
  - `y = i64x2.shr_s(x, 63)` is lowered to `VSHR.S64 Qy, Qx, #63`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Canonical sign-replication operation #437

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON extension

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Canonical sign-replication operation #437

Description

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON extension

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions