JIT: Allow embedded broadcast regardless of intrinsic base type #117700

saucecontrol · 2025-07-16T07:14:38Z

Use the instruction table data to determine whether a given constant can be converted to embedded broadcast, rather than the intrinsic's base type.

dotnet-policy-service · 2025-07-16T07:15:30Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

src/coreclr/jit/instrsxarch.h

src/coreclr/jit/lowerxarch.cpp

tannergooding · 2025-07-16T16:21:09Z

Looks like there's some regressions related to certain bitwise instructions (smaller encoding, but larger data size; we want to prefer smaller data size for improved cache locality):

-       vpandq   ymm0, ymm0, qword ptr [reloc @RWD144] {1to4}
+       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD160]

The few bitwise instructions (including vpternlog) are swappable between the 32-bit and 64-bit base type. Either is fine since we can also update it to be the other if there is an embedded masking support scenario.

saucecontrol · 2025-07-16T16:39:48Z

Looks like there's some regressions related to certain bitwise instructions (smaller encoding, but larger data size;

I was fixing it while you were reviewing 😄.

I think this will end up being zero diff for the SPMI collections, but in addition to allowing for removal of the special-casing for GFNI instructions, this catches a few others where broadcast size doesn't match base type. ex:

static Vector128<int> vnni(Vector128<byte> v) =>
    AvxVnni.MultiplyWideningAndAdd(Vector128<int>.Zero, v, Vector128<sbyte>.One);

 G_M42192_IG02:  ;; offset=0x0000
        C5F857C0             vxorps   xmm0, xmm0, xmm0
        C5F8100A             vmovups  xmm1, xmmword ptr [rdx]
-       C4E27150050F000000   vpdpbusd xmm0, xmm1, xmmword ptr [reloc @RWD00]
+       62F2751850050E000000 vpdpbusd xmm0, xmm1, dword ptr [reloc @RWD00] {1to4}
        C5F81101             vmovups  xmmword ptr [rcx], xmm0
        488BC1               mov      rax, rcx
-						;; size=24 bbWeight=1 PerfScore 12.58
+						;; size=25 bbWeight=1 PerfScore 12.58
 
-G_M42192_IG03:  ;; offset=0x0018
+G_M42192_IG03:  ;; offset=0x0019
        C3                   ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dq	0101010101010101h, 0101010101010101h
-; Total bytes of code: 25
+RWD00  	dd	01010101h
+; Total bytes of code: 26

saucecontrol · 2025-07-16T19:12:06Z

SPMI shows zero diffs but as noted above, there are a few intrinsics with base types that don't match the instruction's broadcast size that light up with this change.

src/coreclr/jit/lowerxarch.cpp

tannergooding

Changes LGTM and this is a positive simplification, even with zero diffs.

I did note some handling that's "missing" and could be added if we wanted to try and get some positive diffs as part of this.

src/coreclr/jit/lowerxarch.cpp

This reverts commit 5b5db3a.

This reverts commit 902f913.

saucecontrol · 2025-07-22T22:35:34Z

Latest commit implements the suggested optimizations. Diffs show some places where code size is the same but data size is smaller. There are also patterns where we can now use embedded masking that would have been blocked.

Example where shrinking the broadcast enables use of a dword-granularity masked op:

static Vector128<uint> mask_andd(Vector128<ulong> v, Vector128<uint> w) => Vector128.ConditionalSelect(
    Vector128.GreaterThan(w, Vector128<uint>.Zero),
    (v & Vector128<uint>.One.AsUInt64()).AsUInt32(),
    Vector128<uint>.Zero
);

 G_M4931_IG02:  ;; offset=0x0000
        vmovups  xmm0, xmmword ptr [r8]
        vxorps   xmm1, xmm1, xmm1
        vpcmpud  k1, xmm0, xmm1, 6
        vmovups  xmm0, xmmword ptr [rdx]
-       vpandq   xmm0, xmm0, qword ptr [reloc @RWD00] {1to2}
-       vpblendmd xmm0 {k1}{z}, xmm0, xmm0
+       vpand    xmm0 {k1}{z}, xmm0, dword ptr [reloc @RWD00] {1to4}
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
-						;; size=43 bbWeight=1 PerfScore 15.92
+						;; size=37 bbWeight=1 PerfScore 15.58
 
-G_M4931_IG03:  ;; offset=0x002B
+G_M4931_IG03:  ;; offset=0x0025
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dq	0000000100000001h
-; Total bytes of code: 44
+RWD00  	dd	00000001h
+; Total bytes of code: 38

And where expanding the broadcast enables use of a qword-granularity masked op:

static Vector128<ulong> mask_andq(Vector128<uint> v, Vector128<ulong> w) => Vector128.ConditionalSelect(
    Vector128.GreaterThan(w, Vector128<ulong>.Zero),
    (v & Vector128<uint>.One).AsUInt64(),
    Vector128<ulong>.Zero
);

 G_M6891_IG02:  ;; offset=0x0000
        vmovups  xmm0, xmmword ptr [r8]
        vxorps   xmm1, xmm1, xmm1
        vpcmpuq  k1, xmm0, xmm1, 6
        vmovups  xmm0, xmmword ptr [rdx]
-       vpandd   xmm0, xmm0, dword ptr [reloc @RWD00] {1to4}
-       vpblendmq xmm0 {k1}{z}, xmm0, xmm0
+       vpandq   xmm0 {k1}{z}, xmm0, qword ptr [reloc @RWD00] {1to2}
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
-						;; size=43 bbWeight=1 PerfScore 15.92
+						;; size=37 bbWeight=1 PerfScore 15.58
 
-G_M6891_IG03:  ;; offset=0x002B
+G_M6891_IG03:  ;; offset=0x0025
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dd	00000001h
-; Total bytes of code: 44
+RWD00  	dq	0000000100000001h
+; Total bytes of code: 38

src/coreclr/jit/gentree.cpp

EgorBo · 2025-09-15T07:36:13Z

/azp list

EgorBo · 2025-09-15T07:37:17Z

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr jitstress-isas-x86, Fuzzlyn

EgorBo · 2025-09-15T07:37:31Z

@MihuBot

azure-pipelines · 2025-09-15T07:37:40Z

Azure Pipelines successfully started running 4 pipeline(s).

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 16, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 16, 2025

saucecontrol mentioned this pull request Jul 16, 2025

JIT: Allow more containment opts in Tier0 #117622

Merged

tannergooding reviewed Jul 16, 2025

View reviewed changes

src/coreclr/jit/instrsxarch.h Show resolved Hide resolved

tannergooding reviewed Jul 16, 2025

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved

allow embedded broadcast regardless of intrinsic base type

e164839

saucecontrol force-pushed the more-broadcast branch from e59b5f4 to e164839 Compare July 16, 2025 16:26

newlines

9fe4954

saucecontrol marked this pull request as ready for review July 16, 2025 19:11

tannergooding reviewed Jul 17, 2025

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

tannergooding approved these changes Jul 17, 2025

View reviewed changes

tannergooding requested a review from EgorBo July 17, 2025 05:43

saucecontrol added 3 commits July 17, 2025 10:22

Merge remote-tracking branch 'upstream/main' into more-broadcast

3ed331e

try 8-byte where possible

902f913

limit to instructions currently substituted in codegen

5b5db3a

tannergooding reviewed Jul 17, 2025

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved

saucecontrol added 2 commits July 17, 2025 14:57

Revert "limit to instructions currently substituted in codegen"

ec87232

This reverts commit 5b5db3a.

Revert "try 8-byte where possible"

b121de3

This reverts commit 902f913.

build-analysis bot mentioned this pull request Jul 18, 2025

JIT assert in ILC in test runs: ILC: Assertion failed 'operand->isUsedFromReg()' during 'Generate code' #117795

Closed

saucecontrol added 2 commits July 21, 2025 21:26

Merge remote-tracking branch 'upstream/main' into more-broadcast

0c0ab25

optimize for broadcast size and mask use potential

260e28c

build-analysis bot mentioned this pull request Jul 23, 2025

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

3 tasks

saucecontrol added 2 commits August 7, 2025 18:02

Merge branch 'main' into more-broadcast

f114471

Merge branch 'main' into more-broadcast

6383f41

EgorBo reviewed Sep 15, 2025

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

Merge branch 'main' into more-broadcast

9616b73

This comment was marked as resolved.

Sign in to view

MihuBot mentioned this pull request Sep 15, 2025

[JitDiff X64] [saucecontrol] JIT: Allow embedded broadcast regardless of int ... MihuBot/runtime-utils#1492

Open

EgorBo approved these changes Sep 15, 2025

View reviewed changes

EgorBo enabled auto-merge (squash) September 15, 2025 21:58

EgorBo merged commit c6f83d3 into dotnet:main Sep 15, 2025
195 of 200 checks passed

saucecontrol deleted the more-broadcast branch September 15, 2025 22:02

github-actions bot locked and limited conversation to collaborators Oct 16, 2025

JIT: Allow embedded broadcast regardless of intrinsic base type #117700

JIT: Allow embedded broadcast regardless of intrinsic base type #117700

Uh oh!

Conversation

saucecontrol commented Jul 16, 2025

Uh oh!

dotnet-policy-service bot commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Jul 16, 2025

Uh oh!

saucecontrol commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saucecontrol commented Jul 16, 2025

Uh oh!

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saucecontrol commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Sep 15, 2025

Uh oh!

This comment was marked as resolved.

EgorBo commented Sep 15, 2025

Uh oh!

EgorBo commented Sep 15, 2025

Uh oh!

azure-pipelines bot commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saucecontrol commented Jul 16, 2025 •

edited

Loading

saucecontrol commented Jul 22, 2025 •

edited

Loading