Skip to content

Conversation

@saucecontrol
Copy link
Member

Use the instruction table data to determine whether a given constant can be converted to embedded broadcast, rather than the intrinsic's base type.

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 16, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 16, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

Looks like there's some regressions related to certain bitwise instructions (smaller encoding, but larger data size; we want to prefer smaller data size for improved cache locality):

-       vpandq   ymm0, ymm0, qword ptr [reloc @RWD144] {1to4}
+       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD160]

The few bitwise instructions (including vpternlog) are swappable between the 32-bit and 64-bit base type. Either is fine since we can also update it to be the other if there is an embedded masking support scenario.

@saucecontrol
Copy link
Member Author

saucecontrol commented Jul 16, 2025

Looks like there's some regressions related to certain bitwise instructions (smaller encoding, but larger data size;

I was fixing it while you were reviewing 😄.

I think this will end up being zero diff for the SPMI collections, but in addition to allowing for removal of the special-casing for GFNI instructions, this catches a few others where broadcast size doesn't match base type. ex:

static Vector128<int> vnni(Vector128<byte> v) =>
    AvxVnni.MultiplyWideningAndAdd(Vector128<int>.Zero, v, Vector128<sbyte>.One);
 G_M42192_IG02:  ;; offset=0x0000
        C5F857C0             vxorps   xmm0, xmm0, xmm0
        C5F8100A             vmovups  xmm1, xmmword ptr [rdx]
-       C4E27150050F000000   vpdpbusd xmm0, xmm1, xmmword ptr [reloc @RWD00]
+       62F2751850050E000000 vpdpbusd xmm0, xmm1, dword ptr [reloc @RWD00] {1to4}
        C5F81101             vmovups  xmmword ptr [rcx], xmm0
        488BC1               mov      rax, rcx
-						;; size=24 bbWeight=1 PerfScore 12.58
+						;; size=25 bbWeight=1 PerfScore 12.58
 
-G_M42192_IG03:  ;; offset=0x0018
+G_M42192_IG03:  ;; offset=0x0019
        C3                   ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dq	0101010101010101h, 0101010101010101h
-; Total bytes of code: 25
+RWD00  	dd	01010101h
+; Total bytes of code: 26

@saucecontrol saucecontrol marked this pull request as ready for review July 16, 2025 19:11
@saucecontrol
Copy link
Member Author

SPMI shows zero diffs but as noted above, there are a few intrinsics with base types that don't match the instruction's broadcast size that light up with this change.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM and this is a positive simplification, even with zero diffs.

I did note some handling that's "missing" and could be added if we wanted to try and get some positive diffs as part of this.

@tannergooding tannergooding requested a review from EgorBo July 17, 2025 05:43
@saucecontrol
Copy link
Member Author

saucecontrol commented Jul 22, 2025

Latest commit implements the suggested optimizations. Diffs show some places where code size is the same but data size is smaller. There are also patterns where we can now use embedded masking that would have been blocked.

Example where shrinking the broadcast enables use of a dword-granularity masked op:

static Vector128<uint> mask_andd(Vector128<ulong> v, Vector128<uint> w) => Vector128.ConditionalSelect(
    Vector128.GreaterThan(w, Vector128<uint>.Zero),
    (v & Vector128<uint>.One.AsUInt64()).AsUInt32(),
    Vector128<uint>.Zero
);
 G_M4931_IG02:  ;; offset=0x0000
        vmovups  xmm0, xmmword ptr [r8]
        vxorps   xmm1, xmm1, xmm1
        vpcmpud  k1, xmm0, xmm1, 6
        vmovups  xmm0, xmmword ptr [rdx]
-       vpandq   xmm0, xmm0, qword ptr [reloc @RWD00] {1to2}
-       vpblendmd xmm0 {k1}{z}, xmm0, xmm0
+       vpand    xmm0 {k1}{z}, xmm0, dword ptr [reloc @RWD00] {1to4}
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
-						;; size=43 bbWeight=1 PerfScore 15.92
+						;; size=37 bbWeight=1 PerfScore 15.58
 
-G_M4931_IG03:  ;; offset=0x002B
+G_M4931_IG03:  ;; offset=0x0025
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dq	0000000100000001h
-; Total bytes of code: 44
+RWD00  	dd	00000001h
+; Total bytes of code: 38
 

And where expanding the broadcast enables use of a qword-granularity masked op:

static Vector128<ulong> mask_andq(Vector128<uint> v, Vector128<ulong> w) => Vector128.ConditionalSelect(
    Vector128.GreaterThan(w, Vector128<ulong>.Zero),
    (v & Vector128<uint>.One).AsUInt64(),
    Vector128<ulong>.Zero
);
 G_M6891_IG02:  ;; offset=0x0000
        vmovups  xmm0, xmmword ptr [r8]
        vxorps   xmm1, xmm1, xmm1
        vpcmpuq  k1, xmm0, xmm1, 6
        vmovups  xmm0, xmmword ptr [rdx]
-       vpandd   xmm0, xmm0, dword ptr [reloc @RWD00] {1to4}
-       vpblendmq xmm0 {k1}{z}, xmm0, xmm0
+       vpandq   xmm0 {k1}{z}, xmm0, qword ptr [reloc @RWD00] {1to2}
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
-						;; size=43 bbWeight=1 PerfScore 15.92
+						;; size=37 bbWeight=1 PerfScore 15.58
 
-G_M6891_IG03:  ;; offset=0x002B
+G_M6891_IG03:  ;; offset=0x0025
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dd	00000001h
-; Total bytes of code: 44
+RWD00  	dq	0000000100000001h
+; Total bytes of code: 38
 

@EgorBo
Copy link
Member

EgorBo commented Sep 15, 2025

/azp list

@azure-pipelines

This comment was marked as resolved.

@EgorBo
Copy link
Member

EgorBo commented Sep 15, 2025

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr jitstress-isas-x86, Fuzzlyn

@EgorBo
Copy link
Member

EgorBo commented Sep 15, 2025

@MihuBot

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@EgorBo EgorBo enabled auto-merge (squash) September 15, 2025 21:58
@EgorBo EgorBo merged commit c6f83d3 into dotnet:main Sep 15, 2025
195 of 200 checks passed
@saucecontrol saucecontrol deleted the more-broadcast branch September 15, 2025 22:02
@github-actions github-actions bot locked and limited conversation to collaborators Oct 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants