Skip to content

[X86][SelectionDAG] - Add support for llvm.canonicalize intrinsic #106370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Sep 23, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -508,6 +508,7 @@ namespace {
SDValue visitFSQRT(SDNode *N);
SDValue visitFCOPYSIGN(SDNode *N);
SDValue visitFPOW(SDNode *N);
SDValue visitFCANONICALIZE(SDNode *N);
SDValue visitSINT_TO_FP(SDNode *N);
SDValue visitUINT_TO_FP(SDNode *N);
SDValue visitFP_TO_SINT(SDNode *N);
Expand Down Expand Up @@ -1980,6 +1981,7 @@ SDValue DAGCombiner::visit(SDNode *N) {
case ISD::FREEZE: return visitFREEZE(N);
case ISD::GET_FPENV_MEM: return visitGET_FPENV_MEM(N);
case ISD::SET_FPENV_MEM: return visitSET_FPENV_MEM(N);
case ISD::FCANONICALIZE: return visitFCANONICALIZE(N);
case ISD::VECREDUCE_FADD:
case ISD::VECREDUCE_FMUL:
case ISD::VECREDUCE_ADD:
Expand Down Expand Up @@ -2090,6 +2092,19 @@ static SDValue getInputChainForNode(SDNode *N) {
return SDValue();
}

SDValue DAGCombiner::visitFCANONICALIZE(SDNode *N) {
SDValue Operand = N->getOperand(0);
EVT VT = Operand.getValueType();
SDLoc dl(N);

// Canonicalize undef to quiet NaN.
if (Operand.isUndef()) {
APFloat CanonicalQNaN = APFloat::getQNaN(VT.getFltSemantics());
return DAG.getConstantFP(CanonicalQNaN, dl, VT);
}
return SDValue();
}

SDValue DAGCombiner::visitTokenFactor(SDNode *N) {
// If N has two operands, where one has an input chain equal to the other,
// the 'other' chain is redundant.
Expand Down
21 changes: 21 additions & 0 deletions llvm/lib/Target/X86/X86ISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2559,6 +2559,7 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
ISD::STRICT_FMA,
ISD::FMINNUM,
ISD::FMAXNUM,
ISD::FCANONICALIZE,
ISD::SUB,
ISD::LOAD,
ISD::LRINT,
Expand Down Expand Up @@ -58159,6 +58160,25 @@ static SDValue combineINTRINSIC_VOID(SDNode *N, SelectionDAG &DAG,
return SDValue();
}

static SDValue combineCanonicalize(SDNode *N, SelectionDAG &DAG) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add static

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use N

SDValue Operand = N->getOperand(0);
EVT VT = Operand.getValueType();
SDLoc dl(N);

// Canonicalize scalar variable FP Nodes.
SDValue One =
DAG.getNode(ISD::SINT_TO_FP, dl, VT, DAG.getConstant(1, dl, MVT::i32));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change MVT::i32 to VT.changeTypeToInteger() I think this should work for vectors as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I handle it in a following PR? Or you recommend I do it now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this patch might be simpler thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will do it here. Thanks for the suggestion.

Copy link
Contributor Author

@pawan-nirpal-031 pawan-nirpal-031 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this suggestion, But I'm running into a crash for f80 scalar input, What I realized while debugging though is that changeTypeToInteger may not be required, I did following changes and I see that vector inputs are handled pretty seamlessly,

Change

-  // Canonicalize scalar variable FP Nodes.
-  SDValue One =
-      DAG.getNode(ISD::SINT_TO_FP, dl, VT, DAG.getConstant(1, dl, MVT::i32));
+  SDValue One = DAG.getConstantFP(1.0, dl, VT);
+

Running via gdb, I get a BUILD_VECTOR as such.

t11: v4f32 = BUILD_VECTOR ConstantFP:f32<1.000000e+00>, ConstantFP:f32<1.000000e+00>, ConstantFP:f32<1.000000e+00>, ConstantFP:f32<1.000000e+00>

input

define <4 x float> @canon_fp32_varargsv4f32(<4 x float> %a) {
  %canonicalized = call <4 x float> @llvm.canonicalize.v4f32(<4 x float> %a)
  ret <4 x float> %canonicalized
}

result

.LCPI9_0:
	.long	0x3f800000                      # float 1
	.long	0x3f800000                      # float 1
	.long	0x3f800000                      # float 1
	.long	0x3f800000                      # float 1
	.text
	.globl	canon_fp32_varargsv4f32
	.p2align	4, 0x90
	.type	canon_fp32_varargsv4f32,@function
canon_fp32_varargsv4f32:                # @canon_fp32_varargsv4f32
	.cfi_startproc
# %bb.0:
	mulps	.LCPI9_0(%rip), %xmm0

input

define <4 x double> @canon_fp64_varargsv4f64(<4 x double> %a) {
  %canonicalized = call <4 x double> @llvm.canonicalize.v4f32(<4 x double> %a)
  ret <4 x double> %canonicalized
}

result

.LCPI10_0:
	.quad	0x3ff0000000000000              # double 1
	.quad	0x3ff0000000000000              # double 1
	.text
	.globl	canon_fp64_varargsv4f64
	.p2align	4, 0x90
	.type	canon_fp64_varargsv4f64,@function
canon_fp64_varargsv4f64:                # @canon_fp64_varargsv4f64
	.cfi_startproc
# %bb.0:
	movapd	.LCPI10_0(%rip), %xmm2          # xmm2 = [1.0E+0,1.0E+0]
	mulpd	%xmm2, %xmm0
	mulpd	%xmm2, %xmm1
	retq

input

define void @vec_canonicalize_x86_fp80(<4 x x86_fp80> addrspace(1)* %out) #1 {
  %val = load <4 x x86_fp80>, <4 x x86_fp80> addrspace(1)* %out
  %canonicalized = call <4 x x86_fp80> @llvm.canonicalize.f80(<4 x x86_fp80> %val)
  store <4 x x86_fp80> %canonicalized, <4 x x86_fp80> addrspace(1)* %out
  ret void
}

result

# %bb.0:
	fldt	30(%rdi)
	fldt	20(%rdi)
	fldt	10(%rdi)
	fldt	(%rdi)
	fld1
	fmul	%st, %st(1)
	fmul	%st, %st(2)
	fmul	%st, %st(3)
	fmulp	%st, %st(4)
	fxch	%st(3)
	fstpt	30(%rdi)
	fxch	%st(1)
	fstpt	20(%rdi)
	fstpt	10(%rdi)
	fstpt	(%rdi)
	retq

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just emit a regular getConstantFP instead of emitting this as an integer cast? This can also just go as the generic lowering implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also is lowering, it is not combine. It should not be invoked through PerformDAGCombine

Copy link
Contributor Author

@pawan-nirpal-031 pawan-nirpal-031 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct place ( conditions under which setOperationAction is placed ) /method ( legal or custom or promote ?) of handling data types?

@@ -331,9 +331,11 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
       setOperationAction(ISD::FP_TO_UINT_SAT, VT, Custom);
       setOperationAction(ISD::FP_TO_SINT_SAT, VT, Custom);
     }
+    setOperationAction(ISD::FCANONICALIZE, MVT::f32, Custom);
     if (Subtarget.is64Bit()) {
       setOperationAction(ISD::FP_TO_UINT_SAT, MVT::i64, Custom);
       setOperationAction(ISD::FP_TO_SINT_SAT, MVT::i64, Custom);
+      setOperationAction(ISD::FCANONICALIZE, MVT::f64, Custom);
     }
   }
 
@@ -708,6 +710,7 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     setOperationAction(ISD::STRICT_FROUNDEVEN, MVT::f16, Promote);
     setOperationAction(ISD::STRICT_FTRUNC, MVT::f16, Promote);
     setOperationAction(ISD::STRICT_FP_ROUND, MVT::f16, Custom);
+    setOperationAction(ISD::FCANONICALIZE, MVT::f16, Custom);
     setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f32, Custom);
     setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f64, Custom);
 
@@ -924,6 +927,7 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine
 &TM,
     if (isTypeLegal(MVT::f80)) {
       setOperationAction(ISD::FP_ROUND, MVT::f80, Custom);
       setOperationAction(ISD::STRICT_FP_ROUND, MVT::f80, Custom);
+      setOperationAction(ISD::FCANONICALIZE, MVT::f80, Custom);
     }
 
     setOperationAction(ISD::SETCC, MVT::f128, Custom);
@@ -1042,6 +1046,11 @@ X86TargetLowering::X86TargetLowering(const X86TargetMach
ine &TM,
     // No operations on x86mmx supported, everything uses intrinsics.
   }
 

   if (!Subtarget.useSoftFloat() && Subtarget.hasSSE1()) {
     addRegisterClass(MVT::v4f32, Subtarget.hasVLX() ? &X86::VR128XRegClass
                                                     : &X86::VR128RegClass);
@@ -1057,9 +1066,11 @@ X86TargetLowering::X86TargetLowering(const X86TargetMach
ine &TM,
     setOperationAction(ISD::VSELECT,            MVT::v4f32, Custom);
     setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4f32, Custom);
     setOperationAction(ISD::SELECT,             MVT::v4f32, Custom);
+    setOperationAction(ISD::FCANONICALIZE,      MVT::v4f32, Custom);
 
     setOperationAction(ISD::LOAD,               MVT::v2f32, Custom);
     setOperationAction(ISD::STORE,              MVT::v2f32, Custom);
+    setOperationAction(ISD::FCANONICALIZE,      MVT::v2f32, Custom);
 
     setOperationAction(ISD::STRICT_FADD,        MVT::v4f32, Legal);
     setOperationAction(ISD::STRICT_FSUB,        MVT::v4f32, Legal);
@@ -1120,6 +1131,7 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachi
ne &TM,
     setOperationAction(ISD::UMULO,              MVT::v2i32, Custom);
 
     setOperationAction(ISD::FNEG,               MVT::v2f64, Custom);
+    setOperationAction(ISD::FCANONICALIZE,      MVT::v2f64, Custom);
     setOperationAction(ISD::FABS,               MVT::v2f64, Custom);
     setOperationAction(ISD::FCOPYSIGN,          MVT::v2f64, Custom);
 
@@ -1452,6 +1464,7 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachi
ne &TM,
 
       setOperationAction(ISD::FMAXIMUM,          VT, Custom);
       setOperationAction(ISD::FMINIMUM,          VT, Custom);
+      setOperationAction(ISD::FCANONICALIZE,     VT, Custom);
     }
 
     setOperationAction(ISD::LRINT, MVT::v8f32, Custom);
@@ -1796,6 +1809,7 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachi
ne &TM,
       setOperationAction(ISD::FMA,   VT, Legal);
       setOperationAction(ISD::STRICT_FMA, VT, Legal);
       setOperationAction(ISD::FCOPYSIGN, VT, Custom);
+      setOperationAction(ISD::FCANONICALIZE, VT, Custom);
     }
     setOperationAction(ISD::LRINT, MVT::v16f32,
                        Subtarget.hasDQI() ? Legal : Custom);

// TODO: Fix Crash for bf16 when generating strict_fmul as it
// leads to a error : SoftPromoteHalfResult #0: t11: bf16,ch = strict_fmul t0,
// ConstantFP:bf16<APFloat(16256)>, t5 LLVM ERROR: Do not know how to soft
// promote this operator's result!
SDValue Chain = DAG.getEntryNode();
SDValue StrictFmul = DAG.getNode(ISD::STRICT_FMUL, dl, {VT, MVT::Other},
{Chain, One, Operand});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant operands canonically should be the RHS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm @RKSimon friendly ping for re review ;)

return StrictFmul;
// TODO : Hanlde vectors.
}

SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;
Expand Down Expand Up @@ -58198,6 +58218,7 @@ SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);
case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);
case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);
case ISD::FCANONICALIZE: return combineCanonicalize(N, DAG);
case ISD::BITREVERSE: return combineBITREVERSE(N, DAG, DCI, Subtarget);
case ISD::AVGCEILS:
case ISD::AVGCEILU:
Expand Down
273 changes: 273 additions & 0 deletions llvm/test/CodeGen/X86/canonicalize-vars-f16-type.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --default-march x86_64-unknown-linux-gnu --version 5
; RUN: llc -mattr=+sse2 -mtriple=x86_64 < %s | FileCheck %s -check-prefixes=SSE
; RUN: llc -mattr=+avx -mtriple=x86_64 < %s | FileCheck %s -check-prefixes=AVX1
; RUN: llc -mattr=+avx2 -mtriple=x86_64 < %s | FileCheck %s -check-prefixes=AVX2
; RUN: llc -mattr=+avx512f -mtriple=x86_64 < %s | FileCheck %s -check-prefixes=AVX512F
; RUN: llc -mattr=+avx512bw -mtriple=x86_64 < %s | FileCheck %s -check-prefixes=AVX512BW

define void @v_test_canonicalize__half(half addrspace(1)* %out) nounwind {
; SSE-LABEL: v_test_canonicalize__half:
; SSE: # %bb.0: # %entry
; SSE-NEXT: pushq %rbx
; SSE-NEXT: subq $16, %rsp
; SSE-NEXT: movq %rdi, %rbx
; SSE-NEXT: pinsrw $0, (%rdi), %xmm0
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: movd %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Folded Spill
; SSE-NEXT: pinsrw $0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: mulss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload
; SSE-NEXT: callq __truncsfhf2@PLT
; SSE-NEXT: pextrw $0, %xmm0, %eax
; SSE-NEXT: movw %ax, (%rbx)
; SSE-NEXT: addq $16, %rsp
; SSE-NEXT: popq %rbx
; SSE-NEXT: retq
;
; AVX1-LABEL: v_test_canonicalize__half:
; AVX1: # %bb.0: # %entry
; AVX1-NEXT: pushq %rbx
; AVX1-NEXT: subq $16, %rsp
; AVX1-NEXT: movq %rdi, %rbx
; AVX1-NEXT: vpinsrw $0, (%rdi), %xmm0, %xmm0
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vmovd %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Folded Spill
; AVX1-NEXT: vpinsrw $0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vmulss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX1-NEXT: callq __truncsfhf2@PLT
; AVX1-NEXT: vpextrw $0, %xmm0, (%rbx)
; AVX1-NEXT: addq $16, %rsp
; AVX1-NEXT: popq %rbx
; AVX1-NEXT: retq
;
; AVX2-LABEL: v_test_canonicalize__half:
; AVX2: # %bb.0: # %entry
; AVX2-NEXT: pushq %rbx
; AVX2-NEXT: subq $16, %rsp
; AVX2-NEXT: movq %rdi, %rbx
; AVX2-NEXT: vpinsrw $0, (%rdi), %xmm0, %xmm0
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vmovd %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Folded Spill
; AVX2-NEXT: vpinsrw $0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vmulss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX2-NEXT: callq __truncsfhf2@PLT
; AVX2-NEXT: vpextrw $0, %xmm0, (%rbx)
; AVX2-NEXT: addq $16, %rsp
; AVX2-NEXT: popq %rbx
; AVX2-NEXT: retq
;
; AVX512F-LABEL: v_test_canonicalize__half:
; AVX512F: # %bb.0: # %entry
; AVX512F-NEXT: movzwl (%rdi), %eax
; AVX512F-NEXT: movzwl {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ecx
; AVX512F-NEXT: vmovd %ecx, %xmm0
; AVX512F-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512F-NEXT: vmovd %eax, %xmm1
; AVX512F-NEXT: vcvtph2ps %xmm1, %xmm1
; AVX512F-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX512F-NEXT: vxorps %xmm1, %xmm1, %xmm1
; AVX512F-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
; AVX512F-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512F-NEXT: vmovd %xmm0, %eax
; AVX512F-NEXT: movw %ax, (%rdi)
; AVX512F-NEXT: retq
;
; AVX512BW-LABEL: v_test_canonicalize__half:
; AVX512BW: # %bb.0: # %entry
; AVX512BW-NEXT: movzwl (%rdi), %eax
; AVX512BW-NEXT: movzwl {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ecx
; AVX512BW-NEXT: vmovd %ecx, %xmm0
; AVX512BW-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512BW-NEXT: vmovd %eax, %xmm1
; AVX512BW-NEXT: vcvtph2ps %xmm1, %xmm1
; AVX512BW-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX512BW-NEXT: vxorps %xmm1, %xmm1, %xmm1
; AVX512BW-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
; AVX512BW-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512BW-NEXT: vmovd %xmm0, %eax
; AVX512BW-NEXT: movw %ax, (%rdi)
; AVX512BW-NEXT: retq
entry:
%val = load half, half addrspace(1)* %out
%canonicalized = call half @llvm.canonicalize.f16(half %val)
store half %canonicalized, half addrspace(1)* %out
ret void
}

define half @complex_canonicalize_fmul_half(half %a, half %b) nounwind {
; SSE-LABEL: complex_canonicalize_fmul_half:
; SSE: # %bb.0: # %entry
; SSE-NEXT: pushq %rax
; SSE-NEXT: movss %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: movss %xmm0, (%rsp) # 4-byte Spill
; SSE-NEXT: movss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Reload
; SSE-NEXT: # xmm0 = mem[0],zero,zero,zero
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: movss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
; SSE-NEXT: movss (%rsp), %xmm1 # 4-byte Reload
; SSE-NEXT: # xmm1 = mem[0],zero,zero,zero
; SSE-NEXT: subss %xmm0, %xmm1
; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: callq __truncsfhf2@PLT
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: movss %xmm0, (%rsp) # 4-byte Spill
; SSE-NEXT: addss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload
; SSE-NEXT: callq __truncsfhf2@PLT
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: subss (%rsp), %xmm0 # 4-byte Folded Reload
; SSE-NEXT: callq __truncsfhf2@PLT
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: movss %xmm0, (%rsp) # 4-byte Spill
; SSE-NEXT: pinsrw $0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: mulss (%rsp), %xmm0 # 4-byte Folded Reload
; SSE-NEXT: callq __truncsfhf2@PLT
; SSE-NEXT: callq __extendhfsf2@PLT
; SSE-NEXT: subss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload
; SSE-NEXT: callq __truncsfhf2@PLT
; SSE-NEXT: popq %rax
; SSE-NEXT: retq
;
; AVX1-LABEL: complex_canonicalize_fmul_half:
; AVX1: # %bb.0: # %entry
; AVX1-NEXT: pushq %rax
; AVX1-NEXT: vmovss %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vmovss %xmm0, (%rsp) # 4-byte Spill
; AVX1-NEXT: vmovss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Reload
; AVX1-NEXT: # xmm0 = mem[0],zero,zero,zero
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vmovss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
; AVX1-NEXT: vmovss (%rsp), %xmm1 # 4-byte Reload
; AVX1-NEXT: # xmm1 = mem[0],zero,zero,zero
; AVX1-NEXT: vsubss %xmm0, %xmm1, %xmm0
; AVX1-NEXT: callq __truncsfhf2@PLT
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vmovss %xmm0, (%rsp) # 4-byte Spill
; AVX1-NEXT: vaddss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX1-NEXT: callq __truncsfhf2@PLT
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vsubss (%rsp), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX1-NEXT: callq __truncsfhf2@PLT
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vmovss %xmm0, (%rsp) # 4-byte Spill
; AVX1-NEXT: vpinsrw $0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vmulss (%rsp), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX1-NEXT: callq __truncsfhf2@PLT
; AVX1-NEXT: callq __extendhfsf2@PLT
; AVX1-NEXT: vsubss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX1-NEXT: callq __truncsfhf2@PLT
; AVX1-NEXT: popq %rax
; AVX1-NEXT: retq
;
; AVX2-LABEL: complex_canonicalize_fmul_half:
; AVX2: # %bb.0: # %entry
; AVX2-NEXT: pushq %rax
; AVX2-NEXT: vmovss %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vmovss %xmm0, (%rsp) # 4-byte Spill
; AVX2-NEXT: vmovss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Reload
; AVX2-NEXT: # xmm0 = mem[0],zero,zero,zero
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vmovss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
; AVX2-NEXT: vmovss (%rsp), %xmm1 # 4-byte Reload
; AVX2-NEXT: # xmm1 = mem[0],zero,zero,zero
; AVX2-NEXT: vsubss %xmm0, %xmm1, %xmm0
; AVX2-NEXT: callq __truncsfhf2@PLT
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vmovss %xmm0, (%rsp) # 4-byte Spill
; AVX2-NEXT: vaddss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX2-NEXT: callq __truncsfhf2@PLT
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vsubss (%rsp), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX2-NEXT: callq __truncsfhf2@PLT
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vmovss %xmm0, (%rsp) # 4-byte Spill
; AVX2-NEXT: vpinsrw $0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vmulss (%rsp), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX2-NEXT: callq __truncsfhf2@PLT
; AVX2-NEXT: callq __extendhfsf2@PLT
; AVX2-NEXT: vsubss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm0 # 4-byte Folded Reload
; AVX2-NEXT: callq __truncsfhf2@PLT
; AVX2-NEXT: popq %rax
; AVX2-NEXT: retq
;
; AVX512F-LABEL: complex_canonicalize_fmul_half:
; AVX512F: # %bb.0: # %entry
; AVX512F-NEXT: vpextrw $0, %xmm1, %eax
; AVX512F-NEXT: vpextrw $0, %xmm0, %ecx
; AVX512F-NEXT: vmovd %ecx, %xmm0
; AVX512F-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512F-NEXT: vmovd %eax, %xmm1
; AVX512F-NEXT: vcvtph2ps %xmm1, %xmm1
; AVX512F-NEXT: vsubss %xmm1, %xmm0, %xmm0
; AVX512F-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512F-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512F-NEXT: vaddss %xmm1, %xmm0, %xmm2
; AVX512F-NEXT: vcvtps2ph $4, %xmm2, %xmm2
; AVX512F-NEXT: vcvtph2ps %xmm2, %xmm2
; AVX512F-NEXT: vsubss %xmm0, %xmm2, %xmm0
; AVX512F-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512F-NEXT: vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
; AVX512F-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512F-NEXT: movzwl {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %eax
; AVX512F-NEXT: vmovd %eax, %xmm2
; AVX512F-NEXT: vcvtph2ps %xmm2, %xmm2
; AVX512F-NEXT: vmulss %xmm0, %xmm2, %xmm0
; AVX512F-NEXT: vxorps %xmm2, %xmm2, %xmm2
; AVX512F-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm2[1,2,3]
; AVX512F-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512F-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512F-NEXT: vsubss %xmm1, %xmm0, %xmm0
; AVX512F-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512F-NEXT: vmovd %xmm0, %eax
; AVX512F-NEXT: vpinsrw $0, %eax, %xmm0, %xmm0
; AVX512F-NEXT: retq
;
; AVX512BW-LABEL: complex_canonicalize_fmul_half:
; AVX512BW: # %bb.0: # %entry
; AVX512BW-NEXT: vpextrw $0, %xmm1, %eax
; AVX512BW-NEXT: vpextrw $0, %xmm0, %ecx
; AVX512BW-NEXT: vmovd %ecx, %xmm0
; AVX512BW-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512BW-NEXT: vmovd %eax, %xmm1
; AVX512BW-NEXT: vcvtph2ps %xmm1, %xmm1
; AVX512BW-NEXT: vsubss %xmm1, %xmm0, %xmm0
; AVX512BW-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512BW-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512BW-NEXT: vaddss %xmm1, %xmm0, %xmm2
; AVX512BW-NEXT: vcvtps2ph $4, %xmm2, %xmm2
; AVX512BW-NEXT: vcvtph2ps %xmm2, %xmm2
; AVX512BW-NEXT: vsubss %xmm0, %xmm2, %xmm0
; AVX512BW-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512BW-NEXT: vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
; AVX512BW-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512BW-NEXT: movzwl {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %eax
; AVX512BW-NEXT: vmovd %eax, %xmm2
; AVX512BW-NEXT: vcvtph2ps %xmm2, %xmm2
; AVX512BW-NEXT: vmulss %xmm0, %xmm2, %xmm0
; AVX512BW-NEXT: vxorps %xmm2, %xmm2, %xmm2
; AVX512BW-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm2[1,2,3]
; AVX512BW-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512BW-NEXT: vcvtph2ps %xmm0, %xmm0
; AVX512BW-NEXT: vsubss %xmm1, %xmm0, %xmm0
; AVX512BW-NEXT: vcvtps2ph $4, %xmm0, %xmm0
; AVX512BW-NEXT: vmovd %xmm0, %eax
; AVX512BW-NEXT: vpinsrw $0, %eax, %xmm0, %xmm0
; AVX512BW-NEXT: retq
entry:

%mul1 = fsub half %a, %b
%add = fadd half %mul1, %b
%mul2 = fsub half %add, %mul1
%canonicalized = call half @llvm.canonicalize.f16(half %mul2)
%result = fsub half %canonicalized, %b
ret half %result
}

declare half @llvm.canonicalize.f16(half)
Loading