Proposal to add fixed-point multiplication instructions #221
Description
(Branched from Issue #175)
It would be useful to have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH.
Some may think that the availability of a 32x32=64 integer multiplication (Issue #175) would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance.
(This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 )