Skip to content

Conversation

@sanchitintel
Copy link

@sanchitintel sanchitintel commented May 28, 2025

Summary

  • Line-encoding was changed to Unix, so in order to view the actual diff (not as distracting), please go to https://github.com/codeplaysoftware/cutlass-sycl/pull/396/files?diff=split&w=1.

  • Even if we would use shared memory for caching FP8 -> FP16 converted elements, faster conversion would still be helpful on GPUs that lack native FP8 support.

  • Since FP8_E5M2 -> FP16 conversion is using 32-bit registers & int32 ALUs, we might as well convert 4 FP8 elements to FP16 in one iteration. This novel implementation uses 3 shift instructions for every 4 FP8 elements (besides 4 and, 2 or instructions but shift has quite higher latency than them), so it should perform better than the projection, which uses 1 shift instruction for each FP8 element (apart from other lower latency instructions).

image
  • Also moved convert_FP8_to_FP16 to fp8_to_fp16.h, since it needs to be reused in FP8 Grouped GEMM.

Performance improvement

Manual loop unrolling can improve performance a bit more. For example, if 8 elements would be converted inside the loop, the performance would be slightly better. With 16, a bit even more.

Converting 8 elements in a loop iteration

With the default problem size in the example, GEMM performance improved by ~18.37% on Intel GPU Max 1550.
image

On BMG, the performance improvement was ~15.28% for the default input shape.
image

Converting 4 elements in a loop iteration (current implementation in this PR)

By only converting 4 elements inside the loop & using a compile-time selected unroll factor equal to the number of iterations, there's a small regression compared to converting 8 elements in an iteration.

Intel GPU Max 1550:

image

BMG:
image

DPCPP: https://github.com/intel/llvm/releases/tag/nightly-2025-03-24

cc @jiyang1011 @pengzhao-intel

@sanchitintel sanchitintel changed the title Optimize FP8_E5M2 -> FP16 conversion with fewer assembly instructions Reduce FP8_E5M2 -> FP16 conversion overhead May 28, 2025
@sanchitintel
Copy link
Author

sanchitintel commented May 29, 2025

Hi @t4c1 @jiyang1011,

Can you please review this PR as per your convenience? Thanks!
The CI failures are unrelated, BTW.

@sanchitintel sanchitintel marked this pull request as draft May 30, 2025 01:53
@sanchitintel sanchitintel marked this pull request as ready for review May 30, 2025 04:02
@sanchitintel sanchitintel marked this pull request as draft May 30, 2025 09:33
@sanchitintel sanchitintel marked this pull request as ready for review May 30, 2025 19:41
@sanchitintel sanchitintel marked this pull request as draft June 4, 2025 07:29
@sanchitintel sanchitintel marked this pull request as ready for review June 4, 2025 07:46
@sanchitintel sanchitintel marked this pull request as draft June 4, 2025 22:14
@sanchitintel sanchitintel marked this pull request as ready for review June 4, 2025 22:23
@sanchitintel sanchitintel marked this pull request as draft June 9, 2025 09:29
@sanchitintel sanchitintel marked this pull request as ready for review June 9, 2025 09:49
@sanchitintel
Copy link
Author

sanchitintel commented Jun 10, 2025

Hi @aacostadiaz, thank you for triggering the CI!
All CI checks have now passed. Can you please help land this PR, as I lack the permissions to do so at my end?

Thanks!

@aacostadiaz aacostadiaz merged commit b7732d5 into intel:sycl-develop Jun 10, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants