Reduce FP8_E5M2 -> FP16 conversion overhead #396

sanchitintel · 2025-05-28T08:07:34Z

Summary

Line-encoding was changed to Unix, so in order to view the actual diff (not as distracting), please go to https://github.com/codeplaysoftware/cutlass-sycl/pull/396/files?diff=split&w=1.
Even if we would use shared memory for caching FP8 -> FP16 converted elements, faster conversion would still be helpful on GPUs that lack native FP8 support.
Since FP8_E5M2 -> FP16 conversion is using 32-bit registers & int32 ALUs, we might as well convert 4 FP8 elements to FP16 in one iteration. This novel implementation uses 3 shift instructions for every 4 FP8 elements (besides 4 and, 2 or instructions but shift has quite higher latency than them), so it should perform better than the projection, which uses 1 shift instruction for each FP8 element (apart from other lower latency instructions).

Also moved convert_FP8_to_FP16 to fp8_to_fp16.h, since it needs to be reused in FP8 Grouped GEMM.

Performance improvement

Manual loop unrolling can improve performance a bit more. For example, if 8 elements would be converted inside the loop, the performance would be slightly better. With 16, a bit even more.

Converting 8 elements in a loop iteration

With the default problem size in the example, GEMM performance improved by ~18.37% on Intel GPU Max 1550.

On BMG, the performance improvement was ~15.28% for the default input shape.

Converting 4 elements in a loop iteration (current implementation in this PR)

By only converting 4 elements inside the loop & using a compile-time selected unroll factor equal to the number of iterations, there's a small regression compared to converting 8 elements in an iteration.

Intel GPU Max 1550:

BMG:

DPCPP: https://github.com/intel/llvm/releases/tag/nightly-2025-03-24

cc @jiyang1011 @pengzhao-intel

sanchitintel · 2025-05-29T08:34:33Z

Hi @t4c1 @jiyang1011,

Can you please review this PR as per your convenience? Thanks!
The CI failures are unrelated, BTW.

include/cutlass/fp8_to_fp16.h

include/cutlass/gemm/collective/xe_mma_w8a8.hpp

include/cutlass/fp8_to_fp16.h

include/cutlass/gemm/collective/xe_mma_w8a8.hpp

include/cutlass/fp8_to_fp16.h

sanchitintel · 2025-06-10T08:39:47Z

Hi @aacostadiaz, thank you for triggering the CI!
All CI checks have now passed. Can you please help land this PR, as I lack the permissions to do so at my end?

Thanks!

sanchitintel added 2 commits May 28, 2025 00:51

Optimize FP8_E5M2 -> FP16 conversion with fewer assembly instructions

f52eea2

Fix typo

b69f3da

sanchitintel mentioned this pull request May 28, 2025

[FEA] Improve FP8xFP8 GEMM performance, especially for FP8_E4M3 #394

Open

2 tasks

sanchitintel changed the title ~~Optimize FP8_E5M2 -> FP16 conversion with fewer assembly instructions~~ Reduce FP8_E5M2 -> FP16 conversion overhead May 28, 2025

sanchitintel added 8 commits May 28, 2025 01:54

Merge branch 'sycl-develop' into optimize_e5m2_gemm

44f566e

Fix code-formatting

19c7a28

Add comments

aa38a75

Merge branch 'sycl-develop' into optimize_e5m2_gemm

63a3ff5

Update comment

dcba170

Move convert_FP8_to_FP16 to fp8_to_fp16.h

9f1fcfa

Format fp8_to_fp16.h with clang-format

d247c65

Refactor

d676f39

sanchitintel marked this pull request as draft May 30, 2025 01:53

sanchitintel added 2 commits May 29, 2025 20:51

Use fewer bit shift instructions

ab036a0

Merge branch 'sycl-develop' into optimize_e5m2_gemm

f366aa8

sanchitintel marked this pull request as ready for review May 30, 2025 04:02

Fix typo in comments

fd02829

t4c1 reviewed May 30, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma_w8a8.hpp Show resolved Hide resolved

sanchitintel marked this pull request as draft May 30, 2025 09:33

Revise as per review suggestions

ea36e2f

sanchitintel marked this pull request as ready for review May 30, 2025 19:41

t4c1 reviewed Jun 2, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

Merge branch 'sycl-develop' into optimize_e5m2_gemm

469dac8

aacostadiaz added the release label Jun 3, 2025

sanchitintel marked this pull request as draft June 4, 2025 07:29

Rename variable

9dcdb78

sanchitintel marked this pull request as ready for review June 4, 2025 07:46

Merge branch 'sycl-develop' into optimize_e5m2_gemm

780fc02

t4c1 approved these changes Jun 4, 2025

View reviewed changes

mehdi-goli reviewed Jun 4, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

mehdi-goli reviewed Jun 4, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

mehdi-goli reviewed Jun 4, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

mehdi-goli reviewed Jun 4, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

mehdi-goli reviewed Jun 4, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

mehdi-goli reviewed Jun 4, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma_w8a8.hpp Outdated Show resolved Hide resolved

sanchitintel force-pushed the optimize_e5m2_gemm branch from 780fc02 to d676f39 Compare June 4, 2025 22:12

sanchitintel marked this pull request as draft June 4, 2025 22:14

sanchitintel added 2 commits June 4, 2025 15:20

Revise as per review recommendations

f033157

Merge branch 'sycl-develop' into optimize_e5m2_gemm

2ae098d

mehdi-goli approved these changes Jun 4, 2025

View reviewed changes

sanchitintel marked this pull request as ready for review June 4, 2025 22:23

sanchitintel marked this pull request as draft June 9, 2025 09:29

Merge branch 'sycl-develop' into optimize_e5m2_gemm

efab78b

sanchitintel marked this pull request as ready for review June 9, 2025 09:49

mehdi-goli reviewed Jun 9, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

sanchitintel added 2 commits June 9, 2025 10:07

Refactor

f4f2beb

Remove dead code already present in the sycl-develop branch

a557010

aacostadiaz merged commit b7732d5 into intel:sycl-develop Jun 10, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reduce FP8_E5M2 -> FP16 conversion overhead #396

Reduce FP8_E5M2 -> FP16 conversion overhead #396

Uh oh!

sanchitintel commented May 28, 2025 •

edited

Loading

Uh oh!

sanchitintel commented May 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanchitintel commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Reduce FP8_E5M2 -> FP16 conversion overhead #396

Reduce FP8_E5M2 -> FP16 conversion overhead #396

Uh oh!

Conversation

sanchitintel commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance improvement

Converting 8 elements in a loop iteration

Converting 4 elements in a loop iteration (current implementation in this PR)

Uh oh!

sanchitintel commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanchitintel commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sanchitintel commented May 28, 2025 •

edited

Loading

sanchitintel commented May 29, 2025 •

edited

Loading

sanchitintel commented Jun 10, 2025 •

edited

Loading