forked from NVIDIA/cutlass
-
Couldn't load subscription status.
- Fork 64
Reduce FP8_E5M2 -> FP16 conversion overhead #396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
aacostadiaz
merged 22 commits into
intel:sycl-develop
from
sanchitintel:optimize_e5m2_gemm
Jun 10, 2025
Merged
Reduce FP8_E5M2 -> FP16 conversion overhead #396
aacostadiaz
merged 22 commits into
intel:sycl-develop
from
sanchitintel:optimize_e5m2_gemm
Jun 10, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 tasks
|
Hi @t4c1 @jiyang1011, Can you please review this PR as per your convenience? Thanks! |
t4c1
reviewed
May 30, 2025
t4c1
reviewed
May 30, 2025
t4c1
reviewed
Jun 2, 2025
t4c1
approved these changes
Jun 4, 2025
mehdi-goli
reviewed
Jun 4, 2025
mehdi-goli
reviewed
Jun 4, 2025
mehdi-goli
reviewed
Jun 4, 2025
mehdi-goli
reviewed
Jun 4, 2025
mehdi-goli
reviewed
Jun 4, 2025
mehdi-goli
reviewed
Jun 4, 2025
780fc02 to
d676f39
Compare
mehdi-goli
approved these changes
Jun 4, 2025
mehdi-goli
reviewed
Jun 9, 2025
|
Hi @aacostadiaz, thank you for triggering the CI! Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Line-encoding was changed to Unix, so in order to view the actual diff (not as distracting), please go to https://github.com/codeplaysoftware/cutlass-sycl/pull/396/files?diff=split&w=1.
Even if we would use shared memory for caching
FP8 -> FP16converted elements, faster conversion would still be helpful on GPUs that lack native FP8 support.Since FP8_E5M2 -> FP16 conversion is using 32-bit registers & int32 ALUs, we might as well convert 4 FP8 elements to FP16 in one iteration. This novel implementation uses 3 shift instructions for every 4 FP8 elements (besides 4
and, 2orinstructions but shift has quite higher latency than them), so it should perform better than the projection, which uses 1 shift instruction for each FP8 element (apart from other lower latency instructions).convert_FP8_to_FP16tofp8_to_fp16.h, since it needs to be reused in FP8 Grouped GEMM.Performance improvement
Manual loop unrolling can improve performance a bit more. For example, if 8 elements would be converted inside the loop, the performance would be slightly better. With 16, a bit even more.
Converting 8 elements in a loop iteration
With the default problem size in the example, GEMM performance improved by ~18.37% on Intel GPU Max 1550.

On BMG, the performance improvement was ~15.28% for the default input shape.

Converting 4 elements in a loop iteration (current implementation in this PR)
By only converting 4 elements inside the loop & using a compile-time selected unroll factor equal to the number of iterations, there's a small regression compared to converting 8 elements in an iteration.
Intel GPU Max 1550:
BMG:

DPCPP: https://github.com/intel/llvm/releases/tag/nightly-2025-03-24
cc @jiyang1011 @pengzhao-intel