Improve kDequantizeBlockwise kernel performance for NF4/FP4 #1747

matthewdouglas · 2025-09-05T15:45:36Z

This PR improves the blockwise dequantization kernel's performance for the 4bit formats (FP4 and NF4). There is no accuracy impact. For end users, this will speed up prefill and batch decoding for inference, as well as QLoRA training. This also complements #1720, as the overhead from dequantization is especially pronounced when applying that experimental feature.

The 4bit dequantization functions with branching are replaced with a LUT that is distributed across the threads of each warp. Each lookup is performed using a warp shuffle.

This change was evaluated on the following data center hardware:

CUDA Toolkit 12.8, driver version 575:

T4 16GB (sm75)
A100 40GB (sm80)
A10 24GB (sm86)
L4 24GB (sm89)
H100 80GB (sm90)

CUDA Toolkit 12.5, driver version 560:

P100 16GB (sm60)

Additionally, it was evaluated on the following consumer hardware, with CUDA Toolkit 12.9, driver version 580:

RTX 4090 24GB (sm89)

There are significant gains on the P100, A100, and H100 across all problem sizes, where dequantization is between 1.36x - 2.07x faster.

On the T4, the gains typically range from 1.25x - 1.80x, with an outlier improvement of 3.05x at the 134M problem size.

On sm86 (e.g. A10) and sm89 (e.g. RTX 4090, L4, L40S) the improvement is more modest 1.3x - 1.6x for only smaller problem sizes, such as those found in the self-attention layers or in smaller models. At larger sizes, the performance is the same as prior to this PR.

Additional end-to-end benchmarking for inference will come.

github-actions · 2025-09-05T15:54:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…tion#1747

Improve kDequantizeBlockwise kernel performance for NF4/FP4

65b66b2

matthewdouglas mentioned this pull request Sep 5, 2025

[CUDA] Branchless NF4/FP4 kDequantizeBlockwise kernel for faster dequantization #1746

Open

matthewdouglas added the CUDA Issues and PRs related to the CUDA backend, excluding installation/support help. label Sep 5, 2025

matthewdouglas added this to the v0.48.0 milestone Sep 5, 2025

Mhmd-Hisham added a commit to Mhmd-Hisham/bitsandbytes that referenced this pull request Sep 11, 2025

Implementing the dequantization algorithm from PR bitsandbytes-founda…

ab296ef

…tion#1747

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve kDequantizeBlockwise kernel performance for NF4/FP4 #1747

Improve kDequantizeBlockwise kernel performance for NF4/FP4 #1747

Uh oh!

matthewdouglas commented Sep 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Improve kDequantizeBlockwise kernel performance for NF4/FP4 #1747

Are you sure you want to change the base?

Improve kDequantizeBlockwise kernel performance for NF4/FP4 #1747

Uh oh!

Conversation

matthewdouglas commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

Uh oh!

matthewdouglas commented Sep 5, 2025 •

edited

Loading