Skip to content

Conversation

matthewdouglas
Copy link
Member

@matthewdouglas matthewdouglas commented Sep 5, 2025

This PR improves the blockwise dequantization kernel's performance for the 4bit formats (FP4 and NF4). There is no accuracy impact. For end users, this will speed up prefill and batch decoding for inference, as well as QLoRA training. This also complements #1720, as the overhead from dequantization is especially pronounced when applying that experimental feature.

The 4bit dequantization functions with branching are replaced with a LUT that is distributed across the threads of each warp. Each lookup is performed using a warp shuffle.

This change was evaluated on the following data center hardware:

CUDA Toolkit 12.8, driver version 575:

  • T4 16GB (sm75)
  • A100 40GB (sm80)
  • A10 24GB (sm86)
  • L4 24GB (sm89)
  • H100 80GB (sm90)

CUDA Toolkit 12.5, driver version 560:

  • P100 16GB (sm60)

Additionally, it was evaluated on the following consumer hardware, with CUDA Toolkit 12.9, driver version 580:

  • RTX 4090 24GB (sm89)

There are significant gains on the P100, A100, and H100 across all problem sizes, where dequantization is between 1.36x - 2.07x faster.

On the T4, the gains typically range from 1.25x - 1.80x, with an outlier improvement of 3.05x at the 134M problem size.

On sm86 (e.g. A10) and sm89 (e.g. RTX 4090, L4, L40S) the improvement is more modest 1.3x - 1.6x for only smaller problem sizes, such as those found in the self-attention layers or in smaller models. At larger sizes, the performance is the same as prior to this PR.

Improvement by Problem Size

Additional end-to-end benchmarking for inference will come.

Copy link

github-actions bot commented Sep 5, 2025

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas matthewdouglas added the CUDA Issues and PRs related to the CUDA backend, excluding installation/support help. label Sep 5, 2025
@matthewdouglas matthewdouglas added this to the v0.48.0 milestone Sep 5, 2025
Mhmd-Hisham added a commit to Mhmd-Hisham/bitsandbytes that referenced this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA Issues and PRs related to the CUDA backend, excluding installation/support help.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant