Improve kDequantizeBlockwise kernel performance for NF4/FP4 #1747
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the blockwise dequantization kernel's performance for the 4bit formats (FP4 and NF4). There is no accuracy impact. For end users, this will speed up prefill and batch decoding for inference, as well as QLoRA training. This also complements #1720, as the overhead from dequantization is especially pronounced when applying that experimental feature.
The 4bit dequantization functions with branching are replaced with a LUT that is distributed across the threads of each warp. Each lookup is performed using a warp shuffle.
This change was evaluated on the following data center hardware:
CUDA Toolkit 12.8, driver version 575:
CUDA Toolkit 12.5, driver version 560:
Additionally, it was evaluated on the following consumer hardware, with CUDA Toolkit 12.9, driver version 580:
There are significant gains on the P100, A100, and H100 across all problem sizes, where dequantization is between 1.36x - 2.07x faster.
On the T4, the gains typically range from 1.25x - 1.80x, with an outlier improvement of 3.05x at the 134M problem size.
On sm86 (e.g. A10) and sm89 (e.g. RTX 4090, L4, L40S) the improvement is more modest 1.3x - 1.6x for only smaller problem sizes, such as those found in the self-attention layers or in smaller models. At larger sizes, the performance is the same as prior to this PR.
Additional end-to-end benchmarking for inference will come.