Support for Qwen3-VL models #911

mayankagarwals · 2025-10-18T13:50:56Z

Summary

Support for Qwen3-VL models
Solves #897

Details

NA

Testing Done

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

…ng. Rest pass

dahwin · 2025-10-23T19:25:06Z

Please add support for Qwen3-VL .
Please do fast.
we are waited for a long time.
It's been a month model has been released.

mayankagarwals · 2025-10-24T05:03:43Z

Hi @dahwin
The branch already supports FLCE for qwen 3 VL. You can go ahead and use it . I'm working on qwen 3 VL Moe

dahwin · 2025-10-24T07:33:41Z

Hi @dahwin The branch already supports FLCE for qwen 3 VL. You can go ahead and use it . I'm working on qwen 3 VL Moe

Here's the updated message for @mayankagarwals with all the findings:

HI @mayankagarwals

Thanks for working on Qwen3-VL support! I've done extensive benchmarking of PR #911 with **Qwen3-VL-4B-Instruct** and discovered some critical issues.

## Setup
- **Model**: Qwen3-VL-4B-Instruct (dense, non-MoE)
- **Hardware**: 4x NVIDIA L40S (48GB each) 
- **Framework**: MS-Swift with Flash Attention + Liger Kernel (PR #911 branch)
- **Training Config**: batch_size=1, grad_accum=2, bf16, gradient_checkpointing, full parameter training
- **Comparison**: Tested both with and without Liger Kernel using identical hyperparameters

## Test 1: Using `apply_liger_kernel_to_qwen2_vl()` (Initial Attempt)

Since your comment said "The branch already supports FLCE for qwen 3 VL", I initially tried using `apply_liger_kernel_to_qwen2_vl()` as a fallback:

```python
from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl

apply_liger_kernel_to_qwen2_vl(
    fused_linear_cross_entropy=True,
    rms_norm=True,
    rope=True,
    swiglu=True,
)

Results:

Memory Usage (Per GPU): 38.73 GiB (identical to baseline)
Training Speed: 0.68 iter/s (vs 0.66 iter/s baseline)
Expected reduction: 60-80% memory savings
Actual reduction: ❌ 0% - No memory savings at all

Conclusion: Using qwen2_vl patches on Qwen3-VL did NOT activate FLCE properly.

Test 2: Using `apply_liger_kernel_to_qwen3_vl()` (Correct Function)

After checking dir(liger_kernel.transformers), I found that apply_liger_kernel_to_qwen3_vl does exist in PR #911:

from liger_kernel.transformers import apply_liger_kernel_to_qwen3_vl

# Function signature:
# apply_liger_kernel_to_qwen3_vl(
#     rope: bool = False,
#     cross_entropy: bool = False, 
#     fused_linear_cross_entropy: bool = True,
#     rms_norm: bool = False,
#     swiglu: bool = False,
#     model: PreTrainedModel = None
# )

apply_liger_kernel_to_qwen3_vl(
    rope=True,
    fused_linear_cross_entropy=True,
    rms_norm=True,
    swiglu=True,
)

Result:

NotImplementedError: Under development

Error traceback:

File "/Liger-Kernel/src/liger_kernel/transformers/monkey_patch.py", line 1673, in apply_liger_kernel_to_qwen3_vl
    raise NotImplementedError("Under development")

Summary

✅ Function exists: apply_liger_kernel_to_qwen3_vl is present in PR Support for Qwen3-VL models #911
❌ Function not implemented: It raises NotImplementedError("Under development")
❌ Fallback doesn't work: Using apply_liger_kernel_to_qwen2_vl() doesn't activate FLCE for Qwen3-VL
❌ No memory savings: Zero reduction in memory usage with current implementation

Questions & Request

When will apply_liger_kernel_to_qwen3_vl() be fully implemented? The function exists but throws NotImplementedError.
Why doesn't apply_liger_kernel_to_qwen2_vl() work for Qwen3-VL? Are the architectures too different for cross-compatibility?
Your comment said "branch already supports FLCE for qwen 3 VL" - which function should we actually use? Both options failed:
- apply_liger_kernel_to_qwen3_vl() → NotImplementedError
- apply_liger_kernel_to_qwen2_vl() → No FLCE activation (0% memory reduction)
Is there a working branch or commit where Qwen3-VL FLCE actually works? I'm happy to test a different branch/commit if available.

Baseline Performance (Flash Attention Only)

For reference, here's what we're comparing against:

Training time: 50 minutes (50m 44s for 2000 steps)
Memory per GPU: 38.73 GiB
Training speed: 0.66 iter/s
Loss convergence: Working perfectly (final loss ~0.16)

We're eager to test FLCE to potentially:

Reduce memory by 60-80% → increase batch size → compensate for any speed loss
Enable larger models or longer sequences on same hardware

Environment Details

- Liger Kernel: PR #911 branch (installed from: git checkout pr-911 && pip install -e .)
- PyTorch: 2.8.0+cu128
- Transformers: Latest
- MS-Swift: Latest
- CUDA: 13.0
- Python: 3.12.3

Would really appreciate clarification on which function to use and when the implementation will be complete. Happy to provide more logs, test different configurations, or help debug! 🚀

---

mayankagarwals · 2025-10-24T11:28:32Z

When will apply_liger_kernel_to_qwen3_vl() be fully implemented? The function exists but throws NotImplementedError. Don't have an ETA right now, can update the thread when I do
Why doesn't apply_liger_kernel_to_qwen2_vl() work for Qwen3-VL? Are the architectures too different for cross-compatibility? Yes
Use apply_liger_kernel_to_qwen3_vl() but mark rope, rmsnorm and swiglu as false. These are yet to be implemented
The current one does, just keep flce as true.

Tcc0403 · 2025-10-29T16:11:03Z

@dahwin Hi, thank you for your testings. There's a detailed analysis on memory with and without liger FLCE in #517.

TLDR, it only cut the memory usage of logits related tensors, so you don't see much difference with low batch size/short seqlen.

Lowering the memory wall due to logits, you can achieve more efficient training by increasing batch size/seqlen as you mentioned, or disabling gradient checkpointing.

Tcc0403 · 2025-10-29T17:14:24Z

@mayankagarwals
Some info might be helpful:

rope: there are two rope functions, apply_rotary_pos_emb and apply_rotary_pos_emb_vision
- apply_rotary_pos_emb: should be identical to liger_rotary_pos_emb
- apply_rotary_pos_emb_vision: just two additional castings before and after liger_rotary_pos_emb
rmsnorm: both Qwen3VLMoeTextRMSNorm and Qwen3VLTextRMSNorm are LlamaRMSNorm, ie LigerRMSNorm with
- init_fn="ones"
- casting_mode="llama", only inverse RMS is computed on fp32
- offset=0.0, no shifting on weight before multiplying inverse RMS
cross entropy: We just override transformers.loss.loss_utils.nn.functional.cross_entropy.

mayankagarwals · 2025-10-30T10:03:43Z

Hi @Tcc0403

Apologies for the delay, running a little low on time

Current status:
Qwen3VL supported for rope, cross entropy, rmsnorm, flce -> All tests passing
Qwen3VLMoe supported for rope, cross entropy, rmsnorm, flce -> Convergence failing for the two following tests

FAILED test/convergence/fp32/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen3_vl_moe-32-0.0001-dtype5-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AssertionError: [Loss]Number of mismatched elements: 28
FAILED test/convergence/bf16/test_mini_models_with_logits.py::test_mini_model[mini_qwen3_vl_moe-32-1e-05-dtype9-0.01-0.05-0.1-0.01-0.01-0.01] - AssertionError: [Top k logprobs]Number of mismatched elements: 1

If you feel this is time sensitive, feel free to bring it home. I'll try to take out more time soon and close.

mayankagarwals · 2025-10-30T14:20:21Z

Update:
This warrants a deeper analysis but the issue was numeric instability of liger_rotary_pos_emb in bf16. After upcasting (similar to vision) , rope works in bf16

So the following test is current failing and requires solving

FAILED test/convergence/fp32/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen3_vl_moe-32-0.0001-dtype5-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AssertionError: [Loss]Number of mismatched elements: 28

shimizust · 2025-11-01T16:03:31Z

Verified the bf16/fp32 convergence tests pass on h100. Fixed the other failing convergence tests. LGTM!

shimizust

Thanks for all the effort in adding support for qwen3-vl model @mayankagarwals @Tcc0403

## Summary tiny fix re: #930 , grad_weight and grad_bias were never set on no_grad path First commit is my fix, second commit was from running `make checkstyle` and a small import change for `qwen3_vl` as the import was broken and tests were not passing? Believe this was introduced in #911 ## Testing Done Repro provided in #930 now passes: ```python import torch import torch.nn as nn from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss vocab_size, hidden_dim, num_tokens = 1000, 512, 256 device = "cuda" if torch.cuda.is_available() else "cpu" linear = nn.Linear(hidden_dim, vocab_size, bias=False).to(device) fused_loss_fn = LigerFusedLinearCrossEntropyLoss() hidden_states = torch.randn(num_tokens, hidden_dim, device=device) labels = torch.randint(0, vocab_size, (num_tokens,), device=device) with torch.no_grad(): loss = fused_loss_fn(linear.weight, hidden_states, labels) print(f"Loss: {loss.item()}") ``` - Hardware Type: 3090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence > 2110 passed, 255 skipped, 41 warnings, 1 rerun in 276.17s (0:04:36)

matthewdm0816 · 2025-11-27T11:11:11Z

Hi, is it possible to add SwiGLU support for Qwen3-VL? It seems that currently passing swiglu=True in monkey patch function is a no-op.

Tcc0403 · 2025-11-28T14:12:57Z

@matthewdm0816 Feel free to open an issue for it! I'll work on it.

Leun9 · 2025-12-03T08:41:33Z

@matthewdm0816 The ROPE operator seems to have some issues.

Trainer + Qwen3-VL-8B:

[rank6]: return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
[rank6]: triton.compiler.errors.CompilationError: at 69:27:
[rank6]: cos_offsets = tl.arange(0, pad_hd // 2)
[rank6]: cos_mask = cos_offsets < hd // 2
[rank6]: cos_row = tl.load(cos + cos_offsets, mask=cos_mask, other=0)
[rank6]: sin_row = tl.load(sin + cos_offsets, mask=cos_mask, other=0) [rank6]: # ####################################################################
[rank6]: # Load the left and right half of q and k for the current
[rank6]: # program instance (i.e. for the current token) separately
[rank6]: # ####################################################################
[rank6]: # left half of the head
[rank6]: first_half_q_offsets = tl.arange(0, pad_n_qh)[:, None] * hd + tl.arange(0, pad_hd // 2)[None, :]
[rank6]: ^
[rank6]: ValueError('numel (2097152) exceeds triton maximum tensor numel (1048576)')

casper-hansen · 2025-12-04T11:46:10Z

@Leun9 i got same issue, so i opened #964

Mayank Agarwal and others added 6 commits October 18, 2025 19:18

init:

1f7da4b

test-mini-models test passes

0dedfad

test_mini_models_with_logits for bf16

62b14b1

test_mini_models_multimodal for bf16

102041d

Add remaining tests. convergence test for moe in multimodal are faili…

eb79956

…ng. Rest pass

Merge master

dbcbfe5

shimizust assigned momochen Oct 28, 2025

Add support for rope, cross entropy and rmsnorm

1f3a777

mayankagarwals and others added 3 commits October 30, 2025 10:15

update fp32 test

797aa6d

update

c25343f

update rope numeric stability for qwen3vl

06948a4

shimizust added 3 commits November 1, 2025 07:57

Merge branch 'main' into mayank/qwen3vl

6567f79

Fixed bf16 tests

5da5160

Fixed granite fp32 test

73d2fee

shimizust marked this pull request as ready for review November 1, 2025 16:02

shimizust approved these changes Nov 1, 2025

View reviewed changes

shimizust merged commit 6cd9fd6 into linkedin:main Nov 1, 2025

keatonelvins mentioned this pull request Nov 4, 2025

fix: initialize grad_weight and grad_bias on flce no_grad path #931

Merged

3 tasks

matthewdm0816 mentioned this pull request Nov 28, 2025

[Feature Request] Add SwiGLU support for Qwen3-VL models #956

Open

casper-hansen mentioned this pull request Dec 4, 2025

[Bug] TRL GRPO with Qwen3-VL fails compile #964

Closed

Tcc0403 mentioned this pull request Dec 5, 2025

Fix qwen3vl apply_rotary_pos_emb_vision #967

Merged

3 tasks

Support for Qwen3-VL models #911

Support for Qwen3-VL models #911

Uh oh!

Conversation

mayankagarwals commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing Done

Uh oh!

dahwin commented Oct 23, 2025

Uh oh!

mayankagarwals commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dahwin commented Oct 24, 2025

Results:

Test 2: Using apply_liger_kernel_to_qwen3_vl() (Correct Function)

Result:

Summary

Questions & Request

Baseline Performance (Flash Attention Only)

Environment Details

Uh oh!

mayankagarwals commented Oct 24, 2025

Uh oh!

Tcc0403 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tcc0403 commented Oct 29, 2025

Uh oh!

mayankagarwals commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayankagarwals commented Oct 30, 2025

Uh oh!

shimizust commented Nov 1, 2025

Uh oh!

shimizust left a comment

Choose a reason for hiding this comment

Uh oh!

matthewdm0816 commented Nov 27, 2025

Uh oh!

Tcc0403 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Leun9 commented Dec 3, 2025

Uh oh!

casper-hansen commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mayankagarwals commented Oct 18, 2025 •

edited

Loading

mayankagarwals commented Oct 24, 2025 •

edited

Loading

Test 2: Using `apply_liger_kernel_to_qwen3_vl()` (Correct Function)

Tcc0403 commented Oct 29, 2025 •

edited

Loading

mayankagarwals commented Oct 30, 2025 •

edited

Loading

Tcc0403 commented Nov 28, 2025 •

edited

Loading