Skip to content

Conversation

@mayankagarwals
Copy link
Contributor

@mayankagarwals mayankagarwals commented Oct 18, 2025

Summary

Support for Qwen3-VL models
Solves #897

Details

NA

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@dahwin
Copy link

dahwin commented Oct 23, 2025

Please add support for Qwen3-VL .
Please do fast.
we are waited for a long time.
It's been a month model has been released.

@mayankagarwals
Copy link
Contributor Author

mayankagarwals commented Oct 24, 2025

Hi @dahwin
The branch already supports FLCE for qwen 3 VL. You can go ahead and use it . I'm working on qwen 3 VL Moe

@dahwin
Copy link

dahwin commented Oct 24, 2025

Hi @dahwin The branch already supports FLCE for qwen 3 VL. You can go ahead and use it . I'm working on qwen 3 VL Moe

Here's the updated message for @mayankagarwals with all the findings:


HI @mayankagarwals

Thanks for working on Qwen3-VL support! I've done extensive benchmarking of PR #911 with **Qwen3-VL-4B-Instruct** and discovered some critical issues.

## Setup
- **Model**: Qwen3-VL-4B-Instruct (dense, non-MoE)
- **Hardware**: 4x NVIDIA L40S (48GB each) 
- **Framework**: MS-Swift with Flash Attention + Liger Kernel (PR #911 branch)
- **Training Config**: batch_size=1, grad_accum=2, bf16, gradient_checkpointing, full parameter training
- **Comparison**: Tested both with and without Liger Kernel using identical hyperparameters

## Test 1: Using `apply_liger_kernel_to_qwen2_vl()` (Initial Attempt)

Since your comment said "The branch already supports FLCE for qwen 3 VL", I initially tried using `apply_liger_kernel_to_qwen2_vl()` as a fallback:

```python
from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl

apply_liger_kernel_to_qwen2_vl(
    fused_linear_cross_entropy=True,
    rms_norm=True,
    rope=True,
    swiglu=True,
)

Results:

  • Memory Usage (Per GPU): 38.73 GiB (identical to baseline)
  • Training Speed: 0.68 iter/s (vs 0.66 iter/s baseline)
  • Expected reduction: 60-80% memory savings
  • Actual reduction: ❌ 0% - No memory savings at all

Conclusion: Using qwen2_vl patches on Qwen3-VL did NOT activate FLCE properly.

Test 2: Using apply_liger_kernel_to_qwen3_vl() (Correct Function)

After checking dir(liger_kernel.transformers), I found that apply_liger_kernel_to_qwen3_vl does exist in PR #911:

from liger_kernel.transformers import apply_liger_kernel_to_qwen3_vl

# Function signature:
# apply_liger_kernel_to_qwen3_vl(
#     rope: bool = False,
#     cross_entropy: bool = False, 
#     fused_linear_cross_entropy: bool = True,
#     rms_norm: bool = False,
#     swiglu: bool = False,
#     model: PreTrainedModel = None
# )

apply_liger_kernel_to_qwen3_vl(
    rope=True,
    fused_linear_cross_entropy=True,
    rms_norm=True,
    swiglu=True,
)

Result:

NotImplementedError: Under development

Error traceback:

File "/Liger-Kernel/src/liger_kernel/transformers/monkey_patch.py", line 1673, in apply_liger_kernel_to_qwen3_vl
    raise NotImplementedError("Under development")

Summary

  1. Function exists: apply_liger_kernel_to_qwen3_vl is present in PR Support for Qwen3-VL models #911
  2. Function not implemented: It raises NotImplementedError("Under development")
  3. Fallback doesn't work: Using apply_liger_kernel_to_qwen2_vl() doesn't activate FLCE for Qwen3-VL
  4. No memory savings: Zero reduction in memory usage with current implementation

Questions & Request

  1. When will apply_liger_kernel_to_qwen3_vl() be fully implemented? The function exists but throws NotImplementedError.

  2. Why doesn't apply_liger_kernel_to_qwen2_vl() work for Qwen3-VL? Are the architectures too different for cross-compatibility?

  3. Your comment said "branch already supports FLCE for qwen 3 VL" - which function should we actually use? Both options failed:

    • apply_liger_kernel_to_qwen3_vl() → NotImplementedError
    • apply_liger_kernel_to_qwen2_vl() → No FLCE activation (0% memory reduction)
  4. Is there a working branch or commit where Qwen3-VL FLCE actually works? I'm happy to test a different branch/commit if available.

Baseline Performance (Flash Attention Only)

For reference, here's what we're comparing against:

  • Training time: 50 minutes (50m 44s for 2000 steps)
  • Memory per GPU: 38.73 GiB
  • Training speed: 0.66 iter/s
  • Loss convergence: Working perfectly (final loss ~0.16)

We're eager to test FLCE to potentially:

  • Reduce memory by 60-80% → increase batch size → compensate for any speed loss
  • Enable larger models or longer sequences on same hardware

Environment Details

- Liger Kernel: PR #911 branch (installed from: git checkout pr-911 && pip install -e .)
- PyTorch: 2.8.0+cu128
- Transformers: Latest
- MS-Swift: Latest
- CUDA: 13.0
- Python: 3.12.3

Would really appreciate clarification on which function to use and when the implementation will be complete. Happy to provide more logs, test different configurations, or help debug! 🚀


---

@mayankagarwals
Copy link
Contributor Author

  1. When will apply_liger_kernel_to_qwen3_vl() be fully implemented? The function exists but throws NotImplementedError. Don't have an ETA right now, can update the thread when I do

  2. Why doesn't apply_liger_kernel_to_qwen2_vl() work for Qwen3-VL? Are the architectures too different for cross-compatibility? Yes

  3. Use apply_liger_kernel_to_qwen3_vl() but mark rope, rmsnorm and swiglu as false. These are yet to be implemented

  4. The current one does, just keep flce as true.

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Oct 29, 2025

@dahwin Hi, thank you for your testings. There's a detailed analysis on memory with and without liger FLCE in #517.

TLDR, it only cut the memory usage of logits related tensors, so you don't see much difference with low batch size/short seqlen.

Lowering the memory wall due to logits, you can achieve more efficient training by increasing batch size/seqlen as you mentioned, or disabling gradient checkpointing.

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Oct 29, 2025

@mayankagarwals
Some info might be helpful:

@mayankagarwals
Copy link
Contributor Author

mayankagarwals commented Oct 30, 2025

Hi @Tcc0403

Apologies for the delay, running a little low on time

Current status:
Qwen3VL supported for rope, cross entropy, rmsnorm, flce -> All tests passing
Qwen3VLMoe supported for rope, cross entropy, rmsnorm, flce -> Convergence failing for the two following tests

FAILED test/convergence/fp32/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen3_vl_moe-32-0.0001-dtype5-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AssertionError: [Loss]Number of mismatched elements: 28
FAILED test/convergence/bf16/test_mini_models_with_logits.py::test_mini_model[mini_qwen3_vl_moe-32-1e-05-dtype9-0.01-0.05-0.1-0.01-0.01-0.01] - AssertionError: [Top k logprobs]Number of mismatched elements: 1

If you feel this is time sensitive, feel free to bring it home. I'll try to take out more time soon and close.

@mayankagarwals
Copy link
Contributor Author

Update:
This warrants a deeper analysis but the issue was numeric instability of liger_rotary_pos_emb in bf16. After upcasting (similar to vision) , rope works in bf16

So the following test is current failing and requires solving

FAILED test/convergence/fp32/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen3_vl_moe-32-0.0001-dtype5-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AssertionError: [Loss]Number of mismatched elements: 28

@shimizust shimizust marked this pull request as ready for review November 1, 2025 16:02
@shimizust
Copy link
Collaborator

Verified the bf16/fp32 convergence tests pass on h100. Fixed the other failing convergence tests. LGTM!

Copy link
Collaborator

@shimizust shimizust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the effort in adding support for qwen3-vl model @mayankagarwals @Tcc0403

@shimizust shimizust merged commit 6cd9fd6 into linkedin:main Nov 1, 2025
Tcc0403 pushed a commit that referenced this pull request Nov 5, 2025
## Summary
tiny fix re: #930 , grad_weight and grad_bias were never set on no_grad
path

First commit is my fix, second commit was from running `make checkstyle`
and a small import change for `qwen3_vl` as the import was broken and
tests were not passing? Believe this was introduced in #911

## Testing Done
Repro provided in #930 now passes:
```python
import torch
import torch.nn as nn
from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss


vocab_size, hidden_dim, num_tokens = 1000, 512, 256
device = "cuda" if torch.cuda.is_available() else "cpu"

linear = nn.Linear(hidden_dim, vocab_size, bias=False).to(device)
fused_loss_fn = LigerFusedLinearCrossEntropyLoss()

hidden_states = torch.randn(num_tokens, hidden_dim, device=device)
labels = torch.randint(0, vocab_size, (num_tokens,), device=device)

with torch.no_grad():
    loss = fused_loss_fn(linear.weight, hidden_states, labels)
    print(f"Loss: {loss.item()}")
```

- Hardware Type: 3090
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

> 2110 passed, 255 skipped, 41 warnings, 1 rerun in 276.17s (0:04:36)
@matthewdm0816
Copy link

Hi, is it possible to add SwiGLU support for Qwen3-VL? It seems that currently passing swiglu=True in monkey patch function is a no-op.

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Nov 28, 2025

@matthewdm0816 Feel free to open an issue for it! I'll work on it.

@Leun9
Copy link

Leun9 commented Dec 3, 2025

@matthewdm0816 The ROPE operator seems to have some issues.

Trainer + Qwen3-VL-8B:

[rank6]: return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
[rank6]: triton.compiler.errors.CompilationError: at 69:27:
[rank6]: cos_offsets = tl.arange(0, pad_hd // 2)
[rank6]: cos_mask = cos_offsets < hd // 2
[rank6]: cos_row = tl.load(cos + cos_offsets, mask=cos_mask, other=0)
[rank6]: sin_row = tl.load(sin + cos_offsets, mask=cos_mask, other=0) [rank6]: # ####################################################################
[rank6]: # Load the left and right half of q and k for the current
[rank6]: # program instance (i.e. for the current token) separately
[rank6]: # ####################################################################
[rank6]: # left half of the head
[rank6]: first_half_q_offsets = tl.arange(0, pad_n_qh)[:, None] * hd + tl.arange(0, pad_hd // 2)[None, :]
[rank6]: ^
[rank6]: ValueError('numel (2097152) exceeds triton maximum tensor numel (1048576)')

@casper-hansen
Copy link

@Leun9 i got same issue, so i opened #964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants