Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B）

I have been using the Liger Kernel to replace standard operators to train Qwen25 model series with deepspeed ZERO3 strategy.
It significantly reduces memory usage on a 7B model（about 36%）, however，it shows limited memory saving (about 6%) on a 14B model. 

Questions:
1. Are there known limitations in Liger Kernel optimizations for larger models like 14B?
2. Is there any recommended configuration or parameter adjustment to improve memory efficiency for larger models?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B） #517

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B） #517

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions