You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been using the Liger Kernel to replace standard operators to train Qwen25 model series with deepspeed ZERO3 strategy.
It significantly reduces memory usage on a 7B model(about 36%), however,it shows limited memory saving (about 6%) on a 14B model.
Questions:
Are there known limitations in Liger Kernel optimizations for larger models like 14B?
Is there any recommended configuration or parameter adjustment to improve memory efficiency for larger models?