Skip to content

cokeshao/Awesome-Multimodal-Token-Compression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Multimodal Token Compression

License: MIT PRs Welcome arXiv Last Commit

[arXiv] [HuggingFace] [Database]

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios [arXiv]
Kele Shao*,1,2, Keda Tao*,1,2, Kejia Zhang3, Sicheng Feng2,4, Mu Cai5, Yuzhang Shang6, Haoxuan You7, Can Qin8, Yang Sui9, Huan Wang†,2

1Zhejiang University, 2Westlake University, 3Xiamen University, 4National University of Singapore, 5University of Wisconsin-Madison, 6University of Central Florida, 7Columbia University, 8Salesforce AI Research, 9Rice University

* Equal Contribution. † Corresponding Author ([email protected]).


Important

We welcome your help in improving the repository and paper. Please feel free to submit a pull request or contact us to:

  • Add a relevant paper not yet included.

  • Suggest a more suitable category.

  • Update the information.

  • Ask for clarification about any content.


πŸ”₯ News

🎯 Motivation

Awesome Token Compression

Motivation: Up: Image, video, and audio data types can scale in their representation dimensions, leading to a corresponding increase in the number of tokens. Down: Top-performing MLLMs cannot address real-world demands, as the number of tokens for multimodal information, especially video, vastly exceeds that of text. Therefore, token compression is crucial to address this limitation.

πŸ“Œ Citation

If you find our paper or this resource helpful, please consider cite:

@article{shao2025tokens,
  title={When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios},
  author={Shao, Kele and Tao, Keda and Zhang, Kejia and Feng, Sicheng and Cai, Mu and Shang, Yuzhang and You, Haoxuan and Qin, Can and Sui, Yang and Wang, Huan},
  journal={arXiv preprint arXiv:2507.20198},
  year={2025}
}

πŸ“š Contents

Please check out all the papers by selecting the sub-area you're interested in. On this main page, only papers released in the past 6 months are shown.


Badge Colors

  • arXiv Badge red for arXiv papers
  • PDF Badge blue for conference/journal papers
  • GitHub Badge white for GitHub repositories
  • Research Areas Badge purple for research areas
  • Categories Badge green for categories
  • Cost Badge yellow for training cost

Recent Papers (Last 6 Months)

Image
Title & Authors Areas Tags Links
Arxiv
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang
Area Type Type
Cost
Paper
Arxiv
AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance
Weichen Zhang, Zhui Zhu, Ningbo Li, Kebin Liu, Yunhao Liu
Area Type
Cost
Paper
Arxiv
Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models
Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin
Area Cost Paper
Publish Star
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang
Area Area Type
Cost
Paper
GitHub
Arxiv Star
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou
Area Type
Cost
Paper
GitHub
Model
Publish
Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models
Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Area Cost Paper
Arxiv
HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models
Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen
Area Type
Cost
Paper
Arxiv
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang
Area Cost Paper
Publish Star
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian
Area Type Type
Cost
Paper
GitHub
Arxiv Star
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang
Area Type
Cost
Paper
GitHub
Arxiv Star
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Area Area Area Area Paper
GitHub
Arxiv
Efficient Whole Slide Pathology VQA via Token Compression
Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen
Area Type
Cost
Paper
Arxiv
Training-free Token Reduction for Vision Mamba
Qiankun Ma, Ziyao Zhang, Chi Su, Jie Chen, Zhen Song, Hairong Zheng, Wen Gao
Area Cost Paper
Arxiv Star
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
Area Type
Cost
Paper
GitHub
Model
Dataset
Publish
LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models
Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng
Area Area Type
Cost
Paper
Publish
ToSA: Token Merging with Spatial Awareness
Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang
Area Type
Cost
Paper
Arxiv Star
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
Area Area Type
Cost
Paper
GitHub
Arxiv
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu
Area Area Type
Cost
Paper
Publish Star
EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models
Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin
Area Area Area Paper
GitHub
Arxiv Star
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
Area Area Type Type
Cost
Paper
GitHub
Arxiv
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen
Area Type Type
Cost
Paper
Publish Star
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
Area Area Type
Cost
Paper
GitHub
Arxiv Star
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Area Area Area Paper
GitHub
Arxiv Star
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Area Area Area Area Paper
GitHub
Arxiv
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering
Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu
Area Type
Cost
Paper
Arxiv Star
Seed1.5-VL Technical Report
Seed Team
Area Area Type
Cost
Paper
GitHub
Arxiv Star
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu
Area Area Type
Cost
Paper
GitHub
Publish Star
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou
Area Area Type Type
Cost
Paper
GitHub
Arxiv
QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
Shuai Li, Jian Xu, Xiao-Hui Li, Chao Deng, Lin-Lin Huang
Area Type
Cost
Paper
Publish Star
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan
Area Type
Cost
Paper
GitHub
Arxiv Star
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao
Area Area Type Type
Cost
Paper
GitHub
Model
Arxiv Star
Qwen2.5-Omni Technical Report
Qwen Team
Area Area Area Type
Cost
Paper
GitHub
Model
Publish
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan
Area Area Type
Cost
Paper
Publish
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu
Area Type
Cost
Paper
Arxiv Star
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen
Area Type
Cost
Paper
GitHub
Publish Star
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
Area Area Type
Cost
Paper
GitHub
Publish Star
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang
Area Area Type Type
Cost
Paper
GitHub
Arxiv Star
Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang
Area Area Type
Cost Cost
Paper
GitHub
Video
Title & Authors Areas Tags Links
Publish Star
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang
Area Area Type
Cost
Paper
GitHub
Arxiv Star
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Area Area Area Area Paper
GitHub
Arxiv
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent
Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen
Area Type
Cost
Paper
Publish Star
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video-LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim
Area Type
Cost
Paper
GitHub
Arxiv Star
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang
Area Cost Paper
GitHub
Dataset
Publish
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang
Area Type
Cost
Paper
Publish
LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models
Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng
Area Area Type
Cost
Paper
Arxiv Star
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video-LLMs
Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
Area Type
Cost
Paper
GitHub
Arxiv Star
Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu
Area Type Type
Cost
Paper
GitHub
Model
Arxiv Star
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
Area Area Type
Cost
Paper
GitHub
Arxiv
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
Hongzhi Zhang, Jingyuan Zhang, Xingguang Ji, Qi Wang, Fuzheng Zhang
Area Type
Cost
Paper
Arxiv Star
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma
Area Type Type Type
Cost
Paper
GitHub
Arxiv
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu
Area Area Type
Cost
Paper
Arxiv Star
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu
Area Type Type
Cost
Paper
GitHub
Publish Star
EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models
Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin
Area Area Area Paper
GitHub
Arxiv Star
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
Area Area Type Type
Cost
Paper
GitHub
Arxiv Star
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
Area Type Type
Cost
Paper
GitHub
Arxiv
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models
Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding
Area Type Type
Cost
Paper
Publish Star
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
Area Area Type
Cost
Paper
GitHub
Arxiv Star
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Area Area Area Paper
GitHub
Arxiv Star
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Area Area Area Area Paper
GitHub
Arxiv Star
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang
Area Type
Cost
Paper
GitHub
Arxiv Star
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Xuyang Liu, Yiyu Wang, Junpeng Ma, Linfeng Zhang
Area Type
Cost
Paper
GitHub
Arxiv Star
Seed1.5-VL Technical Report
Seed Team
Area Area Type
Cost
Paper
GitHub
Publish Star
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun
Area Type
Cost
Paper
GitHub
Model
Dataset
Arxiv Star
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu
Area Area Type
Cost
Paper
GitHub
Publish Star
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou
Area Area Type Type
Cost
Paper
GitHub
Publish Star
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan
Area Type Type
Cost
Paper
GitHub
Model
Arxiv Star
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao
Area Area Type Type
Cost
Paper
GitHub
Model
Arxiv Star
Qwen2.5-Omni Technical Report
Qwen Team
Area Area Area Type
Cost
Paper
GitHub
Model
Arxiv
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan
Area Type
Cost
Paper
Arxiv Star
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, Bo Zhao
Area Type Type
Cost
Paper
GitHub
Model
Publish
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan
Area Area Type
Cost
Paper
Arxiv
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
Saket Gurukar, Asim Kadav
Area Type
Cost
Paper
Publish Star
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia
Area Area Paper
GitHub
Dataset
Arxiv Star
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
Area Type Type
Cost
Paper
GitHub
Arxiv
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu, Jingwei Sun, Yueqian Lin, Jingyang Zhang, Ming Yin, Qinsi Wang, Jianyi Zhang, Hai Li, Yiran Chen
Area Paper
Arxiv
Token-Efficient Long Video Understanding for Multimodal LLMs
Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
Area Type
Cost
Paper
Publish Star
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
Area Area Type
Cost
Paper
GitHub
Publish Star
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang
Area Area Type Type
Cost
Paper
GitHub
Arxiv Star
Qwen2.5-VL Technical Report
Qwen Team
Area Type
Cost
Paper
GitHub
Model
Arxiv Star
Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang
Area Area Type
Cost Cost
Paper
GitHub
Audio
Title & Authors Areas Tags Links
Arxiv Star
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Area Area Area Area Paper
GitHub
Arxiv Star
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Area Area Area Area Paper
GitHub
Publish Star
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng
Area Type
Cost
Paper
GitHub
Publish Star
Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance
Taehan Lee, Hyukjun Lee
Area Type
Cost
Paper
GitHub
Model
Arxiv Star
Qwen2.5-Omni Technical Report
Qwen Team
Area Area Area Type
Cost
Paper
GitHub
Model
Publish Star
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro
Area Type
Cost
Paper
GitHub
Arxiv
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
Umberto Cappellazzo, Minsu Kim, Stavros Petridis
Area Type
Cost
Paper
Arxiv Star
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen
Area Type
Cost
Paper
GitHub
Model

Published in Recent Conference/Journal

ICCV 2025
Title & Authors Areas Tags Links
Publish Star
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang
Area Area Type
Cost
Paper
GitHub
Publish Star
Representation Shift: Unifying Token Compression with FlashAttention
Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim
Area Type
Cost
Paper
GitHub
Publish Star
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian
Area Type Type
Cost
Paper
GitHub
Publish Star
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video-LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim
Area Type
Cost
Paper
GitHub
Publish
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang
Area Type
Cost
Paper
Publish Star
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan
Area Type
Cost
Paper
GitHub
Publish
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu
Area Type
Cost
Paper
Publish Star
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia
Area Area Paper
GitHub
Dataset
Publish Star
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor QuΓ©tu, Shuai Xiao, Enzo Tartaglione
Area Area Type
Cost
Paper
GitHub
Publish Star
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
Area Type Type Paper
GitHub
Publish Star
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
Area Type Type
Cost
Paper
GitHub
Publish Star
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia
Area Area Area Type
Cost
Paper
GitHub
Model
Dataset
Publish Star
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
Area Type Type
Cost
Paper
GitHub
Publish Star
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
Area Area Type Type
Cost
Paper
GitHub
Publish
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
Area Area Type
Cost
Paper
Publish Star
LLaVA-PruMerge:Β Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
Area Area Type Type
Cost
Paper
GitHub
ACL 2025
Title & Authors Areas Tags Links
Publish Star
EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models
Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin
Area Area Area Paper
GitHub
Publish Star
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro
Area Type
Cost
Paper
GitHub
Publish Star
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang
Area Area Type Type
Cost
Paper
GitHub
Publish Star
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang, Hao Zhou, Kai Han
Area Type Type
Cost
Paper
GitHub
Publish Star
Prompt Compression for Large Language Models: A Survey
Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier
Area Area Paper
GitHub
ICML 2025
Title & Authors Areas Tags Links
Publish Star
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
Area Area Type
Cost
Paper
GitHub
Publish Star
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng
Area Type
Cost
Paper
GitHub
Publish Star
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan
Area Type Type
Cost
Paper
GitHub
Model
Publish Star
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra
Area Type Type
Cost
Paper
GitHub
Model
Publish Star
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang
Area Area Type Type
Cost
Paper
GitHub
ACM MM 2025
Title & Authors Areas Tags Links
Publish
Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models
Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Area Cost Paper
Publish Star
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun
Area Type
Cost
Paper
GitHub
Model
Dataset

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

This repository is inspired by Awesome-Efficient-Reasoning-Models, Awesome-Efficient-LLM, Awesome-Context-Engineering

πŸ§‘β€πŸ’» Contributors

πŸ‘ Thanks to these contributors for this excellent work!

βœ‰οΈ Contact

For questions, suggestions, or collaboration opportunities, please feel free to reach out:

βœ‰οΈ Email: [email protected] / [email protected]