This repository is the official implementation of Folder (ICCV2025) and Turbo (ECCV2024 oral)
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance (ICCV2025) [Paper]
Haicheng Wang*, Zhemeng Yu*, Gabriele Spadaro, Chen Ju, Shuai Xiao, Victor Quétu, Enzo Tartaglione✉️ (*Equal Contribution)
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models (ECCV 2024, Oral) [Paper]
Chen Ju*, Haicheng Wang*, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao✉️, Bo Zheng (*Equal Contribution)
- 🔥 Universal Acceleration for Various VLMs Applicable on various types of VLMs, including CLIP-like VLAs, diffusions and MLLMs.
- 🔥 Performace Maintenance Accelerate throughput 1.6-2.0X with minor performance drop.
- 🔥 Plug-and-play Can be directly applied in most of VLMs without retraining. Can also be used for training acceleration. Very easy to implement (10-min-ready).
🚀 [2025/6/26] FOLDER has been accepted by ICCV2025.
🚀 [2025/2/9] We release the code for BLIP and MLLMs (LLaVA1.5, Minigptv2, VITA1.5, VILA1.5, WePOINTs1.5, VideoLLaVA).
🚀 [2024/7/3] Turbo has been accepted by ECCV2024 as oral presentation.
- Turbo for ViT
- Turbo for Stable Diffusion
- Checkpoints of Folder retrained models
To build up the environment, please follow BLIP, VLMEvalKit and corresponding MLLMs (LLaVA1.5, Minigptv2, Video-LLaVA, VITA1.5, VILA1.5, WePOINTS1.5).
Please first clone our repo from github by running the following command.
git clone https://github.com/anakin-skywalker-Joseph/Folder.git
cd Folder
The implementation of Turbo on BLIP is in BLIP_turbo. Folder can better accelerate BLIP on captioning task BLIP_folder. For example, for BLIP_folder, after setting up the image fold address in BLIP_folder/configs/caption_coco_base.yaml, go into BLIP_folder and run bash run_caption.sh to reproduce the result. Similar setup can be done in BLIP_turbo for various tasks.
Folder is an upgraded version of Turbo for MLLMs acceleration, by merging tokens in the last layer. We provide complete implementation for LLaVA1.5, Video-LLaVA and Minigptv2. You can modify the reduction ratio in Line 33-34 for LLaVA1.5, Line 71-72 for Minigptv2 and Line 204-205 for Video-LLaVA. alphavalue is the balancing hyperparameter between mutual redundancy and semantic value in Turbo and rvalue controls the number of tokens reduced (number-of-reduced-token=rvalue*num_layer, e.g. 16 is 66% reduction ratio and 18 is 75% reduction ratio for LLaVA1.5).
Folder can also accelerate the training (serves as an alternative for pixel-shuffle/avg-pooling or regularization term). We offer training code for LLaVA1.5. It's sufficient to replace llava in LLaVA repo by our llava, and indicate the reduction ratio as before in Line 33-34.
Although the implementation of Turbo/Folder for MLLM is rather simple, it still needs to adapt for different vision encoder architectures (and some models possess token reduction operation like pooling/pixel-shuffle, which may cause problems). In order to minimize the deployment effort, we offer a simplied version of Folder in folder.py. We provide several implementation examples in folder_example. It's sufficient to insert the function merge_features into any desired place for token reduction (e.g. before/after projection layer).
merge_features(image_features, metric=None, size=None, r=1, class_token=True)
# image_features: (bs, seq_len, hidden_dim)
# metric: (bs, seq_len, metric_dim) set to image_features itself if not specified
# size: default set to None
# r: number of tokens to be reduced (e.g. 300)
# class_token: whether the visual sequence contains class/cls token
- We strongly recommand using this simplified version for deployment/comparison.
We leverage VLMEvalKit to do the evaluation. Please refer to the repo instruction and replace the related files with ours. Normally, by going to the corresponding repo and run the following command to build the environment.
pip install -e .
- VLMEvalKit Fantastic MLLMs evaluation toolkit.
- ToMe Our code is based on ToMe. Thanks for this wonderful work.
- Credit on BLIP, LLaVA1.5, Minigptv2, Video-LLaVA, VITA1.5, VILA1.5, WePOINTS1.5 for their open-source VLMs/MLLMs.
If you find our work helpful for your research, please consider citing:
@inproceedings{ju2024turbo,
title={Turbo: Informativity-driven acceleration plug-in for vision-language large models},
author={Ju, Chen and Wang, Haicheng and Cheng, Haozhe and Chen, Xu and Zhai, Zhonghua and Huang, Weilin and Lan, Jinsong and Xiao, Shuai and Zheng, Bo},
booktitle={European Conference on Computer Vision},
pages={436--455},
year={2024},
organization={Springer}
}
@article{wang2025folder,
title={FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance},
author={Wang, Haicheng and Yu, Zhemeng and Spadaro, Gabriele and Ju, Chen and Qu{\'e}tu, Victor and Tartaglione, Enzo},
journal={arXiv preprint arXiv:2501.02430},
year={2025}
}