FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

This repository is the official implementation of Folder (ICCV2025) and Turbo (ECCV2024 oral)

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance (ICCV2025) [Paper]
Haicheng Wang*, Zhemeng Yu*, Gabriele Spadaro, Chen Ju, Shuai Xiao, Victor Quétu, Enzo Tartaglione✉️ (*Equal Contribution)

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models (ECCV 2024, Oral) [Paper]
Chen Ju*, Haicheng Wang*, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao✉️, Bo Zheng (*Equal Contribution)

💡 Highlights

🔥 Universal Acceleration for Various VLMs Applicable on various types of VLMs, including CLIP-like VLAs, diffusions and MLLMs.
🔥 Performace Maintenance Accelerate throughput 1.6-2.0X with minor performance drop.
🔥 Plug-and-play Can be directly applied in most of VLMs without retraining. Can also be used for training acceleration. Very easy to implement (10-min-ready).

📜 News

🚀 [2025/6/26] FOLDER has been accepted by ICCV2025.

🚀 [2025/2/9] We release the code for BLIP and MLLMs (LLaVA1.5, Minigptv2, VITA1.5, VILA1.5, WePOINTs1.5, VideoLLaVA).

🚀 [2024/7/3] Turbo has been accepted by ECCV2024 as oral presentation.

👨‍💻 Todo

Turbo for ViT
Turbo for Stable Diffusion
Checkpoints of Folder retrained models

🛠️ Usage

Installation

To build up the environment, please follow BLIP, VLMEvalKit and corresponding MLLMs (LLaVA1.5, Minigptv2, Video-LLaVA, VITA1.5, VILA1.5, WePOINTS1.5).

How to use

Please first clone our repo from github by running the following command.

git clone https://github.com/anakin-skywalker-Joseph/Folder.git
cd Folder

VLMs inference acceleration

The implementation of Turbo on BLIP is in BLIP_turbo. Folder can better accelerate BLIP on captioning task BLIP_folder. For example, for BLIP_folder, after setting up the image fold address in BLIP_folder/configs/caption_coco_base.yaml, go into BLIP_folder and run bash run_caption.sh to reproduce the result. Similar setup can be done in BLIP_turbo for various tasks.

MLLMs inference acceleration (complete version)

Folder is an upgraded version of Turbo for MLLMs acceleration, by merging tokens in the last layer. We provide complete implementation for LLaVA1.5, Video-LLaVA and Minigptv2. You can modify the reduction ratio in Line 33-34 for LLaVA1.5, Line 71-72 for Minigptv2 and Line 204-205 for Video-LLaVA. alphavalue is the balancing hyperparameter between mutual redundancy and semantic value in Turbo and rvalue controls the number of tokens reduced (number-of-reduced-token=rvalue*num_layer, e.g. 16 is 66% reduction ratio and 18 is 75% reduction ratio for LLaVA1.5).

MLLMs training acceleration (complete version)

Folder can also accelerate the training (serves as an alternative for pixel-shuffle/avg-pooling or regularization term). We offer training code for LLaVA1.5. It's sufficient to replace llava in LLaVA repo by our llava, and indicate the reduction ratio as before in Line 33-34.

MLLMs inference acceleration (simplified version, strongly recommand)

Although the implementation of Turbo/Folder for MLLM is rather simple, it still needs to adapt for different vision encoder architectures (and some models possess token reduction operation like pooling/pixel-shuffle, which may cause problems). In order to minimize the deployment effort, we offer a simplied version of Folder in folder.py. We provide several implementation examples in folder_example. It's sufficient to insert the function merge_features into any desired place for token reduction (e.g. before/after projection layer).

merge_features(image_features, metric=None, size=None, r=1, class_token=True)
# image_features: (bs, seq_len, hidden_dim)
# metric: (bs, seq_len, metric_dim)  set to image_features itself if not specified
# size: default set to None
# r: number of tokens to be reduced (e.g. 300)
# class_token: whether the visual sequence contains class/cls token

We strongly recommand using this simplified version for deployment/comparison.

Evaluation

We leverage VLMEvalKit to do the evaluation. Please refer to the repo instruction and replace the related files with ours. Normally, by going to the corresponding repo and run the following command to build the environment.

pip install -e .

👍 Acknowledgement

VLMEvalKit Fantastic MLLMs evaluation toolkit.
ToMe Our code is based on ToMe. Thanks for this wonderful work.
Credit on BLIP, LLaVA1.5, Minigptv2, Video-LLaVA, VITA1.5, VILA1.5, WePOINTS1.5 for their open-source VLMs/MLLMs.

Citation

If you find our work helpful for your research, please consider citing:

@inproceedings{ju2024turbo,
  title={Turbo: Informativity-driven acceleration plug-in for vision-language large models},
  author={Ju, Chen and Wang, Haicheng and Cheng, Haozhe and Chen, Xu and Zhai, Zhonghua and Huang, Weilin and Lan, Jinsong and Xiao, Shuai and Zheng, Bo},
  booktitle={European Conference on Computer Vision},
  pages={436--455},
  year={2024},
  organization={Springer}
}

@article{wang2025folder,
  title={FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance},
  author={Wang, Haicheng and Yu, Zhemeng and Spadaro, Gabriele and Ju, Chen and Qu{\'e}tu, Victor and Tartaglione, Enzo},
  journal={arXiv preprint arXiv:2501.02430},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
BLIP_folder		BLIP_folder
BLIP_turbo		BLIP_turbo
folder_example		folder_example
llava		llava
minigpt4		minigpt4
videollava		videollava
README.md		README.md
folder.py		folder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

💡 Highlights

📜 News

👨‍💻 Todo

🛠️ Usage

Installation

How to use

VLMs inference acceleration

MLLMs inference acceleration (complete version)

MLLMs training acceleration (complete version)

MLLMs inference acceleration (simplified version, strongly recommand)

Evaluation

👍 Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

anakin-skywalker-Joseph/Folder

Folders and files

Latest commit

History

Repository files navigation

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

💡 Highlights

📜 News

👨‍💻 Todo

🛠️ Usage

Installation

How to use

VLMs inference acceleration

MLLMs inference acceleration (complete version)

MLLMs training acceleration (complete version)

MLLMs inference acceleration (simplified version, strongly recommand)

Evaluation

👍 Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages