The official repository for paper "FlexSelect: Flexible Token Selection for Efficient Long Video Understanding".
Webpage
| Paper
| Huggingface
We present FlexSelect, a flexible and efficient token selection method that leverages cross-modal attention scores in VideoLLMs to identify query-relevant visual tokens. Our approach combines: (1) training-free attention-based token ranking, and (2) a lightweight selector for fast filtering.
We conduct experiments on three video LLMs (LLaVA-video, Qwen2.5VL, InternVL2.5) under for benchmarks: LongVideoBench, VideoMME, LVbench, MLVU.
Model | Size | VideoMME (Long) | VideoMME (Overall) | MLVU (M-Avg) | LongVB (Val) | LVBench (Test) |
---|---|---|---|---|---|---|
Proprietary Models | ||||||
GPT-4o | - | 65.3 | 71.9 | 64.6 | 66.7 | 34.7 |
Gemini-1.5-Pro | - | 67.4 | 75.0 | - | 64.0 | 33.1 |
Open-Source VideoLLMs | ||||||
mPLUG-Owl3 | 7B | 50.1 | 59.3 | 63.7 | 52.1 | 43.5 |
Qwen2-VL | 7B | 53.8 | 63.3 | 66.9 | 55.6 | 42.4 |
NVILA | 8B | 54.8 | 64.2 | 70.1 | 57.7 | - |
VideoLLaMA3 | 7B | - | 66.2 | 73.0 | 59.8 | 45.3 |
Aria | 8×3.5B | 58.8 | 67.6 | 70.6 | 65.3 | - |
Oryx-1.5 | 34B | 59.3 | 67.3 | 72.3 | 62.0 | 30.8 |
Video-XL-Pro | 3B | - | 60.0 | 70.6 | 56.7 | - |
SF-LLaVA-1.5 | 7B | - | 63.9 | 71.5 | 62.5 | 45.3 |
TPO | 7B | 55.4 | 65.6 | 71.1 | 60.1 | - |
Quato | 7B | 55.7 | 65.9 | 71.9 | 59.0 | - |
ViLAMP | 7B | 57.8 | 67.5 | 72.6 | 61.2 | 45.2 |
LLaVA-Video | 7B | 52.9 | 64.4 | 68.6 | 58.2 | 43.1 |
+ FlexSelect | 7B | 59.8 (↑6.9) | 68.9 (↑4.5) | 73.2 (↑4.6) | 61.9 (↑3.7) | 52.9 (↑9.8) |
+ FlexSelect-Lite | 7B | 58.3 (↑5.4) | 68.3 (↑3.9) | 71.8 (↑3.2) | 60.7 (↑2.5) | 52.2 (↑9.1) |
InternVL2.5 | 8B | 52.8 | 64.2 | 68.9 | 59.5 | 43.4 |
+ FlexSelect | 8B | 58.1 (↑5.3) | 67.0 (↑2.8) | 71.9 (↑3.0) | 60.1 (↑0.6) | 49.7 (↑6.3) |
+ FlexSelect-Lite | 8B | 57.9 (↑5.1) | 67.2 (↑3.0) | 71.9 (↑3.0) | 61.2 (↑1.7) | 49.9 (↑6.5) |
Qwen2.5-VL | 7B | 55.6 | 65.4 | 70.2 | 59.5 | 45.3 |
+ FlexSelect | 7B | 59.3 (↑3.7) | 68.2 (↑2.8) | 72.5 (↑2.3) | 62.4 (↑2.9) | 51.2 (↑5.9) |
+ FlexSelect-Lite | 7B | 58.6 (↑3.0) | 67.4 (↑2.0) | 70.3 (↑0.1) | 61.9 (↑2.4) | 50.0 (↑4.7) |
LLaVA-Video | 72B | 61.9 | 70.0 | 71.2 | 62.4 | 45.5 |
+ FlexSelect | 72B | 66.1 (↑4.2) | 73.1 (↑3.1) | 76.0 (↑4.8) | 66.9 (↑4.5) | 55.5 (↑10.0) |
Qwen2.5 VL | 72B | 63.9 | 73.4 | 76.3 | 66.2 | 47.3 |
+ FlexSelect | 72B | 66.9 (↑3.0) | 74.4 (↑1.0) | 76.6 (↑0.3) | 66.4 (↑0.2) | 56.6 (↑9.3) |
All four used benchmarks can be downloaded from huggingface website: LongVideoBench
, VideoMME
, MLVU
, and LVBench
.
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download lmms-lab/Video-MME --local-dir lmms-lab/Video-MME --local-dir-use-symlinks False
- Unzip the videos
cd lmms-lab/Video-MME
unzip 'videos_chunked_*.zip' -d videos/
- Move the data to eval directory
ln -s lmms-lab/Video-MME/videos flexselect/eval/data/videomme/data
ln -s lmms-lab/Video-MME/videomme/test-00000-of-00001.parquet flexselect/eval/data/videomme/test-00000-of-00001.parquet
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download sy1998/MLVU_dev --local-dir sy1998/MLVU_dev --local-dir-use-symlinks False
- Unzip the videos
cd sy1998/MLVU_dev
unzip 'video_part_*.zip' -d videos/
- Move the data to eval directory
ln -s sy1998/MLVU_dev/videos flexselect/eval/data/mlvu_test/data
ln -s sy1998/MLVU_dev/mlvu/test-00000-of-00001.parquet flexselect/eval/data/mlvu_test/test-00000-of-00001.parquet
-
Download the videos and files. Follow instructions here for downloading videos:
LVBench
The flexselect/eval/data/lvbench/test.jsonl is the test file that we have compiled and conforms to the lmms-eval supported format. -
Move or Link the videos dir under flexselect/eval/data/lvbench
-
We reorangnize the test files to support lmms eval evaluation. You can download it from
here
and move or link it underdata/lvbench/
dir.
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download longvideobench/LongVideoBench --local-dir longvideobench/LongVideoBench --local-dir-use-symlinks False
- Untar the videos
cd longvideobench/LongVideoBench
cat videos.tar.part.* > videos.tar
tar -xvf videos_merged.tar -C videos
- Move the data to eval directory
ln -s longvideobench/LongVideoBench/videos flexselect/eval/data/longvideobench/data
ln -s longvideobench/LongVideoBench/test-00000-of-00001.parquet flexselect/eval/data/longvideobench/test-00000-of-00001.parquet
The pretrained model can be found in their respective repositories: LLaVA-Video-7B
, LLaVA-Video-72B
, InternVL2.5-8B
, Qwen2.5VL-7B
and Qwen2.5VL-72B
.
FlexSelect works in two modes: training-free mode and lightweight mode. We evaluate them using LMMS-Eval. We follow the environment installation guideline of LMMS-EVAL
. You can setup a environment by running:
sh setup.sh
You should download the token selector weights into flexselect/eval/models from huggingface:
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_llava_video --local-dir flexselect/eval/models/flexselect_llava_video
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_qwen2.5vl --local-dir flexselect/eval/models/flexselect_qwen2.5vl
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_internvl2.5 --local-dir flexselect/eval/models/flexselect_internvl2.5
Then you can reproduce our results:
cd flexselect/eval
sh scripts/eval_llavavideo.sh
sh scripts/eval_internvl2_5.sh
sh scripts/eval_qwenvl2_5.sh
Here are explanations of variants in our eval scripts:
Parameter | Type | Options / Notes | Default |
---|---|---|---|
use_token_selector |
boolean |
- true : Enable FlexSelect token selection- false : Disable (standard eval) |
false |
token_selector_path |
string |
- "self" : Training-free mode- "path/to/token selector model" : Lightweight mode |
"self" |
token_selector_layer |
integer |
reference layer number(only effective in Training-free mode) | -1 |
drop_func_name |
string |
ways to get semantic relevance score - "token_selection" : average on head and text dimension- "token_selection_argmax" : argmax on head and text dimension |
"token_selection" |
tkn_budget |
integer |
max selected tokens | 6720 |
Here are explanations of some commandline choice:
Specify the evaluation model with the following options:
Value | Model Evaluated |
---|---|
llava_vid |
LLaVA-Video-7B |
internvl2 |
InternVL2.5 |
qwen2_5_vl |
Qwen2.5VL |
Value | Task Name | Notes |
---|---|---|
videomme |
Video-MME | Standard video evaluation |
mlvu_dev |
MLVU | Multi-language video understanding |
lvbench |
LVBench | Short-video benchmark |
longvideobench_val_v |
LongVideoBench | Default variant (e.g., for LLaVA) |
longvideobench_val_v_sub |
LongVideoBench | InternVL series only (uses caption) |
FlexSelect trains 0.5B token selector for LLaVA-Video-7B, Qwen2.5VL-7B and InternVL2.5-8B.
We follow the environment installation guideline of corresponding project to construct training environment:
- LLaVA-Video: https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file#2-install-the-inference-package
- Qwen2.5VL: https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-finetune/README.md
- InternVL2.5: https://internvl.readthedocs.io/en/latest/internvl2.5/finetune.html
You should first download the dataset LLaVA-Video-178K
.
We randomly select 5% data from it for training.
You can download the our sampled QA pairs from token_selector_train_data
and put the data under the right directory.
# Train LLaVA-Video selector
mv output_rnd05.yaml flexselect/train/LLaVA-Video
cd flexselect/train/LLaVA-Video && sh scripts/train_selector.sh
# Train Qwen2.5-VL selector
mv llava_video_178k.json flexselect/train/Qwen2.5-VL/
cd flexselect/train/Qwen2.5-VL/qwen-vl-finetune && sh scripts/sft_7b.sh
# Train InternVL selector
mv llava_video_178k.jsonl flexselect/train/InternVL/internvl_chat
cd flexselect/train/InternVL/internvl_chat && sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_res_2nd_finetune_full.sh
We identify the reference layer in VideoLLM where cross-modal attention scores best reflect the text-visual semantical relevance. If you are interested in the visualization results in the paper, you can run the following command to reproduce them after setup conda virtual environment.
Download needle data from needle data
and unzip it.
python3 flexselect/visualization/LLaVA-NeXT/needle_llava.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
python3 flexselect/visualization/QwenVL/needle_qwen2_5vl.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
python3 flexselect/visualization/InternVL2/needle_internvl2_5.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
python3 flexselect/visualization/LLaVA-NeXT/draw_heatmap_llava.py
python3 flexselect/visualization/InternVL2/draw_heatmap_internvl.py
python3 /mnt/sh/mmvision/home/yunzhuzhang/flexselect/visualization/LLaVA-NeXT/pca_llava.py
This repository is built upon LMMS-EVAL
, LLaVA-Video
, InternVL2.5
, and Qwen2.5VL
. Thanks for those well-organized codebases.
FlexSelect is released under the CC BY-NC-SA 4.0 license
.
@misc{zhang2025flexselectflexibletokenselection, title={FlexSelect: Flexible Token Selection for Efficient Long Video Understanding}, author={Yunzhu Zhang and Yu Lu and Tianyi Wang and Fengyun Rao and Yi Yang and Linchao Zhu}, year={2025}, eprint={2506.00993}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.00993}, }