ReVisionLLM

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan Md Mohaiminul Islam Jindong Gu Thomas Seidl Gedas Bertasius

Accepted by CVPR 2025

📢 Latest Updates

Mar-03: The trained models weights are available here
Mar-03: Released the training and evaluation code.
Feb-27: ReVisionLLM is accepted to CVPR 2025! 🔥🔥

ReVisionLLM Overview 💡

ReVisionLLM is a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos.

Contributions 🏆

We extend the existing VLMs to enable temporal grounding capabilities in hour-long videos.
We propose a vision-language model that recursively processes hour-long videos for effective and efficient hourlong video processing.
We propose a progressive training strategy, where the model is first trained to identify events in short video segments, then progressively scales to hour-long videos, enabling it to effectively handle longer, more complex video sequences.
Our model significantly outperforms previous state-ofthe-art approaches, surpassing specialized models and other Vision-Language Models (VLMs) on multiple datasets by a substantial margin.

Installation 🔧

We recommend setting up a conda environment for the project:

conda create --name=revisionllm python=3.10
conda activate revisionllm

git clone https://github.com/Tanveer81/ReVisionLLM.git
cd ReVisionLLM
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -r requirements.txt

Additionally, install additional packages for training cases.

pip install ninja
pip install flash-attn==2.5.6 --no-build-isolation

Process MAD Dataset:

Follow RGNet to download and extracted features for the MAD Dataset and put them in /data/mad/ folder.
Run the following commands to extract more features

python revisionllm/data/mad/mad_to_activitynet.py
python revisionllm/data/feature_extraction/mad_clip_text_extractor.py

Process VidChapters-7M Dataset:

Follow VidChapters-7M to download the VidChapters-7M Dataset and put that in /data/chapters/ folder.
Run the following commands to extract features

python revisionllm/data/feature_extraction/chapters_clip_text_extractor.py
python revisionllm/data/vidchap7m/chapters_clip_extractor.py
python revisionllm/data/vidchap7m/chapters_test_to_activitynet.py
python revisionllm/data/vidchap7m/chapters_to_activitynet.py

Download Encoder And LLM Weights

Follow VTimeLLM to download CLIP, Vicuna-v1.5, Stage-1 and Stage-2 weights and place them in /checkpoints folder.

Training on MAD Dataset:

sh scripts/mad/stage1_dense.sh
sh scripts/mad/stage1_sparse.sh
sh scripts/mad/stage2_long_33.sh
sh scripts/mad/stage2_long_100.sh

Inference on MAD Dataset:

bash scripts/mad/eval_stage1_dense.sh
bash scripts/mad/eval_stage2_33.sh
bash scripts/mad/eval_stage2_100.sh
python revisionllm/eval/metric_retrieval_forward.py

Training on VidChapters7M Dataset:

sh scripts/chapters/stage1_dense.sh
sh scripts/chapters/stage1_sparse.sh
sh scripts/chapters/stage2_long_100.sh

Inference on VidChapters7M Dataset:

bash scripts/chapters/eval_loop_stage1_dense.sh
bash scripts/chapters/eval_stage2_100.sh
python revisionllm/eval/metric_retrieval_forward_chapters.py

Qualitative Analysis 🔍

A Comprehensive Evaluation of ReVisionLLM's Performance on MAD Datasets.

Acknowledgements 🙏

We are grateful for the following awesome projects our ReVisionLLM arising from:

LLaVA: Large Language and Vision Assistant
FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
LLaMA: Open and Efficient Foundation Language Models
VTimeLLM: Video moment understanding and reasoning model.
VidChapters7M: A large-scale dataset of user-chaptered videos
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

If you're using ReVisionLLM in your research or applications, please cite using this BibTeX:

@article{hannan2024revisionllm,
  title={ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos},
  author={Hannan, Tanveer and Islam, Md Mohaiminul and Gu, Jindong and Seidl, Thomas and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2411.14901},
  year={2024}
}

License 📜

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
revisionllm		revisionllm
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
method.png		method.png
qual.png		qual.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ReVisionLLM

📢 Latest Updates

ReVisionLLM Overview 💡

Contributions 🏆

Installation 🔧

Process MAD Dataset:

Process VidChapters-7M Dataset:

Download Encoder And LLM Weights

Training on MAD Dataset:

Inference on MAD Dataset:

Training on VidChapters7M Dataset:

Inference on VidChapters7M Dataset:

Qualitative Analysis 🔍

Acknowledgements 🙏

License 📜

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Tanveer81/ReVisionLLM

Folders and files

Latest commit

History

Repository files navigation

ReVisionLLM

📢 Latest Updates

ReVisionLLM Overview 💡

Contributions 🏆

Installation 🔧

Process MAD Dataset:

Process VidChapters-7M Dataset:

Download Encoder And LLM Weights

Training on MAD Dataset:

Inference on MAD Dataset:

Training on VidChapters7M Dataset:

Inference on VidChapters7M Dataset:

Qualitative Analysis 🔍

Acknowledgements 🙏

License 📜

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages