Skip to content

This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

License

Notifications You must be signed in to change notification settings

Tanveer81/ReVisionLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReVisionLLM

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan Md Mohaiminul Islam Jindong Gu Thomas Seidl Gedas Bertasius

Accepted by CVPR 2025

[Website] [Paper]

PWC

PWC


📢 Latest Updates

  • Mar-03: The trained models weights are available here
  • Mar-03: Released the training and evaluation code.
  • Feb-27: ReVisionLLM is accepted to CVPR 2025! 🔥🔥

ReVisionLLM Overview 💡

ReVisionLLM is a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos.

drawing


Contributions 🏆

  • We extend the existing VLMs to enable temporal grounding capabilities in hour-long videos.
  • We propose a vision-language model that recursively processes hour-long videos for effective and efficient hourlong video processing.
  • We propose a progressive training strategy, where the model is first trained to identify events in short video segments, then progressively scales to hour-long videos, enabling it to effectively handle longer, more complex video sequences.
  • Our model significantly outperforms previous state-ofthe-art approaches, surpassing specialized models and other Vision-Language Models (VLMs) on multiple datasets by a substantial margin.

Installation 🔧

We recommend setting up a conda environment for the project:

conda create --name=revisionllm python=3.10
conda activate revisionllm

git clone https://github.com/Tanveer81/ReVisionLLM.git
cd ReVisionLLM
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -r requirements.txt

Additionally, install additional packages for training cases.

pip install ninja
pip install flash-attn==2.5.6 --no-build-isolation

Process MAD Dataset:

  • Follow RGNet to download and extracted features for the MAD Dataset and put them in /data/mad/ folder.
  • Run the following commands to extract more features
python revisionllm/data/mad/mad_to_activitynet.py
python revisionllm/data/feature_extraction/mad_clip_text_extractor.py

Process VidChapters-7M Dataset:

  • Follow VidChapters-7M to download the VidChapters-7M Dataset and put that in /data/chapters/ folder.
  • Run the following commands to extract features
python revisionllm/data/feature_extraction/chapters_clip_text_extractor.py
python revisionllm/data/vidchap7m/chapters_clip_extractor.py
python revisionllm/data/vidchap7m/chapters_test_to_activitynet.py
python revisionllm/data/vidchap7m/chapters_to_activitynet.py

Download Encoder And LLM Weights

Follow VTimeLLM to download CLIP, Vicuna-v1.5, Stage-1 and Stage-2 weights and place them in /checkpoints folder.

Training on MAD Dataset:

sh scripts/mad/stage1_dense.sh
sh scripts/mad/stage1_sparse.sh
sh scripts/mad/stage2_long_33.sh
sh scripts/mad/stage2_long_100.sh

Inference on MAD Dataset:

bash scripts/mad/eval_stage1_dense.sh
bash scripts/mad/eval_stage2_33.sh
bash scripts/mad/eval_stage2_100.sh
python revisionllm/eval/metric_retrieval_forward.py

Training on VidChapters7M Dataset:

sh scripts/chapters/stage1_dense.sh
sh scripts/chapters/stage1_sparse.sh
sh scripts/chapters/stage2_long_100.sh

Inference on VidChapters7M Dataset:

bash scripts/chapters/eval_loop_stage1_dense.sh
bash scripts/chapters/eval_stage2_100.sh
python revisionllm/eval/metric_retrieval_forward_chapters.py

Qualitative Analysis 🔍

A Comprehensive Evaluation of ReVisionLLM's Performance on MAD Datasets.

drawing


Acknowledgements 🙏

We are grateful for the following awesome projects our ReVisionLLM arising from:

  • LLaVA: Large Language and Vision Assistant
  • FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
  • LLaMA: Open and Efficient Foundation Language Models
  • VTimeLLM: Video moment understanding and reasoning model.
  • VidChapters7M: A large-scale dataset of user-chaptered videos
  • MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

If you're using ReVisionLLM in your research or applications, please cite using this BibTeX:

@article{hannan2024revisionllm,
  title={ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos},
  author={Hannan, Tanveer and Islam, Md Mohaiminul and Gu, Jindong and Seidl, Thomas and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2411.14901},
  year={2024}
}

License 📜

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟

About

This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published