GitHub - yhy-2000/VideoDeepResearch

🎬 VideoDeepResearch: Long Video Understanding With Agentic Tool Using

👉 Introduction

In this work, we propose VideoDeepResearch, an agentic framework that tackles long video understanding (LVU) using a text-only reasoning model with modular multimodal tools, outperforming MLLM baselines across major LVU benchmarks.

🎬 Demo

example_sports.mov

✨ Key Features

📹 Diverse Long-Video Understanding
- Single-detail, multi-detail, and multi-hop question answering across various scenes
🛠️ Multi-Tool Integration
- Visual Perceiver, Video Browser, Text/Subtitle/Image Retriever & Extractor
🔄 Dynamic Multi-Round Calls
- Automated tool-call scheduling based on question complexity
🔍 Full Interpretability
- Detailed trace logs and step-by-step reasoning

🚀 Quick Start

1. Clone & Install

# Clone repository
git clone https://github.com/yhy-2000/VideoDeepResearch.git
cd VideoDeepResearch

# Install dependencies
pip install -r requirements.txt

Project Layout:

VideoDeepResearch/
├── streamlit_demo_vlm_local.py   # Streamlit demo script that use local vllm server as visual module
├── streamlit_demo_vlm_api.py     # Streamlit demo script that use proprietary API as visual module
├── requirements.txt              # Python dependencies
├── eval/                         # Code for evaluating benchmarks
├── asset/                        # Assets used in the demo
├── data/
│   ├── videos/                   # Raw video files
│   ├── clips/                    # Generated video clips
│   ├── dense_frames/             # Extracted key frames
│   └── subtitles/                # Subtitle files(optional)
└── README.md                     # This documentation

2. Launch Demo

Set the following environment variables of text-only large reasoning model(example for deepseek-reasoner):

export API_MODEL_NAME=deepseek-r1-250120
export API_BASE_URL=https://ark.cn-beijing.volces.com/api/v3
export API_KEY=YOUR_API_KEY

💡 Tip: We recomment to use volcengine(https://www.volcengine.com/product/ark) for faster and more stable responses.

For Visual Perceiver & Video Browser:

Local Server: run bash init_vllm_server.sh then:
```
  streamlit run streamlit_demo_vlm_local.py
```

Proprietary API: set the enviroment variable API_MODEL_NAME_VLM, API_BASE_URL_VLM, and API_KEY_VLM, e.g.:

  export API_MODEL_NAME_VLM=doubao-1.5-vision-pro-250328
  export API_BASE_URL_VLM=https://ark.cn-beijing.volces.com/api/v3
  export API_KEY_VLM=YOUR_API_KEY

then:

  streamlit run streamlit_demo_vlm_api.py

After that, you should see in terminal like:

  Local URL:    http://localhost:8501
  Network URL:  http://192.168.x.x:8501
  External URL: http://your_public_ip:8501

Open Local URL in your browser to start.

🧰 Usage Instructions

Open Browser: Navigate to http://localhost:8501.
Configure Settings:
- Choose model and API parameters in the sidebar.
- Upload or select a video file (.mp4) and (optionally) a subtitle file (.srt).
Ask Questions:
- Type your question regarding the video content.
- Click Start Processing.
Review Results:
- View tool-call logs, extracted frames/clips, and final answers below the video player.

💡 Tip: For faster responses, try faster reasoning model apis like gemini-2.5.

✅ Results Replication

The examples we provide are sourced from the LVBench and MLVU test sets. To run these examples, please download the corresponding datasets and replace the video_path with the appropriate local path.

We also provide the prompts used in prompt_qwen25vl.py and prompt_seed15vl.py, allowing you to replicate our results using the corresponding configurations.

📬 Contact

Encounter issues or have questions? Reach out to:

H.Y. Yuan Email: [email protected]

📄 Citation

If you find this work helpful, please cite our paper:

@misc{yuan2025videodeepresearchlongvideounderstanding,
      title={VideoDeepResearch: Long Video Understanding With Agentic Tool Using}, 
      author={Huaying Yuan and Zheng Liu and Junjie Zhou and Ji-Rong Wen and Zhicheng Dou},
      year={2025},
      eprint={2506.10821},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.10821}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
asset		asset
eval		eval
languagebind		languagebind
init_vllm_server.sh		init_vllm_server.sh
prompt.py		prompt.py
readme.md		readme.md
requirements.txt		requirements.txt
retriever.py		retriever.py
streamlit_demo_vlm_api.py		streamlit_demo_vlm_api.py
streamlit_demo_vlm_local.py		streamlit_demo_vlm_local.py
video_utils.py		video_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎬 VideoDeepResearch: Long Video Understanding With Agentic Tool Using

👉 Introduction

🎬 Demo

✨ Key Features

🚀 Quick Start

1. Clone & Install

2. Launch Demo

🧰 Usage Instructions

✅ Results Replication

📬 Contact

📄 Citation

About

Uh oh!

Releases

Packages

Languages

yhy-2000/VideoDeepResearch

Folders and files

Latest commit

History

Repository files navigation

🎬 VideoDeepResearch: Long Video Understanding With Agentic Tool Using

👉 Introduction

🎬 Demo

✨ Key Features

🚀 Quick Start

1. Clone & Install

2. Launch Demo

🧰 Usage Instructions

✅ Results Replication

📬 Contact

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages