Vidi: Large Multimodal Models for Video Understanding and Editing

Homepage: https://bytedance.github.io/vidi-website/

We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query.

Release

[06/06/2025] 🔥 Vidi-7B demo released at https://vidi.byteintl.com/. Follow the instructions in the demo section to run the demo.
[04/21/2025] 🔥 The first release of Vidi consists of tech report and the VUE-TR evaluation benchmark. The 7B model demo and weights are coming soon.

Content

Installation
Evaluation
Demo
Vidi-7B Weight

Demo

Click "Choose File" button and find a video local file (better in mp4 format). Click the "Upload" button.

(Optional) Video files could contain corrupted frames which causes errors for video loading, it is recommended to use the following command to transcode the video file before uploading if the demo raises an error:
```
ffmpeg -i {vpath_in} -vf scale=480:-2 -c:v libx264 -c:a copy -preset ultrafast {vpath_out} -y
```
After the video is uploaded, wait till the video is ready to play in the "Input Video" box.
Enter the text query in the "Input Query". Click the "Run Time Retrieval" button.
Wait till the result clips show in the "Output Clips" box. This could take several minutes for long video.

Installation

Run the install.sh.

Evaluation

We release the ground-truth annotation and evaluation results in 5 json files. Run the script for a standalone evaluation:

python3 -u qa_eval.py --pred_path results_Vidi.json

The result figures will be saved in the output folder ('./results' by default) . See example figures below:

For evaluation of new models, first download the videos based on the ids in "video_id.txt" from Youtube (e.g., yt-dlp ). Then run inference and save the results in the following format:

[
    {
        "query_id": 0,
        "video_id": "coPfnSFOXj0",
        "duration": 32.625,
        "query": "transition from storyboards to animation",
        "answer": [
            [
                0.0,
                32.29875
            ]
        ],
        "task": "temporal_retrieval"
    },
    ...
]

Citation

If you find Vidi useful for your research and applications, please cite using this BibTeX:

@article{Vidi2025vidi,
    title={Vidi: Large Multimodal Models for Video 
            Understanding and Editing},
    author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, 
            Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
            Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, 
            Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, 
            Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, 
            Xueqiong Qu},
    journal={arXiv preprint arXiv:2504.15681},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
results		results
LICENSE.txt		LICENSE.txt
README.md		README.md
VUE-TR_ground_truth.json		VUE-TR_ground_truth.json
install.sh		install.sh
qa_eval.py		qa_eval.py
requirements.txt		requirements.txt
results_GPT-4o.json		results_GPT-4o.json
results_Gemini-2.0-Flash.json		results_Gemini-2.0-Flash.json
results_Gemini-2.5-Pro.json		results_Gemini-2.5-Pro.json
results_Vidi.json		results_Vidi.json
results_table.csv		results_table.csv
video_id.txt		video_id.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vidi: Large Multimodal Models for Video Understanding and Editing

Release

Content

Demo

Installation

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

bytedance/vidi

Folders and files

Latest commit

History

Repository files navigation

Vidi: Large Multimodal Models for Video Understanding and Editing

Release

Content

Demo

Installation

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages