GitHub - showlab/SAM-I2V: [CVPR 2025] SAM-I2V

[CVPR 2025] SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

Haiyang Mei, Pengyu Zhang, Mike Zheng Shou

[Paper] [arXiv] [BibTeX]

Table of Contents

1. Overview

SAM-I2V is a training-efficient method to upgrade the image-based SAM for promptable video segmentation. It achieves over 90% of SAM 2’s performance while requiring only 0.2% of its training cost.

SAM-I2V takes an input video and extracts frame features via an image encoder enhanced by a temporal feature integrator to capture dynamic context. These features are processed by a memory associator and memory prompt generator to manage historical information and generate target prompts. A prompt encoder incorporates optional user inputs (e.g., masks, points, boxes). Finally, the mask decoder produces segmentation masks for each frame, enabling user-guided and memory-conditioned promptable video segmentation.

2. Installation

Our implementation uses python==3.11, torch==2.5.0 and torchvision==0.20.0. Please follow the instructions here to install both PyTorch and TorchVision dependencies. You can install SAM-I2V on a GPU machine using:

git clone https://github.com/showlab/SAM-I2V.git && cd SAM-I2V
conda create -n sam-i2v python=3.11
conda activate sam-i2v
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
pip install -e .

3. Getting Started

3.1 Download Checkpoint

First, we need to download the SAM-I2V checkpoint. It can be downloaded from:

[sam-i2v_8gpu.pt] [ Google Drive ] [ OneDrive ] [ BaiduDisk ]
[sam-i2v_32gpu.pt] [ Google Drive ] [ OneDrive ] [ BaiduDisk ]

Both models were trained in one day using 24GB GPUs. The first model (sam-i2v_8gpu.pt) was trained with 8 GPUs, while the second model (sam-i2v_32gpu.pt) was trained with 32 GPUs and offers better performance.

3.2 Demo Use

SAM-I2V can be used in a few lines as follows for promptable video segmentation. Below provides a video predictor with APIs for example to add prompts and propagate masklets throughout a video. Same as SAM2, SAM-I2V supports video inference on multiple objects and uses an inference state to keep track of the interactions in each video.

import torch
from i2v.build_i2v import build_i2v_video_predictor

checkpoint = "./checkpoints/sam-i2v_32gpu.pt"
model_cfg = "./i2v/configs/i2v-infer.yaml"
predictor = build_i2v_video_predictor(model_cfg, checkpoint)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = predictor.init_state(<your_video>)

    # add new prompts and instantly get the output on the same frame
    frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>):

    # propagate the prompts to get masklets throughout the video
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        ...

3.3 Testing

We provide instructions for testing on the SAV-Test dataset.

(a) Please refer to the sav_dataset/README.md for detailed instructions on how to download and prepare the SAV-Test dataset before testing.

(b) Prepare the 'mask_info' for the ease of testing via:

python tools/save_gt_mask_multiprocess.py

Or you can directly download the preprocessed 'mask_info' here.

(c) Run the inference script

cd test_pvs
sh semi_infer.sh

3.4 Evaluation

Run the evaluation script

sh semi_eval.sh

3.5 Training

(a) Please refer to the sav_dataset/README.md for detailed instructions on how to download and prepare the SAV-Train dataset. Totally 50,583 training videos (train/txt/sav_train_list.txt).

(b) We follow SAM 2 to train the model on mixed video and image data. Download the SA-1B dataset and sample a subset of images, as the full dataset is too large to use in its entirety. We randomly sample 10k images (train/txt/sa1b_10k_train_list.txt) to train SAM-I2V.

(c) Download the SAM 1 model (i.e., TinySAM) to be upgraded and put it to checkpoints/tinysam.pth.

(d) Train the model:

Single node with 8 GPUs:

nohup sh train.sh > txt/8gpu.txt 2>&1 & disown
tail -f txt/8gpu.txt -n 9999999

Multi-node with each node has 8 GPUs (e.g., 4x8=32 GPUs):

sh multi_node_train_4_nodes.sh

4. Acknowledgements

Our implementation builds upon SAM 2 and reuses essential modules from its official codebase.

5. Citation

If you use SAM-I2V in your research, please use the following BibTeX entry.

@InProceedings{Mei_2025_CVPR,
    author    = {Mei, Haiyang and Zhang, Pengyu and Shou, Mike Zheng},
    title     = {SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {3417-3426}
}

6. License

Please see LICENSE

7. Contact

E-Mail: Haiyang Mei ([email protected])

⬆ back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[CVPR 2025] SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

1. Overview

2. Installation

3. Getting Started

3.1 Download Checkpoint

3.2 Demo Use

3.3 Testing

3.4 Evaluation

3.5 Training

4. Acknowledgements

5. Citation

6. License

7. Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
i2v		i2v
sav_dataset		sav_dataset
test_pvs		test_pvs
tools		tools
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

License

showlab/SAM-I2V

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2025] SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

1. Overview

2. Installation

3. Getting Started

3.1 Download Checkpoint

3.2 Demo Use

3.3 Testing

3.4 Evaluation

3.5 Training

4. Acknowledgements

5. Citation

6. License

7. Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages