StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
Sijie Zhao*
Wenbo Hu*
Xiaodong Cun*
Yong Zhang†
Xiaoyu Li†
Zhe Kong
Xiangjun Gao
Muyao Niu
Ying Shan
* equal contribution † corresponding author
- The red-blue 3D glasses I purchased above are used to watch Anaglyph type 3D videos, while the 3D phone holder I bought below, when paired with a 3D video player on a smartphone, is used to view Side-by-Side (SBS) 3D videos. Based on the three results below。
- this project is suitable for generating 3D videos of real-world scenes or 3D videos where the foreground characters are not restricted to specific scenarios (this depends on the quality of the depth estimation model and the scene materials, and additional preprocessing is required for videos containing text)."
_fix.mp4
https://huggingface.co/datasets/svjack/Genshin-Impact-RealWorld-Teyvat-Reality-3D-Anaglyph-Video
final_anaglyph_audio.1.mp4
https://huggingface.co/datasets/svjack/Genshin-Impact-RealWorld-Teyvat-Reality-3D-STS-Video
final_sbs_audio.1.mp4
P057.mp4
https://huggingface.co/datasets/svjack/Genshin-Impact-LiYue-QunYuGe-War-3D-Anaglyph-Video
b1ca707c-c535-4eba-b721-c713376e75e2.mp4
https://huggingface.co/datasets/svjack/Genshin-Impact-LiYue-QunYuGe-War-3D-STS-Video
c3843b97-27d8-4eca-a0c5-84da38916fa0.mp4
EP.-.mp4
https://huggingface.co/datasets/svjack/Genshin-Impact-Nahida-EP-3D-Anaglyph-Video
final_anaglyph_audio.2.mp4
https://huggingface.co/datasets/svjack/Genshin-Impact-Nahida-EP-3D-STS-Video
final_sbs_audio.2.mp4
We propose a novel framework to convert any 2D videos to immersive stereoscopic 3D ones that can be viewed on different display devices, like 3D Glasses, Apple Vision Pro and 3D Display. It can be applied to various video sources, such as movies, vlogs, 3D cartoons, and AIGC videos.
2024/12/27
We released our inference code and model weights.2024/09/11
We submitted our technical report on arXiv and released our project page.
Here we show some examples of input videos and their corresponding stereo outputs in Anaglyph 3D format.
We run our code on Python 3.8 and Cuda 11.8. You can use Anaconda or Docker to build this basic environment.
sudo apt-get update && sudo apt-get install git-lfs ffmpeg cbm
sudo apt-get upgrade ffmpeg
conda create -n stereoCrafter python=3.10
conda activate stereoCrafter
pip install ipykernel
python -m ipykernel install --user --name stereoCrafter --display-name "stereoCrafter"
git clone https://github.com/svjack/StereoCrafter
cd StereoCrafter
pip install -r requirements.txt
pip uninstall torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip uninstall xformers
pip install xformers
pip install moviepy==1.0.3
pip install pillow==9.0.0
pip install pydub
pip install pandas
cd ./dependency/Forward-Warp
chmod a+x install.sh
./install.sh
pip install -r requirements.txt
1. Download the SVD img2vid model for the image encoder and VAE.
# in StereoCrafter project root directory
mkdir weights
cd ./weights
git lfs install
huggingface-cli login
huggingface-cli download stabilityai/stable-video-diffusion-img2vid-xt-1-1 --local-dir stable-video-diffusion-img2vid-xt-1-1
#### OR
git clone https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1
2. Download the DepthCrafter model for the video depth estimation.
git clone https://huggingface.co/tencent/DepthCrafter
3. Download the StereoCrafter model for the stereo video generation.
git clone https://huggingface.co/TencentARC/StereoCrafter
- Step 0
python depth_splatting_inference_npz.py \
--pre_trained_path ./weights/stable-video-diffusion-img2vid-xt-1-1\
--unet_path ./weights/DepthCrafter \
--input_video_path ./source_video/camel.mp4 \
--output_video_path ./outputs/camel_splatting_results.mp4
- Step 1
import numpy as np
from moviepy.editor import ImageSequenceClip
def npz_to_video(npz_path, output_video_path, fps, key="video_grid"):
"""
从 npz 文件中加载视频结构并保存为视频文件
:param npz_path: npz 文件路径
:param output_video_path: 输出视频文件路径
:param fps: 视频帧率
:param key: npz 文件中保存视频数据的键名
"""
# 加载 npz 文件
data = np.load(npz_path)
video_grid = data[key] # 假设视频结构保存在 'video_grid' 键中
# 将视频结构转换为 0-255 范围的图像序列
video_grid = (video_grid * 255).astype(np.uint8)
print("Original shape:", video_grid.shape)
# 检查是否为黑白视频(单通道)
if len(video_grid.shape) == 3: # 形状为 (T, H, W)
# 将黑白视频转换为三通道
video_grid = np.stack([video_grid] * 3, axis=-1) # 形状变为 (T, H, W, 3)
print("Converted shape:", video_grid.shape)
# 使用 moviepy 保存视频
clip = ImageSequenceClip(list(video_grid), fps=fps)
clip.write_videofile(output_video_path, codec="libx264", ffmpeg_params=["-crf", "16"])
# 示例调用
npz_path = "outputs/camel_splatting_results_video_grid.npz" # 替换为你的 npz 文件路径
output_video_path = "outputs/camel_splatting_results_video_grid.mp4" # 替换为输出视频文件路径
fps = 30 # 替换为视频的帧率
npz_to_video(npz_path, output_video_path, fps)
#### OR
# 示例调用
npz_path = "outputs/camel_splatting_results.npz" # 替换为你的 npz 文件路径
output_video_path = "outputs/camel_splatting_results.mp4" # 替换为输出视频文件路径
fps = 30 # 替换为视频的帧率
npz_to_video(npz_path, output_video_path, fps, key = "depth")
camel_splatting_results_video_grid.mp4
camel_splatting_results_video_grid.mp4
camel_splatting_results.mp4
camel_splatting_results.mp4
cp ./outputs/camel_splatting_results_video_grid.mp4 camel_splatting_results_video_grid.mp4
from moviepy.editor import VideoFileClip
def resize_video(input_video_path, output_video_path, scale_factor=0.5):
"""
缩放视频的长宽
:param input_video_path: 输入视频文件路径
:param output_video_path: 输出视频文件路径
:param scale_factor: 缩放因子,例如 0.5 表示长宽变为原来的一半,2 表示长宽变为原来的两倍
"""
# 加载视频
clip = VideoFileClip(input_video_path)
# 计算新的分辨率
new_width = int(clip.size[0] * scale_factor)
new_height = int(clip.size[1] * scale_factor)
# 缩放视频
resized_clip = clip.resize((new_width, new_height))
# 保存缩放后的视频
resized_clip.write_videofile(output_video_path, codec="libx264")
# 示例调用
input_video_path = "camel_splatting_results_video_grid.mp4" # 替换为输入视频路径
output_video_path = "camel_splatting_results_video_grid_half.mp4" # 替换为输出视频路径
scale_factor = 0.5 # 缩放因子,0.5 表示长宽变为原来的一半,2 表示长宽变为原来的两倍
resize_video(input_video_path, output_video_path, scale_factor)
camel_splatting_results_video_grid_half.mp4
camel_splatting_results_video_grid_half.mp4
- Step 2
python inpainting_inference.py \
--pre_trained_path ./weights/stable-video-diffusion-img2vid-xt-1-1 \
--unet_path ./weights/StereoCrafter \
--input_video_path camel_splatting_results_video_grid_half.mp4 \
--save_dir ./outputs
camel_video_grid_half_inpainting_results_sbs.mp4
camel_video_grid_half_inpainting_results_sbs.mp4
camel_video_grid_half_inpainting_results_anaglyph.mp4
camel_video_grid_half_inpainting_results_anaglyph.mp4
Script:
# in StereoCrafter project root directory
sh run_inference.sh
There are two main steps in this script for generating stereo video.
Execute the following command:
python depth_splatting_inference.py --pre_trained_path [PATH] --unet_path [PATH]
--input_video_path [PATH] --output_video_path [PATH]
Arguments:
--pre_trained_path
: Path to the SVD img2vid model weights (e.g.,./weights/stable-video-diffusion-img2vid-xt-1-1
).--unet_path
: Path to the DepthCrafter model weights (e.g.,./weights/DepthCrafter
).--input_video_path
: Path to the input video (e.g.,./source_video/camel.mp4
).--output_video_path
: Path to the output video (e.g.,./outputs/camel_splatting_results.mp4
).--max_disp
: Parameter controlling the maximum disparity between the generated right video and the input left video. Default value is20
pixels.
The first step generates a video grid with input video, visualized depth map, occlusion mask, and splatting right video, as shown below:
Execute the following command:
python inpainting_inference.py --pre_trained_path [PATH] --unet_path [PATH]
--input_video_path [PATH] --save_dir [PATH]
Arguments:
-
--pre_trained_path
: Path to the SVD img2vid model weights (e.g.,./weights/stable-video-diffusion-img2vid-xt-1-1
). -
--unet_path
: Path to the StereoCrafter model weights (e.g.,./weights/StereoCrafter
). -
--input_video_path
: Path to the splatting video result generated by the first stage (e.g.,./outputs/camel_splatting_results.mp4
). -
--save_dir
: Directory for the output stereo video (e.g.,./outputs
). -
--tile_num
: The number of tiles in width and height dimensions for tiled processing, which allows for handling high resolution input without requiring more GPU memory. The default value is1
(1$\times$ 1 tile). For input videos with a resolution of 2K or higher, you could use more tiles to avoid running out of memory.
The stereo video inpainting generates the stereo video result in side-by-side format and anaglyph 3D format, as shown below:
We would like to express our gratitude to the following open-source projects:
- Stable Video Diffusion: A latent diffusion model trained to generate video clips from an image or text conditioning.
- DepthCrafter: A novel method to generate temporally consistent depth sequences from videos.
@article{zhao2024stereocrafter,
title={Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos},
author={Zhao, Sijie and Hu, Wenbo and Cun, Xiaodong and Zhang, Yong and Li, Xiaoyu and Kong, Zhe and Gao, Xiangjun and Niu, Muyao and Shan, Ying},
journal={arXiv preprint arXiv:2409.07447},
year={2024}
}