Skip to content

3DTopia/GenDoP

Repository files navigation

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography [ICCV 2025]

Paper | Project page | Video | Data

Mengchen Zhang, Tong Wu✉️, Jing Tan, Ziwei Liu, Gordon Wetzstein, Dahua Lin✉️

Teaser Image

✨ Updates

[2025-07-03] Released inference code and model checkpoints!

[2025-07-03] Released training code.

📦 Install

Make sure torch with CUDA is correctly installed. For training, we rely on flash-attn (requires at least Ampere GPUs like A100). For inference, older GPUs like V100 are also supported, although slower.

# clone
git clone https://github.com/3DTopia/GenDoP.git
cd GenDoP

# environment
conda create --name GenDoP python=3.10
conda activate GenDoP
pip install flash-attn --no-build-isolation
pip install -r requirements.txt

💡 Inference

Pretrained Models

We provide the following pretrained models:

Model Type Description Download Link
text_motion Text (motion)-to-Trajectory Download
text_directorial Text (directorial)-to-Trajectory Download
text_rgbd Text & RGBD-to-Trajectory Download

Minimal Example

Note: You may choose one of the following options: either input the text directly using --text, or provide both --text_path and --text_key. For more examples, please refer to assets/examples.

Inference Commands

  • Text (motion)-to-Trajectory

    python eval.py ArAE --workspace outputs --name text_motion/case1 --resume "checkpoints/text_motion.safetensors" \
        --cond_mode 'text' \
        --text "The camera remains static, then moves right, followed by moving forward while yawing right, and finally moving left and forward while continuing to yaw right."
  • Text (directorial)-to-Trajectory

    python eval.py ArAE --workspace outputs --name text_directorial/case1 --resume "checkpoints/text_directorial.safetensors" \
        --cond_mode 'text' \
        --text "The camera starts static, moves down to reveal clouds, pitches up to show more formations, and returns to a static position."
  • Text & RGBD-to-Trajectory

    python eval.py ArAE --workspace outputs --name text_rgbd/case1 --resume "checkpoints/text_rgbd.safetensors" \
        --cond_mode 'depth+image+text' \
        --text "The camera moves right and yaws left, highlighting the notebook and cup, then shifts forward to emphasize the subject's expression, before coming to a stop." \
        --text_path assets/examples/text_rgbd/case1_caption.json \
        --text_key 'Concise Interaction' \
        --image_path assets/examples/text_rgbd/case1_rgb.png \
        --depth_path assets/examples/text_rgbd/case1_depth.npy

Visualization

Note: Our default visualization, as shown in the *_traj_cleaning.png files in our dataset, displays how the camera moves through the scene. It includes three views: front, top-down, and side perspectives. The colors transition from red to purple to show the sequence of movement. In the front view, you can observe vertical and horizontal movements (up, down, left, right), while the top-down view highlights forward and backward motion.

For a clearer visualization, consistent with the figures in our paper, you can use the following method:

Install

Follow the official instructions to install Blender. The Blender version we used is blender-3.3.1-linux-x64.

Then, install the required Python packages:

<path-to-blender>/<version>/python/bin/python3.10 -m pip install trimesh
<path-to-blender>/<version>/python/bin/python3.10 -m pip install matplotlib

Visualize

To visualize the trajectory, run:

<path-to-blender>/blender --background --python Blender_visualization/blender_visualize.py

Modify the traj_p variable in Blender_visualization/blender_visualize.py to specify the JSON file you want to visualize. This JSON file should follow the same format as the *_transforms_cleaning.json files in our dataset, which are the standardized trajectory files.

Evaluation

CLaTr checkpoints

We provide the following Contrastive Language-Trajectory embedding (CLaTr) checkpoints:

Model Type Description Download Link
epoch99_motion Evaluation for Text (motion)-to-Trajectory Download
epoch99_directorial Evaluation for Text (directorial)-to-Trajectory Download

Place the downloaded files into ./evaluate/CLaTr/CLaTr_checkpoints.

Evaluation Commands

  • Evaluation for Our Text (motion)-to-Trajectory Results

    Note: Modify keys in the config file ./evaluate/CLaTr/configs/config_eval.yaml

    • data_dir: Text (motion)-to-Trajectory Testset Results
    • key: 'Movement'
    # Extract CLaTr (motion) features
    cd ./evaluate/CLaTr
    export HYDRA_FULL_ERROR=1  
    python -m src.extraction checkpoint_path=CLaTr_checkpoints/epoch99_motion.ckpt
    # A .npy file <<path-to-motion-output>-preds.npy> containing the CLaTr (motion) features will be saved in ./evaluate/CLaTr/output
    
    # Evaluate CLaTr (motion)
    cd ./evaluate/eval
    python -m src.eval_only --pred_path <<path-to-motion-output>-preds.npy>
  • Evaluation for Our Text (directorial)-to-Trajectory Results

    Note: Modify keys in the config file ./evaluate/CLaTr/configs/config_eval.yaml

    • data_dir: Text (directorial)-to-Trajectory Testset Results
    • key: 'Concise Interaction'
    # Extract CLaTr (directorial) features
    cd ./evaluate/CLaTr
    export HYDRA_FULL_ERROR=1  
    python -m src.extraction checkpoint_path=CLaTr_checkpoints/epoch99_directorial.ckpt
    # A .npy file <<path-to-directorial-output>-preds.npy> containing the CLaTr (directorial) features will be saved in ./evaluate/CLaTr/output
    
    # Evaluate CLaTr (directorial)
    cd ./evaluate/eval
    python -m src.eval_only --pred_path <<path-to-directorial-output>-preds.npy>

    Our paper presents four metrics: captions/fscore, clatr/clatr_score, clatr/coverage, and clatr/fcd.

📚 Dataset

Note: We provide DataDoP, a large-scale multi-modal dataset containing 29K realworld shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent.

Currently, we are releasing a subset of the dataset for validation purposes. Additional data will be made available coming soon.

🏋️‍♂️ Training

Note: We have released a subset of the DataDoP dataset for training and validation. Please organize your training data in the following structure. If you wish to use your own dataset, refer to our data format and modify the core/provider.py file as needed.

GenDoP
├── DataDoP
│   ├── train
│   ├── test
│   ├── train_valid.txt
│   ├── test_valid.txt

Training Commands

  • Text (motion)-to-trajectory
    accelerate launch --config_file acc_configs/gpu1.yaml main.py ArAE --workspace workspace --exp_name 'text_motion' --cond_mode 'text' --text_key 'Movement' --num_cond_tokens 77
  • Text (directorial)-to-trajectory
    accelerate launch --config_file acc_configs/gpu1.yaml main.py ArAE --workspace workspace --exp_name 'text_directorial' --cond_mode 'text' --text_key 'Concise Interaction' --num_cond_tokens 77
  • Text & RGBD-to-trajectory
    accelerate launch --config_file acc_configs/gpu1.yaml main.py ArAE --workspace workspace --exp_name 'text_rgbd' --cond_mode 'depth+image+text' --text_key 'Concise Interaction' --num_cond_tokens 591

Training Details

The model is trained on a single A100 (80GB) GPU for approximately 8 hours, with a batch size of 16, using a dataset of 30k examples for around 100 epochs. Recommended hyperparameters:

--discrete_bins 256 --pose_length 30 --hidden_dim 1024 --num_heads 8 --num_layers 12

You can adjust these parameters in core/options.py according to your specific requirements.

📆 Todo

  • Release Dataset
  • Release Dataset Construction Code
  • Gradio Demo

📚 Acknowledgements

This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@misc{zhang2025gendopautoregressivecameratrajectory,
      title={GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography}, 
      author={Mengchen Zhang and Tong Wu and Jing Tan and Ziwei Liu and Gordon Wetzstein and Dahua Lin},
      year={2025},
      eprint={2504.07083},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07083}, 
}

About

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages