GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography [ICCV 2025]

Paper | Project page | Video | Data

Mengchen Zhang, Tong Wu✉️, Jing Tan, Ziwei Liu, Gordon Wetzstein, Dahua Lin✉️

✨ Updates

[2025-07-03] Released inference code and model checkpoints!

[2025-07-03] Released training code.

📦 Install

Make sure torch with CUDA is correctly installed. For training, we rely on flash-attn (requires at least Ampere GPUs like A100). For inference, older GPUs like V100 are also supported, although slower.

# clone
git clone https://github.com/3DTopia/GenDoP.git
cd GenDoP

# environment
conda create --name GenDoP python=3.10
conda activate GenDoP
pip install flash-attn --no-build-isolation
pip install -r requirements.txt

💡 Inference

Pretrained Models

We provide the following pretrained models:

Model Type	Description	Download Link
text_motion	Text (motion)-to-Trajectory	Download
text_directorial	Text (directorial)-to-Trajectory	Download
text_rgbd	Text & RGBD-to-Trajectory	Download

Minimal Example

Note: You may choose one of the following options: either input the text directly using --text, or provide both --text_path and --text_key. For more examples, please refer to assets/examples.

Inference Commands

Text (motion)-to-Trajectory

python eval.py ArAE --workspace outputs --name text_motion/case1 --resume "checkpoints/text_motion.safetensors" \
    --cond_mode 'text' \
    --text "The camera remains static, then moves right, followed by moving forward while yawing right, and finally moving left and forward while continuing to yaw right."

Text (directorial)-to-Trajectory

python eval.py ArAE --workspace outputs --name text_directorial/case1 --resume "checkpoints/text_directorial.safetensors" \
    --cond_mode 'text' \
    --text "The camera starts static, moves down to reveal clouds, pitches up to show more formations, and returns to a static position."

Text & RGBD-to-Trajectory

python eval.py ArAE --workspace outputs --name text_rgbd/case1 --resume "checkpoints/text_rgbd.safetensors" \
    --cond_mode 'depth+image+text' \
    --text "The camera moves right and yaws left, highlighting the notebook and cup, then shifts forward to emphasize the subject's expression, before coming to a stop." \
    --text_path assets/examples/text_rgbd/case1_caption.json \
    --text_key 'Concise Interaction' \
    --image_path assets/examples/text_rgbd/case1_rgb.png \
    --depth_path assets/examples/text_rgbd/case1_depth.npy

Visualization

Note: Our default visualization, as shown in the *_traj_cleaning.png files in our dataset, displays how the camera moves through the scene. It includes three views: front, top-down, and side perspectives. The colors transition from red to purple to show the sequence of movement. In the front view, you can observe vertical and horizontal movements (up, down, left, right), while the top-down view highlights forward and backward motion.

For a clearer visualization, consistent with the figures in our paper, you can use the following method:

Install

Follow the official instructions to install Blender. The Blender version we used is blender-3.3.1-linux-x64.

Then, install the required Python packages:

<path-to-blender>/<version>/python/bin/python3.10 -m pip install trimesh
<path-to-blender>/<version>/python/bin/python3.10 -m pip install matplotlib

Visualize

To visualize the trajectory, run:

<path-to-blender>/blender --background --python Blender_visualization/blender_visualize.py

Modify the traj_p variable in Blender_visualization/blender_visualize.py to specify the JSON file you want to visualize. This JSON file should follow the same format as the *_transforms_cleaning.json files in our dataset, which are the standardized trajectory files.

Evaluation

CLaTr checkpoints

We provide the following Contrastive Language-Trajectory embedding (CLaTr) checkpoints:

Model Type	Description	Download Link
epoch99_motion	Evaluation for Text (motion)-to-Trajectory	Download
epoch99_directorial	Evaluation for Text (directorial)-to-Trajectory	Download

Place the downloaded files into ./evaluate/CLaTr/CLaTr_checkpoints.

Evaluation Commands

Evaluation for Our Text (motion)-to-Trajectory Results

Note: Modify keys in the config file ./evaluate/CLaTr/configs/config_eval.yaml

data_dir: Text (motion)-to-Trajectory Testset Results
key: 'Movement'

# Extract CLaTr (motion) features
cd ./evaluate/CLaTr
export HYDRA_FULL_ERROR=1  
python -m src.extraction checkpoint_path=CLaTr_checkpoints/epoch99_motion.ckpt
# A .npy file <<path-to-motion-output>-preds.npy> containing the CLaTr (motion) features will be saved in ./evaluate/CLaTr/output

# Evaluate CLaTr (motion)
cd ./evaluate/eval
python -m src.eval_only --pred_path <<path-to-motion-output>-preds.npy>

Evaluation for Our Text (directorial)-to-Trajectory Results

Note: Modify keys in the config file ./evaluate/CLaTr/configs/config_eval.yaml

data_dir: Text (directorial)-to-Trajectory Testset Results
key: 'Concise Interaction'

# Extract CLaTr (directorial) features
cd ./evaluate/CLaTr
export HYDRA_FULL_ERROR=1  
python -m src.extraction checkpoint_path=CLaTr_checkpoints/epoch99_directorial.ckpt
# A .npy file <<path-to-directorial-output>-preds.npy> containing the CLaTr (directorial) features will be saved in ./evaluate/CLaTr/output

# Evaluate CLaTr (directorial)
cd ./evaluate/eval
python -m src.eval_only --pred_path <<path-to-directorial-output>-preds.npy>

Our paper presents four metrics: captions/fscore, clatr/clatr_score, clatr/coverage, and clatr/fcd.

📚 Dataset

Note: We provide DataDoP, a large-scale multi-modal dataset containing 29K realworld shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent.

Currently, we are releasing a subset of the dataset for validation purposes. Additional data will be made available coming soon.

🏋️‍♂️ Training

Note: We have released a subset of the DataDoP dataset for training and validation. Please organize your training data in the following structure. If you wish to use your own dataset, refer to our data format and modify the core/provider.py file as needed.

GenDoP
├── DataDoP
│   ├── train
│   ├── test
│   ├── train_valid.txt
│   ├── test_valid.txt

Training Commands

Text (motion)-to-trajectory

accelerate launch --config_file acc_configs/gpu1.yaml main.py ArAE --workspace workspace --exp_name 'text_motion' --cond_mode 'text' --text_key 'Movement' --num_cond_tokens 77

Text (directorial)-to-trajectory

accelerate launch --config_file acc_configs/gpu1.yaml main.py ArAE --workspace workspace --exp_name 'text_directorial' --cond_mode 'text' --text_key 'Concise Interaction' --num_cond_tokens 77

Text & RGBD-to-trajectory

accelerate launch --config_file acc_configs/gpu1.yaml main.py ArAE --workspace workspace --exp_name 'text_rgbd' --cond_mode 'depth+image+text' --text_key 'Concise Interaction' --num_cond_tokens 591

Training Details

The model is trained on a single A100 (80GB) GPU for approximately 8 hours, with a batch size of 16, using a dataset of 30k examples for around 100 epochs. Recommended hyperparameters:

--discrete_bins 256 --pose_length 30 --hidden_dim 1024 --num_heads 8 --num_layers 12

You can adjust these parameters in core/options.py according to your specific requirements.

📆 Todo

Release Dataset
Release Dataset Construction Code
Gradio Demo

📚 Acknowledgements

This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@misc{zhang2025gendopautoregressivecameratrajectory,
      title={GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography}, 
      author={Mengchen Zhang and Tong Wu and Jing Tan and Ziwei Liu and Gordon Wetzstein and Dahua Lin},
      year={2025},
      eprint={2504.07083},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07083}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography [ICCV 2025]

✨ Updates

📦 Install

💡 Inference

Pretrained Models

Minimal Example

Visualization

Evaluation

📚 Dataset

🏋️‍♂️ Training

📆 Todo

📚 Acknowledgements

✒️ Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Blender_visualization		Blender_visualization
acc_configs		acc_configs
assets		assets
core		core
dataset		dataset
evaluate		evaluate
extrinsic2pyramid		extrinsic2pyramid
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
infer.py		infer.py
main.py		main.py
requirements.txt		requirements.txt

License

3DTopia/GenDoP

Folders and files

Latest commit

History

Repository files navigation

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography [ICCV 2025]

✨ Updates

📦 Install

💡 Inference

Pretrained Models

Minimal Example

Visualization

Evaluation

📚 Dataset

🏋️‍♂️ Training

📆 Todo

📚 Acknowledgements

✒️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages