Skip to content

[IJCV 2025] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Notifications You must be signed in to change notification settings

chaolongy/KDTalker

Repository files navigation

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

arXiv License GitHub Stars


1 University of Liverpool   2 Ant Group   3 Xi’an Jiaotong-Liverpool University  
4 Duke Kunshan University   5 Ricoh Software Research Center  

News

[2025.09.03] Our paper was accepted by the International Journal of Computer Vision (IJCV).

[2025.07.30] Training and evaluation codes have been released.

[2025.07.03] Our demo KDTalker++ was accepted by the 2025 ACM Multimedia Demo and Video Track.

[2025.05.26] Important update! New models and new functions have been updated to local deployment KDTalker. New functions include background replacement and expression editing.

[2025.04.13] A more powerful TTS has been updated in our local deployment KDTalker.

[2025.03.14] Release paper version demo and inference code.

Comparative videos

Comparative.mp4

Demo

Local deployment(4090) demo KDTalker.

You can also visit the demo deployed on Huggingface, where inference is slower due to ZeroGPU.

Demo

Environment

Our KDTalker could be conducted on one RTX4090 or RTX3090.

1. Clone the code and prepare the environment

Note: Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/chaolongy/KDTalker
cd KDTalker

# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

2. Download pretrained weights

First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights. Ensuring the directory structure is as follows:

pretrained_weights
├── insightface
│   └── models
│       └── buffalo_l
│           ├── 2d106det.onnx
│           └── det_10g.onnx
└── liveportrait
    ├── base_models
    │   ├── appearance_feature_extractor.pth
    │   ├── motion_extractor.pth
    │   ├── spade_generator.pth
    │   └── warping_module.pth
    ├── landmark.onnx
    └── retargeting_models
        └── stitching_retargeting_module.pth

You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.

OR, you can download above all weights in Huggingface.

Training

1. Data processing

python ./dataset_process/extract_motion_dataset.py -mp4_root ./path_to_your_video_root

2. Calculate data norm

python ./dataset_process/cal_norm.py

3. Configure wandb and train

Please configure your own "WANDB_API_KEY" on ./config/structured.py. Then execute the code ./main.py

python main.py

Inference

python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4

Evaluation

1. Diversity

First, please download the Hopenet pretrained weights from Google Drive. Put it in ./evaluation/deep-head-pose/, and then execute the code ./evaluation/deep-head-pose/test_on_video_dlib.py.

python test_on_video_dlib.py -video ./path_to_your_video_root

Finally, calculating the standard deviation.

python cal_std.py

2. Beat align

python cal_beat_align_score.py -video_root ./path_to_your_video_root

3. LSE-C and LSE-D

Please configure it as follows: Wav2lip.

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at [email protected]

Citation

If you find this code helpful for your research, please cite:

@misc{yang2025kdtalker,
      title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait}, 
      author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
      year={2025},
      eprint={2503.12963},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.12963}, 
}

Acknowledge

We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid, deep-head-pose, Bailando, etc.

Star History

Star History Chart

About

[IJCV 2025] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages