
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
Jie Sun 3† Guangliang Cheng 1 Yifei Zhang 5 Bin Dong 4 Kaizhu Huang 4†
4 Duke Kunshan University 5 Ricoh Software Research Center
[2025.09.03] Our paper was accepted by the International Journal of Computer Vision (IJCV).
[2025.07.30] Training and evaluation codes have been released.
[2025.07.03] Our demo KDTalker++ was accepted by the 2025 ACM Multimedia Demo and Video Track.
[2025.05.26] Important update! New models and new functions have been updated to local deployment KDTalker
. New functions include background replacement and expression editing.
[2025.04.13] A more powerful TTS has been updated in our local deployment KDTalker
.
[2025.03.14] Release paper version demo and inference code.
Comparative.mp4
Local deployment(4090) demo KDTalker
.
You can also visit the demo deployed on Huggingface
, where inference is slower due to ZeroGPU.

Our KDTalker could be conducted on one RTX4090 or RTX3090.
Note: Make sure your system has git
, conda
, and FFmpeg
installed.
git clone https://github.com/chaolongy/KDTalker
cd KDTalker
# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights
.
Ensuring the directory structure is as follows:
pretrained_weights
├── insightface
│ └── models
│ └── buffalo_l
│ ├── 2d106det.onnx
│ └── det_10g.onnx
└── liveportrait
├── base_models
│ ├── appearance_feature_extractor.pth
│ ├── motion_extractor.pth
│ ├── spade_generator.pth
│ └── warping_module.pth
├── landmark.onnx
└── retargeting_models
└── stitching_retargeting_module.pth
You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts
.
OR, you can download above all weights in Huggingface.
python ./dataset_process/extract_motion_dataset.py -mp4_root ./path_to_your_video_root
python ./dataset_process/cal_norm.py
Please configure your own "WANDB_API_KEY" on ./config/structured.py
. Then execute the code ./main.py
python main.py
python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4
First, please download the Hopenet pretrained weights from Google Drive. Put it in ./evaluation/deep-head-pose/
, and then execute the code ./evaluation/deep-head-pose/test_on_video_dlib.py
.
python test_on_video_dlib.py -video ./path_to_your_video_root
Finally, calculating the standard deviation.
python cal_std.py
python cal_beat_align_score.py -video_root ./path_to_your_video_root
Please configure it as follows: Wav2lip.
Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at [email protected]
If you find this code helpful for your research, please cite:
@misc{yang2025kdtalker,
title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait},
author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
year={2025},
eprint={2503.12963},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.12963},
}
We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid, deep-head-pose, Bailando, etc.