Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Chaolong Yang ^1,3* Kai Yao ^2* Yuyao Yan ³ Chenru Jiang ⁴ Weiguang Zhao ^1,3
Jie Sun ^3† Guangliang Cheng ¹ Yifei Zhang ⁵ Bin Dong ⁴ Kaizhu Huang ^4†

¹ University of Liverpool ² Ant Group ³ Xi’an Jiaotong-Liverpool University
⁴ Duke Kunshan University ⁵ Ricoh Software Research Center

News

[2025.09.03] Our paper was accepted by the International Journal of Computer Vision (IJCV).

[2025.07.30] Training and evaluation codes have been released.

[2025.07.03] Our demo KDTalker++ was accepted by the 2025 ACM Multimedia Demo and Video Track.

[2025.05.26] Important update! New models and new functions have been updated to local deployment KDTalker. New functions include background replacement and expression editing.

[2025.04.13] A more powerful TTS has been updated in our local deployment KDTalker.

[2025.03.14] Release paper version demo and inference code.

Comparative videos

Comparative.mp4

Demo

Local deployment(4090) demo KDTalker.

You can also visit the demo deployed on Huggingface, where inference is slower due to ZeroGPU.

Environment

Our KDTalker could be conducted on one RTX4090 or RTX3090.

1. Clone the code and prepare the environment

Note: Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/chaolongy/KDTalker
cd KDTalker

# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

2. Download pretrained weights

First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights. Ensuring the directory structure is as follows:

pretrained_weights
├── insightface
│   └── models
│       └── buffalo_l
│           ├── 2d106det.onnx
│           └── det_10g.onnx
└── liveportrait
    ├── base_models
    │   ├── appearance_feature_extractor.pth
    │   ├── motion_extractor.pth
    │   ├── spade_generator.pth
    │   └── warping_module.pth
    ├── landmark.onnx
    └── retargeting_models
        └── stitching_retargeting_module.pth

You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.

OR, you can download above all weights in Huggingface.

Training

1. Data processing

python ./dataset_process/extract_motion_dataset.py -mp4_root ./path_to_your_video_root

2. Calculate data norm

python ./dataset_process/cal_norm.py

3. Configure wandb and train

Please configure your own "WANDB_API_KEY" on ./config/structured.py. Then execute the code ./main.py

python main.py

Inference

python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4

Evaluation

1. Diversity

First, please download the Hopenet pretrained weights from Google Drive. Put it in ./evaluation/deep-head-pose/, and then execute the code ./evaluation/deep-head-pose/test_on_video_dlib.py.

python test_on_video_dlib.py -video ./path_to_your_video_root

Finally, calculating the standard deviation.

python cal_std.py

2. Beat align

python cal_beat_align_score.py -video_root ./path_to_your_video_root

3. LSE-C and LSE-D

Please configure it as follows: Wav2lip.

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at [email protected]

Citation

If you find this code helpful for your research, please cite:

@misc{yang2025kdtalker,
      title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait}, 
      author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
      year={2025},
      eprint={2503.12963},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.12963}, 
}

Acknowledge

We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid, deep-head-pose, Bailando, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

News

Comparative videos

Demo

Environment

1. Clone the code and prepare the environment

2. Download pretrained weights

Training

1. Data processing

2. Calculate data norm

3. Configure wandb and train

Inference

Evaluation

1. Diversity

2. Beat align

3. LSE-C and LSE-D

Contact

Citation

Acknowledge

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
config		config
dataset_process		dataset_process
datasets		datasets
evaluation		evaluation
example		example
model		model
src		src
README.md		README.md
inference.py		inference.py
main.py		main.py
requirements.txt		requirements.txt
training_utils.py		training_utils.py

chaolongy/KDTalker

Folders and files

Latest commit

History

Repository files navigation

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

News

Comparative videos

Demo

Environment

1. Clone the code and prepare the environment

2. Download pretrained weights

Training

1. Data processing

2. Calculate data norm

3. Configure wandb and train

Inference

Evaluation

1. Diversity

2. Beat align

3. LSE-C and LSE-D

Contact

Citation

Acknowledge

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages