Skip to content

DanielLin94144/Full-Duplex-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Full-Duplex-Bench v1 & v1.5: A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models

v1.0 Authors: Guan-Ting Lin, Jiachen Lian*, Tingle Li*, Qirui Wang*, Gopala Anumanchipalli, Alexander H. Liu, Hung-yi Lee

v1.5 Authors: Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian*, Tingle Li, Hung-yi Lee

TL;DR

Benchmark for full-duplex spoken dialogue models β€” v1.0 evaluates turn-taking, v1.5 adds overlap handling with richer metrics.

arXiv arXiv code

News πŸ”₯

  • (2025/8/22) v1.5 Server-client Model inference Code Release: Added server-client inference scripts under model_inference/.
  • (2025/8/15) v1.5 Data Release: Added v1.5 dataset with overlap scenarios and metadata annotations under dataset/.
  • (2025/8/14) v1.5 Evaluation Code Release: Added support for overlap handling with new metrics in Full-Duplex-Bench v1.5 under evaluation/.
  • (2025/6/05) Paper & ASR Model Update: Replaced the ASR model with nvidia/parakeet-tdt-0.6b-v2, which offers more reliable time-aligned transcriptions for evaluation purposes. The paper has been updated accordingly to reflect this change.
  • (2025/4/30) Dataset Released: see under the dataset/ folder.
  • (2025/4/30) Evaluation Code Released: see under the evaluation/ folder.

Stay tuned for upcoming releases!

Highlights πŸ’‘

Full-Duplex-Bench v1.0

  • Provides an open and standardized benchmark to assess interactive behaviors systematically.
  • Evaluates four key turn-taking dimensions: Pause Handling, Backchanneling, Smooth Turn-Taking, and User Interruption Management.
  • Leverages automatic metrics for reproducible evaluation across models.

Full-Duplex-Bench v1.5

  • Extends the benchmark with four simulated overlap scenarios: user interruption, listener backchannel, side conversation, and ambient speech.
  • Supports both open-sourced and commercial models.
  • Introduces a comprehensive metric suite β€” categorical dialogue behaviors, stop and response latency, prosodic adaptation, and perceived speech quality β€” customizable to application needs.

Repository Structure πŸ“‚

This repository is organized into three main components. Please refer to the respective folders for details:

  • dataset/: Dataset release and detailed description of v1.0 and v1.5 benchmark data.
  • evaluation/: Evaluation code for running benchmark tasks and metrics.
  • model_inference/: Server–client inference setup for running full-duplex models in a streaming manner.

Each subfolder contains its own README with more detailed instructions.

πŸ“Š Evaluation Results

Full-Duplex-Bench (v1.0)

Model Pause Handling Backchannel Smooth Turn Taking User Interruption
Synthetic TOR ↓Candor TOR ↓ TOR ↓Freq ↑JSD ↓ Candor TOR ↑Latency ↓ TOR ↑GPT-4o ↑Latency ↓
dGSLM 0.9340.935 0.6910.0150.934 0.9750.352 0.9170.2012.531
Moshi 0.9850.980 1.0000.0010.957 0.9410.265 1.0000.7650.257
Freeze-Omni 0.6420.481 0.6360.0010.997 0.3360.953 0.8673.6151.409
Gemini Live 0.2550.310 0.0910.0120.896 0.6551.301 0.8913.3761.183
  • TOR: Turn-Over Rate (↓: lower is better for Pause/Backchannel, ↑ for Smooth Turn/User Interruption)
  • Freq: Frequency of backchannels (↑ better)
  • JSD: Jensen-Shannon Divergence (↓ better)
  • Latency: Response latency (↓ better)
  • GPT-4o: GPT-4o-assessed contextual relevance (↑ better)

Getting Started 🏁

Installation

conda create -n full-duplex-bench python=3.10
conda activate full-duplex-bench
pip install -r requirements.txt

Step-by-step Instruction

1. Model Inference

The goal of model inference is to let the model generate the time-synchronous output.wav given the audio stream of user speech (input.wav). You can use you own model to generate the output speech for evaluation.

We will provide the example inference code of Freeze-omni under model_inference/freeze-omni for different tasks.

⚠️ Issue

We have observed the same issue and suspect it is due to recent internal changes in Gemini.
We are investigating and will share updates once a solution is found.

2. Prepare for Evaluation with time-aligned transcription

Under get_transcript folder, you can find asr.py to obtain the time-aligned transcription for the model generated audio. For more details please see the readme in the folder.

3. Running Evaluations

Under evaluation folder, please see the readme file in the folder for detailed instruction to run the evaluation for each tasks.

Citation πŸ“–

If you have any questions, please feel free to submit an issue or contact Guan-Ting Lin ([email protected])

If you found this research helpful, please consider citing our work:

@article{lin2025full,
  title={Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities},
  author={Lin, Guan-Ting and Lian, Jiachen and Li, Tingle and Wang, Qirui and Anumanchipalli, Gopala and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2503.04721},
  year={2025}
}

@article{lin2025full,
  title={Full-Duplex-Bench v1. 5: Evaluating Overlap Handling for Full-Duplex Speech Models},
  author={Lin, Guan-Ting and Kuan, Shih-Yun Shan and Wang, Qirui and Lian, Jiachen and Li, Tingle and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2507.23159},
  year={2025}
}

About

A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published