Full-Duplex-Bench v1 & v1.5: A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models
v1.0 Authors: Guan-Ting Lin, Jiachen Lian*, Tingle Li*, Qirui Wang*, Gopala Anumanchipalli, Alexander H. Liu, Hung-yi Lee
v1.5 Authors: Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian*, Tingle Li, Hung-yi Lee
Benchmark for full-duplex spoken dialogue models β v1.0 evaluates turn-taking, v1.5 adds overlap handling with richer metrics.
- (2025/8/22) v1.5 Server-client Model inference Code Release: Added server-client inference scripts under
model_inference/
. - (2025/8/15) v1.5 Data Release: Added v1.5 dataset with overlap scenarios and metadata annotations under
dataset/
. - (2025/8/14) v1.5 Evaluation Code Release: Added support for overlap handling with new metrics in Full-Duplex-Bench v1.5 under
evaluation/
. - (2025/6/05) Paper & ASR Model Update: Replaced the ASR model with nvidia/parakeet-tdt-0.6b-v2, which offers more reliable time-aligned transcriptions for evaluation purposes. The paper has been updated accordingly to reflect this change.
- (2025/4/30) Dataset Released: see under the
dataset/
folder. - (2025/4/30) Evaluation Code Released: see under the
evaluation/
folder.
Stay tuned for upcoming releases!
- Provides an open and standardized benchmark to assess interactive behaviors systematically.
- Evaluates four key turn-taking dimensions: Pause Handling, Backchanneling, Smooth Turn-Taking, and User Interruption Management.
- Leverages automatic metrics for reproducible evaluation across models.
- Extends the benchmark with four simulated overlap scenarios: user interruption, listener backchannel, side conversation, and ambient speech.
- Supports both open-sourced and commercial models.
- Introduces a comprehensive metric suite β categorical dialogue behaviors, stop and response latency, prosodic adaptation, and perceived speech quality β customizable to application needs.
This repository is organized into three main components. Please refer to the respective folders for details:
dataset/
: Dataset release and detailed description of v1.0 and v1.5 benchmark data.evaluation/
: Evaluation code for running benchmark tasks and metrics.model_inference/
: Serverβclient inference setup for running full-duplex models in a streaming manner.
Each subfolder contains its own README with more detailed instructions.
Model | Pause Handling | Backchannel | Smooth Turn Taking | User Interruption | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Synthetic TOR β | Candor TOR β | TOR β | Freq β | JSD β | Candor TOR β | Latency β | TOR β | GPT-4o β | Latency β | |
dGSLM | 0.934 | 0.935 | 0.691 | 0.015 | 0.934 | 0.975 | 0.352 | 0.917 | 0.201 | 2.531 |
Moshi | 0.985 | 0.980 | 1.000 | 0.001 | 0.957 | 0.941 | 0.265 | 1.000 | 0.765 | 0.257 |
Freeze-Omni | 0.642 | 0.481 | 0.636 | 0.001 | 0.997 | 0.336 | 0.953 | 0.867 | 3.615 | 1.409 |
Gemini Live | 0.255 | 0.310 | 0.091 | 0.012 | 0.896 | 0.655 | 1.301 | 0.891 | 3.376 | 1.183 |
- TOR: Turn-Over Rate (β: lower is better for Pause/Backchannel, β for Smooth Turn/User Interruption)
- Freq: Frequency of backchannels (β better)
- JSD: Jensen-Shannon Divergence (β better)
- Latency: Response latency (β better)
- GPT-4o: GPT-4o-assessed contextual relevance (β better)
conda create -n full-duplex-bench python=3.10
conda activate full-duplex-bench
pip install -r requirements.txt
The goal of model inference is to let the model generate the time-synchronous output.wav
given the audio stream of user speech (input.wav
). You can use you own model to generate the output speech for evaluation.
We will provide the example inference code of Freeze-omni under model_inference/freeze-omni
for different tasks.
We have observed the same issue and suspect it is due to recent internal changes in Gemini.
We are investigating and will share updates once a solution is found.
Under get_transcript
folder, you can find asr.py
to obtain the time-aligned transcription for the model generated audio. For more details please see the readme in the folder.
Under evaluation
folder, please see the readme file in the folder for detailed instruction to run the evaluation for each tasks.
If you have any questions, please feel free to submit an issue or contact Guan-Ting Lin ([email protected])
If you found this research helpful, please consider citing our work:
@article{lin2025full,
title={Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities},
author={Lin, Guan-Ting and Lian, Jiachen and Li, Tingle and Wang, Qirui and Anumanchipalli, Gopala and Liu, Alexander H and Lee, Hung-yi},
journal={arXiv preprint arXiv:2503.04721},
year={2025}
}
@article{lin2025full,
title={Full-Duplex-Bench v1. 5: Evaluating Overlap Handling for Full-Duplex Speech Models},
author={Lin, Guan-Ting and Kuan, Shih-Yun Shan and Wang, Qirui and Lian, Jiachen and Li, Tingle and Lee, Hung-yi},
journal={arXiv preprint arXiv:2507.23159},
year={2025}
}