Full-Duplex-Bench v1 & v1.5: A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models

v1.0 Authors: Guan-Ting Lin, Jiachen Lian*, Tingle Li*, Qirui Wang*, Gopala Anumanchipalli, Alexander H. Liu, Hung-yi Lee

v1.5 Authors: Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian*, Tingle Li, Hung-yi Lee

TL;DR

Benchmark for full-duplex spoken dialogue models — v1.0 evaluates turn-taking, v1.5 adds overlap handling with richer metrics.

News 🔥

(2025/8/22) v1.5 Server-client Model inference Code Release: Added server-client inference scripts under model_inference/.
(2025/8/15) v1.5 Data Release: Added v1.5 dataset with overlap scenarios and metadata annotations under dataset/.
(2025/8/14) v1.5 Evaluation Code Release: Added support for overlap handling with new metrics in Full-Duplex-Bench v1.5 under evaluation/.
(2025/6/05) Paper & ASR Model Update: Replaced the ASR model with nvidia/parakeet-tdt-0.6b-v2, which offers more reliable time-aligned transcriptions for evaluation purposes. The paper has been updated accordingly to reflect this change.
(2025/4/30) Dataset Released: see under the dataset/ folder.
(2025/4/30) Evaluation Code Released: see under the evaluation/ folder.

Stay tuned for upcoming releases!

Highlights 💡

Full-Duplex-Bench v1.0

Provides an open and standardized benchmark to assess interactive behaviors systematically.
Evaluates four key turn-taking dimensions: Pause Handling, Backchanneling, Smooth Turn-Taking, and User Interruption Management.
Leverages automatic metrics for reproducible evaluation across models.

Full-Duplex-Bench v1.5

Extends the benchmark with four simulated overlap scenarios: user interruption, listener backchannel, side conversation, and ambient speech.
Supports both open-sourced and commercial models.
Introduces a comprehensive metric suite — categorical dialogue behaviors, stop and response latency, prosodic adaptation, and perceived speech quality — customizable to application needs.

Repository Structure 📂

This repository is organized into three main components. Please refer to the respective folders for details:

dataset/: Dataset release and detailed description of v1.0 and v1.5 benchmark data.
evaluation/: Evaluation code for running benchmark tasks and metrics.
model_inference/: Server–client inference setup for running full-duplex models in a streaming manner.

Each subfolder contains its own README with more detailed instructions.

📊 Evaluation Results

Full-Duplex-Bench (v1.0)

Model	Pause Handling		Backchannel			Smooth Turn Taking		User Interruption
Model	Synthetic TOR ↓	Candor TOR ↓	TOR ↓	Freq ↑	JSD ↓	Candor TOR ↑	Latency ↓	TOR ↑	GPT-4o ↑	Latency ↓
dGSLM	0.934	0.935	0.691	0.015	0.934	0.975	0.352	0.917	0.201	2.531
Moshi	0.985	0.980	1.000	0.001	0.957	0.941	0.265	1.000	0.765	0.257
Freeze-Omni	0.642	0.481	0.636	0.001	0.997	0.336	0.953	0.867	3.615	1.409
Gemini Live	0.255	0.310	0.091	0.012	0.896	0.655	1.301	0.891	3.376	1.183

TOR: Turn-Over Rate (↓: lower is better for Pause/Backchannel, ↑ for Smooth Turn/User Interruption)
Freq: Frequency of backchannels (↑ better)
JSD: Jensen-Shannon Divergence (↓ better)
Latency: Response latency (↓ better)
GPT-4o: GPT-4o-assessed contextual relevance (↑ better)

Getting Started 🏁

Installation

conda create -n full-duplex-bench python=3.10
conda activate full-duplex-bench
pip install -r requirements.txt

Step-by-step Instruction

1. Model Inference

The goal of model inference is to let the model generate the time-synchronous output.wav given the audio stream of user speech (input.wav). You can use you own model to generate the output speech for evaluation.

We will provide the example inference code of Freeze-omni under model_inference/freeze-omni for different tasks.

⚠️ Issue

We have observed the same issue and suspect it is due to recent internal changes in Gemini.
We are investigating and will share updates once a solution is found.

2. Prepare for Evaluation with time-aligned transcription

Under get_transcript folder, you can find asr.py to obtain the time-aligned transcription for the model generated audio. For more details please see the readme in the folder.

3. Running Evaluations

Under evaluation folder, please see the readme file in the folder for detailed instruction to run the evaluation for each tasks.

Citation 📖

If you have any questions, please feel free to submit an issue or contact Guan-Ting Lin ([email protected])

If you found this research helpful, please consider citing our work:

@article{lin2025full,
  title={Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities},
  author={Lin, Guan-Ting and Lian, Jiachen and Li, Tingle and Wang, Qirui and Anumanchipalli, Gopala and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2503.04721},
  year={2025}
}

@article{lin2025full,
  title={Full-Duplex-Bench v1. 5: Evaluating Overlap Handling for Full-Duplex Speech Models},
  author={Lin, Guan-Ting and Kuan, Shih-Yun Shan and Wang, Qirui and Lian, Jiachen and Li, Tingle and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2507.23159},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Full-Duplex-Bench v1 & v1.5: A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models

TL;DR

News 🔥

Highlights 💡

Full-Duplex-Bench v1.0

Full-Duplex-Bench v1.5

Repository Structure 📂

📊 Evaluation Results

Full-Duplex-Bench (v1.0)

Getting Started 🏁

Installation

Step-by-step Instruction

1. Model Inference

⚠️ Issue

2. Prepare for Evaluation with time-aligned transcription

3. Running Evaluations

Citation 📖

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
dataset		dataset
evaluation		evaluation
get_transcript		get_transcript
model_inference		model_inference
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

DanielLin94144/Full-Duplex-Bench

Folders and files

Latest commit

History

Repository files navigation

Full-Duplex-Bench v1 & v1.5: A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models

TL;DR

News 🔥

Highlights 💡

Full-Duplex-Bench v1.0

Full-Duplex-Bench v1.5

Repository Structure 📂

📊 Evaluation Results

Full-Duplex-Bench (v1.0)

Getting Started 🏁

Installation

Step-by-step Instruction

1. Model Inference

⚠️ Issue

2. Prepare for Evaluation with time-aligned transcription

3. Running Evaluations

Citation 📖

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages