Feature/spec decode draft model #24322

tomasruizt · 2025-09-05T12:43:07Z

Purpose

Enabling draft models for speculative decoding (SD).
E.g. Qwen3-1.7B as draft model and Qwen3-32B as target model.
This type of SD requires no special trained heads (like EAGLE, or Medusa).

Example usage:

vllm serve \
    --model=Qwen/Qwen3-4B \
    --speculative-config '{"model": "Qwen/Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 3, "max-model-len": 2000}' \
    --max-model-len 2000

Get a generation:

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Capital of France", "max_tokens": 16}'

Status

The acceptance rates in greedy decoding are great for Qwen3 models (see corresponding section).
Using SD with Qwen3 has higher throughput (TPOT) than not using SD.

Acceptance Length

As suggested by @ekagra-ranjan, I benchmarked acceptance length (AL) with the command below:

VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py \
    --model-dir Qwen/Qwen3-32B \
    --draft-model Qwen/Qwen3-1.7B \
    --method draft_model \
    --num_spec_tokens 3 \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num_prompts 100 \
    --temp 1.0 \
    --gpu-memory-utilization 0.9

The AL values within the Qwen3 family seem good, both with temperatures of 0.0 (greedy) and 1.0.
As a sanity check, I benchmarked LLama-3.2-1B as both target and draft, which had almost perfect AL (3.97/4), suggesting its working as intended.
I have not run the default model meta-llama/Llama-3.1-8B-Instruct, because I didn't find a good draft model for it, but feel free to suggest one and I can run the benchmarks.

Temperature t=0:

Target	Draft	K	AL
Qwen3-32B	Qwen3-1.7B	3	2.79
Qwen3-32B	Qwen3-1.7B	4	3.12
Llama-3.2-1B	Llama-3.2-1B	3	3.97

Temperature t=1.0:

Target	Draft	K	Temperature	AL
Qwen3-32B	Qwen3-1.7B	3	1.0	2.61
Qwen3-32B	Qwen3-1.7B	4	1.0	2.85
Llama-3.2-1B	Llama-3.2-1B	3	1.0	2.82

Using t=1.0, the AL metric degrades. However, spec-decode with probabilities is not yet implemented, needed for lossless rejection sampling. This is being addressed atm: #20459. After that PR is merged, the AL for non-greedy spec-decode should improve.

All scripts and logs used for the benchmarks can be found in this Google Drive.

Online Throughput Metrics

I measured online throughput metrics using the commands below. Hardware was an RTX PRO 6000 96GB. After making sure the draft model also uses CUDA graph, SD has higher throughput than not using SD. See tables below.

VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
  --max-model-len 20000 \
  --disable-uvicorn-access-log

# or 

VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
  --speculative_config '{"method": "draft_model", "model": "Qwen/Qwen3-1.7B", "num_speculative_tokens": 3, "max_model_len": 20000}' \
  --max-model-len 20000 \
  --disable-uvicorn-access-log

nohup vllm bench serve \
  --model Qwen/Qwen3-32B \
  --dataset-name hf \
  --dataset-path philschmid/mt-bench \
  --num-prompts 80 \
  --max-concurrency (100|1) \
  --temperature 0.0 \
  --top-p 1.0 2>&1 > results/qwen3-32b-t0-run1.out &

The tables show shorter runtime and higher throughput for SD (both in batch size 1 and 100).
Using SD the TPOT is 50% shorter (better) in batch size 1, and 26% to 33% shorter in batch size 100. The reason is the higher throughput of the draft model.
Using SD the TTFT and ITL are higher (worse), because tokens are produced in batches by the spec-decoding. Nevertheless, total runtimes are shorter overall when using SD.

The metrics (lower is better) are:

TTFT: Time-to-first-token
TPOT: Time-per-output-token
ITL: Inter-token-latency

Batch Size = 1

For Temperature = 0.0:

Target	Draft	Runtime	TTFT	TPOT	ITL
Qwen3-32B	-	943s	69.14ms	46.06ms	45.88ms
Qwen3-32B	Qwen3-1.7B	466s	76.77ms	22.62ms	62.44ms

Using SD runtimes and TPOT are shorter by ~50%.

Batch Size = 100

For Temperature = 0.0:

Target	Draft	Runtime	TTFT	TPOT	ITL
Qwen3-32B	-	16.88s	262.59ms	65.09ms	64.84ms
Qwen3-32B	Qwen3-1.7B	13.04s	284.45ms	43.70ms	121.48ms

For Temperature = 1.0:

Target	Draft	Runtime	TTFT	TPOT	ITL
Qwen3-32B	-	16.83s	230.04ms	64.95ms	64.70ms
Qwen3-32B	Qwen3-1.7B	14.84s	272.51ms	48.00ms	122.27ms

This scenario with batch size 100 is a more realistic inference case.
Using SD runtimes and TPOT are shorter.

Profiling

This section was removed, since using CUDA graphs on the draft model significantly improved its speed.

Profiling script

I used the command below to profile the generation process and identify that the draft model was running too slow before.

export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_DIR=./profiles/
export CUDA_LAUNCH_BLOCKING=1

vllm bench throughput \
    --model=Qwen/Qwen3-32B \
    --speculative-config '{"model": "Qwen/Qwen3-1.7B", "method": "draft_model", "num_speculative_tokens": 3, "max_model_len": 2048}' \
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
    --max-num-seqs=100 \
    --num-prompts=10 \
    --input-len=1000 \
    --output-len=10 \
    --max-model-len=2048 \
    --gpu-memory-utilization=0.95 \
    --profile

Note: The command uses the --profile flag, which I introduce in this PR: #24575

Test Plan

The added unit test check the correctness metrics. To run it:

cd tests/v1/e2e/
pytest test_spec_decode.py -k test_draft_model_correctness

EAGLE testing

I tested that the EAGLE implementation stays unaffected the command below

VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py \
    --model-dir meta-llama/Llama-3.1-8B-Instruct \
    --eagle-dir yuhuili/EAGLE3-LLaMA3.1-Instruct-8B \
    --method eagle3 \
    --num_spec_tokens 7 \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num_prompts 80 \
    --temp 0.0 \
    --gpu-memory-utilization 0.9

The results are in line with previous measurements like #17504 (comment)

total_num_output_tokens: 16990
num_drafts: 4816
num_draft_tokens: 33712
num_accepted_tokens: 12208
mean acceptance length: 3.53
--------------------------------------------------
acceptance at token 0: 0.74
acceptance at token 1: 0.54
acceptance at token 2: 0.41
acceptance at token 3: 0.31
acceptance at token 4: 0.23
acceptance at token 5: 0.17
acceptance at token 6: 0.13

Follow-up Optimizations

Include the tokens in next_token_ids together with target_token_ids in the first forward pass of the draft model. This reduces the number of forward passes needed in each drafting phase by one, speeding up drafting.

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

github-actions · 2025-09-05T12:43:35Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces support for speculative decoding using a draft model. The changes are comprehensive, touching configuration, model loading, scheduling, and the core speculative decoding logic. New tests and benchmark modifications are also included to validate and measure the new feature. The overall implementation appears solid. However, I've identified a critical issue in a refactoring of the bind_kv_cache utility function, which removes an important safety check and could lead to incorrect behavior for certain model architectures.

vllm/v1/worker/utils.py

ekagra-ranjan · 2025-09-05T14:34:16Z

@tomasruizt - Thank you for the PR!

Can you also report Acceptance Length (AL) and K used? It is more informative than AR. For e.g., with Eagle with K=3 we get AL ~2.29 on MTBench so we can expect 2.29x speedup assuming 0 draft overhead. AL gives a more holistic overview of the speedup for a given K than AR.
Can you run the metrics on MTBench instead of one off sample for less noisy metric? This would also allow us to compare the AL and TPOT improvement from other SD methods: [Benchmark][V1][Spec Decode][EAGLE] Tracking benchmark for V1 EAGLE #17812. You can find the cmd to run the offline inference to find AL and online inference to find TPOT using MTBench in vLLM here: [V1][Spec Decode][Feature] Spec decode with probs #20459. You will need to update offline inference script to use separate draft model.

ekagra-ranjan · 2025-09-05T15:10:05Z

As a main model Qwen3-1.7B runs a decoding forward pass in 9-10ms, while when its used as a draft model, each forward pass takes 19-22ms.

What is the TP you are using for Qwen3-32B? By default, draft model TP is equal to target model TP. Since Qwen3-1.7B is a small model, running it on high TP might be incurring nccl communication cost. Try setting draft TP to 1.

tomasruizt · 2025-09-05T15:22:36Z

What is the TP you are using for Qwen3-32B? By default, draft model TP is equal to target model TP. Since Qwen3-1.7B is a small model, running it on high TP might be incurring nccl communication cost. Try setting draft TP to 1.

I ran the benchmarks with TP=1 and num_draft_tokens=3. So we can rule out TP communication issues.

mergify · 2025-09-08T09:02:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomasruizt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tomasruizt · 2025-09-19T15:23:29Z

@benchislett fair. I'll factor out the duplicated code in .propose() to reduce duplication as much as possible. I might submit that as a separate refactor-only PR, to ensure the integrity of the EAGLE implementation.

In terms of the extra decode: I'll try to eliminate it and reach out for help if needed.

Signed-off-by: Tomas Ruiz <[email protected]>

mergify · 2025-09-22T20:06:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomasruizt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tomas Ruiz <[email protected]>

tomasruizt · 2025-09-22T21:23:58Z

@benchislett

As we discussed in our call, I moved the propose() method into the superclass so it could be reused among classes. I also introduced flags to the constructor to toggle the behavior between EAGLE and DraftModel:

pass_hidden_states_to_model: bool True for EAGLE
pass_cudagraph_args_to_forward_ctx: bool True for DraftModel
one_extra_forward_pass: bool True for DraftModel
drop_first_drafted_tokens: bool True for DraftModel
A variable self.num_forward_passes which is 1 larger for the DraftModel

Number 3 to 5 will become unnecessary if we implement the optimal prefill we discussed, which would reduce the forward passes by one, and improve runtimes. Nevertheless, it's useful to remember the major factor affecting DraftModel speed is the CUDA graph usage, which is now conveniently a single flag.

At the moment the EAGLE file diff looks horrible, I guess because of the combination of extracting a superclass plus introducing flags to the constructor. Let me know if you would prefer to review the EAGLE refactor as a separate PR to main (or as multiple small PRs) 👍

Edit: Managed to get nice git diff for the EAGLE file by minimizing changes.

mergify · 2025-09-23T18:21:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomasruizt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tomas Ruiz <[email protected]>

tomasruizt · 2025-09-24T08:31:13Z

The EAGLE code is frequently changed on main, so it is difficult to move EAGLE code around without painful merge conflicts.
To get nice git diffs on the EAGLE file I put the body of EAGLE into the superclass.

Signed-off-by: Benjamin Chislett <[email protected]>

mergify · 2025-09-25T05:33:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomasruizt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fix next_token_ids issue

Signed-off-by: Tomas Ruiz <[email protected]>

mergify · 2025-09-26T09:20:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomasruizt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tomas Ruiz <[email protected]>

tomasruizt requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, simon-mo, youkaichao, mgoin, tlrmchlsmth, houseroad, hmellor, yewentao256 and ProExpertProg as code owners September 5, 2025 12:43

mergify bot added performance Performance-related issues speculative-decoding v1 labels Sep 5, 2025

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

vllm/v1/worker/utils.py Outdated Show resolved Hide resolved

tomasruizt mentioned this pull request Sep 5, 2025

Feature/Speculative Decoding with Draft Models tomasruizt/vllm#1

Closed

tomasruizt requested a review from 22quinn as a code owner September 6, 2025 08:34

mergify bot added the documentation Improvements or additions to documentation label Sep 6, 2025

tomasruizt requested review from benchislett and luccafong as code owners September 8, 2025 09:02

mergify bot added the needs-rebase label Sep 8, 2025

tomasruizt force-pushed the feature/spec-decode-draft-model branch from 7de2ae1 to 2e0fb65 Compare September 8, 2025 09:17

github-project-automation bot moved this from To Triage to In progress in gpt-oss Issues & Enhancements Sep 19, 2025

tomasruizt added 2 commits September 22, 2025 15:54

Merge branch 'main' into feature/spec-decode-draft-model

b45f7af

Fix call to model.compute_logits()

07d1b97

Signed-off-by: Tomas Ruiz <[email protected]>

mergify bot added the needs-rebase label Sep 22, 2025

Move .propose() to superclass

86d8040

Signed-off-by: Tomas Ruiz <[email protected]>

tomasruizt force-pushed the feature/spec-decode-draft-model branch from e3b85dd to 86d8040 Compare September 22, 2025 20:48

Merge branch 'main' into feature/spec-decode-draft-model

a696797

Signed-off-by: Tomas Ruiz <[email protected]>

mergify bot removed the needs-rebase label Sep 22, 2025

mergify bot added the needs-rebase label Sep 23, 2025

Merge branch 'main' into feature/spec-decode-draft-model

1afbe14

Signed-off-by: Tomas Ruiz <[email protected]>

mergify bot removed the needs-rebase label Sep 24, 2025

tomasruizt added 2 commits September 24, 2025 09:12

Minimize git diffs in EAGLE

d37d780

Signed-off-by: Tomas Ruiz <[email protected]>

Fix missing input

5967e09

Signed-off-by: Tomas Ruiz <[email protected]>

fix next_token_ids issue

7b03a45

Signed-off-by: Benjamin Chislett <[email protected]>

mergify bot added the needs-rebase label Sep 25, 2025

tomasruizt added 2 commits September 25, 2025 09:33

Merge pull request #3 from CentML/spec-decode-draft-model

35fa5a9

fix next_token_ids issue

Merge branch 'main' into feature/spec-decode-draft-model

ef5da86

Signed-off-by: Tomas Ruiz <[email protected]>

mergify bot removed the needs-rebase label Sep 25, 2025

tomasruizt added 2 commits September 25, 2025 10:13

Test also acceptance-len

c7d2fd5

Signed-off-by: Tomas Ruiz <[email protected]>

Pass missing argument in test_eagle.py

ac90311

Signed-off-by: Tomas Ruiz <[email protected]>

mergify bot added the needs-rebase label Sep 26, 2025

Merge branch 'main' into feature/spec-decode-draft-model

857415b

Signed-off-by: Tomas Ruiz <[email protected]>

mergify bot removed the needs-rebase label Sep 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature/spec decode draft model #24322

Feature/spec decode draft model #24322

tomasruizt commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ekagra-ranjan commented Sep 5, 2025 •

edited

Loading

Uh oh!

ekagra-ranjan commented Sep 5, 2025 •

edited

Loading

Uh oh!

tomasruizt commented Sep 5, 2025

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

tomasruizt commented Sep 19, 2025

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

tomasruizt commented Sep 22, 2025 •

edited

Loading

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

tomasruizt commented Sep 24, 2025

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

mergify bot commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Feature/spec decode draft model #24322

Are you sure you want to change the base?

Feature/spec decode draft model #24322

Conversation

tomasruizt commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Status

Acceptance Length

Online Throughput Metrics

Profiling

Test Plan

Follow-up Optimizations

(Optional) Documentation Update

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ekagra-ranjan commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekagra-ranjan commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomasruizt commented Sep 5, 2025

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

tomasruizt commented Sep 19, 2025

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

tomasruizt commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

tomasruizt commented Sep 24, 2025

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

mergify bot commented Sep 26, 2025

Uh oh!

Uh oh!

tomasruizt commented Sep 5, 2025 •

edited by github-actions bot

Loading

ekagra-ranjan commented Sep 5, 2025 •

edited

Loading

ekagra-ranjan commented Sep 5, 2025 •

edited

Loading

tomasruizt commented Sep 22, 2025 •

edited

Loading