Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Gonçalo Faria, Noah A. Smith

Paper: https://arxiv.org/abs/2504.03790

TL;DR: QAlign is a new test-time alignment approach that improves language model performance by using Markov chain Monte Carlo methods.

Abstract:

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Average error rate across multiple evaluation datasets (GSM8K, MATH500, MMLU-Redux, TruthfulQA, and IFEval) as a function of the floating point operations (FLOPS) in log scale. We compare QAlign method with Tülu3-8B-SFT against four baselines: majority vote (MV) Tülu3-8B-DPO, and applied to Tülu3-8B-SFT the methods best-of-n (BoN), MV, and weighted MV (WMV). All experiments use temperature 1.0 with reasoning included in model outputs. Note that Tülu3-8B-DPO model is the result of doing preference finetuning on the Tülu3-8B-SFT with 271k preference pairs. The costs associated with this process are not accounted for in this plot.

Dependencies

This project relies strongly on the following external libraries:

pip install quest-decoding
pip install expkit-core
pip install literegistry

Install the required packages:

pip install -r requirements.txt

Reproducing the work

Replicating the work:

Experiment Setup

Create Configuration Files

# Create configs for general experiments
scripts/create_all_general_experiments.sh

# Create configs for task-specific experiments
scripts/create_all_task_experiments.sh

Running Experiments

Execute Experiments

# Run experiments locally
scripts/run_local_experiments.sh

# Run experiments on remote server
scripts/run_remote_experiments.sh

Evaluation & Analysis

Evaluate Results

# Compare responses against ground truth answers
scripts/run_eval_experiment.sh

# Evaluate reward model for ancestral predictions (remote by default)
scripts/run_rm_eval.sh

Generate Final Predictions

# Run WMV, BON, and MV final prediction methods
scripts/run_pred.sh

Quick Start

This guide will help you get started running QAlign.

Basic Usage

from qalign.remote import RemoteVLLM
from qalign.reward import ConstantReward
from qalign.base import QAlign 


model = RemoteVLLM(
    server_url="http://g3090.hyak.local:8080",
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    max_prompt_length=1000,
    max_new_tokens=1000,
)

reward = ConstantReward(1.0)

chain = QAlign(
    model=model,
    reward=reward,
    beta=1.0, 
)

t = model.tokenizer.apply_chat_template(
    [{"role": "user", "content": "What district is Guimarães in?"}],
    tokenize=False,
    add_generation_prompt=True,
)

results =chain.run(
    prompts=[t],
    steps=8,
)

Contact

For bugs and feature requests please visit GitHub Issues. For business inquiries or professional support requests please send an e-mail.

Citation

@misc{faria2025sampledontsearchrethinking,
      title={Sample, Don't Search: Rethinking Test-Time Alignment for Language Models}, 
      author={Gonçalo Faria and Noah A. Smith},
      year={2025},
      eprint={2504.03790},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.03790}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
qalign		qalign
scripts		scripts
README.md		README.md
eval.py		eval.py
launch_experiment.py		launch_experiment.py
pred.py		pred.py
resume_experiment.py		resume_experiment.py
resume_experiment_remote.py		resume_experiment_remote.py
test_qalign.py		test_qalign.py
test_qalign_legacy.py		test_qalign_legacy.py
test_remote.py		test_remote.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Abstract:

Dependencies

Reproducing the work

Experiment Setup

Running Experiments

Evaluation & Analysis

Quick Start

Basic Usage

Contact

Citation

About

Uh oh!

Releases

Packages

Languages

goncalorafaria/qalign

Folders and files

Latest commit

History

Repository files navigation

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Abstract:

Dependencies

Reproducing the work

Experiment Setup

Running Experiments

Evaluation & Analysis

Quick Start

Basic Usage

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages