- Update results of additional model size (7B) under more metrics (F1, Cover EM).
- Support quick start of gradio demo or quick inference. Refer to Quick Start.
- Homepage is available at [Here]
- Paper is available on [Arxiv]
- Checkpoints are released at [🤗HuggingFace].
Official implementation of paper Search and Refine During Think: Autonomous Retrieval‑Augmented Reasoning of LLMs.
AutoRefine is an RL post-training framework that adopts a new "search-and-refine-during-think" paradigm. It introduces:
- explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer.
- tailored retrieval-specific rewards alongside answer correctness rewards to guide the searching behaviors.
Main Environment
The enrivonment for training/testing of AutoRefine can be built by running:
conda create -n autorefine python=3.9
conda activate autorefine
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.5.4
# build verl
pip install -e .
# flash attention 2
pip install flash-attn==2.7.0.post2
pip install wandb
Retrieval Environment
This environment is for the local retrieval server.
conda create -n faiss_env python=3.10
conda activate faiss_env
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers datasets pyserini
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
pip install uvicorn fastapi
To quickly test the model, you can run the demo script:
- Start the retrieval server:
conda activate faiss_env
bash retrieval_launch.sh
Please refer to the Retrieval Corpus section for the preparation of the retrieval corpus. This won't take long if your internet connection is good.
- Run the demo script:
conda activate autorefine
python demo.py
This will start a Gradio interface where you can input questions and see the model's responses.
If you prefer a local inference without the Gradio interface, you can directly run the inference script:
conda activate autorefine
python infer.py
This will print the model's response to the console. You may modify the infer.py
script to change the input question or adjust the model parameters.
save_path=./data
python preprocess/download.py --save_path $save_path
cat $save_path/part_* > $save_path/e5_Flat.index
gzip -d $save_path/wiki-18.jsonl.gz
We download the data for model training/evaluation from FlashRAG Collection.
To download and build the dataset, run:
bash preprocess/scripts/data_process.sh
This will merge the training set of NQ and HotpotQA as the training data, and merge the test/dev sets of nq,triviaqa,popqa,hotpotqa,2wikimultihopqa,musique,bamboogle
as the test set.
Before running the code for training/evaluation, you need to load the retrieval server first:
conda activate faiss_env
bash retrieval_launch.sh
This will start a server listening on http://127.0.0.1:8000/retrieve
.
To reproduce the result in the paper (Table 1), run the following code for training:
conda activate autorefine
bash cmd/train.sh
The script above will train the model for 300 steps while saving checkpoints with (1) highest reward (2) highest evaluation accuracy.
If you want to log the results onto wandb
, you may set the wandb_token
and WAND_PROJECT
variables in the scripts to your wandb token and prefered project name.
For evaluation, run:
conda activate autorefine
bash cmd/eval.sh
This project is built upon the foundational work of VeRL and Search-R1. We sincerely thank the authors of these projects for their valuable contributions, which have significantly supported and inspired our work.
Thanks for the mention by Search-R1 at Here.
@article{AutoRefine,
title={Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs},
author={Yaorui, Shi and Shihan, Li and Chang, Wu and Zhiyuan, Liu and Junfeng, Fang and Hengxing, Cai and An, Zhang and Xiang, Wang},
journal={arXiv preprint arXiv:2505.11277},
year={2025}
}