This repository provides the code and environment to reproduce the experiments in our paper (preprint). It supports meta‑bandit LLM training with multi‑turn agents, asynchronous vLLM rollouts for parallel environments, and LoRA, etc, built on top of verl and other open-source tools.
For the key innovations in reward design (meta‑bandit feedback shaping, scoring, and evaluation) and optimization (multi‑turn PPO adaptations, credit assignment), please refer to our paper.
Twitter thread: https://x.com/sanxing_chen/status/1973078286898176345
- Async multi‑turn PPO training: Implements a parallel rollout using a vLLM async server for high‑throughput multi‑turn (e.g., 50 turns) interactions.
- Bandit environments and rewards: Adds various bandit tasks and scoring utilities (e.g., UCB), with multi‑turn chat wrappers for agent rollouts.
- LoRA end‑to‑end: Trains with LoRA adapters and serves via vLLM with per‑request LoRA selection; both FSDP1 and FSDP2 are supported. See async_lora.md for details.
conda create -n bandit python==3.10
conda activate bandit
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
pip install gymnasium==1.2.0- PPO with async multi‑turn rollout + LoRA: Use the Slurm templates and the shared runner described in scripts/slurm/README.md. Example:
sbatch scripts/slurm/rl_ucb.sbatchsbatch scripts/slurm/rl_lora_ucb.sbatch
- Supervised finetuning (SFT): Use scripts/slurm/sft_template.sbatch and scripts/slurm/sft_runner.sh.
If you find this work useful, please cite the preprint:
@article{chen2025greedy,
title = {When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training},
author = {Sanxing Chen and Xiaoyin Chen and Yukun Huang and Roy Xie and Bhuwan Dhingra},
year = {2025},
journal = {arXiv preprint arXiv: 2509.24923}
}
- We thank the verl authors and community for modern LLM RL and training infrastructure.
- Our experiments are mainly based on Qwen 2.5 series of models.