Skip to content

sanxing-chen/meta-bandit-llm

Repository files navigation

Meta-Bandit LLM

When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

This repository provides the code and environment to reproduce the experiments in our paper (preprint). It supports meta‑bandit LLM training with multi‑turn agents, asynchronous vLLM rollouts for parallel environments, and LoRA, etc, built on top of verl and other open-source tools.

For the key innovations in reward design (meta‑bandit feedback shaping, scoring, and evaluation) and optimization (multi‑turn PPO adaptations, credit assignment), please refer to our paper.

Twitter thread: https://x.com/sanxing_chen/status/1973078286898176345

System components

  • Async multi‑turn PPO training: Implements a parallel rollout using a vLLM async server for high‑throughput multi‑turn (e.g., 50 turns) interactions.
  • Bandit environments and rewards: Adds various bandit tasks and scoring utilities (e.g., UCB), with multi‑turn chat wrappers for agent rollouts.
  • LoRA end‑to‑end: Trains with LoRA adapters and serves via vLLM with per‑request LoRA selection; both FSDP1 and FSDP2 are supported. See async_lora.md for details.

Install

conda create -n bandit python==3.10
conda activate bandit
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
pip install gymnasium==1.2.0

Quick start

Citation

If you find this work useful, please cite the preprint:

@article{chen2025greedy,
  title   = {When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training},
  author  = {Sanxing Chen and Xiaoyin Chen and Yukun Huang and Roy Xie and Bhuwan Dhingra},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2509.24923}
}

Acknowledgments

  • We thank the verl authors and community for modern LLM RL and training infrastructure.
  • Our experiments are mainly based on Qwen 2.5 series of models.

About

Multiturn Meta-Bandit LLM RL Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published