Meta-Bandit LLM

When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

This repository provides the code and environment to reproduce the experiments in our paper (preprint). It supports meta‑bandit LLM training with multi‑turn agents, asynchronous vLLM rollouts for parallel environments, and LoRA, etc, built on top of verl and other open-source tools.

For the key innovations in reward design (meta‑bandit feedback shaping, scoring, and evaluation) and optimization (multi‑turn PPO adaptations, credit assignment), please refer to our paper.

Twitter thread: https://x.com/sanxing_chen/status/1973078286898176345

System components

Async multi‑turn PPO training: Implements a parallel rollout using a vLLM async server for high‑throughput multi‑turn (e.g., 50 turns) interactions.
Bandit environments and rewards: Adds various bandit tasks and scoring utilities (e.g., UCB), with multi‑turn chat wrappers for agent rollouts.
LoRA end‑to‑end: Trains with LoRA adapters and serves via vLLM with per‑request LoRA selection; both FSDP1 and FSDP2 are supported. See async_lora.md for details.

Install

conda create -n bandit python==3.10
conda activate bandit
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
pip install gymnasium==1.2.0

Quick start

PPO with async multi‑turn rollout + LoRA: Use the Slurm templates and the shared runner described in scripts/slurm/README.md. Example:
- sbatch scripts/slurm/rl_ucb.sbatch
- sbatch scripts/slurm/rl_lora_ucb.sbatch
Supervised finetuning (SFT): Use scripts/slurm/sft_template.sbatch and scripts/slurm/sft_runner.sh.

Citation

If you find this work useful, please cite the preprint:

@article{chen2025greedy,
  title   = {When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training},
  author  = {Sanxing Chen and Xiaoyin Chen and Yukun Huang and Roy Xie and Bhuwan Dhingra},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2509.24923}
}

Acknowledgments

We thank the verl authors and community for modern LLM RL and training infrastructure.
Our experiments are mainly based on Qwen 2.5 series of models.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
scripts		scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
async_lora.md		async_lora.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Meta-Bandit LLM

When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

System components

Install

Quick start

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

sanxing-chen/meta-bandit-llm

Folders and files

Latest commit

History

Repository files navigation

Meta-Bandit LLM

When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

System components

Install

Quick start

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages