OneShotGRPO Documentation

Welcome to the comprehensive documentation for OneShotGRPO! This guide will help you train, monitor, and deploy small language models using GRPO (Generative Reinforcement Policy Optimization).

Documentation Index

Getting Started

Educational GRPO Notebook: Start here! A comprehensive, step-by-step notebook that teaches you GRPO from scratch.
CLAUDE.md: Quick reference guide for the codebase structure and core concepts.
README2.md: Original project documentation with additional context.

Integration Guides

Guide	Description	Level
Prime Intellect Integration	Scale training across multiple GPUs/nodes with Prime RL	Advanced
Google Cloud Storage	Persistent checkpoint storage and management	Intermediate
Weights & Biases Visualization	Advanced monitoring with 3D charts	Intermediate
Gradio Deployment	Deploy chat interfaces to HF Spaces	Beginner

Quick Navigation

By Use Case

I want to...

Learn GRPO from scratch → Start with Educational Notebook
Scale to multiple GPUs → See Prime Intellect Guide
Save checkpoints reliably → Read GCS Guide
Monitor training deeply → Check W&B Guide
Deploy a chat demo → Follow Gradio Guide
Understand the code → Review CLAUDE.md

By Experience Level

Beginner:

Educational Notebook (Start here!)
Gradio Deployment (Deploy your model)
README2.md (Learn about the project)

Intermediate:

Google Cloud Storage (Better checkpointing)
Weights & Biases (Advanced monitoring)
CLAUDE.md (Code deep dive)

Advanced:

Prime Intellect (Distributed training)
Source code in src/oneshot_grpo/
Custom environments and reward functions

Documentation Overview

1. Prime Intellect Integration

File: PRIME_INTELLECT.md

Learn how to use Prime Intellect's distributed RL framework:

Installation and setup
Using pre-built environments (AQuA-RAT)
Creating custom environments
Multi-GPU training configuration
Fault-tolerant training at scale

Best for: Teams needing to scale training beyond a single GPU, or those wanting access to pre-built RL environments.

2. Google Cloud Storage Integration

File: GOOGLE_CLOUD_STORAGE.md

Everything about checkpoint persistence:

GCS setup and authentication
Automatic checkpoint uploading
Resuming from saved checkpoints
Cost optimization strategies
Lifecycle management

Best for: Anyone training for >2 hours or needing guaranteed checkpoint persistence.

3. Weights & Biases Visualization

File: WANDB_VISUALIZATION.md

Advanced experiment tracking and visualization:

Real-time metric logging
3D reward landscape plots
Policy evolution visualization
Custom dashboards
Hyperparameter sweeps

Best for: Researchers wanting deep insights into training dynamics or comparing multiple runs.

4. Gradio Deployment

File: GRADIO_DEPLOYMENT.md

Build and deploy chat interfaces:

Quick chat interface creation
HuggingFace Hub integration
Deploying to HF Spaces
Production considerations
Custom themes and features

Best for: Anyone wanting to demo their trained model with a user-friendly interface.

Learning Paths

Path 1: Quick Start (2-4 hours)

Perfect for getting your first GRPO model trained and deployed:

Setup (30 min)
- Open Educational Notebook in Colab
- Get a GPU runtime (A100 recommended)
- Install dependencies
Training (1-2 hours)
- Follow notebook sections 1-7
- Train on 1,000 GSM8K examples
- Monitor with basic W&B
Deployment (30 min)
- Push model to HuggingFace Hub
- Create Gradio interface (Section 11)
- Test with sample questions
Result: A working math reasoning model with chat interface!

Path 2: Production Setup (1-2 days)

For serious projects requiring robust infrastructure:

Day 1 Morning: Core Training
- Complete Educational Notebook (full dataset)
- Set up Google Cloud Storage
- Configure automatic checkpoint backups
Day 1 Afternoon: Monitoring
- Implement W&B visualization
- Set up custom dashboards
- Create 3D reward landscapes
Day 2 Morning: Scaling (Optional)
- Set up Prime Intellect
- Configure multi-GPU training
- Test fault tolerance
Day 2 Afternoon: Deployment
- Create production Gradio app
- Deploy to HF Spaces with GPU
- Set up monitoring and rate limiting
Result: Production-ready GRPO training pipeline!

Path 3: Research Deep Dive (Ongoing)

For researchers extending GRPO or exploring RL:

Week 1: Understanding
- Study GRPO paper and implementation
- Read CLAUDE.md thoroughly
- Examine source code in src/
- Run experiments with different rewards
Week 2: Experimentation
- Implement custom reward functions
- Try different datasets
- Use W&B sweeps
- Compare with baselines
Week 3: Scaling
- Set up Prime Intellect
- Create custom environments
- Scale to larger models
- Optimize hyperparameters
Week 4: Publication
- Write model cards
- Create visualizations
- Deploy demo apps
- Share results

External Resources

Official Documentation

Research Papers

Community

Getting Help

Troubleshooting

Each guide has a dedicated troubleshooting section:

Common Issues

Issue: CUDA out of memory

Solution: Reduce batch size, use gradient accumulation, or enable 8-bit quantization

Issue: Training too slow

Solution: Enable vLLM, use bf16 precision, or scale to multiple GPUs with Prime Intellect

Issue: Checkpoints lost after disconnect

Solution: Set up Google Cloud Storage integration

Issue: Can't monitor training well

Solution: Enable Weights & Biases with custom dashboards

Support Channels

GitHub Issues: Report bugs or request features
Discussions: Ask questions or share ideas
Email: Contact maintainers

Contributing

We welcome contributions! Here's how:

Documentation: Found a typo or want to clarify something? Edit the docs!
Code: Improved a reward function? Created a new environment? Submit a PR!
Examples: Trained a cool model? Share your notebook!
Guides: Found a better way to do something? Write a guide!

See CONTRIBUTING.md for details.

Citation

If you use OneShotGRPO in your research, please cite:

@misc{oneshotgrpo,
  title={OneShotGRPO: Educational Framework for GRPO Training},
  author={Your Name},
  year={2025},
  publisher={GitHub},
  howpublished={\url{https://github.com/HarleyCoops/OneShotGRPO}}
}

License

This project is licensed under LICENSE.

Base model (Qwen) and other components have their own licenses. See individual files for details.

Acknowledgments

Base Model: Qwen Team
Dataset: OpenAI (GSM8K)
Frameworks: HuggingFace (TRL, Transformers), vLLM, Gradio, W&B
Inspiration: Will Brown's GRPO demo
Community: All contributors and users!

**Happy Training! **

Start with the Educational Notebook and build amazing math reasoning models!

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
TinyRecursiveInference		TinyRecursiveInference
configs		configs
docs		docs
src		src
wandb/run-20251009_143907-1vvw5i7h		wandb/run-20251009_143907-1vvw5i7h
CLAUDE.md		CLAUDE.md
EducationalGRPO.ipynb		EducationalGRPO.ipynb
OneShotAquaRAT.ipynb		OneShotAquaRAT.ipynb
PublicWorkingGRPO copy.ipynb		PublicWorkingGRPO copy.ipynb
README2.md		README2.md
StoneyGRPO.ipynb		StoneyGRPO.ipynb
hf_grpotuned_pipeline.py		hf_grpotuned_pipeline.py
inspect_grpo_signature.py		inspect_grpo_signature.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OneShotGRPO Documentation

Documentation Index

Getting Started

Integration Guides

Quick Navigation

By Use Case

By Experience Level

Documentation Overview

1. Prime Intellect Integration

2. Google Cloud Storage Integration

3. Weights & Biases Visualization

4. Gradio Deployment

Learning Paths

Path 1: Quick Start (2-4 hours)

Path 2: Production Setup (1-2 days)

Path 3: Research Deep Dive (Ongoing)

External Resources

Official Documentation

Research Papers

Community

Getting Help

Troubleshooting

Common Issues

Support Channels

Contributing

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

HarleyCoops/OneShotAquaRAT

Folders and files

Latest commit

History

Repository files navigation

OneShotGRPO Documentation

Documentation Index

Getting Started

Integration Guides

Quick Navigation

By Use Case

By Experience Level

Documentation Overview

1. Prime Intellect Integration

2. Google Cloud Storage Integration

3. Weights & Biases Visualization

4. Gradio Deployment

Learning Paths

Path 1: Quick Start (2-4 hours)

Path 2: Production Setup (1-2 days)

Path 3: Research Deep Dive (Ongoing)

External Resources

Official Documentation

Research Papers

Community

Getting Help

Troubleshooting

Common Issues

Support Channels

Contributing

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages