Welcome to the comprehensive documentation for OneShotGRPO! This guide will help you train, monitor, and deploy small language models using GRPO (Generative Reinforcement Policy Optimization).
- Educational GRPO Notebook: Start here! A comprehensive, step-by-step notebook that teaches you GRPO from scratch.
- CLAUDE.md: Quick reference guide for the codebase structure and core concepts.
- README2.md: Original project documentation with additional context.
| Guide | Description | Level |
|---|---|---|
| Prime Intellect Integration | Scale training across multiple GPUs/nodes with Prime RL | Advanced |
| Google Cloud Storage | Persistent checkpoint storage and management | Intermediate |
| Weights & Biases Visualization | Advanced monitoring with 3D charts | Intermediate |
| Gradio Deployment | Deploy chat interfaces to HF Spaces | Beginner |
I want to...
- Learn GRPO from scratch → Start with Educational Notebook
- Scale to multiple GPUs → See Prime Intellect Guide
- Save checkpoints reliably → Read GCS Guide
- Monitor training deeply → Check W&B Guide
- Deploy a chat demo → Follow Gradio Guide
- Understand the code → Review CLAUDE.md
Beginner:
- Educational Notebook (Start here!)
- Gradio Deployment (Deploy your model)
- README2.md (Learn about the project)
Intermediate:
- Google Cloud Storage (Better checkpointing)
- Weights & Biases (Advanced monitoring)
- CLAUDE.md (Code deep dive)
Advanced:
- Prime Intellect (Distributed training)
- Source code in
src/oneshot_grpo/ - Custom environments and reward functions
File: PRIME_INTELLECT.md
Learn how to use Prime Intellect's distributed RL framework:
- Installation and setup
- Using pre-built environments (AQuA-RAT)
- Creating custom environments
- Multi-GPU training configuration
- Fault-tolerant training at scale
Best for: Teams needing to scale training beyond a single GPU, or those wanting access to pre-built RL environments.
File: GOOGLE_CLOUD_STORAGE.md
Everything about checkpoint persistence:
- GCS setup and authentication
- Automatic checkpoint uploading
- Resuming from saved checkpoints
- Cost optimization strategies
- Lifecycle management
Best for: Anyone training for >2 hours or needing guaranteed checkpoint persistence.
File: WANDB_VISUALIZATION.md
Advanced experiment tracking and visualization:
- Real-time metric logging
- 3D reward landscape plots
- Policy evolution visualization
- Custom dashboards
- Hyperparameter sweeps
Best for: Researchers wanting deep insights into training dynamics or comparing multiple runs.
File: GRADIO_DEPLOYMENT.md
Build and deploy chat interfaces:
- Quick chat interface creation
- HuggingFace Hub integration
- Deploying to HF Spaces
- Production considerations
- Custom themes and features
Best for: Anyone wanting to demo their trained model with a user-friendly interface.
Perfect for getting your first GRPO model trained and deployed:
-
Setup (30 min)
- Open Educational Notebook in Colab
- Get a GPU runtime (A100 recommended)
- Install dependencies
-
Training (1-2 hours)
- Follow notebook sections 1-7
- Train on 1,000 GSM8K examples
- Monitor with basic W&B
-
Deployment (30 min)
- Push model to HuggingFace Hub
- Create Gradio interface (Section 11)
- Test with sample questions
-
Result: A working math reasoning model with chat interface!
For serious projects requiring robust infrastructure:
-
Day 1 Morning: Core Training
- Complete Educational Notebook (full dataset)
- Set up Google Cloud Storage
- Configure automatic checkpoint backups
-
Day 1 Afternoon: Monitoring
- Implement W&B visualization
- Set up custom dashboards
- Create 3D reward landscapes
-
Day 2 Morning: Scaling (Optional)
- Set up Prime Intellect
- Configure multi-GPU training
- Test fault tolerance
-
Day 2 Afternoon: Deployment
- Create production Gradio app
- Deploy to HF Spaces with GPU
- Set up monitoring and rate limiting
-
Result: Production-ready GRPO training pipeline!
For researchers extending GRPO or exploring RL:
-
Week 1: Understanding
- Study GRPO paper and implementation
- Read CLAUDE.md thoroughly
- Examine source code in
src/ - Run experiments with different rewards
-
Week 2: Experimentation
- Implement custom reward functions
- Try different datasets
- Use W&B sweeps
- Compare with baselines
-
Week 3: Scaling
- Set up Prime Intellect
- Create custom environments
- Scale to larger models
- Optimize hyperparameters
-
Week 4: Publication
- Write model cards
- Create visualizations
- Deploy demo apps
- Share results
Each guide has a dedicated troubleshooting section:
Issue: CUDA out of memory
- Solution: Reduce batch size, use gradient accumulation, or enable 8-bit quantization
Issue: Training too slow
- Solution: Enable vLLM, use bf16 precision, or scale to multiple GPUs with Prime Intellect
Issue: Checkpoints lost after disconnect
- Solution: Set up Google Cloud Storage integration
Issue: Can't monitor training well
- Solution: Enable Weights & Biases with custom dashboards
- GitHub Issues: Report bugs or request features
- Discussions: Ask questions or share ideas
- Email: Contact maintainers
We welcome contributions! Here's how:
- Documentation: Found a typo or want to clarify something? Edit the docs!
- Code: Improved a reward function? Created a new environment? Submit a PR!
- Examples: Trained a cool model? Share your notebook!
- Guides: Found a better way to do something? Write a guide!
See CONTRIBUTING.md for details.
If you use OneShotGRPO in your research, please cite:
@misc{oneshotgrpo,
title={OneShotGRPO: Educational Framework for GRPO Training},
author={Your Name},
year={2025},
publisher={GitHub},
howpublished={\url{https://github.com/HarleyCoops/OneShotGRPO}}
}This project is licensed under LICENSE.
Base model (Qwen) and other components have their own licenses. See individual files for details.
- Base Model: Qwen Team
- Dataset: OpenAI (GSM8K)
- Frameworks: HuggingFace (TRL, Transformers), vLLM, Gradio, W&B
- Inspiration: Will Brown's GRPO demo
- Community: All contributors and users!
**Happy Training! **
Start with the Educational Notebook and build amazing math reasoning models!