Skip to content

One click away from a locally downloaded, fine-tuned model, hosted on hugging face, with inference built in. In two hours.

Notifications You must be signed in to change notification settings

HarleyCoops/OneShotAquaRAT

Repository files navigation

OneShotGRPO Documentation

Welcome to the comprehensive documentation for OneShotGRPO! This guide will help you train, monitor, and deploy small language models using GRPO (Generative Reinforcement Policy Optimization).

Documentation Index

Getting Started

  • Educational GRPO Notebook: Start here! A comprehensive, step-by-step notebook that teaches you GRPO from scratch.
  • CLAUDE.md: Quick reference guide for the codebase structure and core concepts.
  • README2.md: Original project documentation with additional context.

Integration Guides

Guide Description Level
Prime Intellect Integration Scale training across multiple GPUs/nodes with Prime RL Advanced
Google Cloud Storage Persistent checkpoint storage and management Intermediate
Weights & Biases Visualization Advanced monitoring with 3D charts Intermediate
Gradio Deployment Deploy chat interfaces to HF Spaces Beginner

Quick Navigation

By Use Case

I want to...

By Experience Level

Beginner:

  1. Educational Notebook (Start here!)
  2. Gradio Deployment (Deploy your model)
  3. README2.md (Learn about the project)

Intermediate:

  1. Google Cloud Storage (Better checkpointing)
  2. Weights & Biases (Advanced monitoring)
  3. CLAUDE.md (Code deep dive)

Advanced:

  1. Prime Intellect (Distributed training)
  2. Source code in src/oneshot_grpo/
  3. Custom environments and reward functions

Documentation Overview

1. Prime Intellect Integration

File: PRIME_INTELLECT.md

Learn how to use Prime Intellect's distributed RL framework:

  • Installation and setup
  • Using pre-built environments (AQuA-RAT)
  • Creating custom environments
  • Multi-GPU training configuration
  • Fault-tolerant training at scale

Best for: Teams needing to scale training beyond a single GPU, or those wanting access to pre-built RL environments.

2. Google Cloud Storage Integration

File: GOOGLE_CLOUD_STORAGE.md

Everything about checkpoint persistence:

  • GCS setup and authentication
  • Automatic checkpoint uploading
  • Resuming from saved checkpoints
  • Cost optimization strategies
  • Lifecycle management

Best for: Anyone training for >2 hours or needing guaranteed checkpoint persistence.

3. Weights & Biases Visualization

File: WANDB_VISUALIZATION.md

Advanced experiment tracking and visualization:

  • Real-time metric logging
  • 3D reward landscape plots
  • Policy evolution visualization
  • Custom dashboards
  • Hyperparameter sweeps

Best for: Researchers wanting deep insights into training dynamics or comparing multiple runs.

4. Gradio Deployment

File: GRADIO_DEPLOYMENT.md

Build and deploy chat interfaces:

  • Quick chat interface creation
  • HuggingFace Hub integration
  • Deploying to HF Spaces
  • Production considerations
  • Custom themes and features

Best for: Anyone wanting to demo their trained model with a user-friendly interface.


Learning Paths

Path 1: Quick Start (2-4 hours)

Perfect for getting your first GRPO model trained and deployed:

  1. Setup (30 min)

  2. Training (1-2 hours)

    • Follow notebook sections 1-7
    • Train on 1,000 GSM8K examples
    • Monitor with basic W&B
  3. Deployment (30 min)

    • Push model to HuggingFace Hub
    • Create Gradio interface (Section 11)
    • Test with sample questions
  4. Result: A working math reasoning model with chat interface!

Path 2: Production Setup (1-2 days)

For serious projects requiring robust infrastructure:

  1. Day 1 Morning: Core Training

    • Complete Educational Notebook (full dataset)
    • Set up Google Cloud Storage
    • Configure automatic checkpoint backups
  2. Day 1 Afternoon: Monitoring

  3. Day 2 Morning: Scaling (Optional)

  4. Day 2 Afternoon: Deployment

    • Create production Gradio app
    • Deploy to HF Spaces with GPU
    • Set up monitoring and rate limiting
  5. Result: Production-ready GRPO training pipeline!

Path 3: Research Deep Dive (Ongoing)

For researchers extending GRPO or exploring RL:

  1. Week 1: Understanding

    • Study GRPO paper and implementation
    • Read CLAUDE.md thoroughly
    • Examine source code in src/
    • Run experiments with different rewards
  2. Week 2: Experimentation

    • Implement custom reward functions
    • Try different datasets
    • Use W&B sweeps
    • Compare with baselines
  3. Week 3: Scaling

    • Set up Prime Intellect
    • Create custom environments
    • Scale to larger models
    • Optimize hyperparameters
  4. Week 4: Publication

    • Write model cards
    • Create visualizations
    • Deploy demo apps
    • Share results

External Resources

Official Documentation

Research Papers

Community


Getting Help

Troubleshooting

Each guide has a dedicated troubleshooting section:

Common Issues

Issue: CUDA out of memory

  • Solution: Reduce batch size, use gradient accumulation, or enable 8-bit quantization

Issue: Training too slow

  • Solution: Enable vLLM, use bf16 precision, or scale to multiple GPUs with Prime Intellect

Issue: Checkpoints lost after disconnect

  • Solution: Set up Google Cloud Storage integration

Issue: Can't monitor training well

  • Solution: Enable Weights & Biases with custom dashboards

Support Channels

  1. GitHub Issues: Report bugs or request features
  2. Discussions: Ask questions or share ideas
  3. Email: Contact maintainers

Contributing

We welcome contributions! Here's how:

  1. Documentation: Found a typo or want to clarify something? Edit the docs!
  2. Code: Improved a reward function? Created a new environment? Submit a PR!
  3. Examples: Trained a cool model? Share your notebook!
  4. Guides: Found a better way to do something? Write a guide!

See CONTRIBUTING.md for details.


Citation

If you use OneShotGRPO in your research, please cite:

@misc{oneshotgrpo,
  title={OneShotGRPO: Educational Framework for GRPO Training},
  author={Your Name},
  year={2025},
  publisher={GitHub},
  howpublished={\url{https://github.com/HarleyCoops/OneShotGRPO}}
}

License

This project is licensed under LICENSE.

Base model (Qwen) and other components have their own licenses. See individual files for details.


Acknowledgments

  • Base Model: Qwen Team
  • Dataset: OpenAI (GSM8K)
  • Frameworks: HuggingFace (TRL, Transformers), vLLM, Gradio, W&B
  • Inspiration: Will Brown's GRPO demo
  • Community: All contributors and users!

**Happy Training! **

Start with the Educational Notebook and build amazing math reasoning models!

About

One click away from a locally downloaded, fine-tuned model, hosted on hugging face, with inference built in. In two hours.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •