Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"
-
Updated
Jul 24, 2025 - Python
Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"
RLHF and Verifiable Reward Models - Post training Research
Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.
To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."