Skip to content

0.5.0

Latest
Compare
Choose a tag to compare
@Myhs-phz Myhs-phz released this 01 Sep 06:20
· 7 commits to main since this release
a88f268

OpenCompass v0.5.0 Release Notes

🌟 Highlights

​Comprehensive Scientific Benchmarks: Integrated 10+ specialized datasets (MedXpertQA, ClimaQA, SmolInstruct, etc.), covering multiple scientific fields such as chemistry, physics, biology, and earth sciences
Cascade Evaluator: Supported cascading eval methods from rules to LLM judgments.
New Runner: Supported for Rjob Runner has now been completed.
OpenAISDK Streaming: Provided a more stable OpenAI API method.
New Evaluation Examples: Published the real-time evaluation config of CompassAcademic Leaderboard and the Intern-S1 related benchmark evaluation config.


🚀 New Features

🔧 Cascade Evaluator (#1992)
🔧 Rjob Runner (#2144)
🔧 OpenAISDK Streaming (#2208)
🔧 Evaluation Example for CompassAcademic Leaderboard. (#2202)
🔧 Evaluation Example for Intern-S1 and Scientific Benchmarks. (#2220)
🔧 So Many New Scientific Datasets!

  1. MedXpertQA for expert-level medical knowledge evaluation (#2002)
  2. ClimaQA for climate question evaluation (#2017)
  3. HealthBench for better measuring capabilities of AI systems for health (#2099)
  4. ProteinLMBench for protein related tasks (#2064)
    ...

📖 Documentation

📝 Fixed 404 links between Chinese/English docs (#2001)
📝 Added CompassAcademic Leaderboard task tutorial (#2202)
📝 Added Intern-S1 evaluation task tutorial (#2220)
📝 Fixed format problems of the dataset statistics page (#2170)
📝 Align NIAH CLI command guide to the actual CLI argument parser (#2194)
📝 Set correct paths for the examples (#2198)


🐛Bug Fixes

🔧 Fixed compare error base_evaluator (#2010)
🔧 Fixed OpenICL Math Evaluator Config (#2007)
🔧 Added Error Case for content filter (#2167)
🔧 Fix the OpenAI SDK to adapt to gpt-5 (#2236)
🔧 Fixed dataset repeat by concatenating (#2039)
🔧 Concat OpenaiSDK reasoning content (#2041)


⚙ Enhancements and Refactors###

Infrastructure Refactors:

  • Set dump-eval-details as default behavior (#1999)
  • Refactorized openicl eval task (#1990)
  • Added openai_extra_kwargs for API customization (#2210)

CI/CD Improvements:

  • Fixed baseline score (#2000)
  • Updated baseline for kernal change of vllm and lmdeploy (#2011)
  • Updated baseline and fix lmdeploy version (#2098)
  • Added check rule (#2101)
  • Updated testcases' baseline (#2184)
    ...

🎉 Welcome New Contributors

A warm welcome to our newest contributors:


Full Changelog: 0.4.2...0.5.0

Thank you for using OpenCompass! These updates empower deeper insights and more reliable evaluations. Keep exploring, and stay tuned for future innovations! 🌟