Picking the cheapest/best open-weight models for the job #21
TomLucidor
started this conversation in
General
Replies: 1 comment 1 reply
-
Thank you very much for the recommendation, I have been paying attention to the QWEN Next 80B A3B model recently, and it may become a great core agent-driven model in the future |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Task work
On BCFL (v4 as latest): https://gorilla.cs.berkeley.edu/leaderboard.html
On LiveBench (they have not updated their tasks for a while) https://livebench.ai/#/
On other benchmarks
Social thinking
On Longform Writing (in case we want Obsidian to output something new) https://eqbench.com/creative_writing_longform.html
The coherence likely stems from the generalization ability of DeepSeek (and GLM 4.5) https://github.com/lechmazur/generalization
On JudgeMark https://eqbench.com/judgemark-v2.html
On EQ-Bench-3 https://eqbench.com/
Not even surprised since Kimi K2 is also good at customer service (Tau2-Bench, not sure if this is rigged or not) https://artificialanalysis.ai/evaluations/tau2-bench
Idea
I am thinking how mixed model operations (thinking/planning, writing/FC, fast proofreading) can do something better
But why obsess over code as example task? Qwen3-Coder (and to a lesser extent Kimi K2 + GLM-4.5) are known to have issues doing function calls (AKA "tool calls"), once a model is fine-tuned to use a certain format, it breaks compatibility to other formats/tools. All-Hands-AI/OpenHands#10112 Kilo-Org/kilocode#2107
Addendum
GPT-OSS is definitely cheaper than R1 when it comes to self-optimization tasks (ALE-Bench), hyper-constraint task (IFBench!?), and maybe competitive programming (LiveCodeBench-Pro but tough questions only), never anything like agentic coding or any sort of writing task. 10x cheaper than DeepSeek R1 as a "creativity guru". https://sakanaai.github.io/ALE-Bench-Leaderboard/ https://artificialanalysis.ai/evaluations/ifbench
Beta Was this translation helpful? Give feedback.
All reactions