Picking the cheapest/best open-weight models for the job #21

TomLucidor · 2025-09-15T04:11:31Z

TomLucidor
Sep 15, 2025

Task work

On BCFL (v4 as latest): https://gorilla.cs.berkeley.edu/leaderboard.html

GLM 4.5 is the best at agentic and multi-turn tasks on par with Claude 4 and GPT o-series
SLMs like Phi-4 (and some Qwen3) are good at getting rid of Hallucinations on par with Gemini 2.5 and Claude 4 series

On LiveBench (they have not updated their tasks for a while) https://livebench.ai/#/

For reasoning (common sense work and basic thinking), Qwen3 and DeepSeek are on par with with the mid-range Gemini 2.5, Claude 4, and GPT o-series models => Qwen3 is better for reasoning since they are closer to human thinking than work-related thinking
For agentic coding (might apply to other agentic or multi-turn tasks), DeepSeek (as well as Qwen3 Coder and GLM4.5) at least attempted to reach the mid-rage variants of GPT o-series, and Claude 4 => agentic coding is a hard problem for open weight model to catchup, DeepSeek is slightly better for coding than reasoning
For data analysis (in case we need systematic thinking) and "IF" (instruction following is very important), Qwen3 (as well as DeepSeek R1 for data and v3.1 for IF) managed to GPT-5 and most high-end proprietary models (!?) => either Qwen3 benchmark-hacked the results, or that most models underestimate the difficulty of human desires and their relation to data work
Note: DeepSeek SOTA refers to both R1 updates and v3.1, mid-range variants refers to models like GPT-5 mini, o3 medium, Claude 4 Sonnet, and occasionally Gemini 2.5 Pro

On other benchmarks

DeepSeek R1 is on par with mid-range models for algorithm writing https://livecodebench.github.io/leaderboard.html
Qwen3 is on par with mid-range models for "competitive programming" https://livecodebenchpro.com/
R1 is good at code editing but DeepSeek v3.1 is good at tool use (low model coverage) https://aider.chat/docs/leaderboards/
SWE-Bench-bash is similar to LiveBench agentic coding but for debugging (low model coverage) https://www.swebench.com/
SWE-Rebench prefers GLM-4.5 (and also Qwen3-Coder) who are slightly behind mid-tier models https://swe-rebench.com/

Social thinking

On Longform Writing (in case we want Obsidian to output something new) https://eqbench.com/creative_writing_longform.html

DeepSeek v3.1 (and to a lesser extent DeepSeek R1 and GLM 4.5) are more likely to be on par with high-end proprietary models when it comes to general quality of writing, based on coherence and depth of writing
Kimi K2 is on-par with mid-range proprietary models when it comes to not biasing towards "AI vocabulary"
Long output degradation wise, GLM 4.5 and DeepSeek are on par with the high-end GPT and Gemini 2.5 Pro models, while Kimi K2 is closer to the the mid-range models
Note: Qwen3 is not even close for "AI vocabulary", it is strictly trained for technical work rather than creative writing

The coherence likely stems from the generalization ability of DeepSeek (and GLM 4.5) https://github.com/lechmazur/generalization

On JudgeMark https://eqbench.com/judgemark-v2.html

Qwen3 (also Kimi) has strong alignment with humanity, on par with most mid-range proprietary models
Kimi K2 and no other open model, not even GLM 4.5, are on par with the mid-range models when it comes to quality control

On EQ-Bench-3 https://eqbench.com/

Kimi K2 beats most high-end models, while GLM 4.5 is on par with most models
The strongest differentiators under Kimi K2 are "insight" (includes GLM 4.5), Empathy, Social Cognition, Assertiveness
Qwen3 is the exception compared to most open-weight models when it comes to "warmth" being on-part with proprietary models, everything else like empathy and social cognition is dominated by Kimi K2
SLMs like Qwen3 and Mistral Small are more likely to be "compliant" (follows orders)
Note: The major drawback from Kimi K2 is "safety" which is a euphemism for censorship, and "pragmatism" which is a euphemism for non-compliance to the spirit of the work in favor of the letter of the work

Not even surprised since Kimi K2 is also good at customer service (Tau2-Bench, not sure if this is rigged or not) https://artificialanalysis.ai/evaluations/tau2-bench

Idea

I am thinking how mixed model operations (thinking/planning, writing/FC, fast proofreading) can do something better

Planning: Qwen3-235B-A22B for deep thinking and compliance to human preferences
Engineering: GLM-4.5 for insights and continuous agentic workloads with FC-compliance
Subagents: Qwen3-30B-A3B for simple function calling and instruction following
Troubleshooting: DeepSeek for both code editing and quality technical writing
Documentation: use Kimi K2 as prose editorial for writing social blogs and FAQs

But why obsess over code as example task? Qwen3-Coder (and to a lesser extent Kimi K2 + GLM-4.5) are known to have issues doing function calls (AKA "tool calls"), once a model is fine-tuned to use a certain format, it breaks compatibility to other formats/tools. All-Hands-AI/OpenHands#10112 Kilo-Org/kilocode#2107

Addendum

GPT-OSS is definitely cheaper than R1 when it comes to self-optimization tasks (ALE-Bench), hyper-constraint task (IFBench!?), and maybe competitive programming (LiveCodeBench-Pro but tough questions only), never anything like agentic coding or any sort of writing task. 10x cheaper than DeepSeek R1 as a "creativity guru". https://sakanaai.github.io/ALE-Bench-Leaderboard/ https://artificialanalysis.ai/evaluations/ifbench

Lapis0x0 · 2025-09-16T14:32:43Z

Lapis0x0
Sep 16, 2025
Maintainer

Thank you very much for the recommendation, I have been paying attention to the QWEN Next 80B A3B model recently, and it may become a great core agent-driven model in the future

1 reply

TomLucidor Sep 17, 2025
Author

Something tells me that Qwen3-Next is just a pilot test for mixed attention and ultra-sparse techniques, and consider how their track record is a bit iffy with Qwe3-Coder, that would need to be looked into as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Picking the cheapest/best open-weight models for the job #21

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Picking the cheapest/best open-weight models for the job #21

Uh oh!

Uh oh!

TomLucidor Sep 15, 2025

Task work

Social thinking

Idea

Addendum

Replies: 1 comment · 1 reply

Uh oh!

Lapis0x0 Sep 16, 2025 Maintainer

Uh oh!

TomLucidor Sep 17, 2025 Author

TomLucidor
Sep 15, 2025

Replies: 1 comment 1 reply

Lapis0x0
Sep 16, 2025
Maintainer

TomLucidor Sep 17, 2025
Author