Benchmarks

AI Models

Tiered ranking of large language models optimized for agentic workflows. Updated continuously.

Tier 1 — Frontier

Complex reasoning · Strategy · Planning · External dev only

7 models

Model	Cost /1M	Context	Key Specs	Benchmarks	Why This Model
Claude Opus 4.7Anthropic · Apr 2026	$5 / $25 outFlat model: $5/$25	1M128k out	· xhigh effort · 3.75MP vision · task budgets · self-verify	SWE-Verified87.6% SWE-Pro64.3% MCP-Atlas77.3%	Best publicly available model Apr 2026. 3x vision resolution. 87.6% SWE-Verified leads all non-preview models. Same price as 4.6.
Z GLM-5.1Z.AI · Apr 2026	$1.40 / $4.40 out	200K131K max output	· 744B total / 40B active (MoE) · MIT license — open weights · 43.6 t/s · Trained on Huawei Ascend 910B	SWE-Bench Pro58.4% SWE-Verified77.8% GPQA Diamond83.9%	#1 SWE-Pro globally at launch (58.4%). Open weights on HuggingFace. 1/3 the cost of Opus. Self-reported benchmarks — independent verification pending.
Kimi K2.6Moonshot AI · Apr 2026	$0.95 / $4 outVia Kimi API	256K	· Open weights · 300 agents parallel · video input · agent swarm	SWE-Verified80.2% SWE-Pro58.6% Terminal-Bench 2.066.7%	Open-weight at near-frontier level. 300 sub-agents in parallel. Competitive with Opus 4.6 at a fraction of the cost. Native video input.
Claude Opus 4.6Anthropic · Feb 2026	$5 / $25 outFlat model: $5/$25	1M128k out	· Adaptive thinking · Agent Teams · context compaction · 76% long-ctx@1M	SWE-Verified80.8% Terminal B265.4% ARC-AGI-268.9%	#1 agentic terminal coding. Handles own errors in loop over long sessions. API only — no subscription for agents. Superseded by 4.7.
GPT-5.4OpenAI · Mar 2026	$2.50 / $15 out2+ above $72k/day	1.05M	· GPT-5.3-Codex MCP · dynamic tool search · superhuman desktop	SWE-Pro57.7% OSWorld75.0% GPQA Diamond92.8%	Multi-hour autonomous execution with real planning. Only worth the premium when task complexity genuinely needs frontier judgment.
GPT 5.5OpenAI · Apr 2026	$5.00 / $30 outPro variant: significantly higher	1M	· Agentic workflows · Standard + Pro variants · test-time compute (Pro) · computer use	SWE-Pro58.6% Terminal-Bench 2.082.7% GDPval84.9%	OpenAI's latest agentic model (codenamed Spud). 82.7% Terminal-Bench 2.0 — best-in-class for CLI/shell workflows. Strong on autonomous multi-step tasks. Pro variant adds test-time compute for deeper reasoning.
DeepSeek V4 ProDeepSeek · Apr 2026	$1.74 / $3.48 outCached: $0.145/mSelf-host = free	1M	· 1.6T total / 49B active (MoE) · 1M context · hybrid attention · MIT license	SWE-Verified~80.6% MMLU-Pro~73.5 HumanEval~76.8%	Top-tier open-weights model released Apr 24 2026. 1.6T MoE with 49B active params. Matches frontier performance on SWE-Verified at open-weight pricing. 1M context window with hybrid attention for efficiency. Independent verification pending.

Claude Opus 4.7Anthropic · Apr 2026

Cost /1M$5 / $25 out

Context1M

SWE-Verified87.6%

SWE-Pro64.3%

MCP-Atlas77.3%

GLM-5.1Z.AI · Apr 2026

Cost /1M$1.40 / $4.40 out

Context200K

SWE-Bench Pro58.4%

SWE-Verified77.8%

GPQA Diamond83.9%

Kimi K2.6Moonshot AI · Apr 2026

Cost /1M$0.95 / $4 out

Context256K

SWE-Verified80.2%

SWE-Pro58.6%

Terminal-Bench 2.066.7%

Claude Opus 4.6Anthropic · Feb 2026

Cost /1M$5 / $25 out

Context1M

SWE-Verified80.8%

Terminal B265.4%

ARC-AGI-268.9%

GPT-5.4OpenAI · Mar 2026

Cost /1M$2.50 / $15 out

Context1.05M

SWE-Pro57.7%

OSWorld75.0%

GPQA Diamond92.8%

GPT 5.5OpenAI · Apr 2026

Cost /1M$5.00 / $30 out

Context1M

SWE-Pro58.6%

Terminal-Bench 2.082.7%

GDPval84.9%

DeepSeek V4 ProDeepSeek · Apr 2026

Cost /1M$1.74 / $3.48 out

Context1M

SWE-Verified~80.6%

MMLU-Pro~73.5

HumanEval~76.8%

Tier 2 — Agent Execution

Tool calls · Long task chains · Multi-step pipelines

4 models

Model	Cost /1M	Context	Key Specs	Benchmarks	Why This Model
Gemini 3.1 ProGoogle · Feb 2026	$2 / $12 out84k out	1M	· Native multimodal · text/image/video/audio · 3 thinking levels · 138 t/s	ARC-AGI-277.1% GPQA Diamond94.3% SWE-Verified80.6%	7.5× cheaper than Opus on input. Leads most benchmarks on vision. No separate pipeline for media.
MiniMax M2.7MiniMax · Mar 2026	$0.30 / $1.20 out$19/mo = 1500 calls/5h	205K131k out	· Self-evolving 100+ RL cycles · 97% skill adherence · 40+ complex skills	SWE-Pro56.2% Terminal B257.0% Vibe-Pro55.6%	Best price-to-agent-capability in the stack. 97% skill adherence critical for OpenClaw's skill ecosystems. $19 plan is absurdly good value.
Kimi K2.5Moonshot · Feb 2026	$0.60 / $3.00 out	256K	· Watch token verbosity · 1T params / 32B active · MLA attention · agent swarm	HLE w/ tools50.2% BrowseComp79.4% SWE-Verified76.8%	Best long-context stability for extended tasks. ~6× more output tokens than peers — budget carefully.
DeepSeek V3.2DeepSeek · Dec 2025	$0.27 / $0.41 out	164K	· 685B total / sparse active · DSA sparse attention · MIT license	SWE-Verified70.0% Aider polyglot74.2%	90% of GPT-5.4 performance at 1/60th the cost. Best price-performance in this tier via OpenRouter.

Gemini 3.1 ProGoogle · Feb 2026

Cost /1M$2 / $12 out

Context1M

ARC-AGI-277.1%

GPQA Diamond94.3%

SWE-Verified80.6%

MiniMax M2.7MiniMax · Mar 2026

Cost /1M$0.30 / $1.20 out

Context205K

SWE-Pro56.2%

Terminal B257.0%

Vibe-Pro55.6%

Kimi K2.5Moonshot · Feb 2026

Cost /1M$0.60 / $3.00 out

Context256K

HLE w/ tools50.2%

BrowseComp79.4%

SWE-Verified76.8%

DeepSeek V3.2DeepSeek · Dec 2025

Cost /1M$0.27 / $0.41 out

Context164K

SWE-Verified70.0%

Aider polyglot74.2%

Tier 3 — Balanced

Content · Code · Research · Day-to-day tasks

5 models

Model	Cost /1M	Context	Key Specs	Benchmarks	Why This Model
Claude Sonnet 4.6Anthropic · Feb 2026	$3 / $15 outAPI only	1M64k out	· 40–80 t/s adaptive · computer use 94% · Office/QA matches Opus · prompt injection resistance	SWE-Verified79.9% Computer use94.0% AI Index52/100	98% of Opus coding at 1/5 the cost. API only — no $10/mo plan exists for this model.
GPT-5.4 miniOpenAI · Various	$0.75 / $4.50 outOAuth via subscription	400K	· ~$0.075 avg · 2× faster than GPT-5 · text + image + function · web search	SWE-Pro54.4% Tool call r193.4% GSWorld72.1%	Smart enough to run entire system. 93.4% tool-call reliability. ChatGPT OAuth = no API billing needed.
Qwen3.6 PlusAlibaba · Apr 2026	FREE NOW$0 / $0	1M	· Via OpenRouter free tier · Hybrid linear attn + MoE · built-in reasoning · major leap over Qwen 3.5	SWE-Verified78.8%	Best free model available. 1M context. Near-frontier coding. Free until preview window closes.
Llama 4 MaverickMeta · 2026	$0.19–$0.49$0 if self-hosted	1M	· 400B total / 17B active · multimodal · Apache 2.0 · data sovereignty	MMLU85.5% SWE-Verified~68%	Only serious open-weight option at this level. Self-host = zero ongoing cost. Best for data sovereignty needs.
Mistral Small 4Mistral · Mar 2026	$0.15 / $0.60 out	256K	· One model replaces three · reasoning + vision + agentic · Apache 2.0 · 162 t/s	AA Intelligence Index27/100 AA LCR score0.72 MATH-500~93.6%	Replaces three separate models with one Apache 2.0 weights file. 162 t/s output. Best simplicity-to-capability ratio in Tier 3.

Claude Sonnet 4.6Anthropic · Feb 2026

Cost /1M$3 / $15 out

Context1M

SWE-Verified79.9%

Computer use94.0%

AI Index52/100

GPT-5.4 miniOpenAI · Various

Cost /1M$0.75 / $4.50 out

Context400K

SWE-Pro54.4%

Tool call r193.4%

GSWorld72.1%

Qwen3.6 PlusAlibaba · Apr 2026

Cost /1MFREE NOW

Context1M

SWE-Verified78.8%

Llama 4 MaverickMeta · 2026

Cost /1M$0.19–$0.49

Context1M

MMLU85.5%

SWE-Verified~68%

Mistral Small 4Mistral · Mar 2026

Cost /1M$0.15 / $0.60 out

Context256K

AA Intelligence Index27/100

AA LCR score0.72

MATH-500~93.6%

Open Source — Runs on Device

Local inference · Zero API cost · Full privacy

12 models

64 GB RAM — 48–64 GB RAM

Maximum quality local inference. Frontier-grade reasoning without leaving your machine.

5 models

Model	Params	Context	Key Specs	Benchmarks	Why This Model
Qwen3.6-27BAlibaba	27B dense	256K	· Dense architecture · 256K context window · Apache 2.0 license · ~20 GB VRAM at Q4	SWE-Verified80.0% GPQA Diamond89.0% MMLU-Pro88.5%	Next-gen Qwen with 256K context. Frontier-level coding at local speeds. The dense 27B punches far above its weight class.
Qwen3.6-35B-A3BAlibaba	35B (MoE, 3B active)	256K	· Mixture-of-Experts · 3B active parameters · 256K context window · ~14 GB VRAM at Q4	SWE-Verified82.0% GPQA Diamond88.5% MMLU-Pro87.0%	MoE efficiency meets 1M context. Barely uses more VRAM than the 27B but hits harder on coding benchmarks. Best value in the 64 GB tier.
Gemma 4 31BGoogle	31B dense	256K	· Dense architecture · Native multimodal · Apache 2.0 license · ~18–20 GB VRAM at Q4	MMLU-Pro85.2% GPQA Diamond84.3% AIME 202689.2%	Top-ranked open model on Arena AI. Google's dense flagship with genuine frontier reasoning. Best raw quality for local agent deployment.
Gemma 4 26BGoogle	26B (MoE, 3.8B active)	256K	· 128-expert MoE · 3.8B active params · Multimodal · ~12 GB VRAM at Q4	MMLU-Pro78.5% LiveCodeBench68.0% Arena AI ELO1380	Also viable on 32 GB, but on 64 GB it runs with headroom for concurrent tools and larger context batches. Best efficient choice.
Qwen3.5-27BAlibaba	27B	256K	· Dense hybrid architecture · Gated Delta Networks · Early-fusion vision · ~16 GB VRAM at Q4	SWE-Verified72.4% GPQA Diamond85.8% MMLU-Pro86.1%	On 64 GB it runs with massive headroom. You can crank context length, run multiple agents, or keep other apps open. Overkill but buttery smooth.

Qwen3.6-27BAlibaba · 27B dense

Context256K

VRAM (Q4)20 GB VRAM at Q4

SWE-Verified80.0%

GPQA Diamond89.0%

MMLU-Pro88.5%

Qwen3.6-35B-A3BAlibaba · 35B (MoE, 3B active)

Context256K

VRAM (Q4)14 GB VRAM at Q4

SWE-Verified82.0%

GPQA Diamond88.5%

MMLU-Pro87.0%

Gemma 4 31BGoogle · 31B dense

Context256K

VRAM (Q4)18–20 GB VRAM at Q4

MMLU-Pro85.2%

GPQA Diamond84.3%

AIME 202689.2%

Gemma 4 26BGoogle · 26B (MoE, 3.8B active)

Context256K

VRAM (Q4)12 GB VRAM at Q4

MMLU-Pro78.5%

LiveCodeBench68.0%

Arena AI ELO1380

Qwen3.5-27BAlibaba · 27B

Context256K

VRAM (Q4)16 GB VRAM at Q4

SWE-Verified72.4%

GPQA Diamond85.8%

MMLU-Pro86.1%

32 GB RAM — 24–32 GB RAM

Mid-size models with serious reasoning power. The sweet spot for power users.

4 models

Model	Params	Context	Key Specs	Benchmarks	Why This Model
Qwen3.5-27BAlibaba	27B	256K	· Dense hybrid architecture · Gated Delta Networks · Early-fusion vision · ~16 GB VRAM at Q4	SWE-Verified72.4% GPQA Diamond85.8% MMLU-Pro86.1%	Near-frontier coding at local speeds. Matches GPT-5-mini on SWE-Verified. The best 32 GB RAM investment for agentic workflows.
Qwen3.5-35B-A3BAlibaba	35B (MoE, 3B active)	256K	· Mixture-of-Experts (MoE) · 3B active parameters · Multimodal · ~10–12 GB VRAM at Q4	SWE-Verified74.0% GPQA Diamond83.5% MMLU-Pro84.8%	MoE efficiency means 35B quality at 10B speed. Only 3B parameters active per token. Ideal for fast agentic tool calling on 32 GB rigs.
Gemma 4 26BGoogle	26B (MoE, 3.8B active)	256K	· 128-expert MoE · 3.8B active params · Multimodal · ~8–12 GB VRAM at Q4	MMLU-Pro78.5% LiveCodeBench68.0% Arena AI ELO1380	Google's efficient MoE flagship. Faster than dense 26B with near-equivalent quality. 256K context for long-document agent tasks.
KO Carnice-27BKai OS	27B	128K	· Community fine-tune · Agent-optimized · MIT license · ~16 GB VRAM at Q4		Scaled-up Carnice with deeper reasoning. Built by the community for Hermes-style agent workflows. Solid 32 GB RAM choice.

Qwen3.5-27BAlibaba · 27B

Context256K

VRAM (Q4)16 GB VRAM at Q4

SWE-Verified72.4%

GPQA Diamond85.8%

MMLU-Pro86.1%

Qwen3.5-35B-A3BAlibaba · 35B (MoE, 3B active)

Context256K

VRAM (Q4)10–12 GB VRAM at Q4

SWE-Verified74.0%

GPQA Diamond83.5%

MMLU-Pro84.8%

Gemma 4 26BGoogle · 26B (MoE, 3.8B active)

Context256K

VRAM (Q4)8–12 GB VRAM at Q4

MMLU-Pro78.5%

LiveCodeBench68.0%

Arena AI ELO1380

Carnice-27BKai OS · 27B

Context128K

VRAM (Q4)16 GB VRAM at Q4

16 GB RAM — 8–16 GB RAM

Lightweight models for laptops and everyday machines. Fast inference, low memory footprint.

3 models

Model	Params	Context	Key Specs	Benchmarks	Why This Model
Qwen3.5-9BAlibaba	9B	256K	· Dense architecture · Multimodal (text + vision) · Apache 2.0 license · ~6 GB VRAM at Q4	SWE-Verified76.2% GPQA Diamond81.7% MMLU-Pro82.5%	Best beginner local model. Strong coding and reasoning at a tiny footprint. Runs comfortably on 16 GB RAM with room for OS and apps.
Gemma 4 E4BGoogle	4B effective	128K	· Efficient parameter design · Multimodal (text + image) · Apache 2.0 license · ~6 GB VRAM at Q4	MMLU-Pro72.0% LiveCodeBench58.0% Arena AI ELO1280	Google's desktop sweet spot. Natively multimodal with minimal resource use. Best for coding and everyday agent tasks on modest hardware.
KO Carnice-9BKai OS	9B	128K	· Community fine-tune · Agent-optimized · MIT license · ~6 GB VRAM at Q4		Community-built for agent tool use. Smaller but aggressively optimized for function calling and multi-step reasoning. Good 16 GB fallback.

Qwen3.5-9BAlibaba · 9B

Context256K

VRAM (Q4)6 GB VRAM at Q4

SWE-Verified76.2%

GPQA Diamond81.7%

MMLU-Pro82.5%

Gemma 4 E4BGoogle · 4B effective

Context128K

VRAM (Q4)6 GB VRAM at Q4

MMLU-Pro72.0%

LiveCodeBench58.0%

Arena AI ELO1280

Carnice-9BKai OS · 9B

Context128K

VRAM (Q4)6 GB VRAM at Q4