AI Models
Tiered ranking of large language models optimized for agentic workflows. Updated continuously.
Complex reasoning · Strategy · Planning · External dev only
| Model | Cost /1M | Context | Key Specs | Benchmarks | Why This Model |
|---|---|---|---|---|---|
Claude Opus 4.7Anthropic · Apr 2026 | $5 / $25 outFlat model: $5/$25 | 1M128k out |
| SWE-Verified87.6% SWE-Pro64.3% MCP-Atlas77.3% | Best publicly available model Apr 2026. 3x vision resolution. 87.6% SWE-Verified leads all non-preview models. Same price as 4.6. |
Z GLM-5.1Z.AI · Apr 2026 | $1.40 / $4.40 out | 200K131K max output |
| SWE-Bench Pro58.4% SWE-Verified77.8% GPQA Diamond83.9% | #1 SWE-Pro globally at launch (58.4%). Open weights on HuggingFace. 1/3 the cost of Opus. Self-reported benchmarks — independent verification pending. |
Kimi K2.6Moonshot AI · Apr 2026 | $0.95 / $4 outVia Kimi API | 256K |
| SWE-Verified80.2% SWE-Pro58.6% Terminal-Bench 2.066.7% | Open-weight at near-frontier level. 300 sub-agents in parallel. Competitive with Opus 4.6 at a fraction of the cost. Native video input. |
Claude Opus 4.6Anthropic · Feb 2026 | $5 / $25 outFlat model: $5/$25 | 1M128k out |
| SWE-Verified80.8% Terminal B265.4% ARC-AGI-268.9% | #1 agentic terminal coding. Handles own errors in loop over long sessions. API only — no subscription for agents. Superseded by 4.7. |
GPT-5.4OpenAI · Mar 2026 | $2.50 / $15 out2+ above $72k/day | 1.05M |
| SWE-Pro57.7% OSWorld75.0% GPQA Diamond92.8% | Multi-hour autonomous execution with real planning. Only worth the premium when task complexity genuinely needs frontier judgment. |
GPT 5.5OpenAI · Apr 2026 | $5.00 / $30 outPro variant: significantly higher | 1M |
| SWE-Pro58.6% Terminal-Bench 2.082.7% GDPval84.9% | OpenAI's latest agentic model (codenamed Spud). 82.7% Terminal-Bench 2.0 — best-in-class for CLI/shell workflows. Strong on autonomous multi-step tasks. Pro variant adds test-time compute for deeper reasoning. |
DeepSeek V4 ProDeepSeek · Apr 2026 | $1.74 / $3.48 outCached: $0.145/mSelf-host = free | 1M |
| SWE-Verified~80.6% MMLU-Pro~73.5 HumanEval~76.8% | Top-tier open-weights model released Apr 24 2026. 1.6T MoE with 49B active params. Matches frontier performance on SWE-Verified at open-weight pricing. 1M context window with hybrid attention for efficiency. Independent verification pending. |
Tool calls · Long task chains · Multi-step pipelines
| Model | Cost /1M | Context | Key Specs | Benchmarks | Why This Model |
|---|---|---|---|---|---|
Gemini 3.1 ProGoogle · Feb 2026 | $2 / $12 out84k out | 1M |
| ARC-AGI-277.1% GPQA Diamond94.3% SWE-Verified80.6% | 7.5× cheaper than Opus on input. Leads most benchmarks on vision. No separate pipeline for media. |
MiniMax M2.7MiniMax · Mar 2026 | $0.30 / $1.20 out$19/mo = 1500 calls/5h | 205K131k out |
| SWE-Pro56.2% Terminal B257.0% Vibe-Pro55.6% | Best price-to-agent-capability in the stack. 97% skill adherence critical for OpenClaw's skill ecosystems. $19 plan is absurdly good value. |
Kimi K2.5Moonshot · Feb 2026 | $0.60 / $3.00 out | 256K |
| HLE w/ tools50.2% BrowseComp79.4% SWE-Verified76.8% | Best long-context stability for extended tasks. ~6× more output tokens than peers — budget carefully. |
DeepSeek V3.2DeepSeek · Dec 2025 | $0.27 / $0.41 out | 164K |
| SWE-Verified70.0% Aider polyglot74.2% | 90% of GPT-5.4 performance at 1/60th the cost. Best price-performance in this tier via OpenRouter. |
Content · Code · Research · Day-to-day tasks
| Model | Cost /1M | Context | Key Specs | Benchmarks | Why This Model |
|---|---|---|---|---|---|
Claude Sonnet 4.6Anthropic · Feb 2026 | $3 / $15 outAPI only | 1M64k out |
| SWE-Verified79.9% Computer use94.0% AI Index52/100 | 98% of Opus coding at 1/5 the cost. API only — no $10/mo plan exists for this model. |
GPT-5.4 miniOpenAI · Various | $0.75 / $4.50 outOAuth via subscription | 400K |
| SWE-Pro54.4% Tool call r193.4% GSWorld72.1% | Smart enough to run entire system. 93.4% tool-call reliability. ChatGPT OAuth = no API billing needed. |
Qwen3.6 PlusAlibaba · Apr 2026 | FREE NOW$0 / $0 | 1M |
| SWE-Verified78.8% | Best free model available. 1M context. Near-frontier coding. Free until preview window closes. |
Llama 4 MaverickMeta · 2026 | $0.19–$0.49$0 if self-hosted | 1M |
| MMLU85.5% SWE-Verified~68% | Only serious open-weight option at this level. Self-host = zero ongoing cost. Best for data sovereignty needs. |
Mistral Small 4Mistral · Mar 2026 | $0.15 / $0.60 out | 256K |
| AA Intelligence Index27/100 AA LCR score0.72 MATH-500~93.6% | Replaces three separate models with one Apache 2.0 weights file. 162 t/s output. Best simplicity-to-capability ratio in Tier 3. |
Local inference · Zero API cost · Full privacy
Maximum quality local inference. Frontier-grade reasoning without leaving your machine.
| Model | Params | Context | Key Specs | Benchmarks | Why This Model |
|---|---|---|---|---|---|
Qwen3.6-27BAlibaba | 27B dense | 256K |
| SWE-Verified80.0% GPQA Diamond89.0% MMLU-Pro88.5% | Next-gen Qwen with 256K context. Frontier-level coding at local speeds. The dense 27B punches far above its weight class. |
Qwen3.6-35B-A3BAlibaba | 35B (MoE, 3B active) | 256K |
| SWE-Verified82.0% GPQA Diamond88.5% MMLU-Pro87.0% | MoE efficiency meets 1M context. Barely uses more VRAM than the 27B but hits harder on coding benchmarks. Best value in the 64 GB tier. |
Gemma 4 31BGoogle | 31B dense | 256K |
| MMLU-Pro85.2% GPQA Diamond84.3% AIME 202689.2% | Top-ranked open model on Arena AI. Google's dense flagship with genuine frontier reasoning. Best raw quality for local agent deployment. |
Gemma 4 26BGoogle | 26B (MoE, 3.8B active) | 256K |
| MMLU-Pro78.5% LiveCodeBench68.0% Arena AI ELO1380 | Also viable on 32 GB, but on 64 GB it runs with headroom for concurrent tools and larger context batches. Best efficient choice. |
Qwen3.5-27BAlibaba | 27B | 256K |
| SWE-Verified72.4% GPQA Diamond85.8% MMLU-Pro86.1% | On 64 GB it runs with massive headroom. You can crank context length, run multiple agents, or keep other apps open. Overkill but buttery smooth. |
Mid-size models with serious reasoning power. The sweet spot for power users.
| Model | Params | Context | Key Specs | Benchmarks | Why This Model |
|---|---|---|---|---|---|
Qwen3.5-27BAlibaba | 27B | 256K |
| SWE-Verified72.4% GPQA Diamond85.8% MMLU-Pro86.1% | Near-frontier coding at local speeds. Matches GPT-5-mini on SWE-Verified. The best 32 GB RAM investment for agentic workflows. |
Qwen3.5-35B-A3BAlibaba | 35B (MoE, 3B active) | 256K |
| SWE-Verified74.0% GPQA Diamond83.5% MMLU-Pro84.8% | MoE efficiency means 35B quality at 10B speed. Only 3B parameters active per token. Ideal for fast agentic tool calling on 32 GB rigs. |
Gemma 4 26BGoogle | 26B (MoE, 3.8B active) | 256K |
| MMLU-Pro78.5% LiveCodeBench68.0% Arena AI ELO1380 | Google's efficient MoE flagship. Faster than dense 26B with near-equivalent quality. 256K context for long-document agent tasks. |
KO Carnice-27BKai OS | 27B | 128K |
| Scaled-up Carnice with deeper reasoning. Built by the community for Hermes-style agent workflows. Solid 32 GB RAM choice. |
Lightweight models for laptops and everyday machines. Fast inference, low memory footprint.
| Model | Params | Context | Key Specs | Benchmarks | Why This Model |
|---|---|---|---|---|---|
Qwen3.5-9BAlibaba | 9B | 256K |
| SWE-Verified76.2% GPQA Diamond81.7% MMLU-Pro82.5% | Best beginner local model. Strong coding and reasoning at a tiny footprint. Runs comfortably on 16 GB RAM with room for OS and apps. |
Gemma 4 E4BGoogle | 4B effective | 128K |
| MMLU-Pro72.0% LiveCodeBench58.0% Arena AI ELO1280 | Google's desktop sweet spot. Natively multimodal with minimal resource use. Best for coding and everyday agent tasks on modest hardware. |
KO Carnice-9BKai OS | 9B | 128K |
| Community-built for agent tool use. Smaller but aggressively optimized for function calling and multi-step reasoning. Good 16 GB fallback. |