AI Model Benchmarks
Compare the latest AI models across coding, reasoning, math, and knowledge benchmarks. Data sourced from official releases and independent evaluations.
Overall Rankings
Averaged across all benchmarks (normalized scores)
Gemini 3.1 Pro
GoogleClaude Opus 4.6
AnthropicGPT-5.3 Codex
OpenAIClaude Opus 4.5
AnthropicGPT-5.2
OpenAIClaude Sonnet 4.6
AnthropicDeepSeek V4
DeepSeekGrok 4
xAIGemini 3 Flash
GoogleLlama 4 405B
MetaCoding Benchmarks
Real-world software engineering and code generation tasks
SWE-Bench Verified
Real-world software engineering tasks from GitHub issues
HumanEval
Python code generation with unit test verification
Knowledge Benchmarks
General knowledge and understanding across domains
MMLU
Massive Multitask Language Understanding across 57 subjects
Reasoning Benchmarks
Abstract reasoning and problem-solving capabilities
ARC-AGI-2
Abstract reasoning - designed to be easy for humans, hard for AI
GPQA Diamond
Graduate-level science questions (physics, chemistry, biology)
Math Benchmarks
Mathematical reasoning from basic to competition level
MATH-500
Competition-level mathematics problems
AIME 2024
American Invitational Mathematics Examination problems
Human Preference Benchmarks
Crowdsourced human preference ratings
Chatbot Arena Elo
Crowdsourced human preference ratings from 6M+ votes
Model Specifications
Context windows, pricing, and release information
| Model | Company | Context | Max Output | Input $/1M | Output $/1M | Released |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 1.0M | 128K | $15.00 | $75.00 | 2026-02 |
| Claude Opus 4.5 | Anthropic | 200K | 32K | $15.00 | $75.00 | 2025-10 |
| Claude Sonnet 4.6 | Anthropic | 200K | 64K | $3.00 | $15.00 | 2026-02 |
| GPT-5.3 Codex | OpenAI | 256K | 32K | $5.00 | $20.00 | 2026-02 |
| GPT-5.2 | OpenAI | 128K | 16K | $2.50 | $10.00 | 2025-12 |
| Gemini 3.1 Pro | 2.0M | 65.536K | $1.25 | $5.00 | 2026-02 | |
| Gemini 3 Flash | 1.0M | 32.768K | $0.07 | $0.30 | 2026-01 | |
| Grok 4 | xAI | 256K | 32K | $3.00 | $15.00 | 2026-02 |
| DeepSeek V4 | DeepSeek | 128K | 16K | $0.14 | $0.28 | 2026-02 |
| Llama 4 405B | Meta | 128K | 16K | Free | Free | 2026-01 |
Methodology & Sources
Benchmark scores are collected from official model releases, research papers, and independent evaluation platforms. We prioritize verified, reproducible results.
Key Benchmarks Explained
- SWE-Bench Verified: Real software engineering tasks from GitHub, testing end-to-end coding ability
- HumanEval: Python code generation with unit test verification (164 problems)
- MMLU: 57-subject knowledge test covering STEM, humanities, and social sciences
- ARC-AGI-2: Abstract reasoning designed to be easy for humans, hard for AI
- GPQA Diamond: Graduate-level science questions requiring deep understanding
- MATH-500: Competition-level mathematics from AMC to IMO difficulty
- Chatbot Arena: ELO ratings from 6M+ crowdsourced human preference votes
Limitations
- Benchmarks don't capture all real-world capabilities
- Scores can vary based on prompting and evaluation setup
- Some benchmarks may be saturated by frontier models
- Pricing and capabilities change frequently
Want to stay updated on AI models?
We update benchmarks as new models are released and tested.
Follow AI News