Best LLM for Coding in 2026

Ranked by composite coding score averaging HumanEval, SWE-bench Verified, and LiveCodeBench. All scores from official model cards and technical reports.

Updated automatically as new models release. Full benchmark leaderboard →

Claude Mythos Preview Anthropic

93.9%

avg

SWE-bench Verified: 93.9%

Qwen3 72B Alibaba / Qwen

91.2%

avg

HumanEval: 91.2%

o1-pro OpenAI

91.0%

avg

HumanEval: 91.0%

Grok 4.20 xAI

89.1%

avg

HumanEval: 98.2%

SWE-bench Verified: 84.2%

LiveCodeBench: 85.0%

Granite 4.1 30B Ibm

88.4%

avg

HumanEval: 88.4%

GPT-5.4 OpenAI

88.1%

avg

HumanEval: 97.8%

SWE-bench Verified: 83.1%

LiveCodeBench: 83.5%

Grok 4 xAI

85.7%

avg

HumanEval: 96.7%

SWE-bench Verified: 81.0%

LiveCodeBench: 79.3%

Granite 4.1 8B Ibm

85.4%

avg

HumanEval: 85.4%

GPT-5.4 Nano OpenAI

85.2%

avg

HumanEval: 85.2%

DeepSeek V4 Pro DeepSeek

85.2%

avg

HumanEval: 76.8%

LiveCodeBench: 93.5%

Gemini 3.1 Pro Google DeepMind

84.9%

avg

HumanEval: 95.8%

SWE-bench Verified: 80.6%

LiveCodeBench: 78.4%

Kimi K2.6 Moonshot AI

84.9%

avg

SWE-bench Verified: 80.2%

LiveCodeBench: 89.6%

o3 OpenAI

84.7%

avg

HumanEval: 99.2%

SWE-bench Verified: 71.7%

LiveCodeBench: 83.1%

Mistral Medium 3 Mistral AI

84.0%

avg

HumanEval: 84.0%

Sonar Reasoning Pro Perplexity AI

82.0%

avg

HumanEval: 82.0%

Claude Opus 4.6 Anthropic

81.3%

avg

HumanEval: 92.0%

SWE-bench Verified: 80.8%

LiveCodeBench: 71.2%

GPT-5.4 mini OpenAI

81.0%

avg

HumanEval: 93.5%

SWE-bench Verified: 68.5%

Claude Opus 4.5 Anthropic

80.9%

avg

SWE-bench Verified: 80.9%

Trinity Large Thinking Arcee Ai

80.7%

avg

SWE-bench Verified: 63.2%

LiveCodeBench: 98.2%

DeepSeek V4 Flash DeepSeek

80.5%

avg

HumanEval: 69.5%

LiveCodeBench: 91.6%

Claude Sonnet 4.6 Anthropic

80.5%

avg

HumanEval: 91.2%

SWE-bench Verified: 79.6%

LiveCodeBench: 70.8%

Sonar Deep Research Perplexity AI

80.5%

avg

HumanEval: 80.5%

Gemma 4 31B Google DeepMind

80.0%

avg

LiveCodeBench: 80.0%

QwQ-32B Alibaba / Qwen

79.3%

avg

HumanEval: 90.2%

LiveCodeBench: 68.3%

GPT-4.5 OpenAI

79.0%

avg

HumanEval: 79.0%

Claude Sonnet 4.5 Anthropic

78.9%

avg

HumanEval: 90.5%

SWE-bench Verified: 77.2%

LiveCodeBench: 69.1%

Qwen 3.6 Plus Alibaba / Qwen

78.8%

avg

SWE-bench Verified: 78.8%

Qwen 3.6 Plus Alibaba / Qwen

78.8%

avg

SWE-bench Verified: 78.8%

Gemini 2.5 Pro Google DeepMind

78.4%

avg

HumanEval: 94.3%

SWE-bench Verified: 63.8%

LiveCodeBench: 77.0%

Claude Opus 4 Anthropic

78.3%

avg

HumanEval: 92.4%

SWE-bench Verified: 72.5%

LiveCodeBench: 70.0%

Sonar Pro Perplexity AI

78.2%

avg

HumanEval: 78.2%

MiMo-V2-Pro Xiaomi

78.0%

avg

SWE-bench Verified: 78.0%

Step-3.5-Flash-Base StepFun

77.8%

avg

HumanEval: 81.1%

SWE-bench Verified: 74.4%

Mistral Medium 3.5 Mistral AI

77.6%

avg

SWE-bench Verified: 77.6%

Muse Spark Meta AI

77.4%

avg

SWE-bench Verified: 77.4%

Qwen3.6 27B Alibaba / Qwen

77.2%

avg

SWE-bench Verified: 77.2%

Qwen3.6-27B-FP8 Alibaba / Qwen

77.2%

avg

SWE-bench Verified: 77.2%

Gemma 4 26B A4B Google DeepMind

77.1%

avg

LiveCodeBench: 77.1%

Sonar Reasoning Perplexity AI

77.0%

avg

HumanEval: 77.0%

o4-mini OpenAI

76.9%

avg

HumanEval: 99.4%

SWE-bench Verified: 49.5%

LiveCodeBench: 81.8%

Claude 3.7 Sonnet Anthropic

74.8%

avg

HumanEval: 89.0%

SWE-bench Verified: 70.3%

LiveCodeBench: 65.0%

GPT-4.1 OpenAI

74.7%

avg

HumanEval: 95.3%

SWE-bench Verified: 54.6%

LiveCodeBench: 74.1%

Hy3 Preview Tencent

74.4%

avg

SWE-bench Verified: 74.4%

DeepSeek R1 DeepSeek

74.1%

avg

HumanEval: 92.8%

SWE-bench Verified: 57.0%

LiveCodeBench: 72.6%

Grok 4 mini xAI

73.8%

avg

HumanEval: 92.0%

SWE-bench Verified: 55.0%

LiveCodeBench: 74.5%

Qwen3.6 35B A3B Alibaba / Qwen

73.4%

avg

SWE-bench Verified: 73.4%

Qwen3.6-35B-A3B-FP8 Alibaba / Qwen

73.4%

avg

SWE-bench Verified: 73.4%

Sonar Perplexity AI

72.5%

avg

HumanEval: 72.5%

Gemma 4 12B Unified Google DeepMind

72.0%

avg

LiveCodeBench: 72.0%

Nemotron-3-Ultra-550B-A55B NVIDIA

71.9%

avg

SWE-bench Verified: 71.9%

Gemma 4 31B IT NVFP4 NVIDIA

70.6%

avg

LiveCodeBench: 70.6%

Mellum2-12B-A2.5B-Thinking JetBrains

69.9%

avg

LiveCodeBench: 69.9%

Llama 4 Maverick Meta AI

69.8%

avg

HumanEval: 88.4%

SWE-bench Verified: 51.2%

Claude Haiku 4.5 Anthropic

69.8%

avg

HumanEval: 81.5%

SWE-bench Verified: 58.0%

Qwen3.5-35B-A3B Alibaba / Qwen

69.2%

avg

SWE-bench Verified: 69.2%

Laguna XS.2 Poolside

68.2%

avg

SWE-bench Verified: 68.2%

Grok 3 xAI

66.4%

avg

HumanEval: 91.2%

SWE-bench Verified: 49.5%

LiveCodeBench: 58.4%

ZAYA1-8B Zyphra

65.8%

avg

LiveCodeBench: 65.8%

MiMo-V2.5-Pro Xiaomi

64.7%

avg

HumanEval: 75.6%

SWE-bench Verified: 78.9%

LiveCodeBench: 39.6%

Devstral Medium Mistral AI

61.6%

avg

SWE-bench Verified: 61.6%

GPT-4o OpenAI

60.7%

avg

HumanEval: 90.2%

SWE-bench Verified: 38.8%

LiveCodeBench: 53.0%

NVIDIA Nemotron-3-Super-120B-A12B NVIDIA

60.5%

avg

SWE-bench Verified: 60.5%

MiMo-V2.5 Xiaomi

56.1%

avg

SWE-bench Verified: 56.1%

Gemma 4 E2B Google DeepMind

44.0%

avg

LiveCodeBench: 44.0%

Mistral Small 4 Mistral AI

0.7%

avg

LiveCodeBench: 0.7%

How we rank coding models

We average three industry-standard benchmarks:

HumanEval — Function-completion coding tasks in Python. Tests whether a model can write correct, working code from a docstring description. Pass@1 accuracy.
SWE-bench Verified — Real GitHub issues from popular open-source repos. Tests autonomous software engineering: read the issue, write a fix, pass the test suite. The most practical real-world coding benchmark available.
LiveCodeBench — Competitive programming problems from LeetCode, Codeforces, and AtCoder, collected after model training cutoffs to prevent contamination. Harder than HumanEval.

All scores are from official model cards, technical reports, or the HuggingFace Open LLM Leaderboard. Rankings update automatically as new models are released.

Also see: Best Reasoning LLM, Best Cheap LLM, Compare any two models.