model release

GLM-5.1 achieves 58.4% on SWE-Bench Pro with sustained agentic reasoning over hundreds of iterations

TL;DR

Zhipu AI has released GLM-5.1, a 754-billion parameter model designed for agentic engineering with significantly improved coding capabilities over its predecessor. The model achieves 58.4% on SWE-Bench Pro and demonstrates sustained performance improvement over hundreds of tool calls and iterations, unlike earlier models that plateau quickly.

April 7, 2026 · 5:51 PM2 min read

GLM-5.1 — Quick Specs

Compare GLM-5.1 with other models →

GLM-5.1: Zhipu AI Releases 754B Model Optimized for Sustained Agentic Reasoning

Zhipu AI has released GLM-5.1, a 754-billion parameter model designed to maintain performance over extended agentic tasks. The model represents a departure from earlier approaches that achieve quick wins then plateau, instead sustaining optimization over hundreds of rounds and thousands of tool calls.

Core Performance Metrics

GLM-5.1 demonstrates strong performance across agentic and mathematical benchmarks:

Coding & Software Engineering:

SWE-Bench Pro: 58.4% (vs GLM-5's 55.1%)
NL2Repo (repository generation): 42.7% (vs GLM-5's 35.9%)
Terminal-Bench 2.0: 63.5% (vs GLM-5's 56.2%)
BrowseComp (with context management): 79.3% (vs GLM-5's 75.9%)

Mathematical Reasoning:

AIME 2026: 95.3%
HMMT November 2025: 94.0%
HMMT February 2026: 82.6%
GPQA-Diamond: 86.2%

Tool Use & Navigation:

CyberGym: 68.7% (vs GLM-5's 48.3%)
BrowseComp: 68.0% (vs GLM-5's 62.0%)
MCP-Atlas (public set): 71.8% (vs GLM-5's 69.2%)

Key Architectural Differences

The distinguishing characteristic of GLM-5.1 is its ability to remain productive over extended reasoning horizons. According to Zhipu AI's technical report, earlier models—including GLM-5—exhaust effective techniques early and fail to improve with additional computation. GLM-5.1 instead demonstrates improved judgment on ambiguous problems and maintains productivity across longer sessions.

The model accomplishes this through better handling of complex problem decomposition, experimental iteration, result interpretation, and blocker identification. It revises strategy through repeated self-reflection across hundreds of rounds, with performance continuing to improve rather than stalling.

Deployment & Availability

GLM-5.1 is available through:

Z.ai API Platform for inference services
Local deployment via SGLang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and KTransformers (v0.5.3+)
Hugging Face model repository (zai-org/GLM-5.1)
chat.z.ai web interface (launching in coming days)

The model is distributed in BF16 and F32 tensor formats. Pricing and context window information have not been disclosed.

Benchmark Context

GLM-5.1 leads or matches comparable models on most agentic benchmarks. On SWE-Bench Pro, it ranks among the highest performers alongside Claude Code (57.3%) and GPT-4o (57.7%). On NL2Repo, it substantially leads Claude Code (49.8%) and other competitors. However, on pure mathematics (HMMT Feb 2026: 82.6%), it trails specialized reasoning models like Gemini 3.1 Pro (87.3%) and GPT-4o (91.8%).

The model ranked last among tested competitors on Tool-Decathlon (40.7%) and performed lower on Vending Bench 2 ($5,634), suggesting domain-specific limitations despite strong general agentic performance.

What This Means

GLM-5.1 represents a meaningful architectural shift in how models handle extended agentic tasks. The emphasis on sustained reasoning over longer horizons addresses a genuine limitation in current models: the ability to iteratively refine solutions rather than committing to early strategies. For software engineering and web automation tasks specifically, the performance gains over GLM-5 are substantial (3-6 points on most benchmarks). However, the model's ability to scale reasoning with additional computation remains bounded—it still underperforms specialized reasoning models on pure mathematics. The practical value depends heavily on whether real-world agentic tasks benefit from this sustained-reasoning architecture versus quick accuracy gains.

Source: huggingface.co ↗

model-release agentic-ai coding-models zhipu-ai 754B-parameters SWE-Bench software-engineering tool-use

model releaseJune 30, 2026

Claude Sonnet 5 launches on AWS Bedrock with Opus-level intelligence at Sonnet pricing

Anthropic has released Claude Sonnet 5 on Amazon Bedrock and Claude Platform on AWS. The model delivers what Anthropic describes as near-Opus intelligence while maintaining Sonnet-tier pricing, with promotional rates available through August 31, 2026.

model releaseJune 30, 2026

Anthropic releases Claude Sonnet 5 at $2/1M input tokens, 63.2% agentic coding benchmark

Anthropic has released Claude Sonnet 5, its new mid-tier model optimized for agentic tasks, priced at $2 per million input tokens through August 31 before rising to $3/1M. The model scores 63.2% on agentic coding benchmarks, approaching Opus 4.8's 69.2% performance at a significantly lower price point.

model releaseJune 30, 2026

Anthropic releases Claude Sonnet 5 with improved agentic capabilities, $2/$10 per million tokens through August

Anthropic has released Claude Sonnet 5, replacing Sonnet 4.6 as its medium-sized model. The company claims improved agentic performance approaching Opus 4.8 levels while maintaining lower pricing at $2 per million input tokens and $10 per million output tokens through August 31.

model releaseJuly 6, 2026

Nex AGI releases Nex-N2-Mini: open-source agentic MoE model with 262K context window

Nex AGI has released Nex-N2-Mini, an open-source agentic mixture-of-experts model with a 262K-token context window. The model accepts text and image inputs and is priced at $0.025 per 1M input tokens and $0.10 per 1M output tokens.

GLM-5.1 achieves 58.4% on SWE-Bench Pro with sustained agentic reasoning over hundreds of iterations

GLM-5.1 — Quick Specs

GLM-5.1: Zhipu AI Releases 754B Model Optimized for Sustained Agentic Reasoning

Core Performance Metrics

Key Architectural Differences

Deployment & Availability

Benchmark Context

What This Means

Related Articles

Claude Sonnet 5 launches on AWS Bedrock with Opus-level intelligence at Sonnet pricing

Anthropic releases Claude Sonnet 5 at $2/1M input tokens, 63.2% agentic coding benchmark

Anthropic releases Claude Sonnet 5 with improved agentic capabilities, $2/$10 per million tokens through August

Nex AGI releases Nex-N2-Mini: open-source agentic MoE model with 262K context window

Comments