model release

GLM-5.1 achieves 58.4% on SWE-Bench Pro with sustained agentic reasoning over hundreds of iterations

TL;DR

Zhipu AI has released GLM-5.1, a 754-billion parameter model designed for agentic engineering with significantly improved coding capabilities over its predecessor. The model achieves 58.4% on SWE-Bench Pro and demonstrates sustained performance improvement over hundreds of tool calls and iterations, unlike earlier models that plateau quickly.

2 min read
0

GLM-5.1: Zhipu AI Releases 754B Model Optimized for Sustained Agentic Reasoning

Zhipu AI has released GLM-5.1, a 754-billion parameter model designed to maintain performance over extended agentic tasks. The model represents a departure from earlier approaches that achieve quick wins then plateau, instead sustaining optimization over hundreds of rounds and thousands of tool calls.

Core Performance Metrics

GLM-5.1 demonstrates strong performance across agentic and mathematical benchmarks:

Coding & Software Engineering:

  • SWE-Bench Pro: 58.4% (vs GLM-5's 55.1%)
  • NL2Repo (repository generation): 42.7% (vs GLM-5's 35.9%)
  • Terminal-Bench 2.0: 63.5% (vs GLM-5's 56.2%)
  • BrowseComp (with context management): 79.3% (vs GLM-5's 75.9%)

Mathematical Reasoning:

  • AIME 2026: 95.3%
  • HMMT November 2025: 94.0%
  • HMMT February 2026: 82.6%
  • GPQA-Diamond: 86.2%

Tool Use & Navigation:

  • CyberGym: 68.7% (vs GLM-5's 48.3%)
  • BrowseComp: 68.0% (vs GLM-5's 62.0%)
  • MCP-Atlas (public set): 71.8% (vs GLM-5's 69.2%)

Key Architectural Differences

The distinguishing characteristic of GLM-5.1 is its ability to remain productive over extended reasoning horizons. According to Zhipu AI's technical report, earlier models—including GLM-5—exhaust effective techniques early and fail to improve with additional computation. GLM-5.1 instead demonstrates improved judgment on ambiguous problems and maintains productivity across longer sessions.

The model accomplishes this through better handling of complex problem decomposition, experimental iteration, result interpretation, and blocker identification. It revises strategy through repeated self-reflection across hundreds of rounds, with performance continuing to improve rather than stalling.

Deployment & Availability

GLM-5.1 is available through:

  • Z.ai API Platform for inference services
  • Local deployment via SGLang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and KTransformers (v0.5.3+)
  • Hugging Face model repository (zai-org/GLM-5.1)
  • chat.z.ai web interface (launching in coming days)

The model is distributed in BF16 and F32 tensor formats. Pricing and context window information have not been disclosed.

Benchmark Context

GLM-5.1 leads or matches comparable models on most agentic benchmarks. On SWE-Bench Pro, it ranks among the highest performers alongside Claude Code (57.3%) and GPT-4o (57.7%). On NL2Repo, it substantially leads Claude Code (49.8%) and other competitors. However, on pure mathematics (HMMT Feb 2026: 82.6%), it trails specialized reasoning models like Gemini 3.1 Pro (87.3%) and GPT-4o (91.8%).

The model ranked last among tested competitors on Tool-Decathlon (40.7%) and performed lower on Vending Bench 2 ($5,634), suggesting domain-specific limitations despite strong general agentic performance.

What This Means

GLM-5.1 represents a meaningful architectural shift in how models handle extended agentic tasks. The emphasis on sustained reasoning over longer horizons addresses a genuine limitation in current models: the ability to iteratively refine solutions rather than committing to early strategies. For software engineering and web automation tasks specifically, the performance gains over GLM-5 are substantial (3-6 points on most benchmarks). However, the model's ability to scale reasoning with additional computation remains bounded—it still underperforms specialized reasoning models on pure mathematics. The practical value depends heavily on whether real-world agentic tasks benefit from this sustained-reasoning architecture versus quick accuracy gains.

Related Articles

model release

Z.ai releases GLM-5.1 with 202K context window and 8-hour autonomous task capability

Z.ai has released GLM-5.1, a model with a 202,752 token context window and significantly improved coding capabilities. The model claims the ability to work autonomously on single tasks for over 8 hours, handling long-horizon projects with continuous planning and execution.

model release

Anthropic unveils Claude Mythos model, finds thousands of OS vulnerabilities via Project Glasswing

Anthropic has unveiled Claude Mythos, a new AI model designed for cybersecurity that has already discovered thousands of high-severity vulnerabilities in every major operating system and web browser. The model is being distributed as a preview to over 40 organizations and major technology partners including Apple, Google, Microsoft, and Amazon Web Services through Project Glasswing, a coordinated cybersecurity initiative.

model release

Anthropic withholds Mythos Preview model due to advanced hacking capabilities

Anthropic is rolling out its Mythos Preview model only to a handpicked group of 40 tech and cybersecurity companies, withholding public release due to the model's sophisticated ability to find tens of thousands of vulnerabilities and autonomously create working exploits. The model found bugs in every major operating system and web browser during testing, including vulnerabilities decades old and undetected by human security researchers.

model release

Deepseek v4 launching on Huawei chips exclusively, signaling China's AI independence progress

Deepseek v4 is launching in the coming weeks running exclusively on Huawei chips, marking a major milestone in China's effort to reduce dependency on foreign semiconductors. Chinese tech giants including Alibaba, Bytedance, and Tencent have ordered hundreds of thousands of Huawei Ascend 950PR units to deploy the model through their cloud services.

Comments

Loading...