model release

GLM-5.1 achieves 58.4% on SWE-Bench Pro with sustained agentic reasoning over hundreds of iterations

TL;DR

Zhipu AI has released GLM-5.1, a 754-billion parameter model designed for agentic engineering with significantly improved coding capabilities over its predecessor. The model achieves 58.4% on SWE-Bench Pro and demonstrates sustained performance improvement over hundreds of tool calls and iterations, unlike earlier models that plateau quickly.

2 min read
0

GLM-5.1: Zhipu AI Releases 754B Model Optimized for Sustained Agentic Reasoning

Zhipu AI has released GLM-5.1, a 754-billion parameter model designed to maintain performance over extended agentic tasks. The model represents a departure from earlier approaches that achieve quick wins then plateau, instead sustaining optimization over hundreds of rounds and thousands of tool calls.

Core Performance Metrics

GLM-5.1 demonstrates strong performance across agentic and mathematical benchmarks:

Coding & Software Engineering:

  • SWE-Bench Pro: 58.4% (vs GLM-5's 55.1%)
  • NL2Repo (repository generation): 42.7% (vs GLM-5's 35.9%)
  • Terminal-Bench 2.0: 63.5% (vs GLM-5's 56.2%)
  • BrowseComp (with context management): 79.3% (vs GLM-5's 75.9%)

Mathematical Reasoning:

  • AIME 2026: 95.3%
  • HMMT November 2025: 94.0%
  • HMMT February 2026: 82.6%
  • GPQA-Diamond: 86.2%

Tool Use & Navigation:

  • CyberGym: 68.7% (vs GLM-5's 48.3%)
  • BrowseComp: 68.0% (vs GLM-5's 62.0%)
  • MCP-Atlas (public set): 71.8% (vs GLM-5's 69.2%)

Key Architectural Differences

The distinguishing characteristic of GLM-5.1 is its ability to remain productive over extended reasoning horizons. According to Zhipu AI's technical report, earlier models—including GLM-5—exhaust effective techniques early and fail to improve with additional computation. GLM-5.1 instead demonstrates improved judgment on ambiguous problems and maintains productivity across longer sessions.

The model accomplishes this through better handling of complex problem decomposition, experimental iteration, result interpretation, and blocker identification. It revises strategy through repeated self-reflection across hundreds of rounds, with performance continuing to improve rather than stalling.

Deployment & Availability

GLM-5.1 is available through:

  • Z.ai API Platform for inference services
  • Local deployment via SGLang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and KTransformers (v0.5.3+)
  • Hugging Face model repository (zai-org/GLM-5.1)
  • chat.z.ai web interface (launching in coming days)

The model is distributed in BF16 and F32 tensor formats. Pricing and context window information have not been disclosed.

Benchmark Context

GLM-5.1 leads or matches comparable models on most agentic benchmarks. On SWE-Bench Pro, it ranks among the highest performers alongside Claude Code (57.3%) and GPT-4o (57.7%). On NL2Repo, it substantially leads Claude Code (49.8%) and other competitors. However, on pure mathematics (HMMT Feb 2026: 82.6%), it trails specialized reasoning models like Gemini 3.1 Pro (87.3%) and GPT-4o (91.8%).

The model ranked last among tested competitors on Tool-Decathlon (40.7%) and performed lower on Vending Bench 2 ($5,634), suggesting domain-specific limitations despite strong general agentic performance.

What This Means

GLM-5.1 represents a meaningful architectural shift in how models handle extended agentic tasks. The emphasis on sustained reasoning over longer horizons addresses a genuine limitation in current models: the ability to iteratively refine solutions rather than committing to early strategies. For software engineering and web automation tasks specifically, the performance gains over GLM-5 are substantial (3-6 points on most benchmarks). However, the model's ability to scale reasoning with additional computation remains bounded—it still underperforms specialized reasoning models on pure mathematics. The practical value depends heavily on whether real-world agentic tasks benefit from this sustained-reasoning architecture versus quick accuracy gains.

Related Articles

model release

Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context

Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.

model release

Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU

Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.

model release

Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June

Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.

model release

Google releases Gemini 3.5 Flash with autonomous coding and agent capabilities, claims 4x speed boost

Google released Gemini 3.5 Flash, positioning it as an agent-first model designed for autonomous coding and multi-hour workflows. The company claims the model outperforms its 3.1 Pro predecessor on coding and agentic benchmarks while running 4x faster than competing frontier models, with an optimized version achieving 12x speed gains.

Comments

Loading...