model releaseZhipu AI

Zhipu AI's GLM-5.1 outperforms GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro through iterative strategy refinement

TL;DR

Zhipu AI has released GLM-5.1, a freely available open-weight model designed for long-running programming tasks that achieves 58.4% on SWE-Bench Pro, edging out GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%). The model's core capability is iterative strategy refinement—it rethinks its approach across hundreds of iterations and thousands of tool calls, recognizing dead ends and shifting tactics without human intervention. However, GLM-5.1 trails on reasoning and knowledge benchmarks, scoring 31% on Humanity's Last Exam compared to Gemini 3.1 Pro's 45%.

3 min read
0

Zhipu AI Releases GLM-5.1 with Iterative Long-Horizon Coding Capabilities

Zhipu AI has introduced GLM-5.1, an open-weight model available under MIT license on Hugging Face and ModelScope. The model is purpose-built for extended programming tasks where iterative strategy refinement, rather than raw parameter count, determines success on complex problems.

SWE-Bench Pro Leadership

On the SWE-Bench Pro software engineering benchmark, GLM-5.1 scores 58.4%—the highest among freely available models tested. This edges out:

  • GPT-5.4: 57.7%
  • Claude Opus 4.6: 57.3%

On CyberGym (cybersecurity), GLM-5.1 leads with 68.7%, though Zhipu AI notes that Gemini 3.1 Pro and GPT-5.4 refused some tasks for safety reasons, potentially affecting their scores.

Iterative Refinement Across Hundreds of Rounds

The model's defining feature is its ability to repeatedly review and revise its own strategy without external guidance. Zhipu AI demonstrates this through three internal evaluations:

Vector Database Optimization: GLM-5.1 improved query performance from Claude Opus 4.6's baseline of 3,547 queries per second to 21,500 queries per second—a 6.1x improvement. This required 600+ iterations and 6,000+ tool calls. The model initiated six major structural shifts:

  • Iteration 90: Switched from exhaustive search to clustering
  • Iteration 240: Introduced two-stage pipeline with pre-sorting and filtering

GPU Optimization: On KernelBench Level 3, GLM-5.1 achieved 3.6x speedup on baseline ML code versus Claude Opus 4.6's 4.2x. The model sustained progress longer than GLM-5 but remains behind the strongest competitor.

Linux Desktop Construction: When tasked with building a complete Linux desktop environment from a single prompt, GLM-5.1 delivered a functional system with file browser, terminal, text editor, system monitor, calculator, and games after eight hours of iterative refinement.

Mixed Results on Reasoning Tasks

GLM-5.1 shows clear weaknesses in non-coding domains:

  • Humanity's Last Exam (knowledge): 31% (vs. Gemini 3.1 Pro: 45%, GPT-5.4: 39.8%)
  • GPQA-Diamond (scientific reasoning): 86.2% (vs. Gemini 3.1 Pro: 94.3%, GPT-5.4: 92%)
  • Vending Bench 2 (agent business simulation): $5,634 balance (vs. Claude Opus 4.6: $8,018)
  • NL2Repo (repository generation): 42.7% (vs. Claude Opus 4.6: 49.8%)

On the Artificial Analysis Intelligence Index, GLM-5.1 ranks just behind Claude 4.6 Sonnet.

Acknowledged Limitations

Zhipu AI openly identifies remaining challenges: the model needs to recognize dead ends sooner, maintain coherence across thousands of tool calls, and reliably self-assess performance on tasks without clear success metrics. The company explicitly describes GLM-5.1 as a "first step."

Availability and Integration

GLM-5.1 is accessible via:

  • Hugging Face and ModelScope repositories
  • api.z.ai and BigModel.cn API platforms
  • Z.ai chat interface (launching in coming days)
  • Local deployment via vLLM and SGLang inference frameworks

The model integrates with coding agents including Claude Code and OpenClaw.

Market Context

GLM-5.1 represents Zhipu AI's expansion into autonomous coding—the company previously released GLM-5 (744B parameters) in February 2026 and GLM-5V-Turbo (multimodal coding) more recently. Competitors include Moonshot AI's Kimi K2.5 and Alibaba's Qwen3.5, both targeting the same agent-based coding market.

What This Means

GLM-5.1 demonstrates that extended iteration on coding tasks can outperform single-pass approaches from larger proprietary models—but only in specialized domains. The benchmarks reveal a critical trade-off: models optimized for iterative refinement on engineering problems lose generality on reasoning and knowledge tasks. The lack of independent verification for the three internal demonstrations leaves claims about iteration counts and strategy shifts unconfirmed. For coding-specific workloads, the open-weight availability and MIT license make GLM-5.1 worth evaluation; for general-purpose reasoning, leading proprietary models retain clear advantages.

Related Articles

model release

Google releases Gemini 3.5 Flash with autonomous coding and agent capabilities, claims 4x speed boost

Google released Gemini 3.5 Flash, positioning it as an agent-first model designed for autonomous coding and multi-hour workflows. The company claims the model outperforms its 3.1 Pro predecessor on coding and agentic benchmarks while running 4x faster than competing frontier models, with an optimized version achieving 12x speed gains.

model release

Alibaba Releases Qwen3.7 Max with 1M Token Context Window for Agent and Coding Tasks

Alibaba has released Qwen3.7 Max, the flagship model in its Qwen3.7 series, featuring a 1 million token context window. The text-only model is designed for agent-centric workloads with strengths in coding, office productivity, and long-horizon autonomous execution, and includes explicit prompt caching support.

model release

Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June

Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.

model release

Google Releases Gemini 3.5 Flash with 1M Token Context and Configurable Thinking Modes at $1.50/$9 Per Million Tokens

Google has released Gemini 3.5 Flash, a multimodal model with a 1 million token context window priced at $1.50 per million input tokens and $9 per million output tokens. The model supports text, image, video, audio, and PDF inputs with configurable thinking effort levels from minimal to high.

Comments

Loading...