model releaseZhipu AI

Zhipu AI's GLM-5.1 outperforms GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro through iterative strategy refinement

TL;DR

Zhipu AI has released GLM-5.1, a freely available open-weight model designed for long-running programming tasks that achieves 58.4% on SWE-Bench Pro, edging out GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%). The model's core capability is iterative strategy refinement—it rethinks its approach across hundreds of iterations and thousands of tool calls, recognizing dead ends and shifting tactics without human intervention. However, GLM-5.1 trails on reasoning and knowledge benchmarks, scoring 31% on Humanity's Last Exam compared to Gemini 3.1 Pro's 45%.

April 9, 2026 · 11:20 AM3 min read

GLM-5.1 — Quick Specs

Compare GLM-5.1 with other models →

Zhipu AI Releases GLM-5.1 with Iterative Long-Horizon Coding Capabilities

Zhipu AI has introduced GLM-5.1, an open-weight model available under MIT license on Hugging Face and ModelScope. The model is purpose-built for extended programming tasks where iterative strategy refinement, rather than raw parameter count, determines success on complex problems.

SWE-Bench Pro Leadership

On the SWE-Bench Pro software engineering benchmark, GLM-5.1 scores 58.4%—the highest among freely available models tested. This edges out:

GPT-5.4: 57.7%
Claude Opus 4.6: 57.3%

On CyberGym (cybersecurity), GLM-5.1 leads with 68.7%, though Zhipu AI notes that Gemini 3.1 Pro and GPT-5.4 refused some tasks for safety reasons, potentially affecting their scores.

Iterative Refinement Across Hundreds of Rounds

The model's defining feature is its ability to repeatedly review and revise its own strategy without external guidance. Zhipu AI demonstrates this through three internal evaluations:

Vector Database Optimization: GLM-5.1 improved query performance from Claude Opus 4.6's baseline of 3,547 queries per second to 21,500 queries per second—a 6.1x improvement. This required 600+ iterations and 6,000+ tool calls. The model initiated six major structural shifts:

Iteration 90: Switched from exhaustive search to clustering
Iteration 240: Introduced two-stage pipeline with pre-sorting and filtering

GPU Optimization: On KernelBench Level 3, GLM-5.1 achieved 3.6x speedup on baseline ML code versus Claude Opus 4.6's 4.2x. The model sustained progress longer than GLM-5 but remains behind the strongest competitor.

Linux Desktop Construction: When tasked with building a complete Linux desktop environment from a single prompt, GLM-5.1 delivered a functional system with file browser, terminal, text editor, system monitor, calculator, and games after eight hours of iterative refinement.

Mixed Results on Reasoning Tasks

GLM-5.1 shows clear weaknesses in non-coding domains:

Humanity's Last Exam (knowledge): 31% (vs. Gemini 3.1 Pro: 45%, GPT-5.4: 39.8%)
GPQA-Diamond (scientific reasoning): 86.2% (vs. Gemini 3.1 Pro: 94.3%, GPT-5.4: 92%)
Vending Bench 2 (agent business simulation): $5,634 balance (vs. Claude Opus 4.6: $8,018)
NL2Repo (repository generation): 42.7% (vs. Claude Opus 4.6: 49.8%)

On the Artificial Analysis Intelligence Index, GLM-5.1 ranks just behind Claude 4.6 Sonnet.

Acknowledged Limitations

Zhipu AI openly identifies remaining challenges: the model needs to recognize dead ends sooner, maintain coherence across thousands of tool calls, and reliably self-assess performance on tasks without clear success metrics. The company explicitly describes GLM-5.1 as a "first step."

Availability and Integration

GLM-5.1 is accessible via:

Hugging Face and ModelScope repositories
api.z.ai and BigModel.cn API platforms
Z.ai chat interface (launching in coming days)
Local deployment via vLLM and SGLang inference frameworks

The model integrates with coding agents including Claude Code and OpenClaw.

Market Context

GLM-5.1 represents Zhipu AI's expansion into autonomous coding—the company previously released GLM-5 (744B parameters) in February 2026 and GLM-5V-Turbo (multimodal coding) more recently. Competitors include Moonshot AI's Kimi K2.5 and Alibaba's Qwen3.5, both targeting the same agent-based coding market.

What This Means

GLM-5.1 demonstrates that extended iteration on coding tasks can outperform single-pass approaches from larger proprietary models—but only in specialized domains. The benchmarks reveal a critical trade-off: models optimized for iterative refinement on engineering problems lose generality on reasoning and knowledge tasks. The lack of independent verification for the three internal demonstrations leaves claims about iteration counts and strategy shifts unconfirmed. For coding-specific workloads, the open-weight availability and MIT license make GLM-5.1 worth evaluation; for general-purpose reasoning, leading proprietary models retain clear advantages.

Source: the-decoder.com ↗

model-release glm-5-1 zhipu-ai coding software-engineering swe-bench open-weight iterative-refinement

model releaseJuly 8, 2026

Poolside releases Laguna XS 2.1: 33B parameter MoE coding model with 262K context window

Poolside has released Laguna XS 2.1, a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token and a 262,144-token context window. The model achieves 70.9% on SWE-bench Verified and 63.1% on SWE-bench Multilingual, representing a 5.4% improvement over its predecessor on multilingual coding tasks.

model releaseJuly 6, 2026

Nex AGI releases Nex-N2-Mini: open-source agentic MoE model with 262K context window

Nex AGI has released Nex-N2-Mini, an open-source agentic mixture-of-experts model with a 262K-token context window. The model accepts text and image inputs and is priced at $0.025 per 1M input tokens and $0.10 per 1M output tokens.

model releaseJuly 8, 2026

OpenAI releases GPT-Live-1, full-duplex voice model that speaks and listens simultaneously

OpenAI has released GPT-Live-1, a full-duplex voice model that can speak and listen simultaneously, replacing ChatGPT's previous turn-based voice system. The model automatically routes queries to GPT-5.5 for reasoning tasks and includes real-time translation capabilities.

model releaseJuly 8, 2026

DeepSeek Releases V4-Flash: 284B Parameter MoE Model with 1M Context Window at Q8 162GB

Unsloth has released optimized GGUF quantizations of DeepSeek-V4-Flash, a 284B parameter Mixture-of-Experts model that activates 13B parameters and supports 1 million token context windows. The Q8 quantization (UD-Q8_K_XL) runs at 162GB with claimed lossless precision, only 7GB larger than the Q4 variant.

Zhipu AI's GLM-5.1 outperforms GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro through iterative strategy refinement

GLM-5.1 — Quick Specs

Zhipu AI Releases GLM-5.1 with Iterative Long-Horizon Coding Capabilities

SWE-Bench Pro Leadership

Iterative Refinement Across Hundreds of Rounds

Mixed Results on Reasoning Tasks

Acknowledged Limitations

Availability and Integration

Market Context

What This Means

Related Articles

Poolside releases Laguna XS 2.1: 33B parameter MoE coding model with 262K context window

Nex AGI releases Nex-N2-Mini: open-source agentic MoE model with 262K context window

OpenAI releases GPT-Live-1, full-duplex voice model that speaks and listens simultaneously

DeepSeek Releases V4-Flash: 284B Parameter MoE Model with 1M Context Window at Q8 162GB

Comments