Zhipu AI's GLM-5.1 outperforms GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro through iterative strategy refinement
Zhipu AI has released GLM-5.1, a freely available open-weight model designed for long-running programming tasks that achieves 58.4% on SWE-Bench Pro, edging out GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%). The model's core capability is iterative strategy refinement—it rethinks its approach across hundreds of iterations and thousands of tool calls, recognizing dead ends and shifting tactics without human intervention. However, GLM-5.1 trails on reasoning and knowledge benchmarks, scoring 31% on Humanity's Last Exam compared to Gemini 3.1 Pro's 45%.
Zhipu AI Releases GLM-5.1 with Iterative Long-Horizon Coding Capabilities
Zhipu AI has introduced GLM-5.1, an open-weight model available under MIT license on Hugging Face and ModelScope. The model is purpose-built for extended programming tasks where iterative strategy refinement, rather than raw parameter count, determines success on complex problems.
SWE-Bench Pro Leadership
On the SWE-Bench Pro software engineering benchmark, GLM-5.1 scores 58.4%—the highest among freely available models tested. This edges out:
- GPT-5.4: 57.7%
- Claude Opus 4.6: 57.3%
On CyberGym (cybersecurity), GLM-5.1 leads with 68.7%, though Zhipu AI notes that Gemini 3.1 Pro and GPT-5.4 refused some tasks for safety reasons, potentially affecting their scores.
Iterative Refinement Across Hundreds of Rounds
The model's defining feature is its ability to repeatedly review and revise its own strategy without external guidance. Zhipu AI demonstrates this through three internal evaluations:
Vector Database Optimization: GLM-5.1 improved query performance from Claude Opus 4.6's baseline of 3,547 queries per second to 21,500 queries per second—a 6.1x improvement. This required 600+ iterations and 6,000+ tool calls. The model initiated six major structural shifts:
- Iteration 90: Switched from exhaustive search to clustering
- Iteration 240: Introduced two-stage pipeline with pre-sorting and filtering
GPU Optimization: On KernelBench Level 3, GLM-5.1 achieved 3.6x speedup on baseline ML code versus Claude Opus 4.6's 4.2x. The model sustained progress longer than GLM-5 but remains behind the strongest competitor.
Linux Desktop Construction: When tasked with building a complete Linux desktop environment from a single prompt, GLM-5.1 delivered a functional system with file browser, terminal, text editor, system monitor, calculator, and games after eight hours of iterative refinement.
Mixed Results on Reasoning Tasks
GLM-5.1 shows clear weaknesses in non-coding domains:
- Humanity's Last Exam (knowledge): 31% (vs. Gemini 3.1 Pro: 45%, GPT-5.4: 39.8%)
- GPQA-Diamond (scientific reasoning): 86.2% (vs. Gemini 3.1 Pro: 94.3%, GPT-5.4: 92%)
- Vending Bench 2 (agent business simulation): $5,634 balance (vs. Claude Opus 4.6: $8,018)
- NL2Repo (repository generation): 42.7% (vs. Claude Opus 4.6: 49.8%)
On the Artificial Analysis Intelligence Index, GLM-5.1 ranks just behind Claude 4.6 Sonnet.
Acknowledged Limitations
Zhipu AI openly identifies remaining challenges: the model needs to recognize dead ends sooner, maintain coherence across thousands of tool calls, and reliably self-assess performance on tasks without clear success metrics. The company explicitly describes GLM-5.1 as a "first step."
Availability and Integration
GLM-5.1 is accessible via:
- Hugging Face and ModelScope repositories
- api.z.ai and BigModel.cn API platforms
- Z.ai chat interface (launching in coming days)
- Local deployment via vLLM and SGLang inference frameworks
The model integrates with coding agents including Claude Code and OpenClaw.
Market Context
GLM-5.1 represents Zhipu AI's expansion into autonomous coding—the company previously released GLM-5 (744B parameters) in February 2026 and GLM-5V-Turbo (multimodal coding) more recently. Competitors include Moonshot AI's Kimi K2.5 and Alibaba's Qwen3.5, both targeting the same agent-based coding market.
What This Means
GLM-5.1 demonstrates that extended iteration on coding tasks can outperform single-pass approaches from larger proprietary models—but only in specialized domains. The benchmarks reveal a critical trade-off: models optimized for iterative refinement on engineering problems lose generality on reasoning and knowledge tasks. The lack of independent verification for the three internal demonstrations leaves claims about iteration counts and strategy shifts unconfirmed. For coding-specific workloads, the open-weight availability and MIT license make GLM-5.1 worth evaluation; for general-purpose reasoning, leading proprietary models retain clear advantages.
Related Articles
Alibaba's Qwen3.6 Plus reaches 78.8 on SWE-bench with 1M context window
Alibaba released Qwen3.6 Plus on April 2, 2026, featuring a 1 million token context window at $0.50 per million input tokens and $3 per million output tokens. The model combines linear attention with sparse mixture-of-experts routing to achieve a 78.8 score on SWE-bench Verified, with significant improvements in agentic coding, front-end development, and reasoning tasks.
Meta launches Muse Spark model with private API preview and 16 integrated tools
Meta announced Muse Spark today, its first model release since Llama 4 a year ago. The hosted model is available in private API preview and on meta.ai with Instant and Thinking modes, benchmarking competitively against Anthropic's Opus 4.6 and Google's Gemini 3.1 Pro, though behind on Terminal-Bench 2.0.
Meta launches Muse Spark, its first model from revamped AI labs
Meta Superintelligence Labs has launched Muse Spark, its first model since Mark Zuckerberg restructured the company's AI division. The multimodal model now powers Meta AI's app and website in the US, with rollout planned for WhatsApp, Instagram, Facebook, Messenger, and Meta's smart glasses in coming weeks.
Arcee AI releases Trinity-Large-Thinking: 398B sparse MoE model with chain-of-thought reasoning
Arcee AI released Trinity-Large-Thinking, a 398B-parameter sparse Mixture-of-Experts model with approximately 13B active parameters per token, post-trained with extended chain-of-thought reasoning for agentic workflows. The model achieves 94.7% on τ²-Bench, 91.9% on PinchBench, and 98.2% on LiveCodeBench, generating explicit reasoning traces in <think>...</think> blocks before producing responses.
Comments
Loading...