GLM-5.1 released: 754B agentic model outperforms Claude on coding benchmarks
Zhipu AI released GLM-5.1, a 754-parameter model optimized for agentic engineering tasks. The model scores 58.4% on SWE-Bench Pro, outperforming Claude 3.5 Sonnet (57.3%), and demonstrates sustained reasoning capability over hundreds of iterations.
GLM-5.1: 754B Agentic Model Outperforms Claude on Coding Benchmarks
Zhipu AI released GLM-5.1, a 754-parameter flagship model built for agentic engineering tasks. The model achieves 58.4% on SWE-Bench Pro—the primary metric for software engineering capability—exceeding Claude 3.5 Sonnet (57.3%) and maintaining substantial leads on specialized benchmarks.
Key Performance Metrics
GLM-5.1 achieves state-of-the-art performance across multiple agentic benchmarks:
- SWE-Bench Pro: 58.4% (vs. Claude 57.3%, Gemini 3.1 Pro 54.2%)
- NL2Repo (repo generation): 42.7% (vs. Claude 49.8%, significant improvement over GLM-5's 35.9%)
- Terminal-Bench 2.0: 63.5% on Terminus-2 suite
- CyberGym: 68.7% (vs. Claude 66.6%)
- BrowseComp with context management: 79.3% (vs. Gemini 84.0%, Claude 75.9%)
Mathematical reasoning shows mixed performance: 95.3% on AIME 2026 and 86.2% on GPQA-Diamond, trailing GPT-5.4 (98.7% on AIME) and Gemini 3.1 Pro (94.3% on GPQA).
Distinctive Agentic Capability
Unlike previous models including GLM-5, which plateau after initial optimizations, GLM-5.1 is designed to sustain effectiveness over extended problem-solving horizons. According to the developers, the model handles ambiguous problems with improved judgment and maintains productivity across longer sessions—breaking complex tasks into experiments, reading results, identifying blockers, and revising strategies through hundreds of iterations and thousands of tool calls.
This iterative reasoning approach distinguishes it from models optimized for single-pass performance.
Deployment and Quantization
Unsloth released GGUF quantized versions with 17 quant options ranging from 206 GB (1-bit UD-IQ1_M) to 1.51 TB (16-bit BF16). The releases implement Unsloth Dynamic 2.0 quantization, which the developers claim achieves superior accuracy compared to other quantization methods.
Supported inference frameworks include:
- SGLang (v0.5.10+)
- vLLM (v0.19.0+)
- xLLM (v0.8.0+)
- Transformers (v4.5.3+)
- KTransformers (v0.5.3+)
The model received 13,329 downloads on Hugging Face in its first month.
Availability
GLM-5.1 is available through Z.ai API Platform for inference. The developers announced chat.z.ai access would come in subsequent days. A technical report and GitHub repository were published alongside the release.
What This Means
GLM-5.1 represents a shift in agentic model design: instead of pursuing raw benchmark scores on isolated tasks, the focus is extended-horizon reasoning and iterative refinement. Its SWE-Bench Pro lead over Claude positions it as the strongest open-access model for software engineering tasks, though Gemini 3.1 Pro and GPT-5.4 maintain mathematical reasoning advantages. The quantized GGUF versions enable local deployment at scale, with memory requirements scaling from 206 GB to 1.51 TB depending on precision needs.
Related Articles
Zhipu AI's GLM-5.1 outperforms GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro through iterative strategy refinement
Zhipu AI has released GLM-5.1, a freely available open-weight model designed for long-running programming tasks that achieves 58.4% on SWE-Bench Pro, edging out GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%). The model's core capability is iterative strategy refinement—it rethinks its approach across hundreds of iterations and thousands of tool calls, recognizing dead ends and shifting tactics without human intervention. However, GLM-5.1 trails on reasoning and knowledge benchmarks, scoring 31% on Humanity's Last Exam compared to Gemini 3.1 Pro's 45%.
Alibaba's Qwen3.6 Plus reaches 78.8 on SWE-bench with 1M context window
Alibaba released Qwen3.6 Plus on April 2, 2026, featuring a 1 million token context window at $0.50 per million input tokens and $3 per million output tokens. The model combines linear attention with sparse mixture-of-experts routing to achieve a 78.8 score on SWE-bench Verified, with significant improvements in agentic coding, front-end development, and reasoning tasks.
Z.ai releases GLM-5.1, 754B parameter open-weight model with improved code generation
Z.ai has released GLM-5.1, a 754-billion parameter open-weight model matching the size of its predecessor GLM-5. The model demonstrates improved ability to generate complex, multi-part outputs like HTML pages with SVG graphics and CSS animations, available via Hugging Face and OpenRouter.
Meta AI app jumps to No. 5 on App Store following Muse Spark launch
Meta's AI app surged from No. 57 to No. 5 on the U.S. App Store within 24 hours of launching Muse Spark, Meta's new multimodal AI model. The model accepts voice, text, and image inputs and features reasoning capabilities for science and math tasks, visual coding, and multi-agent functionality.
Comments
Loading...