benchmark

Zhipu's GLM-5.2 matches Anthropic's Claude Opus 4.8 on agentic benchmark at one-fifth the cost

TL;DR

Zhipu AI's open-source GLM-5.2 model scores within one percentage point of Anthropic's Claude Opus 4.8 on a key agentic benchmark while costing approximately one-fifth as much. The release comes as U.S. government restrictions limit access to Anthropic's Fable and OpenAI's GPT-5.6 models.

June 26, 2026 · 10:20 PM2 min read

GLM-5.2 — Quick Specs

Compare GLM-5.2 with other models →

Chinese open-source model challenges frontier labs on price-performance

Zhipu AI's GLM-5.2 scores within one percentage point of Anthropic's Claude Opus 4.8 on agentic benchmarks while costing roughly 20% as much, according to the Chinese AI startup. The open-source model, released last week, has surpassed all other open releases on the closely watched benchmark.

OpenRouter token traffic for GLM-5.2 is climbing faster than it did following DeepSeek's V4 launch in April, suggesting rapid developer adoption. Unlike DeepSeek, which focused primarily on chat applications, GLM-5.2 demonstrates strength in agentic tasks including planning, coding, testing, and task looping—capabilities enterprises are prioritizing for automation.

Enterprise cost pressures drive adoption

"I've been consistently surprised by how quickly the open source has caught up," Gabe Pereyra, co-founder of legal AI company Harvey, told CNBC. "GLM-5.2, you're seeing the first model where it's really competitive with some of these closed-source frontier models."

As token spend strains enterprise AI budgets, "intelligence per dollar" is emerging as the critical metric. GLM-5.2's combination of competitive performance and significantly lower costs addresses this pressure directly. The model is free to download, fine-tune, and deploy on enterprise infrastructure, eliminating recurring API costs.

U.S. restrictions create opening for open source

The timing coincides with increased U.S. government oversight of frontier models. Anthropic pulled its Fable Mythos-class model following a Trump administration order. OpenAI announced Friday it is limiting access to GPT-5.6 models "at the request of the U.S. government."

These restrictions make models that "no one can revoke" more attractive to enterprises concerned about deployment stability, according to the report. Open-source models eliminate dependency on external API access and regulatory compliance from third-party providers.

Specific benchmark numbers not disclosed

While the article states GLM-5.2 lands "within a percentage point" of Claude Opus 4.8 on "a key agentic benchmark," neither the specific benchmark name nor exact scores are disclosed. The pricing comparison—"roughly a fifth of the cost"—also lacks precise per-token figures for verification.

Zhipu AI, based in Beijing, has been building foundation models since 2019. The company previously released GLM-4 and other models in its series.

What this means

If verified, GLM-5.2's performance represents a significant compression in the gap between open-source and frontier closed models, particularly for agentic workflows. The combination of regulatory uncertainty around U.S. models and budget pressures on token spend could accelerate enterprise adoption of open-source alternatives. However, the lack of disclosed benchmark specifics makes it difficult to independently verify the claimed parity with Claude Opus 4.8. The shift toward "intelligence per dollar" as the primary evaluation metric reflects a maturing market moving beyond pure capability races.

Source: cnbc.com ↗

zhipu-ai glm-5-2 anthropic claude-opus-4-8 open-source benchmarks agentic-ai china-ai

benchmarkMay 15, 2026

Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7

Augment Code benchmarked its Auggie agent against Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate versus 66.3% while cutting costs by 33%. The company attributes savings to a semantic context engine that reduces cache read tokens by 32% and output tokens by 37% compared to Claude Code's keyword-based retrieval.

benchmarkJune 2, 2026

Claude Opus 4.8 fails legal reasoning test despite improved honesty scores

Anthropic's Claude Opus 4.8 demonstrated better uncertainty handling than its predecessor in independent testing across coding, medical, and financial scenarios. However, the model exhibited a significant judgment error in a legal reasoning test involving travel insurance claims, according to results published by ZDNET.

benchmarkMay 18, 2026

IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture

IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.

benchmarkApril 30, 2026

UK AI Security Institute finds GPT-5.5 matches Claude Mythos in vulnerability detection, but is publicly available

The UK's AI Security Institute has evaluated OpenAI's GPT-5.5 for security vulnerability detection capabilities. The evaluation found GPT-5.5 performs comparably to Anthropic's Claude Mythos, with the key distinction that GPT-5.5 is generally available while Mythos remains in limited release.