benchmark

Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7

TL;DR

Augment Code benchmarked its Auggie agent against Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate versus 66.3% while cutting costs by 33%. The company attributes savings to a semantic context engine that reduces cache read tokens by 32% and output tokens by 37% compared to Claude Code's keyword-based retrieval.

3 min read
0

Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7

Augment Code benchmarked its Auggie CLI agent against Anthropic's Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate compared to Claude Code's 66.3% on Terminal Bench 2.0, while reducing total cost by 33%.

Token efficiency drives cost reduction

According to Augment Code, the savings come from reduced token usage across all categories. On Terminal Bench 2.0 using Opus 4.7, Auggie consumed 367.6 million total tokens versus Claude Code's 543.1 million — a 32% reduction. Cache read tokens dropped 32%, output tokens fell 37%, and cache write tokens decreased 29%. Total cost per benchmark run: $463.04 for Auggie versus $694.50 for Claude Code.

The company ran tests on a GCP n4-highcpu-16 VM with five attempts per task and four parallel tasks using Harbor framework with default settings.

SWE-Bench Pro results show similar pattern

On SWE-Bench Pro, Augment reports Auggie achieved a higher pass rate than Claude Code while costing 23% less per task. Total tokens: 1.65 billion for Auggie versus 2.35 billion for Claude Code, a 30% reduction. Cache reads dropped 30%, cache writes fell 17%. Total cost: $1,448.63 for Auggie versus $1,869.97 for Claude Code.

Semantic indexing versus keyword search

Augment Code attributes the efficiency gains to its Context Engine, which maintains a semantic index of codebases rather than using grep and keyword search. The company claims this approach reduces unnecessary file crawling and irrelevant code retrieval that require additional model turns.

"Most coding agents assemble context through grep and keyword search," according to Augment's blog post. "Agents burn turns crawling files, reading large spans of code, and pulling in irrelevant matches just to find the few lines that actually matter."

Model-agnostic approach and routing

Augment tested Auggie with alternative models on Terminal Bench 2.0. According to the company, Auggie with GPT-4.5 (likely referring to an internal designation) achieved 9.3% higher pass rate than the Claude Code baseline at 54% lower cost. Auggie with GPT-4.4 matched the baseline pass rate at 73% lower cost.

The company also introduced Prism, a model router that selects models per task. Augment claims Prism provides an additional 20-30% cost reduction on top of per-task efficiency gains.

Internal testing on private repositories

On an internal evaluation suite of private repositories, Augment reports Claude Code passed 62 tasks at $6.49 per passing task ($402 total), while Auggie passed 61 tasks at $3.90 per passing task ($238 total).

What this means

This benchmark represents a vendor-run comparison rather than independent third-party testing. The 1.1 percentage point quality difference falls within normal benchmark variance, making cost the primary differentiator. If verified independently, the token efficiency gains would be significant for organizations running coding agents at scale. However, the results reflect Augment's optimized retrieval system paired with the same underlying model, not a fundamentally different AI capability. The model routing approach could provide additional savings, though switching between models mid-task introduces complexity that may affect consistency.

Related Articles

benchmark

Zhipu's GLM-5.2 matches Anthropic's Claude Opus 4.8 on agentic benchmark at one-fifth the cost

Zhipu AI's open-source GLM-5.2 model scores within one percentage point of Anthropic's Claude Opus 4.8 on a key agentic benchmark while costing approximately one-fifth as much. The release comes as U.S. government restrictions limit access to Anthropic's Fable and OpenAI's GPT-5.6 models.

benchmark

China's Zhipu AI releases GLM-5.2, claims parity with Mythos on cybersecurity benchmarks

Zhipu AI released its open-weight GLM-5.2 model, with researchers claiming it matches Anthropic's Mythos on certain bug-finding and cybersecurity tasks. The model lags behind Anthropic and OpenAI models on general benchmarks but represents a significant narrowing of capabilities between Chinese and US AI systems.

benchmark

Claude Opus 4.8 fails legal reasoning test despite improved honesty scores

Anthropic's Claude Opus 4.8 demonstrated better uncertainty handling than its predecessor in independent testing across coding, medical, and financial scenarios. However, the model exhibited a significant judgment error in a legal reasoning test involving travel insurance claims, according to results published by ZDNET.

benchmark

IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture

IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.

Comments

Loading...