benchmark

Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7

TL;DR

Augment Code benchmarked its Auggie agent against Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate versus 66.3% while cutting costs by 33%. The company attributes savings to a semantic context engine that reduces cache read tokens by 32% and output tokens by 37% compared to Claude Code's keyword-based retrieval.

3 min read
0

Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7

Augment Code benchmarked its Auggie CLI agent against Anthropic's Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate compared to Claude Code's 66.3% on Terminal Bench 2.0, while reducing total cost by 33%.

Token efficiency drives cost reduction

According to Augment Code, the savings come from reduced token usage across all categories. On Terminal Bench 2.0 using Opus 4.7, Auggie consumed 367.6 million total tokens versus Claude Code's 543.1 million — a 32% reduction. Cache read tokens dropped 32%, output tokens fell 37%, and cache write tokens decreased 29%. Total cost per benchmark run: $463.04 for Auggie versus $694.50 for Claude Code.

The company ran tests on a GCP n4-highcpu-16 VM with five attempts per task and four parallel tasks using Harbor framework with default settings.

SWE-Bench Pro results show similar pattern

On SWE-Bench Pro, Augment reports Auggie achieved a higher pass rate than Claude Code while costing 23% less per task. Total tokens: 1.65 billion for Auggie versus 2.35 billion for Claude Code, a 30% reduction. Cache reads dropped 30%, cache writes fell 17%. Total cost: $1,448.63 for Auggie versus $1,869.97 for Claude Code.

Semantic indexing versus keyword search

Augment Code attributes the efficiency gains to its Context Engine, which maintains a semantic index of codebases rather than using grep and keyword search. The company claims this approach reduces unnecessary file crawling and irrelevant code retrieval that require additional model turns.

"Most coding agents assemble context through grep and keyword search," according to Augment's blog post. "Agents burn turns crawling files, reading large spans of code, and pulling in irrelevant matches just to find the few lines that actually matter."

Model-agnostic approach and routing

Augment tested Auggie with alternative models on Terminal Bench 2.0. According to the company, Auggie with GPT-4.5 (likely referring to an internal designation) achieved 9.3% higher pass rate than the Claude Code baseline at 54% lower cost. Auggie with GPT-4.4 matched the baseline pass rate at 73% lower cost.

The company also introduced Prism, a model router that selects models per task. Augment claims Prism provides an additional 20-30% cost reduction on top of per-task efficiency gains.

Internal testing on private repositories

On an internal evaluation suite of private repositories, Augment reports Claude Code passed 62 tasks at $6.49 per passing task ($402 total), while Auggie passed 61 tasks at $3.90 per passing task ($238 total).

What this means

This benchmark represents a vendor-run comparison rather than independent third-party testing. The 1.1 percentage point quality difference falls within normal benchmark variance, making cost the primary differentiator. If verified independently, the token efficiency gains would be significant for organizations running coding agents at scale. However, the results reflect Augment's optimized retrieval system paired with the same underlying model, not a fundamentally different AI capability. The model routing approach could provide additional savings, though switching between models mid-task introduces complexity that may affect consistency.

Related Articles

benchmark

UK AI Security Institute finds GPT-5.5 matches Claude Mythos in vulnerability detection, but is publicly available

The UK's AI Security Institute has evaluated OpenAI's GPT-5.5 for security vulnerability detection capabilities. The evaluation found GPT-5.5 performs comparably to Anthropic's Claude Mythos, with the key distinction that GPT-5.5 is generally available while Mythos remains in limited release.

benchmark

Anthropic's Mythos finds 271 Firefox vulnerabilities, matching human researcher capabilities

Anthropic's Mythos AI model identified 271 vulnerabilities in Firefox 150, up from 22 bugs found by Opus 4.6 in Firefox 148. Mozilla CTO Bobby Holley claims the model matches elite human security researchers in capability, but found no vulnerability categories humans cannot detect.

benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

benchmark

Claude Mythos achieves 73% success rate on expert-level hacking challenges, completes full network takeover in 3 of 10 a

The UK's AI Safety Institute reports Claude Mythos Preview achieved a 73% success rate on expert-level capture-the-flag cybersecurity challenges and became the first AI model to complete a full 32-step simulated corporate network takeover, succeeding in 3 out of 10 attempts. The testing occurred in environments without active security monitoring or defenders.

Comments

Loading...