analysis

Mistral's Leanstral code verification agent outperforms Claude Sonnet at 15% of the cost

TL;DR

Mistral has released Leanstral, a 120B-parameter code verification agent built with the Lean programming language, claiming it outperforms larger open-source models and offers significant cost advantages over Anthropic's Claude suite. The model achieves a pass@2 score of 26.3—beating Claude Sonnet by 2.6 points—while costing $36 to run compared to Sonnet's $549.

2 min read
0

Mistral's Leanstral Code Verification Agent Outperforms Claude Sonnet at 15% of the Cost

Mistral has released Leanstral, a 120-billion-parameter coding agent designed for formal code verification using the open-source Lean programming language. The release includes open weights under Apache 2.0 license, integration within Mistral Vibe, and a free API endpoint.

Performance Claims vs. Claude

According to Mistral's internal FLTEval benchmark—a new evaluation framework for engineering proofs that remains unreleased—Leanstral-120B-A6B significantly undercuts Anthropic's pricing while claiming competitive performance.

On pass@2 scoring: Leanstral reaches 26.3, exceeding Claude Sonnet's 23.7 by 2.6 points, while costing $36 versus Sonnet's $549. At pass@16 scoring: Leanstral achieves 31.9, beating Sonnet by 8 points, at $290 versus Sonnet's cost.

Anthropic's Claude Opus 4.6 still scores higher at 39.6 on pass@16, though it costs $1,650 compared to Leanstral's $290—representing a 5.7x price premium for 7.7 additional points.

Mistral claims Leanstral outperforms several larger open-source competitors including GLM5-744B-A40B, Kimi-K2.5-1T-32B, and Qwen3.5-397B-A17B on FLTEval, despite having substantially fewer parameters than these models.

Formal Verification as a Solution

The core appeal of Leanstral addresses a fundamental limitation of AI code generation: the inability to reliably verify correctness without human review. By leveraging formal proof systems, Mistral argues that specifications, proofs, tests, and linting can ground AI agents in verifiable correctness, reducing the time-consuming need for human code review.

Mistral demonstrated this by deploying Leanstral against a real question from the Proof Assistant Stack Exchange involving a bug in Lean 4 code. According to the company, Leanstral successfully generated test code to reproduce the failure and correctly identified and fixed the underlying flaw.

Broader Product Releases

Mistral simultaneously released Mistral Small 4, positioned as a unified model handling reasoning, coding, and instruction-following tasks without requiring users to switch between specialized models.

Critical Considerations

FLTEval has not been publicly released, making independent verification of these benchmarks impossible at present. Comparisons rest entirely on Mistral's claims. The FLTEval framework specifically targets formal proof engineering—a specialized domain not representative of general code generation tasks where Claude excels. Pricing comparisons use Mistral's stated cost per pass attempt; actual deployment costs depend on context window usage, which is not disclosed.

Cost-per-token pricing for Leanstral is not specified in Mistral's announcement, preventing direct technical cost comparison.

What This Means

Mistral is positioning Leanstral as a specialized alternative for formal code verification and proof engineering—a narrower but increasingly important use case as organizations prioritize code correctness. The cost structure targets teams running multiple inference passes for verification purposes, where 15-20% of Claude's price becomes meaningful. However, the advantage is limited to formal verification workflows; general-purpose coding likely remains Claude's domain until independent FLTEval results emerge.

Related Articles

analysis

Anthropic reverses course on invisible Claude Fable distillation guardrails after researcher backlash

Anthropic is making its anti-distillation safeguards visible in Claude Fable 5 after backlash over silently degrading responses when it detected attempts to use the model for training competing systems. Queries suspected of distillation will now be routed to Claude Opus 4.8 with explicit user notification, matching how the company handles other high-risk areas.

analysis

Anthropic reverses stealth policy that secretly downgraded Claude Fable 5 for AI research tasks

Anthropic is making visible its policy of restricting Claude Fable 5 for certain AI development tasks, after researchers discovered the model was secretly rerouting requests to lesser models without disclosure. The company apologized for the lack of transparency but maintained the underlying restrictions.

analysis

Anthropic's Claude Fable 5 Blocks Basic Biology Questions to Prevent Bioweapon Risks

Anthropic's newly released Claude Fable 5, the company's first public Mythos-class model, refuses to answer basic biology questions including 'what are mitochondria' and 'how mRNA vaccines work.' The company told The Verge the filters are intentionally 'overly conservative' to prevent bioweapon research, blocking 'most queries tied to biology work.'

analysis

Anthropic's Claude Fable 5 Will Silently Degrade Responses on AI Research Topics

Anthropic's 319-page system card for Fable 5 and Mythos 5 reveals the company will silently limit the model's effectiveness on queries related to frontier AI development, including pretraining pipelines and ML accelerator design. Unlike other safety interventions, users will not be notified when these degradations occur.

Comments

Loading...