analysis

Mistral's Leanstral code verification agent outperforms Claude Sonnet at 15% of the cost

TL;DR

Mistral has released Leanstral, a 120B-parameter code verification agent built with the Lean programming language, claiming it outperforms larger open-source models and offers significant cost advantages over Anthropic's Claude suite. The model achieves a pass@2 score of 26.3—beating Claude Sonnet by 2.6 points—while costing $36 to run compared to Sonnet's $549.

2 min read

Mistral's Leanstral Code Verification Agent Outperforms Claude Sonnet at 15% of the Cost

Mistral has released Leanstral, a 120-billion-parameter coding agent designed for formal code verification using the open-source Lean programming language. The release includes open weights under Apache 2.0 license, integration within Mistral Vibe, and a free API endpoint.

Performance Claims vs. Claude

According to Mistral's internal FLTEval benchmark—a new evaluation framework for engineering proofs that remains unreleased—Leanstral-120B-A6B significantly undercuts Anthropic's pricing while claiming competitive performance.

On pass@2 scoring: Leanstral reaches 26.3, exceeding Claude Sonnet's 23.7 by 2.6 points, while costing $36 versus Sonnet's $549. At pass@16 scoring: Leanstral achieves 31.9, beating Sonnet by 8 points, at $290 versus Sonnet's cost.

Anthropic's Claude Opus 4.6 still scores higher at 39.6 on pass@16, though it costs $1,650 compared to Leanstral's $290—representing a 5.7x price premium for 7.7 additional points.

Mistral claims Leanstral outperforms several larger open-source competitors including GLM5-744B-A40B, Kimi-K2.5-1T-32B, and Qwen3.5-397B-A17B on FLTEval, despite having substantially fewer parameters than these models.

Formal Verification as a Solution

The core appeal of Leanstral addresses a fundamental limitation of AI code generation: the inability to reliably verify correctness without human review. By leveraging formal proof systems, Mistral argues that specifications, proofs, tests, and linting can ground AI agents in verifiable correctness, reducing the time-consuming need for human code review.

Mistral demonstrated this by deploying Leanstral against a real question from the Proof Assistant Stack Exchange involving a bug in Lean 4 code. According to the company, Leanstral successfully generated test code to reproduce the failure and correctly identified and fixed the underlying flaw.

Broader Product Releases

Mistral simultaneously released Mistral Small 4, positioned as a unified model handling reasoning, coding, and instruction-following tasks without requiring users to switch between specialized models.

Critical Considerations

FLTEval has not been publicly released, making independent verification of these benchmarks impossible at present. Comparisons rest entirely on Mistral's claims. The FLTEval framework specifically targets formal proof engineering—a specialized domain not representative of general code generation tasks where Claude excels. Pricing comparisons use Mistral's stated cost per pass attempt; actual deployment costs depend on context window usage, which is not disclosed.

Cost-per-token pricing for Leanstral is not specified in Mistral's announcement, preventing direct technical cost comparison.

What This Means

Mistral is positioning Leanstral as a specialized alternative for formal code verification and proof engineering—a narrower but increasingly important use case as organizations prioritize code correctness. The cost structure targets teams running multiple inference passes for verification purposes, where 15-20% of Claude's price becomes meaningful. However, the advantage is limited to formal verification workflows; general-purpose coding likely remains Claude's domain until independent FLTEval results emerge.

Leanstral: Mistral's Code Verification Agent Benchmark | TPS