Mistral's Leanstral code verification agent outperforms Claude Sonnet at 15% of the cost
Mistral has released Leanstral, a 120B-parameter code verification agent built with the Lean programming language, claiming it outperforms larger open-source models and offers significant cost advantages over Anthropic's Claude suite. The model achieves a pass@2 score of 26.3—beating Claude Sonnet by 2.6 points—while costing $36 to run compared to Sonnet's $549.
Mistral's Leanstral Code Verification Agent Outperforms Claude Sonnet at 15% of the Cost
Mistral has released Leanstral, a 120-billion-parameter coding agent designed for formal code verification using the open-source Lean programming language. The release includes open weights under Apache 2.0 license, integration within Mistral Vibe, and a free API endpoint.
Performance Claims vs. Claude
According to Mistral's internal FLTEval benchmark—a new evaluation framework for engineering proofs that remains unreleased—Leanstral-120B-A6B significantly undercuts Anthropic's pricing while claiming competitive performance.
On pass@2 scoring: Leanstral reaches 26.3, exceeding Claude Sonnet's 23.7 by 2.6 points, while costing $36 versus Sonnet's $549. At pass@16 scoring: Leanstral achieves 31.9, beating Sonnet by 8 points, at $290 versus Sonnet's cost.
Anthropic's Claude Opus 4.6 still scores higher at 39.6 on pass@16, though it costs $1,650 compared to Leanstral's $290—representing a 5.7x price premium for 7.7 additional points.
Mistral claims Leanstral outperforms several larger open-source competitors including GLM5-744B-A40B, Kimi-K2.5-1T-32B, and Qwen3.5-397B-A17B on FLTEval, despite having substantially fewer parameters than these models.
Formal Verification as a Solution
The core appeal of Leanstral addresses a fundamental limitation of AI code generation: the inability to reliably verify correctness without human review. By leveraging formal proof systems, Mistral argues that specifications, proofs, tests, and linting can ground AI agents in verifiable correctness, reducing the time-consuming need for human code review.
Mistral demonstrated this by deploying Leanstral against a real question from the Proof Assistant Stack Exchange involving a bug in Lean 4 code. According to the company, Leanstral successfully generated test code to reproduce the failure and correctly identified and fixed the underlying flaw.
Broader Product Releases
Mistral simultaneously released Mistral Small 4, positioned as a unified model handling reasoning, coding, and instruction-following tasks without requiring users to switch between specialized models.
Critical Considerations
FLTEval has not been publicly released, making independent verification of these benchmarks impossible at present. Comparisons rest entirely on Mistral's claims. The FLTEval framework specifically targets formal proof engineering—a specialized domain not representative of general code generation tasks where Claude excels. Pricing comparisons use Mistral's stated cost per pass attempt; actual deployment costs depend on context window usage, which is not disclosed.
Cost-per-token pricing for Leanstral is not specified in Mistral's announcement, preventing direct technical cost comparison.
What This Means
Mistral is positioning Leanstral as a specialized alternative for formal code verification and proof engineering—a narrower but increasingly important use case as organizations prioritize code correctness. The cost structure targets teams running multiple inference passes for verification purposes, where 15-20% of Claude's price becomes meaningful. However, the advantage is limited to formal verification workflows; general-purpose coding likely remains Claude's domain until independent FLTEval results emerge.
Related Articles
Enterprise AI gap widens as open-weight models mature into production-ready alternatives
Open-weight models from Google, Alibaba, Microsoft, and Nvidia have crossed a threshold from research projects to enterprise-grade systems. The shift reflects a growing divide: frontier models from OpenAI and Anthropic are too expensive and pose data security risks for most enterprises, while open alternatives now deliver sufficient capability at a fraction of the cost.
Qwen releases three new Qwen3.6 models ranging from 27B to flagship Max Preview
Qwen has released three models in its Qwen3.6 series: a flagship Max Preview model, a 35B parameter A3B variant, and a 27B parameter base model. All three models are now accessible through OpenRouter's API platform.
DeepSeek Releases V4-Flash and V4-Pro Models as Tencent Ships Hy3-Preview
DeepSeek has released two new models in its V4 series: DeepSeek-V4-Flash and DeepSeek-V4-Pro, both now available on Hugging Face. Separately, Tencent has shipped Hy3-Preview, marking simultaneous releases from two major Chinese AI labs.
Qwen 3.6 27B Released With FP8 Quantization, OpenAI Deploys Privacy Filter Model
Alibaba Cloud released Qwen 3.6 27B, a 27-billion parameter language model, alongside an FP8 quantized version for deployment efficiency. Separately, OpenAI published a privacy filter model on Hugging Face, marking a rare public model release from the company.
Comments
Loading...