model releaseMistral AI

Mistral Releases Leanstral 1.5: 6B-Parameter Model Achieves 100% on miniF2F, Solves 587/672 PutnamBench Problems

TL;DR

Mistral AI released Leanstral 1.5, a free Apache-2.0 licensed model with 119B total parameters and 6B active parameters specialized for formal verification in Lean 4. The model achieves 100% on miniF2F benchmark, solves 587 of 672 PutnamBench problems at $4 per problem (versus $300+ for competitors), and reaches state-of-the-art 87% on FATE-H and 34% on FATE-X benchmarks.

3 min read
0

Mistral Releases Leanstral 1.5: 6B-Parameter Model Achieves 100% on miniF2F, Solves 587/672 PutnamBench Problems

Mistral AI released Leanstral 1.5, a free Apache-2.0 licensed model with 119B total parameters and 6B active parameters specialized for formal verification in Lean 4. The model achieves 100% on miniF2F benchmark, solves 587 of 672 PutnamBench problems at approximately $4 per problem, and reaches state-of-the-art 87% on FATE-H and 34% on FATE-X benchmarks.

Benchmark Performance

Leanstral 1.5 saturates the miniF2F benchmark completely, achieving 100% on both validation and test sets. On PutnamBench, the model solves 587 of 672 problems from the Putnam Mathematical Competition, outperforming Seed-Prover 1.5 by 7 problems while operating at far lower cost—$4 per problem versus an estimated $300+ for Seed-Prover's high setting with a 10 H20-days budget per problem.

On graduate and PhD-level abstract algebra benchmarks, Leanstral 1.5 achieves 87% on FATE-H and 34% on FATE-X, according to Mistral AI, representing new state-of-the-art results. On FLTEval, based on real pull requests from the Fermat's Last Theorem repository, the model reaches 28.9% pass@1 (up from 21.9%) and 43.2% pass@8 (up from 31.9%), surpassing Claude Opus 4.6's 39.6% at one-seventh the cost, according to the company.

Training and Architecture

The model underwent three training stages: mid-training, supervised fine-tuning, and reinforcement learning with CISPO. Training involved two RL environments: a multiturn environment where the model proves or disproves theorem statements through iterative compiler feedback, and a code agent environment where it operates like a developer in a filesystem, editing files, running bash commands, and using the Lean language server.

Test-Time Scaling

Mistral AI reports that Leanstral 1.5 demonstrates strong test-time scaling on PutnamBench. With Pass@8 evaluation, performance increases from 44 problems solved at 50k tokens per attempt to 244 at 200k, 493 at 1M, and 587 at 4M tokens. One AVL-tree proof ran for over 2.7 million tokens across 22 compactions.

Code Verification Capabilities

While primarily trained for mathematics, Leanstral 1.5 verified time complexity guarantees for AVL tree implementations, proving O(log n) insertion and deletion through 2.7 million tokens of reasoning. In an automated bug-finding pipeline testing 57 repositories, the model flagged 47 violated properties, identifying 11 genuine bugs—5 previously unreported on GitHub. One discovered bug was an overflow issue in the datrs/varinteger library's zigzag decoding sign function.

Availability

Leanstral 1.5 is available under Apache-2.0 license on Hugging Face and as a free API endpoint identified as "leanstral-1-5". Mistral AI recommends using the model through Mistral Vibe, its proof engineering interface for Lean 4. No pricing information was disclosed for commercial API usage beyond the free tier.

What This Means

Leanstral 1.5's combination of small active parameter count (6B) and strong formal verification performance challenges the assumption that mathematical reasoning requires massive models. The $4 per problem cost versus competitors' $300+ represents a 75x cost reduction for PutnamBench-level problems, potentially making formal verification economically viable for broader applications. The model's bug discovery in real codebases—finding edge cases like overflow bugs that traditional testing missed—demonstrates practical utility beyond academic benchmarks, though the 11 bugs found across 57 repositories (19% bug detection rate) suggests the technology still requires human oversight for production verification workflows.

Related Articles

model release

Anthropic releases Claude Sonnet 5 at $2/1M input tokens, 63.2% agentic coding benchmark

Anthropic has released Claude Sonnet 5, its new mid-tier model optimized for agentic tasks, priced at $2 per million input tokens through August 31 before rising to $3/1M. The model scores 63.2% on agentic coding benchmarks, approaching Opus 4.8's 69.2% performance at a significantly lower price point.

model release

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

model release

Anthropic Restores Claude Fable 5 After Government Takedown, With Stricter Cybersecurity Blocks

Anthropic is redeploying Claude Fable 5 after a month-long government-mandated takedown triggered by Amazon researchers discovering a method to bypass the model's cybersecurity safeguards. The returning version includes enhanced safety classifiers that automatically block cybersecurity tasks and revert to Opus 4.8, with restricted availability through usage credits only.

model release

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.

Comments

Loading...