benchmark
9 articles tagged with benchmark
AI2 releases MolmoWeb, open web agent matching proprietary systems with 8B parameters
The Allen Institute for AI has released MolmoWeb, a fully open web agent that operates websites using only screenshots without access to source code. The 8B-parameter model achieves 78.2% success on WebVoyager—nearly matching OpenAI's o3 at 79.3%—while being trained on one of the largest public web task datasets ever released.
Cursor releases Composer 2 at $0.50/$2.50 per 1M tokens, undercutting Claude and GPT-4 on pricing
Cursor released Composer 2, a code-specialized model priced at $0.50 per million input tokens and $2.50 per million output tokens—roughly 90% cheaper than Claude Opus 4.6 ($5.00/$25.00) and 60% cheaper than GPT-5.4 ($2.50/$15.00). The model scores 61.3 on Cursor's internal CursorBench, competitive with Claude Opus 4.6 (58.2) but below GPT-5.4 Thinking (63.9).
Mistral's Leanstral code verification agent outperforms Claude Sonnet at 15% of the cost
Mistral has released Leanstral, a 120B-parameter code verification agent built with the Lean programming language, claiming it outperforms larger open-source models and offers significant cost advantages over Anthropic's Claude suite. The model achieves a pass@2 score of 26.3—beating Claude Sonnet by 2.6 points—while costing $36 to run compared to Sonnet's $549.
Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate
xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.
Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find
An international research team released the largest video reasoning dataset to date—roughly 1,000 times larger than previous alternatives. Testing reveals that state-of-the-art models including Sora 2 and Veo 3.1 substantially underperform humans on reasoning tasks, suggesting the limitation isn't data scarcity but architectural constraints.
Google benchmarks AI models for Android development; names top performers
Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.
ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark
Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.
Arcada Labs benchmark tests five AI models as autonomous X agents
Arcada Labs, an AI benchmarking startup, has created a new benchmark that pits five leading AI models against each other as autonomous social media agents on X. The test measures how well different models can operate independently on the platform.
New benchmark reveals AI models struggle with personal photo retrieval tasks
A new benchmark evaluating AI models on photo retrieval reveals significant limitations in their ability to find specific images from personal collections. The test presents models with what appears to be a simple task—locating a particular photo—yet results demonstrate the gap between general image recognition and practical personal image search.