benchmark
22 articles tagged with benchmark
Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window
Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.
ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%
OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.
OpenAI GPT-5.5 scores 93/100 in benchmark test, loses points for ignoring instructions
OpenAI's GPT-5.5 scored 93 out of 100 points in a 10-round benchmark test covering summarization, reasoning, coding, and creative tasks. The model lost points primarily for ignoring specific instructions, such as using unauthorized sources when asked to summarize from a single news outlet.
DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context
DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.
QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation
TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.
Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test
In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.
Claude Mythos achieves 73% success rate on expert-level hacking challenges, completes full network takeover in 3 of 10 a
The UK's AI Safety Institute reports Claude Mythos Preview achieved a 73% success rate on expert-level capture-the-flag cybersecurity challenges and became the first AI model to complete a full 32-step simulated corporate network takeover, succeeding in 3 out of 10 attempts. The testing occurred in environments without active security monitoring or defenders.
Arcee AI releases Trinity-Large-Thinking, open reasoning model matching Claude Opus on agent tasks
Arcee AI has released Trinity-Large-Thinking, a 400-billion-parameter open-weight reasoning model with a mixture-of-experts architecture that activates only 13 billion parameters per token. The model matches Claude Opus 4.6 on agent benchmarks like Tau2 and PinchBench but lags on general reasoning tasks. The company spent approximately $20 million—roughly half its total venture capital—to train the model on 2,048 Nvidia B300 GPUs over 33 days.
OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark
Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.
Meta launches Muse Spark, its first frontier model and first closed-weight AI system
Meta Superintelligence Labs has launched Muse Spark, a native multimodal reasoning model that scores 52 on the Artificial Analysis Intelligence Index, placing it in the top 5 frontier models. This marks Meta's first frontier-class model and its first AI system without open weights, representing a strategic shift from its open-source Llama strategy. The model achieves comparable efficiency to Gemini 3.1 Pro while matching Llama 4 Maverick capabilities with over an order of magnitude less compute.
Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources
An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.
Nvidia claims 291 MLPerf wins with 288-GPU setup; AMD MI355X crosses 1M tokens/sec
MLCommons published MLPerf Inference v6.0 results on April 1, 2026, with Nvidia, AMD, and Intel each claiming top spots in different configurations. Nvidia's 288-GPU GB300-NVL72 system achieved 2.49 million tokens per second on DeepSeek-R1, while AMD's MI355X crossed one million tokens per second for the first time. Direct comparisons remain difficult as each chipmaker targets different market segments and benchmarks.
ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks
The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.
AI2 releases MolmoWeb, open web agent matching proprietary systems with 8B parameters
The Allen Institute for AI has released MolmoWeb, a fully open web agent that operates websites using only screenshots without access to source code. The 8B-parameter model achieves 78.2% success on WebVoyager—nearly matching OpenAI's o3 at 79.3%—while being trained on one of the largest public web task datasets ever released.
Cursor releases Composer 2 at $0.50/$2.50 per 1M tokens, undercutting Claude and GPT-4 on pricing
Cursor released Composer 2, a code-specialized model priced at $0.50 per million input tokens and $2.50 per million output tokens—roughly 90% cheaper than Claude Opus 4.6 ($5.00/$25.00) and 60% cheaper than GPT-5.4 ($2.50/$15.00). The model scores 61.3 on Cursor's internal CursorBench, competitive with Claude Opus 4.6 (58.2) but below GPT-5.4 Thinking (63.9).
Mistral's Leanstral code verification agent outperforms Claude Sonnet at 15% of the cost
Mistral has released Leanstral, a 120B-parameter code verification agent built with the Lean programming language, claiming it outperforms larger open-source models and offers significant cost advantages over Anthropic's Claude suite. The model achieves a pass@2 score of 26.3—beating Claude Sonnet by 2.6 points—while costing $36 to run compared to Sonnet's $549.
Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate
xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.
Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find
An international research team released the largest video reasoning dataset to date—roughly 1,000 times larger than previous alternatives. Testing reveals that state-of-the-art models including Sora 2 and Veo 3.1 substantially underperform humans on reasoning tasks, suggesting the limitation isn't data scarcity but architectural constraints.
Google benchmarks AI models for Android development; names top performers
Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.
ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark
Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.
Arcada Labs benchmark tests five AI models as autonomous X agents
Arcada Labs, an AI benchmarking startup, has created a new benchmark that pits five leading AI models against each other as autonomous social media agents on X. The test measures how well different models can operate independently on the platform.
New benchmark reveals AI models struggle with personal photo retrieval tasks
A new benchmark evaluating AI models on photo retrieval reveals significant limitations in their ability to find specific images from personal collections. The test presents models with what appears to be a simple task—locating a particular photo—yet results demonstrate the gap between general image recognition and practical personal image search.