benchmark

33 articles tagged with benchmark

June 18, 2026
model releaseMistral AI

Mistral releases Leanstral, open-source 6B-parameter proof assistant for Lean 4 under Apache 2.0

Mistral AI has released Leanstral, a sparse 120B model with 6B active parameters designed specifically for the Lean 4 proof assistant. The model is available under Apache 2.0 license with free API access and achieves a 26.3 FLTEval score at pass@2, outperforming Claude Sonnet 4.6 while costing $36 versus $549.

June 17, 2026
model release

Z.AI releases GLM-5.2 with 1M token context, outperforms GPT-5.5 on long-horizon coding benchmarks

Z.AI has released GLM-5.2, an open-source model with a 1M-token context window under an MIT license. On FrontierSWE, a long-horizon coding benchmark, GLM-5.2 trails Claude Opus 4.8 by 1% while outperforming GPT-5.5 by 1%, and achieves 81.0 on Terminal-Bench 2.1 compared to Opus 4.8's 85.0.

June 12, 2026
benchmark

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.

June 9, 2026
benchmark

ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language

ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.

May 29, 2026
model release

StepFun releases Step-3.7-Flash: 198B-parameter MoE model with 256K context at $0.20/M input tokens

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates 11B parameters per token and delivers up to 400 tokens per second. The model supports a 256K context window, three selectable reasoning levels, and is priced at $0.20 per million input tokens (cache miss) and $1.15 per million output tokens.

May 28, 2026
model releaseAnthropic

Anthropic releases Claude Opus 4.8 with 69.2% agentic coding score, 2.5x faster performance

Anthropic released Claude Opus 4.8 on May 28, 2026, six weeks after version 4.7. The model achieves 69.2% on agentic coding benchmarks (up from 64.3%), runs 2.5 times faster in fast mode at one-third the cost, while maintaining the same pricing as version 4.7.

May 27, 2026
benchmark

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.

May 23, 2026
model releaseTencent

Tencent Releases Hy-MT2 Translation Models: 1.8B, 7B, and 30B-A3B Support 33 Languages

Tencent released Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B (MoE) sizes. All models support translation among 33 languages and follow translation instructions in multiple languages. The 1.8B model can be compressed to 440MB using 1.25-bit AngelSlim quantization.

May 22, 2026
model releaseTencent

Tencent Releases Hy-MT2: 1.8B Translation Model Compressed to 440MB With 1.25-Bit Quantization

Tencent has open-sourced Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B parameter sizes. The models support translation across 33 languages and include extreme quantization down to 1.25-bit, reducing the 1.8B model to 440MB storage while increasing inference speed by 1.5x.

May 18, 2026
benchmark

IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture

IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.

May 11, 2026
benchmark

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.

April 27, 2026
model releaseXiaomi

Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window

Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.

benchmarkOpenAI

ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%

OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.

April 24, 2026
model releaseOpenAI

OpenAI GPT-5.5 scores 93/100 in benchmark test, loses points for ignoring instructions

OpenAI's GPT-5.5 scored 93 out of 100 points in a 10-round benchmark test covering summarization, reasoning, coding, and creative tasks. The model lost points primarily for ignoring specific instructions, such as using unauthorized sources when asked to summarize from a single news outlet.

model releaseDeepSeek

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.

April 21, 2026
benchmarkTiiuae

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.

April 16, 2026
benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

April 14, 2026
benchmarkAnthropic

Claude Mythos achieves 73% success rate on expert-level hacking challenges, completes full network takeover in 3 of 10 a

The UK's AI Safety Institute reports Claude Mythos Preview achieved a 73% success rate on expert-level capture-the-flag cybersecurity challenges and became the first AI model to complete a full 32-step simulated corporate network takeover, succeeding in 3 out of 10 attempts. The testing occurred in environments without active security monitoring or defenders.

April 12, 2026
model releaseArcee Ai

Arcee AI releases Trinity-Large-Thinking, open reasoning model matching Claude Opus on agent tasks

Arcee AI has released Trinity-Large-Thinking, a 400-billion-parameter open-weight reasoning model with a mixture-of-experts architecture that activates only 13 billion parameters per token. The model matches Claude Opus 4.6 on agent benchmarks like Tau2 and PinchBench but lags on general reasoning tasks. The company spent approximately $20 million—roughly half its total venture capital—to train the model on 2,048 Nvidia B300 GPUs over 33 days.

April 9, 2026
benchmark

OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark

Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.

April 8, 2026
model release

Meta launches Muse Spark, its first frontier model and first closed-weight AI system

Meta Superintelligence Labs has launched Muse Spark, a native multimodal reasoning model that scores 52 on the Artificial Analysis Intelligence Index, placing it in the top 5 frontier models. This marks Meta's first frontier-class model and its first AI system without open weights, representing a strategic shift from its open-source Llama strategy. The model achieves comparable efficiency to Gemini 3.1 Pro while matching Llama 4 Maverick capabilities with over an order of magnitude less compute.

April 7, 2026
benchmark

Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources

An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.

April 2, 2026
benchmarkNVIDIA

Nvidia claims 291 MLPerf wins with 288-GPU setup; AMD MI355X crosses 1M tokens/sec

MLCommons published MLPerf Inference v6.0 results on April 1, 2026, with Nvidia, AMD, and Intel each claiming top spots in different configurations. Nvidia's 288-GPU GB300-NVL72 system achieved 2.49 million tokens per second on DeepSeek-R1, while AMD's MI355X crossed one million tokens per second for the first time. Direct comparisons remain difficult as each chipmaker targets different market segments and benchmarks.

March 26, 2026
benchmarkOpenAI

ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks

The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.

March 25, 2026
model release

AI2 releases MolmoWeb, open web agent matching proprietary systems with 8B parameters

The Allen Institute for AI has released MolmoWeb, a fully open web agent that operates websites using only screenshots without access to source code. The 8B-parameter model achieves 78.2% success on WebVoyager—nearly matching OpenAI's o3 at 79.3%—while being trained on one of the largest public web task datasets ever released.

March 19, 2026
model release

Cursor releases Composer 2 at $0.50/$2.50 per 1M tokens, undercutting Claude and GPT-4 on pricing

Cursor released Composer 2, a code-specialized model priced at $0.50 per million input tokens and $2.50 per million output tokens—roughly 90% cheaper than Claude Opus 4.6 ($5.00/$25.00) and 60% cheaper than GPT-5.4 ($2.50/$15.00). The model scores 61.3 on Cursor's internal CursorBench, competitive with Claude Opus 4.6 (58.2) but below GPT-5.4 Thinking (63.9).

March 17, 2026
analysis

Mistral's Leanstral code verification agent outperforms Claude Sonnet at 15% of the cost

Mistral has released Leanstral, a 120B-parameter code verification agent built with the Lean programming language, claiming it outperforms larger open-source models and offers significant cost advantages over Anthropic's Claude suite. The model achieves a pass@2 score of 26.3—beating Claude Sonnet by 2.6 points—while costing $36 to run compared to Sonnet's $549.

March 14, 2026
benchmarkxAI

Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate

xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.

March 7, 2026
benchmarkOpenAI

Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find

An international research team released the largest video reasoning dataset to date—roughly 1,000 times larger than previous alternatives. Testing reveals that state-of-the-art models including Sora 2 and Veo 3.1 substantially underperform humans on reasoning tasks, suggesting the limitation isn't data scarcity but architectural constraints.

March 6, 2026
benchmark

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

March 1, 2026
benchmark

ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark

Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.

February 28, 2026
benchmark

Arcada Labs benchmark tests five AI models as autonomous X agents

Arcada Labs, an AI benchmarking startup, has created a new benchmark that pits five leading AI models against each other as autonomous social media agents on X. The test measures how well different models can operate independently on the platform.

February 22, 2026
benchmark

New benchmark reveals AI models struggle with personal photo retrieval tasks

A new benchmark evaluating AI models on photo retrieval reveals significant limitations in their ability to find specific images from personal collections. The test presents models with what appears to be a simple task—locating a particular photo—yet results demonstrate the gap between general image recognition and practical personal image search.