benchmarks
15 articles tagged with benchmarks
Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy
Apple researchers, alongside UC San Diego, have published LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning, a framework that combines diffusion models with autoregressive generation. The system runs multiple reasoning paths in parallel during inference, each exploring different possibilities before generating a final answer.
Meta releases Muse Spark, its first closed-source AI model with paid developer access
Meta released Muse Spark in early April, its first closed-source AI model that will eventually offer paid developer access. According to Arena.AI rankings, Muse Spark trails Anthropic's Claude and Google's Gemini in text capabilities but beats OpenAI's GPT in vision tasks.
OpenAI releases GPT-5.5 with improved reasoning and agentic capabilities
OpenAI released GPT-5.5 on April 23, 2026, positioning it as a step toward agentic computing and a unified 'superapp' combining ChatGPT, Codex, and browser capabilities. The company claims the model outperforms GPT-5.4, Google's Gemini 3.1 Pro, and Anthropic's Claude Opus 4.5 across multiple benchmarks.
Xiaomi Launches MiMo-V2.5-Pro with 1M Context Window for Complex Agentic Tasks
Xiaomi released MiMo-V2.5-Pro on April 22, 2026, its flagship model featuring a 1,048,576 token context window and pricing at $1 per million input tokens and $3 per million output tokens. According to Xiaomi, the model ranks highly on ClawEval, GDPVal, and SWE-bench Pro benchmarks, designed for autonomous completion of professional tasks requiring thousands of tool calls.
Open-weight models closing gap with frontier AI, but struggle looms in specialized domains
Open-weight AI models are narrowing the performance gap with closed frontier models in current benchmarks focused on coding and terminal tasks, but industry analysts predict they'll struggle to keep pace as the field shifts toward specialized knowledge work in accounting, law, and healthcare. The gap reduction masks a more complex dynamic where benchmark correlation with real-world performance is weakening.
Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model
Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.
Anthropic's Claude experiences outage as GitHub issues citing quality concerns surge 3.5× since January
Anthropic's Claude.ai and Claude Code suffered a 48-minute outage on April 13, 2026, from 15:31 to 16:19 UTC with elevated error rates. GitHub quality complaints have increased 3.5× from the January-February baseline, though SWE-Bench-Pro scores show no substantive change since February.
AI agent skills fail in real-world conditions, researchers find testing 34,000 skills
A large-scale study testing 34,198 real-world skills reveals that AI agent performance drops drastically when moving from curated benchmarks to realistic conditions. Claude Opus 4.6 saw pass rates fall from 55.4% with hand-selected skills to 38.4% in truly realistic scenarios, while weaker models like Kimi K2.5 actually perform below their no-skill baseline.
AI models guess instead of asking for help, ProactiveBench study shows
Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.
Alibaba releases Qwen3.6-Plus with 1M token context, claims performance near Claude 4.5 Opus
Alibaba has released Qwen3.6-Plus, its third proprietary AI model in days, featuring a 1 million token context window available via Alibaba Cloud Model Studio API. The model claims improved agentic coding capabilities and partially outperforms Anthropic's Claude 4.5 Opus in Alibaba-conducted benchmarks, though trails Claude 4.6 Opus released in December 2025.
Gemini 3.1 Flash Live scores 95.9% on Big Bench Audio, Google's fastest voice model
Google has released Gemini 3.1 Flash Live, its new voice and audio AI model, scoring 95.9% on the Big Bench Audio Benchmark at high thinking levels—second only to Step-Audio R1.1 Realtime at 97.0%. Response times range from 0.96 seconds at minimal thinking to 2.98 seconds at high thinking, with pricing held at $0.35 per hour of audio input and $1.40 per hour of audio output.
MiniMax M2.7 used autonomous loops to optimize its own training process
MiniMax released M2.7, a model that autonomously participated in its own development through self-optimization loops. The model ran over 100 optimization rounds on internal coding tasks, achieving a 30% performance boost, and scored 66.6% on OpenAI's MLE Bench Lite—competitive with Gemini 3.1 Pro and GPT-5.4.
OpenAI releases GPT-5.4 mini and nano with 3-4x price increases but major performance gains
OpenAI has released GPT-5.4 mini and GPT-5.4 nano, compact models optimized for coding and subagent tasks. The new models deliver significant performance improvements—GPT-5.4 mini reaches 54.4% on SWE-Bench Pro versus 45.7% for GPT-5 mini—but cost 3-4x more per input token than their predecessors.
Frontier LLMs lose up to 33% accuracy in long conversations, study finds
Frontier language models including GPT-5.2 and Claude 4.6 experience accuracy degradation of up to 33% as conversations lengthen, according to new research. The finding suggests that extended context use within a single conversation introduces performance challenges even in state-of-the-art models.
OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.