benchmarks
27 articles tagged with benchmarks
Loka Achieves 87% Speech Reasoning Accuracy Using Amazon Nova 2 Sonic, Outperforming GPT Realtime and Gemini
Loka built a conversational voice agent using Amazon Nova 2 Sonic that achieved 87.0% speech reasoning accuracy on Big Bench Audio, surpassing GPT Realtime at 83.0% and Gemini 2.5 Flash Native Audio at 71.0%. The system delivers Time to First Audio of 1.39 seconds at approximately $0.27 per hour of input audio.
OpenAI releases GPT-5.5-Cyber with 85.6% CyberGym score, surpassing restricted Anthropic model
OpenAI released an updated GPT-5.5-Cyber model that scores 85.6% on CyberGym, surpassing Anthropic's Mythos 5 (83.8%) — the same model that triggered Trump administration export controls. The release proceeds without the political pushback that forced Anthropic to restrict foreign national access.
Z.ai's GLM-5.2 Matches Claude Opus 4.8 in Agent Tasks, First Open Model to Compete in Coding
Z.ai released GLM-5.2 on June 16, 2026, the first open-weight model to match proprietary models like Claude Opus 4.8 on agent benchmarks. The MIT-licensed model closes the performance gap to 6.8 months behind frontier labs, down from expected 9+ months as compute scales.
Mistral Releases Codestral Embed, Code-Specialized Embedding Model at $0.15 Per Million Tokens
Mistral AI has released Codestral Embed, its first code-specialized embedding model, priced at $0.15 per million tokens. The model features an 8192-token context window and claims to outperform Voyage Code 3, Cohere Embed v4.0, and OpenAI's large embedding model on code retrieval benchmarks.
Mistral Launches OCR API at $1 Per 1,000 Pages, Claims 94.89% Accuracy on Document Benchmarks
Mistral AI has released Mistral OCR, an API for extracting text and images from documents at $1 per 1,000 pages (approximately $0.50 with batch inference). The company claims 94.89% overall accuracy on its internal test set, comparing favorably to GPT-4o (89.77%), Gemini 2.0 Flash (88.69%), and Azure OCR (89.52%).
NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA
NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.
Ideogram 4: 9.3B parameter open-weight text-to-image model with native 2K resolution and structured JSON prompting
Ideogram has released Ideogram 4, its first open-weight text-to-image model with 9.3 billion parameters. The model supports native 2K resolution, structured JSON prompting with bounding-box layout controls, and is available in nf4 and fp8 quantizations under a non-commercial license.
Ideogram Releases First Open-Weight Image Model With 9.3B Parameters and 2K Native Resolution
Ideogram has released Ideogram 4, a 9.3B parameter open-weight text-to-image model trained from scratch. The model features structured JSON prompting, native 2K resolution output, and ranks as the top open-weight model on Design Arena. Available in fp8 and nf4 quantizations under a non-commercial license.
Mistral Medium 3 launches at $0.4/$2 per million tokens, matching 90% of Claude 3.7 Sonnet performance
Mistral AI launched Mistral Medium 3 on May 7, 2025, priced at $0.4 per million input tokens and $2 per million output tokens. The company claims the model performs at or above 90% of Claude Sonnet 3.7 on benchmarks while being significantly less expensive, and surpasses Llama 4 Maverick and Cohere Command A.
Mistral Releases OCR API at $1 per 1,000 Pages, Claims 94.89% Accuracy on Document Benchmarks
Mistral AI has released an OCR API priced at $1 per 1,000 pages with batch inference costs approximately half that rate. The company claims 94.89% overall accuracy on internal benchmarks, ahead of GPT-4o (89.77%), Gemini 2.0 Flash (88.69%), and Azure OCR (89.52%). The model processes up to 2,000 pages per minute on a single node.
Google releases Gemini 3.5 Flash with autonomous coding and agent capabilities, claims 4x speed boost
Google released Gemini 3.5 Flash, positioning it as an agent-first model designed for autonomous coding and multi-hour workflows. The company claims the model outperforms its 3.1 Pro predecessor on coding and agentic benchmarks while running 4x faster than competing frontier models, with an optimized version achieving 12x speed gains.
Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7
Augment Code benchmarked its Auggie agent against Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate versus 66.3% while cutting costs by 33%. The company attributes savings to a semantic context engine that reduces cache read tokens by 32% and output tokens by 37% compared to Claude Code's keyword-based retrieval.
Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy
Apple researchers, alongside UC San Diego, have published LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning, a framework that combines diffusion models with autoregressive generation. The system runs multiple reasoning paths in parallel during inference, each exploring different possibilities before generating a final answer.
Meta releases Muse Spark, its first closed-source AI model with paid developer access
Meta released Muse Spark in early April, its first closed-source AI model that will eventually offer paid developer access. According to Arena.AI rankings, Muse Spark trails Anthropic's Claude and Google's Gemini in text capabilities but beats OpenAI's GPT in vision tasks.
OpenAI releases GPT-5.5 with improved reasoning and agentic capabilities
OpenAI released GPT-5.5 on April 23, 2026, positioning it as a step toward agentic computing and a unified 'superapp' combining ChatGPT, Codex, and browser capabilities. The company claims the model outperforms GPT-5.4, Google's Gemini 3.1 Pro, and Anthropic's Claude Opus 4.5 across multiple benchmarks.
Xiaomi Launches MiMo-V2.5-Pro with 1M Context Window for Complex Agentic Tasks
Xiaomi released MiMo-V2.5-Pro on April 22, 2026, its flagship model featuring a 1,048,576 token context window and pricing at $1 per million input tokens and $3 per million output tokens. According to Xiaomi, the model ranks highly on ClawEval, GDPVal, and SWE-bench Pro benchmarks, designed for autonomous completion of professional tasks requiring thousands of tool calls.
Open-weight models closing gap with frontier AI, but struggle looms in specialized domains
Open-weight AI models are narrowing the performance gap with closed frontier models in current benchmarks focused on coding and terminal tasks, but industry analysts predict they'll struggle to keep pace as the field shifts toward specialized knowledge work in accounting, law, and healthcare. The gap reduction masks a more complex dynamic where benchmark correlation with real-world performance is weakening.
Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model
Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.
Anthropic's Claude experiences outage as GitHub issues citing quality concerns surge 3.5× since January
Anthropic's Claude.ai and Claude Code suffered a 48-minute outage on April 13, 2026, from 15:31 to 16:19 UTC with elevated error rates. GitHub quality complaints have increased 3.5× from the January-February baseline, though SWE-Bench-Pro scores show no substantive change since February.
AI agent skills fail in real-world conditions, researchers find testing 34,000 skills
A large-scale study testing 34,198 real-world skills reveals that AI agent performance drops drastically when moving from curated benchmarks to realistic conditions. Claude Opus 4.6 saw pass rates fall from 55.4% with hand-selected skills to 38.4% in truly realistic scenarios, while weaker models like Kimi K2.5 actually perform below their no-skill baseline.
AI models guess instead of asking for help, ProactiveBench study shows
Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.
Alibaba releases Qwen3.6-Plus with 1M token context, claims performance near Claude 4.5 Opus
Alibaba has released Qwen3.6-Plus, its third proprietary AI model in days, featuring a 1 million token context window available via Alibaba Cloud Model Studio API. The model claims improved agentic coding capabilities and partially outperforms Anthropic's Claude 4.5 Opus in Alibaba-conducted benchmarks, though trails Claude 4.6 Opus released in December 2025.
Gemini 3.1 Flash Live scores 95.9% on Big Bench Audio, Google's fastest voice model
Google has released Gemini 3.1 Flash Live, its new voice and audio AI model, scoring 95.9% on the Big Bench Audio Benchmark at high thinking levels—second only to Step-Audio R1.1 Realtime at 97.0%. Response times range from 0.96 seconds at minimal thinking to 2.98 seconds at high thinking, with pricing held at $0.35 per hour of audio input and $1.40 per hour of audio output.
MiniMax M2.7 used autonomous loops to optimize its own training process
MiniMax released M2.7, a model that autonomously participated in its own development through self-optimization loops. The model ran over 100 optimization rounds on internal coding tasks, achieving a 30% performance boost, and scored 66.6% on OpenAI's MLE Bench Lite—competitive with Gemini 3.1 Pro and GPT-5.4.
OpenAI releases GPT-5.4 mini and nano with 3-4x price increases but major performance gains
OpenAI has released GPT-5.4 mini and GPT-5.4 nano, compact models optimized for coding and subagent tasks. The new models deliver significant performance improvements—GPT-5.4 mini reaches 54.4% on SWE-Bench Pro versus 45.7% for GPT-5 mini—but cost 3-4x more per input token than their predecessors.
Frontier LLMs lose up to 33% accuracy in long conversations, study finds
Frontier language models including GPT-5.2 and Claude 4.6 experience accuracy degradation of up to 33% as conversations lengthen, according to new research. The finding suggests that extended context use within a single conversation introduces performance challenges even in state-of-the-art models.
OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.