benchmarks

27 articles tagged with benchmarks

June 24, 2026
product updateAmazon Web Services

Loka Achieves 87% Speech Reasoning Accuracy Using Amazon Nova 2 Sonic, Outperforming GPT Realtime and Gemini

Loka built a conversational voice agent using Amazon Nova 2 Sonic that achieved 87.0% speech reasoning accuracy on Big Bench Audio, surpassing GPT Realtime at 83.0% and Gemini 2.5 Flash Native Audio at 71.0%. The system delivers Time to First Audio of 1.39 seconds at approximately $0.27 per hour of input audio.

June 23, 2026
product updateOpenAI+1

OpenAI releases GPT-5.5-Cyber with 85.6% CyberGym score, surpassing restricted Anthropic model

OpenAI released an updated GPT-5.5-Cyber model that scores 85.6% on CyberGym, surpassing Anthropic's Mythos 5 (83.8%) — the same model that triggered Trump administration export controls. The release proceeds without the political pushback that forced Anthropic to restrict foreign national access.

June 22, 2026
model release

Z.ai's GLM-5.2 Matches Claude Opus 4.8 in Agent Tasks, First Open Model to Compete in Coding

Z.ai released GLM-5.2 on June 16, 2026, the first open-weight model to match proprietary models like Claude Opus 4.8 on agent benchmarks. The MIT-licensed model closes the performance gap to 6.8 months behind frontier labs, down from expected 9+ months as compute scales.

June 18, 2026
model releaseMistral AI

Mistral Releases Codestral Embed, Code-Specialized Embedding Model at $0.15 Per Million Tokens

Mistral AI has released Codestral Embed, its first code-specialized embedding model, priced at $0.15 per million tokens. The model features an 8192-token context window and claims to outperform Voyage Code 3, Cohere Embed v4.0, and OpenAI's large embedding model on code retrieval benchmarks.

product updateMistral AI

Mistral Launches OCR API at $1 Per 1,000 Pages, Claims 94.89% Accuracy on Document Benchmarks

Mistral AI has released Mistral OCR, an API for extracting text and images from documents at $1 per 1,000 pages (approximately $0.50 with batch inference). The company claims 94.89% overall accuracy on its internal test set, comparing favorably to GPT-4o (89.77%), Gemini 2.0 Flash (88.69%), and Azure OCR (89.52%).

June 4, 2026
researchNVIDIA

NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA

NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.

model release

Ideogram 4: 9.3B parameter open-weight text-to-image model with native 2K resolution and structured JSON prompting

Ideogram has released Ideogram 4, its first open-weight text-to-image model with 9.3 billion parameters. The model supports native 2K resolution, structured JSON prompting with bounding-box layout controls, and is available in nf4 and fp8 quantizations under a non-commercial license.

June 3, 2026
model release

Ideogram Releases First Open-Weight Image Model With 9.3B Parameters and 2K Native Resolution

Ideogram has released Ideogram 4, a 9.3B parameter open-weight text-to-image model trained from scratch. The model features structured JSON prompting, native 2K resolution output, and ranks as the top open-weight model on Design Arena. Available in fp8 and nf4 quantizations under a non-commercial license.

May 28, 2026
model releaseMistral AI

Mistral Medium 3 launches at $0.4/$2 per million tokens, matching 90% of Claude 3.7 Sonnet performance

Mistral AI launched Mistral Medium 3 on May 7, 2025, priced at $0.4 per million input tokens and $2 per million output tokens. The company claims the model performs at or above 90% of Claude Sonnet 3.7 on benchmarks while being significantly less expensive, and surpasses Llama 4 Maverick and Cohere Command A.

product updateMistral AI

Mistral Releases OCR API at $1 per 1,000 Pages, Claims 94.89% Accuracy on Document Benchmarks

Mistral AI has released an OCR API priced at $1 per 1,000 pages with batch inference costs approximately half that rate. The company claims 94.89% overall accuracy on internal benchmarks, ahead of GPT-4o (89.77%), Gemini 2.0 Flash (88.69%), and Azure OCR (89.52%). The model processes up to 2,000 pages per minute on a single node.

May 19, 2026
model release

Google releases Gemini 3.5 Flash with autonomous coding and agent capabilities, claims 4x speed boost

Google released Gemini 3.5 Flash, positioning it as an agent-first model designed for autonomous coding and multi-hour workflows. The company claims the model outperforms its 3.1 Pro predecessor on coding and agentic benchmarks while running 4x faster than competing frontier models, with an optimized version achieving 12x speed gains.

May 15, 2026
benchmark

Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7

Augment Code benchmarked its Auggie agent against Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate versus 66.3% while cutting costs by 33%. The company attributes savings to a semantic context engine that reduces cache read tokens by 32% and output tokens by 37% compared to Claude Code's keyword-based retrieval.

April 29, 2026
researchApple

Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy

Apple researchers, alongside UC San Diego, have published LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning, a framework that combines diffusion models with autoregressive generation. The system runs multiple reasoning paths in parallel during inference, each exploring different possibilities before generating a final answer.

April 28, 2026
model release+1

Meta releases Muse Spark, its first closed-source AI model with paid developer access

Meta released Muse Spark in early April, its first closed-source AI model that will eventually offer paid developer access. According to Arena.AI rankings, Muse Spark trails Anthropic's Claude and Google's Gemini in text capabilities but beats OpenAI's GPT in vision tasks.

April 23, 2026
model releaseOpenAI

OpenAI releases GPT-5.5 with improved reasoning and agentic capabilities

OpenAI released GPT-5.5 on April 23, 2026, positioning it as a step toward agentic computing and a unified 'superapp' combining ChatGPT, Codex, and browser capabilities. The company claims the model outperforms GPT-5.4, Google's Gemini 3.1 Pro, and Anthropic's Claude Opus 4.5 across multiple benchmarks.

April 22, 2026
model releaseXiaomi

Xiaomi Launches MiMo-V2.5-Pro with 1M Context Window for Complex Agentic Tasks

Xiaomi released MiMo-V2.5-Pro on April 22, 2026, its flagship model featuring a 1,048,576 token context window and pricing at $1 per million input tokens and $3 per million output tokens. According to Xiaomi, the model ranks highly on ClawEval, GDPVal, and SWE-bench Pro benchmarks, designed for autonomous completion of professional tasks requiring thousands of tool calls.

April 20, 2026
analysis

Open-weight models closing gap with frontier AI, but struggle looms in specialized domains

Open-weight AI models are narrowing the performance gap with closed frontier models in current benchmarks focused on coding and terminal tasks, but industry analysts predict they'll struggle to keep pace as the field shifts toward specialized knowledge work in accounting, law, and healthcare. The gap reduction masks a more complex dynamic where benchmark correlation with real-world performance is weakening.

April 16, 2026
model releaseAnthropic

Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model

Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.

April 13, 2026
product updateAnthropic

Anthropic's Claude experiences outage as GitHub issues citing quality concerns surge 3.5× since January

Anthropic's Claude.ai and Claude Code suffered a 48-minute outage on April 13, 2026, from 15:31 to 16:19 UTC with elevated error rates. GitHub quality complaints have increased 3.5× from the January-February baseline, though SWE-Bench-Pro scores show no substantive change since February.

April 12, 2026
researchAnthropic

AI agent skills fail in real-world conditions, researchers find testing 34,000 skills

A large-scale study testing 34,198 real-world skills reveals that AI agent performance drops drastically when moving from curated benchmarks to realistic conditions. Claude Opus 4.6 saw pass rates fall from 55.4% with hand-selected skills to 38.4% in truly realistic scenarios, while weaker models like Kimi K2.5 actually perform below their no-skill baseline.

April 11, 2026
benchmark

AI models guess instead of asking for help, ProactiveBench study shows

Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.

April 2, 2026
model release

Alibaba releases Qwen3.6-Plus with 1M token context, claims performance near Claude 4.5 Opus

Alibaba has released Qwen3.6-Plus, its third proprietary AI model in days, featuring a 1 million token context window available via Alibaba Cloud Model Studio API. The model claims improved agentic coding capabilities and partially outperforms Anthropic's Claude 4.5 Opus in Alibaba-conducted benchmarks, though trails Claude 4.6 Opus released in December 2025.

March 26, 2026
model release

Gemini 3.1 Flash Live scores 95.9% on Big Bench Audio, Google's fastest voice model

Google has released Gemini 3.1 Flash Live, its new voice and audio AI model, scoring 95.9% on the Big Bench Audio Benchmark at high thinking levels—second only to Step-Audio R1.1 Realtime at 97.0%. Response times range from 0.96 seconds at minimal thinking to 2.98 seconds at high thinking, with pricing held at $0.35 per hour of audio input and $1.40 per hour of audio output.

March 23, 2026
model release

MiniMax M2.7 used autonomous loops to optimize its own training process

MiniMax released M2.7, a model that autonomously participated in its own development through self-optimization loops. The model ran over 100 optimization rounds on internal coding tasks, achieving a 30% performance boost, and scored 66.6% on OpenAI's MLE Bench Lite—competitive with Gemini 3.1 Pro and GPT-5.4.

March 17, 2026
model releaseOpenAI

OpenAI releases GPT-5.4 mini and nano with 3-4x price increases but major performance gains

OpenAI has released GPT-5.4 mini and GPT-5.4 nano, compact models optimized for coding and subagent tasks. The new models deliver significant performance improvements—GPT-5.4 mini reaches 54.4% on SWE-Bench Pro versus 45.7% for GPT-5 mini—but cost 3-4x more per input token than their predecessors.

February 28, 2026
benchmarkOpenAI

Frontier LLMs lose up to 33% accuracy in long conversations, study finds

Frontier language models including GPT-5.2 and Claude 4.6 experience accuracy degradation of up to 33% as conversations lengthen, according to new research. The finding suggests that extended context use within a single conversation introduces performance challenges even in state-of-the-art models.

February 23, 2026
benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.