benchmark

AI models guess instead of asking for help, ProactiveBench study shows

TL;DR

Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.

April 11, 2026 · 9:50 AM3 min read

AI Models Guess Instead of Asking for Help, ProactiveBench Study Shows

Researchers have exposed a fundamental failure mode in multimodal language models: when visual information is unavailable or obstructed, they hallucinate answers instead of requesting clarification. The new ProactiveBench benchmark quantifies this problem across 22 production AI models.

What ProactiveBench Tests

The benchmark combines datasets from seven sources, creating 18,000 test samples with over 108,000 images designed to be impossible to solve without human input. Tasks include identifying hidden objects, cleaning noisy images, interpreting sketches, and requesting different camera angles. The benchmark automatically filters out any task a model can solve on the first attempt—passing requires proactively asking for more information.

ProactiveBench covers seven specific scenarios:

Occluded objects (ROD, VSOD datasets)
Uninformative viewpoints (MVP-N)
Noisy images (ImageNet-C)
Sketches (QuickDraw)
Temporal ambiguities (ChangeIt)
Camera movements (MS-COCO)

The Results: Catastrophic Performance Drop

When objects are clearly visible, tested models average 79.8% accuracy. On ProactiveBench tasks requiring help-seeking, accuracy plummets to 17.5%—a 60+ percentage point collapse.

The ROD dataset reveals the starkest failure: accuracy crashes from 98.3% (visible objects) to 8.2% when objects are hidden behind blocks. Models recognize objects fine; they simply never consider asking someone to uncover them.

Model size provides no advantage. InternVL3-1B actually outperforms InternVL3-8B (27.1% vs 12.7%). Older LLaVA-1.5-7B beats much newer LLaVA-OV-72B (24.8% vs 13%). The underlying language model matters significantly: LLaVA-NeXT with Vicuna scores 19.3%, while the same architecture with Mistral manages just 4.5%.

Closed-source models performed better, with GPT-4.1 and GPT-5.2 posting the highest accuracy numbers, though researchers flag unusually high COCO scores as possible data contamination.

"Proactivity" is Actually Guessing

When researchers replaced valid help-seeking suggestions with nonsensical ones (e.g., "Rewind the video" for sketching tasks), models selected the bogus options just as readily. LLaVA-NeXT Vicuna increased selection rates from 37% to 49% when given invalid choices.

What appears proactive is merely a lower threshold for guessing, not genuine comprehension. Models like LLaVA-OV-0.5B and InternVL3-1B pick nonsensical suggestions anyway.

Prompt hints and conversation histories offered minimal help. While hints pushed accuracy to 25.8%, models in 16% of cases spammed proactive suggestions up to maximum allowed steps. Conversation histories actually degraded performance—models parroted previous proactive actions rather than learning from them.

Reinforcement Learning Provides a Path Forward

The study's only bright spot: proactivity can be trained in. Researchers fine-tuned LLaVA-NeXT-Mistral-7B and Qwen2.5-VL-3B using Group-Relative Policy Optimization (GRPO) on ~27,000 examples, with reward functions prioritizing correct predictions over help requests.

After training, both models beat every previously tested model, including o4-mini:

LLaVA-NeXT-Mistral-7B: 37.4%
Qwen2.5-VL-3B: 38.6%
o4-mini (baseline): 34.0%

Generalization held outside training data. On ChangeIt, Qwen2.5-VL-3B jumped from 12.4% to 55.6% accuracy. However, reward balance is critical: when proactive suggestions received equal reward to correct answers, models spammed help requests and accuracy tanked to 5.4%.

Even trained models lag significantly behind reference performance (40.7% vs 75.1%).

Broader Pattern: Uncertainty Handling Failure

ProactiveBench exposes a recurring problem in recent multimodal research. Moonshot AI's WorldVQA benchmark found top-tier models cap around 50% in visual object recognition. Stanford research documented the "Mirage effect"—GPT-5 and Gemini 3 Pro confidently describing visual details and providing medical diagnoses even when no image was provided, achieving 70-80% of normal performance using only text patterns.

Other studies confirm models cannot reliably gauge their own limits, while research using the "Spilled Energy" method suggests hallucinations leave measurable traces in model computations.

What This Means

ProactiveBench establishes that scaling model size does not fix uncertainty handling. Closed-source models show marginal advantages but remain far from human-level performance. The reinforcement learning results prove proactivity is learnable, but require precise reward calibration and remain computationally expensive relative to gains achieved.

The research suggests the gap between current multimodal models and human reasoning on partial information represents a fundamental architectural challenge rather than a simple training data problem. Without dedicated training approaches, deployed models will continue hallucinating under uncertainty rather than flagging knowledge limitations.

Source: the-decoder.com ↗

multimodal-models benchmarks hallucination uncertainty reinforcement-learning vision-language gpt-4 llava

benchmarkJune 26, 2026

Zhipu's GLM-5.2 matches Anthropic's Claude Opus 4.8 on agentic benchmark at one-fifth the cost

Zhipu AI's open-source GLM-5.2 model scores within one percentage point of Anthropic's Claude Opus 4.8 on a key agentic benchmark while costing approximately one-fifth as much. The release comes as U.S. government restrictions limit access to Anthropic's Fable and OpenAI's GPT-5.6 models.

benchmarkJune 2, 2026

Claude Opus 4.8 fails legal reasoning test despite improved honesty scores

Anthropic's Claude Opus 4.8 demonstrated better uncertainty handling than its predecessor in independent testing across coding, medical, and financial scenarios. However, the model exhibited a significant judgment error in a legal reasoning test involving travel insurance claims, according to results published by ZDNET.

benchmarkMay 15, 2026

Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7

Augment Code benchmarked its Auggie agent against Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate versus 66.3% while cutting costs by 33%. The company attributes savings to a semantic context engine that reduces cache read tokens by 32% and output tokens by 37% compared to Claude Code's keyword-based retrieval.

benchmarkJune 28, 2026

China's Zhipu AI releases GLM-5.2, claims parity with Mythos on cybersecurity benchmarks

Zhipu AI released its open-weight GLM-5.2 model, with researchers claiming it matches Anthropic's Mythos on certain bug-finding and cybersecurity tasks. The model lags behind Anthropic and OpenAI models on general benchmarks but represents a significant narrowing of capabilities between Chinese and US AI systems.