AI models guess instead of asking for help, ProactiveBench study shows
Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.
AI Models Guess Instead of Asking for Help, ProactiveBench Study Shows
Researchers have exposed a fundamental failure mode in multimodal language models: when visual information is unavailable or obstructed, they hallucinate answers instead of requesting clarification. The new ProactiveBench benchmark quantifies this problem across 22 production AI models.
What ProactiveBench Tests
The benchmark combines datasets from seven sources, creating 18,000 test samples with over 108,000 images designed to be impossible to solve without human input. Tasks include identifying hidden objects, cleaning noisy images, interpreting sketches, and requesting different camera angles. The benchmark automatically filters out any task a model can solve on the first attempt—passing requires proactively asking for more information.
ProactiveBench covers seven specific scenarios:
- Occluded objects (ROD, VSOD datasets)
- Uninformative viewpoints (MVP-N)
- Noisy images (ImageNet-C)
- Sketches (QuickDraw)
- Temporal ambiguities (ChangeIt)
- Camera movements (MS-COCO)
The Results: Catastrophic Performance Drop
When objects are clearly visible, tested models average 79.8% accuracy. On ProactiveBench tasks requiring help-seeking, accuracy plummets to 17.5%—a 60+ percentage point collapse.
The ROD dataset reveals the starkest failure: accuracy crashes from 98.3% (visible objects) to 8.2% when objects are hidden behind blocks. Models recognize objects fine; they simply never consider asking someone to uncover them.
Model size provides no advantage. InternVL3-1B actually outperforms InternVL3-8B (27.1% vs 12.7%). Older LLaVA-1.5-7B beats much newer LLaVA-OV-72B (24.8% vs 13%). The underlying language model matters significantly: LLaVA-NeXT with Vicuna scores 19.3%, while the same architecture with Mistral manages just 4.5%.
Closed-source models performed better, with GPT-4.1 and GPT-5.2 posting the highest accuracy numbers, though researchers flag unusually high COCO scores as possible data contamination.
"Proactivity" is Actually Guessing
When researchers replaced valid help-seeking suggestions with nonsensical ones (e.g., "Rewind the video" for sketching tasks), models selected the bogus options just as readily. LLaVA-NeXT Vicuna increased selection rates from 37% to 49% when given invalid choices.
What appears proactive is merely a lower threshold for guessing, not genuine comprehension. Models like LLaVA-OV-0.5B and InternVL3-1B pick nonsensical suggestions anyway.
Prompt hints and conversation histories offered minimal help. While hints pushed accuracy to 25.8%, models in 16% of cases spammed proactive suggestions up to maximum allowed steps. Conversation histories actually degraded performance—models parroted previous proactive actions rather than learning from them.
Reinforcement Learning Provides a Path Forward
The study's only bright spot: proactivity can be trained in. Researchers fine-tuned LLaVA-NeXT-Mistral-7B and Qwen2.5-VL-3B using Group-Relative Policy Optimization (GRPO) on ~27,000 examples, with reward functions prioritizing correct predictions over help requests.
After training, both models beat every previously tested model, including o4-mini:
- LLaVA-NeXT-Mistral-7B: 37.4%
- Qwen2.5-VL-3B: 38.6%
- o4-mini (baseline): 34.0%
Generalization held outside training data. On ChangeIt, Qwen2.5-VL-3B jumped from 12.4% to 55.6% accuracy. However, reward balance is critical: when proactive suggestions received equal reward to correct answers, models spammed help requests and accuracy tanked to 5.4%.
Even trained models lag significantly behind reference performance (40.7% vs 75.1%).
Broader Pattern: Uncertainty Handling Failure
ProactiveBench exposes a recurring problem in recent multimodal research. Moonshot AI's WorldVQA benchmark found top-tier models cap around 50% in visual object recognition. Stanford research documented the "Mirage effect"—GPT-5 and Gemini 3 Pro confidently describing visual details and providing medical diagnoses even when no image was provided, achieving 70-80% of normal performance using only text patterns.
Other studies confirm models cannot reliably gauge their own limits, while research using the "Spilled Energy" method suggests hallucinations leave measurable traces in model computations.
What This Means
ProactiveBench establishes that scaling model size does not fix uncertainty handling. Closed-source models show marginal advantages but remain far from human-level performance. The reinforcement learning results prove proactivity is learnable, but require precise reward calibration and remain computationally expensive relative to gains achieved.
The research suggests the gap between current multimodal models and human reasoning on partial information represents a fundamental architectural challenge rather than a simple training data problem. Without dedicated training approaches, deployed models will continue hallucinating under uncertainty rather than flagging knowledge limitations.
Related Articles
Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate
xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.
Frontier LLMs lose up to 33% accuracy in long conversations, study finds
Frontier language models including GPT-5.2 and Claude 4.6 experience accuracy degradation of up to 33% as conversations lengthen, according to new research. The finding suggests that extended context use within a single conversation introduces performance challenges even in state-of-the-art models.
OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.
OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark
Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.
Comments
Loading...