research

Perception-R1 uses visual reward signals to improve multimodal AI reasoning

Researchers propose Perception-R1, a method that adds visual perception reward signals to reinforcement learning training for multimodal AI models. The approach achieves state-of-the-art results on multiple reasoning benchmarks using just 1,442 training examples by explicitly teaching models to accurately perceive visual content before reasoning about it.

2 min read

Perception-R1 Uses Visual Reward Signals to Improve Multimodal AI Reasoning

Researchers have identified and addressed a fundamental gap in how multimodal large language models (MLLMs) learn to reason: existing reinforcement learning approaches fail to improve the visual perception capabilities that form the foundation for complex reasoning tasks.

The new method, Perception-R1, introduces a visual perception reward signal during training. Rather than only rewarding correct final answers, the system explicitly rewards MLLMs for accurately perceiving and describing visual content, treating perception as a prerequisite skill that must be mastered before reasoning.

How It Works

The approach operates in two stages:

  1. Annotation collection: The researchers extract textual descriptions of visual content from chain-of-thought (CoT) trajectories in existing multimodal problems. These descriptions serve as reference annotations for what the model should perceive.

  2. Reward assignment: During training, a judging LLM compares the visual annotations from the model's responses against the reference annotations. Consistency between the two determines the visual perception reward score.

This dual-signal approach—rewarding both accurate perception and correct reasoning—proved more effective than existing reinforcement learning with verifiable rewards (RLVR) methods, which the researchers demonstrated through McNemar's statistical testing.

Performance Results

Perception-R1 achieves state-of-the-art performance on multiple multimodal reasoning benchmarks using only 1,442 training examples. The researchers note this is a minimal training set, suggesting the method is data-efficient and the visual perception reward signal is particularly effective at guiding model improvement.

The study directly challenges the assumption that reasoning models only need final-answer rewards. By separating perception from reasoning as distinct training objectives, Perception-R1 provides a more granular learning signal that appears to unlock better performance across benchmark tasks.

What This Means

The research suggests that multimodal reasoning improvements require explicitly training perception skills first. This two-stage cognitive framework mirrors how humans process visual information—we perceive details before reasoning about them. For practitioners building multimodal systems, the result indicates that reward design matters significantly. Adding perception-specific signals to training pipelines may outperform simpler end-to-end reward approaches. The small training dataset size also suggests the method could scale efficiently to improve existing models without massive computational investment.

Code and datasets are planned for release on GitHub.