research

Research proposes MoD-DPO to reduce cross-modal hallucinations in multimodal LLMs

Researchers have introduced Modality-Decoupled Direct Preference Optimization (MoD-DPO), a framework designed to reduce cross-modal hallucinations in omni-modal large language models. The method adds modality-aware regularization to enforce sensitivity to relevant modalities while reducing reliance on spurious correlations, showing consistent improvements across audiovisual benchmarks.

2 min read

New Approach Targets Cross-Modal Hallucinations in Multimodal AI Models

A research paper published on arXiv presents MoD-DPO (Modality-Decoupled Direct Preference Optimization), a training framework designed to reduce hallucinations in omni-modal large language models—systems that process audio, visual, and textual information simultaneously.

The Problem

Omni-modal LLMs, while achieving strong performance on audiovisual understanding tasks, suffer from cross-modal hallucinations. These occur when models make factually incorrect outputs due to spurious correlations between modalities or over-reliance on language priors. For example, a model might generate inaccurate descriptions when audio contradicts visual information, or simply default to text-based reasoning regardless of what other modalities present.

The Solution

MoD-DPO addresses this through two primary mechanisms:

  1. Modality-Aware Regularization: The framework introduces explicit penalty terms that enforce two behaviors simultaneously: invariance to corruptions in irrelevant modalities (the model should ignore noise in unrelated data streams) and sensitivity to perturbations in relevant modalities (the model should respond to meaningful changes in needed data streams).

  2. Language-Prior Debiasing Penalty: A secondary component discourages the model from relying too heavily on text-only responses, directly penalizing hallucination-prone behavior that ignores audio or visual inputs.

Results and Evaluation

The researchers tested MoD-DPO across multiple audiovisual hallucination benchmarks. According to the paper, the method:

  • Consistently improved perception accuracy across tested benchmarks
  • Demonstrated stronger hallucination resistance compared to previous preference optimization approaches
  • Achieved these improvements under similar computational training budgets, suggesting efficiency
  • Outperformed baseline methods for preference optimization

The approach operates within the direct preference optimization (DPO) framework, an established technique for aligning models with human preferences without requiring reinforcement learning.

Broader Implications

The research emphasizes the importance of "modality-faithful alignment"—ensuring that multimodal models weight each data stream appropriately rather than defaulting to dominant modalities. This appears particularly relevant as omni-modal systems become more common in AI development. The scalability of the approach suggests it could be applied to larger foundation models beyond the experimental scope.

What This Means

Cross-modal hallucinations represent a tangible reliability problem in multimodal AI systems deployed in production. MoD-DPO offers a practical, computationally-efficient training technique to address this. The work suggests that careful regularization during preference optimization—rather than fundamental architectural changes—can meaningfully improve model reliability. As multimodal models see wider deployment in applications requiring accurate perception across multiple data types (accessibility tools, robotics, autonomous systems), techniques like this become increasingly important for trustworthiness.