research

Crab+: New audio-visual model solves negative transfer problem in multimodal learning

A new audio-visual large language model called Crab+ addresses a critical problem in multimodal learning: negative transfer, where training on multiple tasks simultaneously causes performance degradation on nearly 55% of tasks. The model uses a new dataset of 222K samples and a technique called Interaction-aware LoRA to coordinate different audio-visual tasks, reversing the degradation trend to achieve positive transfer on 88% of tasks.

March 5, 2026 · 5:24 AM2 min read

Crab+: New Audio-Visual Model Solves Negative Transfer Problem

Researchers have published a new audio-visual large language model that directly addresses negative transfer—a phenomenon where training models on multiple tasks simultaneously causes performance to degrade on nearly 55% of tasks compared to single-task training.

The model, called Crab+, uses two key innovations to solve this problem. First, it introduces AV-UIE v2, a comprehensive audio-visual instruction-tuning dataset containing approximately 222K samples spanning 17 existing datasets and 7 distinct tasks. This dataset explicitly includes reasoning processes to help the model understand relationships between different audio-visual tasks at varying levels of granularity.

Second, the model implements Interaction-aware LoRA (I-LoRA), a technique that dynamically routes information to coordinate different audio-visual interaction patterns. Rather than using identical parameters across all tasks, I-LoRA explicitly models inter-task relationships and mitigates parameter interference—the root cause of negative transfer in joint training.

The Problem: Task Heterogeneity

Audio-visual scene understanding involves tasks with very different characteristics—some require detecting sounds in images, others involve temporal reasoning across video frames, and still others demand fine-grained spatial understanding. These disparate task granularities and capability demands cause tasks to interfere with each other during joint training, a problem conventional multi-task methods fail to address.

Crab+ tackles this through a unified interface that standardizes how heterogeneous tasks are formulated while allowing the model to maintain task-specific behavior through dynamic routing.

Results

According to the researchers, Crab+ successfully reverses the negative transfer trend. Multi-task learning now surpasses single-task baselines in nearly 88% of tasks, a dramatic reversal from the 55% degradation rate observed in conventional approaches. The model covers a broader range of audio-visual tasks than existing unified models while outperforming specialized models across various benchmarks.

The improvements hold across diverse audio-visual LLM architectures, suggesting the approach is generalizable rather than specific to a particular model family. Researchers validated results through in-depth visualizations of how the model coordinates different tasks.

What this means

This work addresses a fundamental challenge in multimodal AI: most practical applications require models to handle multiple related tasks, but scaling to multiple tasks has historically hurt performance. Crab+ demonstrates that with explicit modeling of task relationships and careful dataset design, you can achieve positive transfer where adding tasks actually improves overall performance. This is a practical step toward audio-visual systems that can handle real-world scene understanding at scale.

Source: arxiv.org ↗

audio-visual-learning multimodal-ai negative-transfer multitask-learning instruction-tuning lora scene-understanding