Crab+: New audio-visual model solves negative transfer problem in multimodal learning
A new audio-visual large language model called Crab+ addresses a critical problem in multimodal learning: negative transfer, where training on multiple tasks simultaneously causes performance degradation on nearly 55% of tasks. The model uses a new dataset of 222K samples and a technique called Interaction-aware LoRA to coordinate different audio-visual tasks, reversing the degradation trend to achieve positive transfer on 88% of tasks.
Crab+: New Audio-Visual Model Solves Negative Transfer Problem
Researchers have published a new audio-visual large language model that directly addresses negative transfer—a phenomenon where training models on multiple tasks simultaneously causes performance to degrade on nearly 55% of tasks compared to single-task training.
The model, called Crab+, uses two key innovations to solve this problem. First, it introduces AV-UIE v2, a comprehensive audio-visual instruction-tuning dataset containing approximately 222K samples spanning 17 existing datasets and 7 distinct tasks. This dataset explicitly includes reasoning processes to help the model understand relationships between different audio-visual tasks at varying levels of granularity.
Second, the model implements Interaction-aware LoRA (I-LoRA), a technique that dynamically routes information to coordinate different audio-visual interaction patterns. Rather than using identical parameters across all tasks, I-LoRA explicitly models inter-task relationships and mitigates parameter interference—the root cause of negative transfer in joint training.
The Problem: Task Heterogeneity
Audio-visual scene understanding involves tasks with very different characteristics—some require detecting sounds in images, others involve temporal reasoning across video frames, and still others demand fine-grained spatial understanding. These disparate task granularities and capability demands cause tasks to interfere with each other during joint training, a problem conventional multi-task methods fail to address.
Crab+ tackles this through a unified interface that standardizes how heterogeneous tasks are formulated while allowing the model to maintain task-specific behavior through dynamic routing.
Results
According to the researchers, Crab+ successfully reverses the negative transfer trend. Multi-task learning now surpasses single-task baselines in nearly 88% of tasks, a dramatic reversal from the 55% degradation rate observed in conventional approaches. The model covers a broader range of audio-visual tasks than existing unified models while outperforming specialized models across various benchmarks.
The improvements hold across diverse audio-visual LLM architectures, suggesting the approach is generalizable rather than specific to a particular model family. Researchers validated results through in-depth visualizations of how the model coordinates different tasks.
What this means
This work addresses a fundamental challenge in multimodal AI: most practical applications require models to handle multiple related tasks, but scaling to multiple tasks has historically hurt performance. Crab+ demonstrates that with explicit modeling of task relationships and careful dataset design, you can achieve positive transfer where adding tasks actually improves overall performance. This is a practical step toward audio-visual systems that can handle real-world scene understanding at scale.