benchmark

New benchmark evaluates music reward models trained on text, lyrics, and audio

Researchers have released CMI-RewardBench, a comprehensive evaluation framework for music reward models that handle mixed text, lyrics, and audio inputs. The benchmark includes 110,000 pseudo-labeled samples and human-annotated data, along with publicly available reward models designed for fine-grained music generation alignment.

March 5, 2026 · 5:06 AM1 min read

New Benchmark Evaluates Music Reward Models Trained on Text, Lyrics, and Audio

A new research paper introduces CMI-RewardBench, a unified evaluation framework for assessing music reward models that process compositional multimodal inputs—combinations of text descriptions, lyrics, and reference audio.

The benchmark addresses a critical gap in music AI evaluation. While music generation models have advanced to handle complex multimodal conditioning, systematic evaluation mechanisms have not kept pace. CMI-RewardBench fills this gap by providing standardized metrics and test sets.

Dataset and Resources

The research introduces two key datasets:

CMI-Pref-Pseudo: 110,000 pseudo-labeled preference samples for large-scale training
CMI-Pref: High-quality, human-annotated corpus for fine-grained alignment evaluation

All training data, benchmarks, and trained reward models are publicly available.

Benchmark Design

CMI-RewardBench evaluates reward models across three dimensions:

Musicality: Overall music quality and coherence
Text-music alignment: Fidelity to text-based descriptions
Compositional instruction alignment: Adherence to combined text, lyric, and audio conditions

The benchmark tests heterogeneous samples spanning these evaluation categories, allowing comprehensive assessment of reward model performance.

Reward Model Architecture

The team developed CMI reward models (CMI-RMs), a parameter-efficient family capable of processing heterogeneous multimodal inputs. These models show strong correlation with human judgment scores on both musicality and alignment tasks, based on evaluation against CMI-Pref and previous datasets.

Additionally, experiments demonstrate that CMI-RM enables effective inference-time scaling through top-k filtering—a technique for selecting higher-quality outputs during generation.

What This Means

As music generation models become more sophisticated in handling mixed conditioning signals, systematic evaluation becomes essential for development and deployment. This benchmark provides researchers and practitioners with standardized metrics to assess alignment between generated music and user intent across multiple modalities. The public release of datasets and models removes evaluation barriers for the research community, potentially accelerating progress in multimodal music generation.

Source: arxiv.org ↗

benchmark music-generation reward-models multimodal evaluation machine-learning