research

Meta researchers show flattened speech tokens outperform hierarchical models in Llama-Mimi

Meta researchers propose Llama-Mimi, a speech language model that flattens multi-level RVQ tokens from neural audio codecs into single sequences processed by a standard Transformer decoder. The approach outperforms hierarchical models on most tasks while achieving best-in-class acoustic consistency performance.

2 min read

Llama-Mimi: Meta's Simpler Approach to Speech Language Modeling

Meta researchers have published findings showing that flattening speech tokens into single sequences outperforms hierarchical architectures for speech language modeling, challenging conventional approaches to handling multi-level acoustic data.

The Core Innovation

Llama-Mimi processes audio through Meta's Mimi neural audio codec, which uses Residual Vector Quantization (RVQ) to produce multiple discrete tokens per time step. Rather than using hierarchical architectures to manage this multi-level structure—the standard approach in speech language modeling—Llama-Mimi simply flattens these tokens into a single sequence and processes them autoregressively with a standard Transformer decoder.

This design choice mirrors a broader trend in NLP toward reducing architectural complexity. Recent advances have demonstrated that simpler single-Transformer architectures often outperform specialized hierarchical designs while maintaining better scalability.

Performance Results

The researchers report that Llama-Mimi outperforms baseline models using Cascaded Speech Modeling (CSM) approaches on most evaluated tasks. Most significantly, the model achieves the best performance on acoustic consistency—a critical metric for maintaining audio fidelity during generation and processing.

The flattening approach appears to provide sufficient context through the standard Transformer attention mechanism without requiring explicit hierarchical structure, suggesting that RVQ tokens contain sufficient sequential information for effective modeling when processed in a unified sequence.

Research Implications

The work indicates that speech language models may have been over-engineered with unnecessary architectural complexity. By leveraging the sequential properties of flattened RVQ tokens, Llama-Mimi demonstrates that simpler, more unified architectures can handle multi-level acoustic information effectively.

This aligns with observations from text modeling where architectural innovations have consistently moved toward unification rather than specialization. The results suggest this principle extends to multimodal and speech domains.

Availability and Reproducibility

Meta has released the paper, model code, and speech sample outputs publicly, enabling the research community to verify findings and build on the approach.

What This Means

Llama-Mimi demonstrates that speech language modeling doesn't require specialized hierarchical designs to handle multi-level token representations. This could simplify future speech model development and reduce the gap between speech and text language model architectures. The approach may prove particularly valuable for scaling speech capabilities alongside text in unified multimodal systems, though practical deployment implications remain to be explored.

Llama-Mimi: Flattened Speech Language Modeling Research | TPS