research

Vevo2 unifies speech and singing voice generation with prosody and style control

Researchers introduce Vevo2, a unified framework for controllable speech and singing voice generation that addresses data scarcity and enables flexible control over prosody, style, and timbre. The system uses two specialized audio tokenizers and combines auto-regressive and flow-matching models to handle both synthesis and voice conversion tasks.

2 min read

Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

Researchers have introduced Vevo2, a framework designed to generate controllable speech and singing voices while maintaining high expressiveness and flexibility. The system addresses a core challenge in voice synthesis: the scarcity of annotated singing data and the difficulty of enabling independent control over multiple voice characteristics.

Technical Architecture

Vevo2 employs a dual-tokenizer approach to decompose voice generation into controllable components:

Audio Tokenizers:

  • A unified prosody tokenizer that captures prosody and melody from speech, singing, and instrumental audio without requiring music notation
  • A unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing while enabling timbre disentanglement

Generation Pipeline: The framework consists of two stages. An auto-regressive (AR) content-style modeling stage enables controllability over text, prosody, and style. A flow-matching acoustic modeling stage allows for independent timbre control, enabling users to modify voice characteristics after initial synthesis.

Cross-Domain Learning Strategy

A critical contribution involves bridging speech and singing during joint training. Vevo2 implements both explicit and implicit prosody learning strategies to leverage the complementary nature of speech and singing data. The researchers designed a multi-objective post-training task that integrates intelligibility and prosody similarity alignment, improving the model's ability to follow both text and prosody specifications.

Demonstrated Capabilities

Experimental results show Vevo2 generalizes across multiple tasks:

  • Voice synthesis from text with controllable prosody and style
  • Voice conversion between speakers
  • Voice editing for specific characteristics
  • Both speech and singing applications

The unified modeling approach reportedly brings mutual benefits to speech and singing voice generation, with the framework demonstrating strong performance on synthesis, conversion, and editing tasks. Audio samples are available through the project's demo page.

What This Means

Vevo2 represents progress in controllable voice synthesis by unifying speech and singing under a single framework. The prosody tokenization without music notation is particularly significant—it reduces the annotation burden that typically constrains singing voice datasets. The explicit separation of timbre control enables more granular voice manipulation than systems requiring end-to-end fine-tuning. However, this is research-stage work; deployment considerations around computational requirements, real-time performance, and practical dataset constraints remain unaddressed in the paper.