research

Vevo2 unifies speech and singing voice generation with controllable prosody and style

Researchers have introduced Vevo2, a unified framework that handles both controllable speech and singing voice generation through two specialized audio tokenizers. The approach enables fine-grained control over prosody, style, and timbre while addressing data scarcity in singing synthesis through joint speech-singing training.

2 min read

Vevo2 Unifies Speech and Singing Voice Generation With Controllable Prosody and Style

A new research paper introduces Vevo2, a framework designed to handle both controllable speech and singing voice generation—a task that has proven difficult due to limited annotated singing data and the complexity of capturing expressive vocal characteristics.

Core Architecture

Vevo2 operates through two main components:

Two Specialized Audio Tokenizers:

  1. A unified music-notation-free prosody tokenizer that captures prosody and melody directly from speech, singing, and instrumental audio without requiring music notation
  2. A unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing while enabling timbre disentanglement

Two-Stage Generation Pipeline:

  1. An auto-regressive (AR) content-style modeling stage that provides controllability over text, prosody, and style
  2. A flow-matching acoustic modeling stage that enables fine-grained timbre control

Key Technical Innovations

The framework addresses the singing data scarcity problem through joint speech-singing training. During this process, the researchers implement both explicit and implicit prosody learning strategies to transfer knowledge between speech and singing domains.

Additionally, Vevo2 incorporates a multi-objective post-training task that integrates intelligibility and prosody similarity alignment, allowing the model to follow both textual input and prosody specifications more accurately.

Capabilities and Results

According to the research, Vevo2 demonstrates strong performance across multiple tasks:

  • Speech synthesis with controllable prosody and style
  • Singing voice synthesis
  • Voice conversion for both speech and singing
  • Voice editing tasks
  • Style transfer capabilities

The unified modeling approach reportedly brings mutual benefits to both speech and singing generation, with the framework showing generalization ability across diverse synthesis, conversion, and editing applications.

Audio samples demonstrating Vevo2's output are available at the project website.

What This Means

Vevo2 represents a meaningful advancement in controllable voice synthesis by treating speech and singing as related problems rather than isolated domains. The dual-tokenizer approach and joint training strategy could reduce the data requirements typically needed for high-quality singing synthesis. The framework's ability to separately control prosody, style, and timbre suggests practical applications in music production, content creation, and accessibility tools. However, as with most research-stage systems, real-world performance and robustness across diverse speakers and languages remain to be demonstrated at production scale.

Vevo2: Unified Speech and Singing Voice Generation Framework | TPS