research

MeanFlowSE enables single-step speech enhancement by learning mean velocity fields instead of instantaneous flows

Researchers introduced MeanFlowSE, a generative speech enhancement model that eliminates the computational bottleneck of multistep inference by learning average velocity over finite intervals rather than instantaneous velocity fields. The single-step approach achieves comparable quality to multistep baselines on VoiceBank-DEMAND while requiring substantially lower computational cost and no knowledge distillation.

1 min read

Single-Step Speech Enhancement Replaces Iterative Solvers

A new research paper introduces MeanFlowSE, a conditional generative model that accelerates speech enhancement by replacing multistep inference with single-step generation. The key innovation sidesteps the computational bottleneck that has limited real-time applications of flow- and diffusion-based audio systems.

The Problem: Multistep Inference Bottleneck

Existing flow- and diffusion-based speech enhancement systems learn instantaneous velocity fields, requiring iterative ordinary differential equation (ODE) solvers during inference. Each audio sample must pass through multiple solver steps, making real-time processing computationally expensive.

The Solution: Mean Velocity Learning

MeanFlowSE reformulates the problem by learning average velocity over finite intervals along a trajectory instead of instantaneous velocity. The researchers use a Jacobian-vector product (JVP) to derive a local training objective that supervises finite-interval displacement while maintaining consistency with instantaneous-field constraints on the diagonal.

At inference, the model performs single-step generation via backward-in-time displacement, eliminating the need for multistep ODE solvers. An optional few-step variant provides additional refinement if needed.

Experimental Results

Tested on VoiceBank-DEMAND, the standard benchmark for speech enhancement, MeanFlowSE's single-step model achieves:

  • Strong intelligibility scores
  • High fidelity metrics
  • Comparable perceptual quality to multistep baselines
  • Substantially lower computational cost

The method requires no knowledge distillation from teacher models or external supervision, reducing implementation complexity.

Open Source Release

The authors released MeanFlowSE as open-source code at https://github.com/liduojia1/MeanFlowSE, enabling adoption by the research and engineering communities.

What This Means

MeanFlowSE demonstrates that flow-based generative models can be dramatically accelerated by learning different mathematical representations of model dynamics. The single-step inference approach makes generative speech enhancement practical for real-time applications—voice calls, live transcription, hearing aids—that previously required expensive multistep computation. The contribution is methodological rather than architectural, suggesting similar velocity-field reformulations could apply to other generative tasks beyond speech.

MeanFlowSE: One-Step Generative Speech Enhancement | TPS