MeanFlowSE enables single-step speech enhancement by learning mean velocity fields instead of instantaneous flows
Researchers introduced MeanFlowSE, a generative speech enhancement model that eliminates the computational bottleneck of multistep inference by learning average velocity over finite intervals rather than instantaneous velocity fields. The single-step approach achieves comparable quality to multistep baselines on VoiceBank-DEMAND while requiring substantially lower computational cost and no knowledge distillation.
Single-Step Speech Enhancement Replaces Iterative Solvers
A new research paper introduces MeanFlowSE, a conditional generative model that accelerates speech enhancement by replacing multistep inference with single-step generation. The key innovation sidesteps the computational bottleneck that has limited real-time applications of flow- and diffusion-based audio systems.
The Problem: Multistep Inference Bottleneck
Existing flow- and diffusion-based speech enhancement systems learn instantaneous velocity fields, requiring iterative ordinary differential equation (ODE) solvers during inference. Each audio sample must pass through multiple solver steps, making real-time processing computationally expensive.
The Solution: Mean Velocity Learning
MeanFlowSE reformulates the problem by learning average velocity over finite intervals along a trajectory instead of instantaneous velocity. The researchers use a Jacobian-vector product (JVP) to derive a local training objective that supervises finite-interval displacement while maintaining consistency with instantaneous-field constraints on the diagonal.
At inference, the model performs single-step generation via backward-in-time displacement, eliminating the need for multistep ODE solvers. An optional few-step variant provides additional refinement if needed.
Experimental Results
Tested on VoiceBank-DEMAND, the standard benchmark for speech enhancement, MeanFlowSE's single-step model achieves:
- Strong intelligibility scores
- High fidelity metrics
- Comparable perceptual quality to multistep baselines
- Substantially lower computational cost
The method requires no knowledge distillation from teacher models or external supervision, reducing implementation complexity.
Open Source Release
The authors released MeanFlowSE as open-source code at https://github.com/liduojia1/MeanFlowSE, enabling adoption by the research and engineering communities.
What This Means
MeanFlowSE demonstrates that flow-based generative models can be dramatically accelerated by learning different mathematical representations of model dynamics. The single-step inference approach makes generative speech enhancement practical for real-time applications—voice calls, live transcription, hearing aids—that previously required expensive multistep computation. The contribution is methodological rather than architectural, suggesting similar velocity-field reformulations could apply to other generative tasks beyond speech.