research

MeanFlowSE enables single-step speech enhancement by learning mean velocity fields instead of instantaneous flows

Researchers introduced MeanFlowSE, a generative speech enhancement model that eliminates the computational bottleneck of multistep inference by learning average velocity over finite intervals rather than instantaneous velocity fields. The single-step approach achieves comparable quality to multistep baselines on VoiceBank-DEMAND while requiring substantially lower computational cost and no knowledge distillation.

March 5, 2026 · 5:08 AM1 min read

Single-Step Speech Enhancement Replaces Iterative Solvers

A new research paper introduces MeanFlowSE, a conditional generative model that accelerates speech enhancement by replacing multistep inference with single-step generation. The key innovation sidesteps the computational bottleneck that has limited real-time applications of flow- and diffusion-based audio systems.

The Problem: Multistep Inference Bottleneck

Existing flow- and diffusion-based speech enhancement systems learn instantaneous velocity fields, requiring iterative ordinary differential equation (ODE) solvers during inference. Each audio sample must pass through multiple solver steps, making real-time processing computationally expensive.

The Solution: Mean Velocity Learning

MeanFlowSE reformulates the problem by learning average velocity over finite intervals along a trajectory instead of instantaneous velocity. The researchers use a Jacobian-vector product (JVP) to derive a local training objective that supervises finite-interval displacement while maintaining consistency with instantaneous-field constraints on the diagonal.

At inference, the model performs single-step generation via backward-in-time displacement, eliminating the need for multistep ODE solvers. An optional few-step variant provides additional refinement if needed.

Experimental Results

Tested on VoiceBank-DEMAND, the standard benchmark for speech enhancement, MeanFlowSE's single-step model achieves:

Strong intelligibility scores
High fidelity metrics
Comparable perceptual quality to multistep baselines
Substantially lower computational cost

The method requires no knowledge distillation from teacher models or external supervision, reducing implementation complexity.

Open Source Release

The authors released MeanFlowSE as open-source code at https://github.com/liduojia1/MeanFlowSE, enabling adoption by the research and engineering communities.

What This Means

MeanFlowSE demonstrates that flow-based generative models can be dramatically accelerated by learning different mathematical representations of model dynamics. The single-step inference approach makes generative speech enhancement practical for real-time applications—voice calls, live transcription, hearing aids—that previously required expensive multistep computation. The contribution is methodological rather than architectural, suggesting similar velocity-field reformulations could apply to other generative tasks beyond speech.

Source: arxiv.org ↗

speech-enhancement generative-models flow-models diffusion-models real-time-audio inference-optimization open-source