Microsoft's MAI-Transcribe-1 achieves lowest word error rate on FLEURS, costs $0.36/audio hour
Microsoft has released MAI-Transcribe-1, a speech-to-text model that achieves the lowest word error rate on the FLEURS benchmark across 25 languages, outperforming Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite. The model runs 2.5 times faster than Microsoft's previous Azure Fast offering and costs $0.36 per audio hour.
Microsoft's MAI-Transcribe-1 Achieves Lowest Word Error Rate on FLEURS Benchmark
Microsoft has introduced MAI-Transcribe-1, a multilingual speech-to-text model supporting 25 languages that outperforms competing transcription systems on the FLEURS benchmark.
Performance and Capabilities
MAI-Transcribe-1 achieves the lowest word error rate among tested models, beating Scribe v2, Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite across the FLEURS evaluation suite. Microsoft says the model is optimized for challenging recording conditions, including background noise, poor audio quality, and overlapping speech.
The model delivers 2.5x faster inference than Microsoft's previous Azure Fast transcription offering. When combined with MAI-Voice-1 (Microsoft's text-to-speech model) and a language model, MAI-Transcribe-1 can power voice agents, according to Microsoft.
Pricing and Availability
MAI-Transcribe-1 is priced at $0.36 per audio hour. The model is rolling out across Copilot Voice and Microsoft Teams. Developers can access it through a public preview on Microsoft Foundry and the Microsoft AI Playground.
Market Context
The release comes as open-source alternatives gain traction. Cohere and Mistral recently released open-source speech-to-text models that perform at comparable quality levels, offering cost-free deployment options for organizations willing to handle self-hosting infrastructure.
What This Means
MAI-Transcribe-1 positions Microsoft competitively in speech recognition, addressing both accuracy and speed requirements for enterprise voice applications. The $0.36/hour pricing sits in the mid-market range for commercial transcription APIs. However, the emergence of capable open-source alternatives means Microsoft must justify the API model through deployment convenience and integration with Copilot and Teams ecosystems rather than technology superiority alone. The 2.5x speed improvement over Azure Fast suggests meaningful optimization work, relevant for real-time voice agent applications.
Related Articles
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
Google releases Gemini 3.5 Flash with autonomous coding and agent capabilities, claims 4x speed boost
Google released Gemini 3.5 Flash, positioning it as an agent-first model designed for autonomous coding and multi-hour workflows. The company claims the model outperforms its 3.1 Pro predecessor on coding and agentic benchmarks while running 4x faster than competing frontier models, with an optimized version achieving 12x speed gains.
Google releases Gemini 3.5 Flash at half the price of frontier models, announces Omni world model
Google released Gemini 3.5 Flash, priced at half to one-third the cost of comparable frontier models, and announced it will become the default model in the Gemini app globally. The company also unveiled Omni, a world model for simulating physical environments, and Gemini Spark, an AI agent in beta testing.
ByteDance releases Lance, 3B-parameter unified multimodal model handling image and video generation, editing, and unders
ByteDance has released Lance, a 3-billion parameter multimodal model that performs image and video generation, editing, and understanding within a single framework. The model was trained entirely from scratch using 128 A100 GPUs and achieves 84.67% on DPG-Bench and 74% on GenEval, competing with larger models despite its compact size.
Comments
Loading...