Microsoft's MAI-Transcribe-1 achieves lowest word error rate on FLEURS, costs $0.36/audio hour
Microsoft has released MAI-Transcribe-1, a speech-to-text model that achieves the lowest word error rate on the FLEURS benchmark across 25 languages, outperforming Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite. The model runs 2.5 times faster than Microsoft's previous Azure Fast offering and costs $0.36 per audio hour.
Microsoft's MAI-Transcribe-1 Achieves Lowest Word Error Rate on FLEURS Benchmark
Microsoft has introduced MAI-Transcribe-1, a multilingual speech-to-text model supporting 25 languages that outperforms competing transcription systems on the FLEURS benchmark.
Performance and Capabilities
MAI-Transcribe-1 achieves the lowest word error rate among tested models, beating Scribe v2, Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite across the FLEURS evaluation suite. Microsoft says the model is optimized for challenging recording conditions, including background noise, poor audio quality, and overlapping speech.
The model delivers 2.5x faster inference than Microsoft's previous Azure Fast transcription offering. When combined with MAI-Voice-1 (Microsoft's text-to-speech model) and a language model, MAI-Transcribe-1 can power voice agents, according to Microsoft.
Pricing and Availability
MAI-Transcribe-1 is priced at $0.36 per audio hour. The model is rolling out across Copilot Voice and Microsoft Teams. Developers can access it through a public preview on Microsoft Foundry and the Microsoft AI Playground.
Market Context
The release comes as open-source alternatives gain traction. Cohere and Mistral recently released open-source speech-to-text models that perform at comparable quality levels, offering cost-free deployment options for organizations willing to handle self-hosting infrastructure.
What This Means
MAI-Transcribe-1 positions Microsoft competitively in speech recognition, addressing both accuracy and speed requirements for enterprise voice applications. The $0.36/hour pricing sits in the mid-market range for commercial transcription APIs. However, the emergence of capable open-source alternatives means Microsoft must justify the API model through deployment convenience and integration with Copilot and Teams ecosystems rather than technology superiority alone. The 2.5x speed improvement over Azure Fast suggests meaningful optimization work, relevant for real-time voice agent applications.
Related Articles
GitHub Copilot CLI adds Microsoft C++ Language Server plugin with automated setup
GitHub has added the Microsoft C++ Language Server as a plugin to the Copilot CLI marketplace. The plugin includes a built-in setup skill designed to automate C++ project configuration.
Google releases Gemini 3.1 Flash Lite Image, its fastest and cheapest image generation model
Google has released Gemini 3.1 Flash Lite Image, also called Nano Banana 2 Lite, which the company describes as its fastest and cheapest image generation model. The model is available through Google's AI Studio and Gemini API with the identifier gemini-3.1-flash-lite-image.
Claude Sonnet 5 ships with 1M token context and new tokenizer that increases costs 30-40% for English text
Anthropic released Claude Sonnet 5 with a 1 million token context window and 128,000 token maximum output. The model removes traditional sampling parameters and introduces a new tokenizer that generates approximately 30% more tokens than Sonnet 4.6 for the same English text—effectively a significant price increase despite unchanged nominal rates of $3/million input and $15/million output tokens.
Claude Sonnet 5 launches on AWS Bedrock with Opus-level intelligence at Sonnet pricing
Anthropic has released Claude Sonnet 5 on Amazon Bedrock and Claude Platform on AWS. The model delivers what Anthropic describes as near-Opus intelligence while maintaining Sonnet-tier pricing, with promotional rates available through August 31, 2026.
Comments
Loading...