Microsoft's MAI-Transcribe-1 achieves lowest word error rate on FLEURS, costs $0.36/audio hour
Microsoft has released MAI-Transcribe-1, a speech-to-text model that achieves the lowest word error rate on the FLEURS benchmark across 25 languages, outperforming Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite. The model runs 2.5 times faster than Microsoft's previous Azure Fast offering and costs $0.36 per audio hour.
Microsoft's MAI-Transcribe-1 Achieves Lowest Word Error Rate on FLEURS Benchmark
Microsoft has introduced MAI-Transcribe-1, a multilingual speech-to-text model supporting 25 languages that outperforms competing transcription systems on the FLEURS benchmark.
Performance and Capabilities
MAI-Transcribe-1 achieves the lowest word error rate among tested models, beating Scribe v2, Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite across the FLEURS evaluation suite. Microsoft says the model is optimized for challenging recording conditions, including background noise, poor audio quality, and overlapping speech.
The model delivers 2.5x faster inference than Microsoft's previous Azure Fast transcription offering. When combined with MAI-Voice-1 (Microsoft's text-to-speech model) and a language model, MAI-Transcribe-1 can power voice agents, according to Microsoft.
Pricing and Availability
MAI-Transcribe-1 is priced at $0.36 per audio hour. The model is rolling out across Copilot Voice and Microsoft Teams. Developers can access it through a public preview on Microsoft Foundry and the Microsoft AI Playground.
Market Context
The release comes as open-source alternatives gain traction. Cohere and Mistral recently released open-source speech-to-text models that perform at comparable quality levels, offering cost-free deployment options for organizations willing to handle self-hosting infrastructure.
What This Means
MAI-Transcribe-1 positions Microsoft competitively in speech recognition, addressing both accuracy and speed requirements for enterprise voice applications. The $0.36/hour pricing sits in the mid-market range for commercial transcription APIs. However, the emergence of capable open-source alternatives means Microsoft must justify the API model through deployment convenience and integration with Copilot and Teams ecosystems rather than technology superiority alone. The 2.5x speed improvement over Azure Fast suggests meaningful optimization work, relevant for real-time voice agent applications.
Related Articles
Microsoft releases three multimodal AI models to compete with OpenAI and Google
Microsoft AI released three foundational models on April 2: MAI-Transcribe-1 for speech-to-text across 25 languages, MAI-Voice-1 for audio generation, and MAI-Image-2 for video generation. The company positions these models as cheaper alternatives to Google and OpenAI offerings. Models are available on Microsoft Foundry with pricing starting at $0.36 per hour for transcription.
Microsoft releases three in-house AI models for speech and images, signaling independence from OpenAI
Microsoft released public preview versions of three proprietary AI models: MAI-Transcribe-1 for speech recognition across 25 languages at 50% lower GPU cost than alternatives, MAI-Voice-1 for speech synthesis generating 60 seconds of audio in under a second, and MAI-Image-2 for text-to-image generation. The models are available exclusively through Microsoft Azure AI Foundry and already power Copilot, Bing, and PowerPoint.
Alibaba releases Qwen 3.6 Plus with 1M context window, free tier now available
Alibaba's Qwen division released Qwen 3.6 Plus on April 2, 2026, offering free access to a model with a 1,000,000 token context window. The model combines linear attention with sparse mixture-of-experts routing and achieves a 78.8 score on SWE-bench Verified for software engineering tasks.
xAI releases Grok 4.20 Multi-Agent with 2M context window and parallel agent reasoning
xAI has released Grok 4.20 Multi-Agent, a variant designed for collaborative agent-based workflows with a 2-million-token context window. The model scales from 4 agents at low/medium reasoning effort to 16 agents at high/xhigh effort levels, priced at $2 per million input tokens and $6 per million output tokens.
Comments
Loading...