Microsoft releases three multimodal AI models to compete with OpenAI and Google
Microsoft AI released three foundational models on April 2: MAI-Transcribe-1 for speech-to-text across 25 languages, MAI-Voice-1 for audio generation, and MAI-Image-2 for video generation. The company positions these models as cheaper alternatives to Google and OpenAI offerings. Models are available on Microsoft Foundry with pricing starting at $0.36 per hour for transcription.
Microsoft Releases Three Multimodal Models to Compete With OpenAI and Google
Microsoft AI announced the release of three foundational models on April 2, 2026: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. All three are now available on Microsoft Foundry, with the transcription and voice models also available in MAI Playground, a new large language model testing platform launched March 19.
The models were developed by Microsoft's MAI Superintelligence team, led by CEO Mustafa Suleyman. The team was formed and announced in November 2025.
Model Capabilities and Performance
MAI-Transcribe-1 converts speech to text across 25 languages and is 2.5 times faster than Microsoft's Azure Fast offering, according to the company. Pricing starts at $0.36 per hour.
MAI-Voice-1 generates audio, producing 60 seconds of audio output in one second. The model supports custom voice creation. Pricing begins at $22 per 1 million characters.
MAI-Image-2 is a video-generation model. Pricing starts at $5 per 1 million input tokens and $33 per 1 million output tokens.
Microsoft claims these models are cheaper than comparable offerings from Google and OpenAI, positioning cost as a primary competitive advantage in an increasingly crowded generative AI market.
Microsoft's Dual Strategy
The release reinforces Microsoft's strategy of building proprietary AI capabilities while maintaining its partnership with OpenAI. Microsoft has invested more than $13 billion in OpenAI through a multi-year agreement and integrates OpenAI models across its product portfolio.
According to Suleyman, a recent renegotiation of the Microsoft-OpenAI partnership enabled Microsoft to pursue independent superintelligence research. The company applies the same dual approach to semiconductors, both manufacturing its own chips and purchasing from external suppliers.
"We're building Humanist AI," Suleyman wrote in a blog post. "We have a distinct view when creating our AI models — putting humans at the center, optimizing for how people actually communicate, training for practical use."
Suleyman told VentureBeat that additional models from Microsoft AI will launch soon on Foundry and integrate directly into Microsoft products.
What This Means
Microsoft's move signals confidence in its ability to develop competitive foundation models independently while preserving strategic partnerships. The pricing structure—particularly the cheaper transcription and voice generation offerings—directly targets enterprises evaluating alternatives to established vendors. However, the company's continued reliance on OpenAI demonstrates that even with substantial internal AI capabilities, Microsoft views OpenAI's technology as complementary rather than redundant. The success of these models depends on adoption velocity and real-world performance matching the company's efficiency claims.
Related Articles
Microsoft's MAI-Transcribe-1 achieves lowest word error rate on FLEURS, costs $0.36/audio hour
Microsoft has released MAI-Transcribe-1, a speech-to-text model that achieves the lowest word error rate on the FLEURS benchmark across 25 languages, outperforming Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite. The model runs 2.5 times faster than Microsoft's previous Azure Fast offering and costs $0.36 per audio hour.
Microsoft releases three in-house AI models for speech and images, signaling independence from OpenAI
Microsoft released public preview versions of three proprietary AI models: MAI-Transcribe-1 for speech recognition across 25 languages at 50% lower GPU cost than alternatives, MAI-Voice-1 for speech synthesis generating 60 seconds of audio in under a second, and MAI-Image-2 for text-to-image generation. The models are available exclusively through Microsoft Azure AI Foundry and already power Copilot, Bing, and PowerPoint.
Google launches Veo 3.1 Lite, cutting video generation costs by half
Google announced Veo 3.1 Lite, a cost-reduced video generation model priced at less than 50% of Veo 3.1 Fast's cost. The model supports text-to-video and image-to-video generation at 720p or 1080p resolution with customizable durations of 4s, 6s, or 8s, rolling out today on the Gemini API and Google AI Studio.
Google DeepMind releases Gemma 4 with 4 model sizes, 256K context, and multimodal reasoning
Google DeepMind released Gemma 4, a family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (3.8B active), and 31B (30.7B parameters). All models support text and image input with 128K-256K context windows, while E2B and E4B add native audio capabilities and reasoning modes across 140+ languages.
Comments
Loading...