Microsoft releases three multimodal AI models to compete with OpenAI and Google
Microsoft AI released three foundational models on April 2: MAI-Transcribe-1 for speech-to-text across 25 languages, MAI-Voice-1 for audio generation, and MAI-Image-2 for video generation. The company positions these models as cheaper alternatives to Google and OpenAI offerings. Models are available on Microsoft Foundry with pricing starting at $0.36 per hour for transcription.
Microsoft Releases Three Multimodal Models to Compete With OpenAI and Google
Microsoft AI announced the release of three foundational models on April 2, 2026: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. All three are now available on Microsoft Foundry, with the transcription and voice models also available in MAI Playground, a new large language model testing platform launched March 19.
The models were developed by Microsoft's MAI Superintelligence team, led by CEO Mustafa Suleyman. The team was formed and announced in November 2025.
Model Capabilities and Performance
MAI-Transcribe-1 converts speech to text across 25 languages and is 2.5 times faster than Microsoft's Azure Fast offering, according to the company. Pricing starts at $0.36 per hour.
MAI-Voice-1 generates audio, producing 60 seconds of audio output in one second. The model supports custom voice creation. Pricing begins at $22 per 1 million characters.
MAI-Image-2 is a video-generation model. Pricing starts at $5 per 1 million input tokens and $33 per 1 million output tokens.
Microsoft claims these models are cheaper than comparable offerings from Google and OpenAI, positioning cost as a primary competitive advantage in an increasingly crowded generative AI market.
Microsoft's Dual Strategy
The release reinforces Microsoft's strategy of building proprietary AI capabilities while maintaining its partnership with OpenAI. Microsoft has invested more than $13 billion in OpenAI through a multi-year agreement and integrates OpenAI models across its product portfolio.
According to Suleyman, a recent renegotiation of the Microsoft-OpenAI partnership enabled Microsoft to pursue independent superintelligence research. The company applies the same dual approach to semiconductors, both manufacturing its own chips and purchasing from external suppliers.
"We're building Humanist AI," Suleyman wrote in a blog post. "We have a distinct view when creating our AI models — putting humans at the center, optimizing for how people actually communicate, training for practical use."
Suleyman told VentureBeat that additional models from Microsoft AI will launch soon on Foundry and integrate directly into Microsoft products.
What This Means
Microsoft's move signals confidence in its ability to develop competitive foundation models independently while preserving strategic partnerships. The pricing structure—particularly the cheaper transcription and voice generation offerings—directly targets enterprises evaluating alternatives to established vendors. However, the company's continued reliance on OpenAI demonstrates that even with substantial internal AI capabilities, Microsoft views OpenAI's technology as complementary rather than redundant. The success of these models depends on adoption velocity and real-world performance matching the company's efficiency claims.
Related Articles
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
ByteDance releases Lance, 3B-parameter unified multimodal model handling image and video generation, editing, and unders
ByteDance has released Lance, a 3-billion parameter multimodal model that performs image and video generation, editing, and understanding within a single framework. The model was trained entirely from scratch using 128 A100 GPUs and achieves 84.67% on DPG-Bench and 74% on GenEval, competing with larger models despite its compact size.
Microsoft Releases Fara-7B: 7B Parameter Computer Use Agent Trained in 2.5 Days on 64 H100s
Microsoft Research has released Fara-7B, a 7-billion parameter small language model designed for computer automation tasks. The model, which took 2.5 days to train on 64 H100 GPUs, can navigate websites to complete tasks like booking restaurants and shopping, using screenshots as input with a 128K token context window.
Comments
Loading...