multimodal AI

7 articles tagged with multimodal AI

May 24, 2026
model releaseStability AI

Stability AI Releases Stable Audio 3 Medium: 2B-Parameter Audio Generation Model with 180-Second Output in Under 2 Secon

Stability AI has released Stable Audio 3 Medium, a 2 billion parameter latent diffusion model capable of generating variable-length audio up to 380 seconds. The model generates music and sound effects in less than 2 seconds on an H200 GPU, trained on 1.28 million licensed and Creative Commons audio recordings.

May 22, 2026
model release

Google releases Gemini 3.5 Flash and autonomous agent Gemini Spark at I/O 2026

Google announced Gemini 3.5 Flash and Gemini Spark at I/O 2026. Gemini 3.5 Flash now powers Google's AI Mode search, while Spark is a cloud-based autonomous agent that can monitor credit card statements, track emails, and interact with third-party services like OpenTable and Instacart.

May 12, 2026
product update

Meta AI app adds voice conversations and live camera features powered by Muse Spark

Meta has added voice conversation capabilities and live camera analysis to its Meta AI app, both powered by the company's Muse Spark model. The features previously available only on Meta's AI glasses are now rolling out to the standalone app, which ranks fourth on the U.S. iPhone App Store.

April 28, 2026
product updateAmazon Web Services

Amazon Nova 2 Sonic Unifies Speech Recognition, Reasoning, and TTS in Single Streaming Model

Amazon Web Services released technical guidance for migrating text agents to voice assistants using Amazon Nova 2 Sonic, a native speech-to-speech model that combines automatic speech recognition, reasoning, tool calling, and text-to-speech in a single bidirectional streaming interface. The model supports asynchronous tool calling and built-in voice activity detection for handling interruptions.

April 24, 2026
product updateOpenAI

OpenAI releases ChatGPT Images 2.0 with accurate text rendering and brand-style matching

OpenAI launched ChatGPT Images 2.0, upgrading from decorative images to full-page graphics with detailed text rendering. The update is available to all ChatGPT tiers, with advanced features requiring paid subscriptions that access the Thinking model. Hands-on testing shows significant improvements in text accuracy and brand-style replication, though factual errors still occur.

April 21, 2026
model releaseOpenAI

OpenAI releases ChatGPT Images 2.0 with integrated reasoning and text-image composition

OpenAI has released ChatGPT Images 2.0, which integrates reasoning capabilities to generate complex visual compositions combining text and images. The model supports aspect ratios from 3:1 to 1:3 and outputs up to 2K resolution, with advanced features available to Plus, Pro, Business, and Enterprise users.

April 8, 2026
model release

Meta launches proprietary Muse Spark model, abandoning open-source commitment

Meta has released Muse Spark, a proprietary AI model with restricted access via API and portal invite only—a striking reversal from CEO Mark Zuckerberg's 2024 manifesto championing open-source AI. The model claims performance matching top competitors from OpenAI, Anthropic, and Google, trained with an order of magnitude less compute than Llama 4.