multimodal AI
7 articles tagged with multimodal AI
Stability AI Releases Stable Audio 3 Medium: 2B-Parameter Audio Generation Model with 180-Second Output in Under 2 Secon
Stability AI has released Stable Audio 3 Medium, a 2 billion parameter latent diffusion model capable of generating variable-length audio up to 380 seconds. The model generates music and sound effects in less than 2 seconds on an H200 GPU, trained on 1.28 million licensed and Creative Commons audio recordings.
Google releases Gemini 3.5 Flash and autonomous agent Gemini Spark at I/O 2026
Google announced Gemini 3.5 Flash and Gemini Spark at I/O 2026. Gemini 3.5 Flash now powers Google's AI Mode search, while Spark is a cloud-based autonomous agent that can monitor credit card statements, track emails, and interact with third-party services like OpenTable and Instacart.
Meta AI app adds voice conversations and live camera features powered by Muse Spark
Meta has added voice conversation capabilities and live camera analysis to its Meta AI app, both powered by the company's Muse Spark model. The features previously available only on Meta's AI glasses are now rolling out to the standalone app, which ranks fourth on the U.S. iPhone App Store.
Amazon Nova 2 Sonic Unifies Speech Recognition, Reasoning, and TTS in Single Streaming Model
Amazon Web Services released technical guidance for migrating text agents to voice assistants using Amazon Nova 2 Sonic, a native speech-to-speech model that combines automatic speech recognition, reasoning, tool calling, and text-to-speech in a single bidirectional streaming interface. The model supports asynchronous tool calling and built-in voice activity detection for handling interruptions.
OpenAI releases ChatGPT Images 2.0 with accurate text rendering and brand-style matching
OpenAI launched ChatGPT Images 2.0, upgrading from decorative images to full-page graphics with detailed text rendering. The update is available to all ChatGPT tiers, with advanced features requiring paid subscriptions that access the Thinking model. Hands-on testing shows significant improvements in text accuracy and brand-style replication, though factual errors still occur.
OpenAI releases ChatGPT Images 2.0 with integrated reasoning and text-image composition
OpenAI has released ChatGPT Images 2.0, which integrates reasoning capabilities to generate complex visual compositions combining text and images. The model supports aspect ratios from 3:1 to 1:3 and outputs up to 2K resolution, with advanced features available to Plus, Pro, Business, and Enterprise users.
Meta launches proprietary Muse Spark model, abandoning open-source commitment
Meta has released Muse Spark, a proprietary AI model with restricted access via API and portal invite only—a striking reversal from CEO Mark Zuckerberg's 2024 manifesto championing open-source AI. The model claims performance matching top competitors from OpenAI, Anthropic, and Google, trained with an order of magnitude less compute than Llama 4.