multimodal-ai

5 articles tagged with multimodal-ai

March 26, 2026
product update

Google rolls out Search Live globally with Gemini 3.1 Flash Live model

Google has begun globally rolling out Search Live, enabling users in 200+ countries and territories to point their phone camera at objects and ask questions about what they see. The expansion is powered by Google's Gemini 3.1 Flash Live model, designed to be natively multilingual with faster, more reliable performance.

February 24, 2026
researchApple

Apple Research Identifies 'Text-Speech Understanding Gap' Limiting LLM Speech Performance

Apple researchers have identified a fundamental limitation in speech-adapted large language models: they consistently underperform their text-based counterparts on language understanding tasks. The team terms this the 'text-speech understanding gap' and documents that speech-adapted LLMs lag behind both their original text versions and cascaded speech-to-text pipelines.

February 22, 2026
benchmark

New benchmark reveals AI models struggle with personal photo retrieval tasks

A new benchmark evaluating AI models on photo retrieval reveals significant limitations in their ability to find specific images from personal collections. The test presents models with what appears to be a simple task—locating a particular photo—yet results demonstrate the gap between general image recognition and practical personal image search.

February 20, 2026
product update

Google integrates Lyria 3 music generation into Gemini with text-to-music and cover art

Google Deepmind has integrated its Lyria 3 model into Gemini, enabling users to generate 30-second music tracks with vocals, lyrics, and cover art from text prompts or uploaded media. The model represents an expansion of Google's multimodal AI capabilities into creative audio generation.

product update

Google rolls out Lyria 3 music generation to all Gemini app users

Google is rolling out Lyria 3, its music generation model, to all Gemini app users. The expansion follows recent releases of audio overviews, image generation, and video capabilities in the Gemini ecosystem.