multimodal-ai
5 articles tagged with multimodal-ai
Google rolls out Search Live globally with Gemini 3.1 Flash Live model
Google has begun globally rolling out Search Live, enabling users in 200+ countries and territories to point their phone camera at objects and ask questions about what they see. The expansion is powered by Google's Gemini 3.1 Flash Live model, designed to be natively multilingual with faster, more reliable performance.
Apple Research Identifies 'Text-Speech Understanding Gap' Limiting LLM Speech Performance
Apple researchers have identified a fundamental limitation in speech-adapted large language models: they consistently underperform their text-based counterparts on language understanding tasks. The team terms this the 'text-speech understanding gap' and documents that speech-adapted LLMs lag behind both their original text versions and cascaded speech-to-text pipelines.
New benchmark reveals AI models struggle with personal photo retrieval tasks
A new benchmark evaluating AI models on photo retrieval reveals significant limitations in their ability to find specific images from personal collections. The test presents models with what appears to be a simple task—locating a particular photo—yet results demonstrate the gap between general image recognition and practical personal image search.
Google integrates Lyria 3 music generation into Gemini with text-to-music and cover art
Google Deepmind has integrated its Lyria 3 model into Gemini, enabling users to generate 30-second music tracks with vocals, lyrics, and cover art from text prompts or uploaded media. The model represents an expansion of Google's multimodal AI capabilities into creative audio generation.
Google rolls out Lyria 3 music generation to all Gemini app users
Google is rolling out Lyria 3, its music generation model, to all Gemini app users. The expansion follows recent releases of audio overviews, image generation, and video capabilities in the Gemini ecosystem.