multimodal-ai
9 articles tagged with multimodal-ai
Meta launches proprietary Muse Spark, abandoning open-source strategy after $14.3B rebuild
Meta launched Muse Spark on April 8, 2026, a natively multimodal reasoning model with tool-use and visual chain-of-thought capabilities. Unlike Llama, it is entirely proprietary with no open weights. The model scores 52 on AI Index v4.0 and excels on health benchmarks but represents Meta's departure from its open-source identity.
Meta replaces Llama with Muse Spark AI, launches Contemplating mode for complex reasoning
Meta has discontinued its Llama model line and launched Muse Spark as the foundation of its new AI strategy under Meta Superintelligence Labs. The model features a Contemplating mode for complex reasoning tasks and specializes in multimodal perception, health applications, and agentic tasks. Muse Spark is available today in Meta AI apps, with a private API preview for select partners.
Meta launches Muse Spark, proprietary AI model built by Wang's Superintelligence Labs
Meta announced Muse Spark, its first major large language model since hiring Scale AI's Alexandr Wang nine months ago for a $14.3 billion deal. The proprietary model emphasizes efficiency and multimodal reasoning over top-tier performance, marking a strategic shift from Meta's previous open-source Llama approach. Muse Spark will power Meta's AI assistant across Facebook, Instagram, WhatsApp, Messenger, and Ray-Ban glasses starting in coming weeks.
Meta's TRIBE v2 AI predicts brain activity from images, audio, and speech with 70,000-voxel fMRI mapping
Meta's FAIR lab released TRIBE v2, an AI model that predicts human brain activity from images, audio, and text. Trained on over 1,000 hours of fMRI data from 720 subjects, the model maps predictions to 70,000 voxels and often matches group-average brain responses more accurately than individual brain scans.
Google rolls out Search Live globally with Gemini 3.1 Flash Live model
Google has begun globally rolling out Search Live, enabling users in 200+ countries and territories to point their phone camera at objects and ask questions about what they see. The expansion is powered by Google's Gemini 3.1 Flash Live model, designed to be natively multilingual with faster, more reliable performance.
Apple Research Identifies 'Text-Speech Understanding Gap' Limiting LLM Speech Performance
Apple researchers have identified a fundamental limitation in speech-adapted large language models: they consistently underperform their text-based counterparts on language understanding tasks. The team terms this the 'text-speech understanding gap' and documents that speech-adapted LLMs lag behind both their original text versions and cascaded speech-to-text pipelines.
New benchmark reveals AI models struggle with personal photo retrieval tasks
A new benchmark evaluating AI models on photo retrieval reveals significant limitations in their ability to find specific images from personal collections. The test presents models with what appears to be a simple task—locating a particular photo—yet results demonstrate the gap between general image recognition and practical personal image search.
Google integrates Lyria 3 music generation into Gemini with text-to-music and cover art
Google Deepmind has integrated its Lyria 3 model into Gemini, enabling users to generate 30-second music tracks with vocals, lyrics, and cover art from text prompts or uploaded media. The model represents an expansion of Google's multimodal AI capabilities into creative audio generation.
Google rolls out Lyria 3 music generation to all Gemini app users
Google is rolling out Lyria 3, its music generation model, to all Gemini app users. The expansion follows recent releases of audio overviews, image generation, and video capabilities in the Gemini ecosystem.