model release

Meta launches Muse Spark, its first frontier model and first closed-weight AI system

TL;DR

Meta Superintelligence Labs has launched Muse Spark, a native multimodal reasoning model that scores 52 on the Artificial Analysis Intelligence Index, placing it in the top 5 frontier models. This marks Meta's first frontier-class model and its first AI system without open weights, representing a strategic shift from its open-source Llama strategy. The model achieves comparable efficiency to Gemini 3.1 Pro while matching Llama 4 Maverick capabilities with over an order of magnitude less compute.

4 min read
0

Meta launches Muse Spark, its first frontier model and first closed-weight system

Meta Superintelligence Labs has unveiled Muse Spark, a native multimodal reasoning model that marks two significant departures from the company's AI strategy: it's Meta's first frontier-class model and its first system without open weights.

Benchmark Performance

Muse Spark scored 52 on the Artificial Analysis Intelligence Index, landing in the top 5 across all tested models. Only Gemini 3.1 Pro Preview (top performer), GPT-5.4, and Claude Opus 4.6 scored higher. For context, Meta's previous models Llama 4 Maverick and Scout achieved only 18 and 13 points respectively when they launched in April 2025.

Independent testing by Artificial Analysis shows the model closing the frontier gap in a single release. However, Artificial Analysis flagged weakness in agent-based tasks: on the GDPval-AA work task benchmark, Muse Spark scored 1,427 points versus Claude Sonnet 4.6's 1,648 and GPT-5.4's 1,676.

On Meta's internal testing, Muse Spark achieved 58% on Humanity's Last Exam and 38% on FrontierScience Research. In extended thinking mode without tools, it scored 50.2 on Humanity's Last Exam (No Tools), outperforming both Gemini 3.1 and GPT-5.4 Pro in this specific benchmark.

Key Capabilities and Architecture

Muse Spark operates as a native multimodal model with three core capabilities: tool usage, visual chain-of-thought reasoning, and multi-agent orchestration. The model includes a "Contemplating Mode" designed to compete with deep reasoning features in competing frontier models like Gemini Deep Think and GPT Pro.

Meta rebuilt the pretraining stack from the ground up over nine months, implementing changes to model architecture, optimization, and data curation. According to Meta's claims, Muse Spark matches Llama 4 Maverick's capabilities using over an order of magnitude less compute, positioning it as substantially more efficient than competing base models.

The company employs two approaches to test-time compute. The first uses thought-time penalties that optimize token consumption. Meta observed a phenomenon it calls "thought compression," where the model initially improves by thinking longer, then compresses reasoning to solve problems with fewer tokens before expanding solutions again for stronger results. The second approach uses multi-agent orchestration—deploying multiple parallel agents on difficult problems simultaneously—to boost performance without adding latency.

Artificial Analysis verified efficiency claims: Muse Spark consumed 58 million output tokens for the full Intelligence Index run, matching Gemini 3.1 Pro Preview (57 million) and well below Claude Opus 4.6 (157 million) or GPT-5.4 (120 million).

Closed Weights Mark Strategic Shift

Unlike the Llama family, Muse Spark is not open-weight and cannot run locally. This represents a sharp break from Meta's open-source playbook championed for years. Meta's AI chief Alexandr Wang stated the company has "plans to open-source future versions," suggesting closed weights may not be permanent policy. The company is also reportedly planning to open-source parts of its new AI models.

Meta justified the shift by noting its enormous spending on AI infrastructure and specialized talent "has to start paying for itself eventually."

Health and Multimodal Focus

Meta partnered with over 1,000 doctors to curate high-quality, factually accurate training data for health applications. The model can generate interactive displays breaking down nutritional value of food or showing which muscles activate during specific exercises. Meta emphasized multimodal perception and health as primary use cases, though interactive applications like mini-game generation are also possible.

Meta acknowledged performance gaps in long-horizon agentic systems and coding workflows. The company also flagged that Muse Spark frequently labeled test scenarios as "alignment traps" during security evaluation, demonstrating "evaluation awareness"—a phenomenon where models appear to recognize they're being tested.

Availability and Future Plans

Muse Spark is live on meta.ai and in the Meta AI app, with private API preview access going to select users. Pricing has not been disclosed.

Meta frames Muse Spark as "the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts" toward "personal superintelligence." The company stated "bigger models are already in development with infrastructure scaling to match." This release follows a rough period for Meta's AI efforts after Llama 4 Maverick and Scout drew criticism in April 2025 for underwhelming benchmark results and internal accusations of benchmark manipulation.

What This Means

Muse Spark demonstrates Meta can compete at the frontier in a single leap, closing a gap that seemed substantial just months ago. However, persistent weaknesses in agentic tasks and the company's admission of gaps in coding workflows suggest the model may not be immediately ready for autonomous agent deployment. The shift to closed weights is pragmatic—Meta's infrastructure spending demands commercial revenue—but the stated commitment to open-sourcing future versions leaves the door open to returning to its original strategy. Real-world performance across extended reasoning tasks will be the critical test; benchmark scores alone may not reflect usability in production environments.

Related Articles

model release

Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context

Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.

model release

Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU

Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.

model release

Tencent Releases Hy-MT2 Translation Models: 1.8B, 7B, and 30B-A3B Support 33 Languages

Tencent released Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B (MoE) sizes. All models support translation among 33 languages and follow translation instructions in multiple languages. The 1.8B model can be compressed to 440MB using 1.25-bit AngelSlim quantization.

model release

Tencent Releases Hy-MT2: 1.8B Translation Model Compressed to 440MB With 1.25-Bit Quantization

Tencent has open-sourced Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B parameter sizes. The models support translation across 33 languages and include extreme quantization down to 1.25-bit, reducing the 1.8B model to 440MB storage while increasing inference speed by 1.5x.

Comments

Loading...