model releaseTencent

Tencent releases HY-Embodied-0.5, a 2B-parameter vision-language model for robot control

TL;DR

Tencent has released HY-Embodied-0.5, a family of foundation models designed specifically for embodied AI and robotic control. The suite includes a 2B-parameter MoT (Mixture-of-Transformers) variant with only 2.2B activated parameters during inference, and a 32B model that claims frontier-level performance comparable to Gemini 3.0 Pro, trained on over 200 billion tokens of embodied-specific data.

2 min read
0

Tencent Releases HY-Embodied-0.5 for Real-World Robot Control

Tencent's Robotics X and HY Vision Team have released HY-Embodied-0.5, an open-source suite of foundation models explicitly engineered for embodied AI agents and robotic control. The suite features two variants: a 2B-parameter model optimized for edge deployment and a 32B variant for complex reasoning tasks.

Architecture and Technical Details

The core innovation is the Mixture-of-Transformers (MoT) architecture. The 2B variant contains 4B total parameters but activates only 2.2B during inference, achieving "the high inference speed of a dense 2B model while delivering superior, fine-grained perceptual representations," according to Tencent. This efficiency comes from modality-specific computing in the vision pathway.

Both models were trained on a curated dataset comprising over 100 million embodied and spatial-specific data points across 200+ billion tokens. Tencent employed on-policy distillation to transfer reasoning capabilities from the 32B model to the compact 2B variant.

Performance Claims

Across 22 embodied-relevant benchmarks against similarly-sized models:

  • CV-Bench: HY-Embodied-0.5 MoT-2B scored 89.2, compared to Qwen3-VL 2B's 80.0 and Qwen3-VL 4B's 85.7
  • DA-2K: 92.3 versus Qwen3-VL 2B's 69.5
  • ERQA (embodied reasoning): 54.5 versus Qwen3-VL 2B's 41.8
  • EmbSpatial-Bench: 82.8 versus Qwen3-VL 2B's 75.9

Tencent claims the 32B variant achieves "frontier-level performance comparable to Gemini 3.0 Pro," though specific benchmarks are not disclosed in the announcement.

Hardware Requirements and Deployment

The model requires CUDA 12.6, PyTorch 2.8.0, and Python 3.12+. Tencent recommends NVIDIA GPUs with at least 16GB VRAM, though CPU inference is supported. The 2B model requires 8GB of disk space for weights; 20GB+ total storage is recommended for dependencies.

A custom Transformers version (specific commit 9293856c419762ebf98fbe2bd9440f9ce7069f1a) is required for inference. Tencent states they "will merge the improvements into the Transformers main branch later."

Vision-Language-Action Integration

HY-Embodied is positioned as the "brain" for Vision-Language-Action (VLA) pipelines. Unlike general vision-language models, the architecture emphasizes spatial-temporal perception, physical object interaction understanding, and agent dynamics—capabilities required for real-world robotic control.

The model supports both single and batch inference with optional chain-of-thought reasoning modes. Maximum generation length extends to 32,768 tokens.

Open Source Availability

Tencent has open-sourced the HY-Embodied-0.5 MoT-2B weights on Hugging Face (model ID: tencent/HY-Embodied-0.5) along with official inference code. A Gradio demo is available for testing. The full codebase is available on GitHub at Tencent-Hunyuan/HY-Embodied.

What This Means

HY-Embodied-0.5 addresses a genuine gap: most foundation models optimize for language or general vision tasks, not the specific demands of physical robots. A 2B model that matches or exceeds 4B competitors on embodied reasoning benchmarks could shift robotics development toward smaller, edge-deployable systems. However, the comparison against Qwen3-VL (which Tencent notes has "repetitive thinking patterns") rather than Gemini 3.0 or Claude variants limits independent assessment of true competitive positioning. The 32B variant's claimed parity with Gemini 3.0 Pro requires third-party validation.

Related Articles

model release

Meta launches proprietary Muse Spark, abandoning open-source strategy after $14.3B rebuild

Meta launched Muse Spark on April 8, 2026, a natively multimodal reasoning model with tool-use and visual chain-of-thought capabilities. Unlike Llama, it is entirely proprietary with no open weights. The model scores 52 on AI Index v4.0 and excels on health benchmarks but represents Meta's departure from its open-source identity.

model release

Meta AI app jumps to No. 5 on App Store following Muse Spark launch

Meta's AI app surged from No. 57 to No. 5 on the U.S. App Store within 24 hours of launching Muse Spark, Meta's new multimodal AI model. The model accepts voice, text, and image inputs and features reasoning capabilities for science and math tasks, visual coding, and multi-agent functionality.

model release

Meta launches Muse Spark model with private API preview and 16 integrated tools

Meta announced Muse Spark today, its first model release since Llama 4 a year ago. The hosted model is available in private API preview and on meta.ai with Instant and Thinking modes, benchmarking competitively against Anthropic's Opus 4.6 and Google's Gemini 3.1 Pro, though behind on Terminal-Bench 2.0.

model release

Meta releases Muse Spark, first closed-source model from Meta Superintelligence Labs

Meta has released Muse Spark, the first model from Meta Superintelligence Labs, a unit assembled under chief AI officer Alexandr Wang following Meta's $14.3 billion investment in Scale AI. The natively multimodal model features a "Contemplating" reasoning mode with parallel sub-agents and marks Meta's break from its open-source Llama heritage by operating as closed source.

Comments

Loading...