model releaseTencent

Tencent releases HY-Embodied-0.5, a 2B-parameter vision-language model for robot control

TL;DR

Tencent has released HY-Embodied-0.5, a family of foundation models designed specifically for embodied AI and robotic control. The suite includes a 2B-parameter MoT (Mixture-of-Transformers) variant with only 2.2B activated parameters during inference, and a 32B model that claims frontier-level performance comparable to Gemini 3.0 Pro, trained on over 200 billion tokens of embodied-specific data.

April 10, 2026 · 2:50 PM2 min read

HY-Embodied-0.5 MoT-2B — Quick Specs

Context window33K tokens

Compare HY-Embodied-0.5 MoT-2B with other models →

Tencent Releases HY-Embodied-0.5 for Real-World Robot Control

Tencent's Robotics X and HY Vision Team have released HY-Embodied-0.5, an open-source suite of foundation models explicitly engineered for embodied AI agents and robotic control. The suite features two variants: a 2B-parameter model optimized for edge deployment and a 32B variant for complex reasoning tasks.

Architecture and Technical Details

The core innovation is the Mixture-of-Transformers (MoT) architecture. The 2B variant contains 4B total parameters but activates only 2.2B during inference, achieving "the high inference speed of a dense 2B model while delivering superior, fine-grained perceptual representations," according to Tencent. This efficiency comes from modality-specific computing in the vision pathway.

Both models were trained on a curated dataset comprising over 100 million embodied and spatial-specific data points across 200+ billion tokens. Tencent employed on-policy distillation to transfer reasoning capabilities from the 32B model to the compact 2B variant.

Performance Claims

Across 22 embodied-relevant benchmarks against similarly-sized models:

CV-Bench: HY-Embodied-0.5 MoT-2B scored 89.2, compared to Qwen3-VL 2B's 80.0 and Qwen3-VL 4B's 85.7
DA-2K: 92.3 versus Qwen3-VL 2B's 69.5
ERQA (embodied reasoning): 54.5 versus Qwen3-VL 2B's 41.8
EmbSpatial-Bench: 82.8 versus Qwen3-VL 2B's 75.9

Tencent claims the 32B variant achieves "frontier-level performance comparable to Gemini 3.0 Pro," though specific benchmarks are not disclosed in the announcement.

Hardware Requirements and Deployment

The model requires CUDA 12.6, PyTorch 2.8.0, and Python 3.12+. Tencent recommends NVIDIA GPUs with at least 16GB VRAM, though CPU inference is supported. The 2B model requires 8GB of disk space for weights; 20GB+ total storage is recommended for dependencies.

A custom Transformers version (specific commit 9293856c419762ebf98fbe2bd9440f9ce7069f1a) is required for inference. Tencent states they "will merge the improvements into the Transformers main branch later."

Vision-Language-Action Integration

HY-Embodied is positioned as the "brain" for Vision-Language-Action (VLA) pipelines. Unlike general vision-language models, the architecture emphasizes spatial-temporal perception, physical object interaction understanding, and agent dynamics—capabilities required for real-world robotic control.

The model supports both single and batch inference with optional chain-of-thought reasoning modes. Maximum generation length extends to 32,768 tokens.

Open Source Availability

Tencent has open-sourced the HY-Embodied-0.5 MoT-2B weights on Hugging Face (model ID: tencent/HY-Embodied-0.5) along with official inference code. A Gradio demo is available for testing. The full codebase is available on GitHub at Tencent-Hunyuan/HY-Embodied.

What This Means

HY-Embodied-0.5 addresses a genuine gap: most foundation models optimize for language or general vision tasks, not the specific demands of physical robots. A 2B model that matches or exceeds 4B competitors on embodied reasoning benchmarks could shift robotics development toward smaller, edge-deployable systems. However, the comparison against Qwen3-VL (which Tencent notes has "repetitive thinking patterns") rather than Gemini 3.0 or Claude variants limits independent assessment of true competitive positioning. The 32B variant's claimed parity with Gemini 3.0 Pro requires third-party validation.

Source: huggingface.co ↗

tencent embodied-ai robot-control vla vision-language-action mixture-of-transformers open-source multimodal

model releaseJuly 9, 2026

NVIDIA Releases Audex-30B-A3B: Unified Audio-Text Model With 1M Token Context and Speech Generation

NVIDIA released Audex-30B-A3B, a unified audio-text model built on the Nemotron-Cascade-2-30B-A3B backbone. The model handles audio understanding, speech recognition and translation, text-to-speech, audio generation, and speech-to-speech while supporting up to 1M token context length.

model releaseJuly 8, 2026

OpenAI Launches GPT-Live Voice Model That Delegates Complex Tasks to GPT-5.5

OpenAI has replaced ChatGPT's voice mode with GPT-Live, a new voice model that can delegate complex tasks to GPT-5.5 in the background. The previous voice mode was based on a GPT-4o era model with a 2024 knowledge cutoff.

model releaseJuly 8, 2026

Mistral Releases Robostral Navigate: 8B Navigation Model Achieves 76.6% Success Using Single RGB Camera

Mistral AI released Robostral Navigate, an 8B parameter model that enables autonomous robot navigation using only a single RGB camera. The model achieves 76.6% success on R2R-CE validation unseen benchmarks, outperforming multi-sensor approaches by 4.5 percentage points despite using no depth sensors or LiDAR.

model releaseJuly 9, 2026

OpenAI releases Sol model without clear government approval process, experts say

OpenAI has released its latest advanced model, Sol, for public access after government review, but researchers and industry figures say the approval process remains opaque. The model is considered comparable to Anthropic's Fable, which was briefly banned from public access, yet details of how either model received clearance are unclear.

Tencent releases HY-Embodied-0.5, a 2B-parameter vision-language model for robot control

HY-Embodied-0.5 MoT-2B — Quick Specs

Tencent Releases HY-Embodied-0.5 for Real-World Robot Control

Architecture and Technical Details

Performance Claims

Hardware Requirements and Deployment

Vision-Language-Action Integration

Open Source Availability

What This Means

Related Articles

NVIDIA Releases Audex-30B-A3B: Unified Audio-Text Model With 1M Token Context and Speech Generation

OpenAI Launches GPT-Live Voice Model That Delegates Complex Tasks to GPT-5.5

Mistral Releases Robostral Navigate: 8B Navigation Model Achieves 76.6% Success Using Single RGB Camera

OpenAI releases Sol model without clear government approval process, experts say

Comments