model releaseTencent

Tencent releases HY-Embodied-0.5, a 2B-parameter vision-language model for robot control

TL;DR

Tencent has released HY-Embodied-0.5, a family of foundation models designed specifically for embodied AI and robotic control. The suite includes a 2B-parameter MoT (Mixture-of-Transformers) variant with only 2.2B activated parameters during inference, and a 32B model that claims frontier-level performance comparable to Gemini 3.0 Pro, trained on over 200 billion tokens of embodied-specific data.

2 min read
0

Tencent Releases HY-Embodied-0.5 for Real-World Robot Control

Tencent's Robotics X and HY Vision Team have released HY-Embodied-0.5, an open-source suite of foundation models explicitly engineered for embodied AI agents and robotic control. The suite features two variants: a 2B-parameter model optimized for edge deployment and a 32B variant for complex reasoning tasks.

Architecture and Technical Details

The core innovation is the Mixture-of-Transformers (MoT) architecture. The 2B variant contains 4B total parameters but activates only 2.2B during inference, achieving "the high inference speed of a dense 2B model while delivering superior, fine-grained perceptual representations," according to Tencent. This efficiency comes from modality-specific computing in the vision pathway.

Both models were trained on a curated dataset comprising over 100 million embodied and spatial-specific data points across 200+ billion tokens. Tencent employed on-policy distillation to transfer reasoning capabilities from the 32B model to the compact 2B variant.

Performance Claims

Across 22 embodied-relevant benchmarks against similarly-sized models:

  • CV-Bench: HY-Embodied-0.5 MoT-2B scored 89.2, compared to Qwen3-VL 2B's 80.0 and Qwen3-VL 4B's 85.7
  • DA-2K: 92.3 versus Qwen3-VL 2B's 69.5
  • ERQA (embodied reasoning): 54.5 versus Qwen3-VL 2B's 41.8
  • EmbSpatial-Bench: 82.8 versus Qwen3-VL 2B's 75.9

Tencent claims the 32B variant achieves "frontier-level performance comparable to Gemini 3.0 Pro," though specific benchmarks are not disclosed in the announcement.

Hardware Requirements and Deployment

The model requires CUDA 12.6, PyTorch 2.8.0, and Python 3.12+. Tencent recommends NVIDIA GPUs with at least 16GB VRAM, though CPU inference is supported. The 2B model requires 8GB of disk space for weights; 20GB+ total storage is recommended for dependencies.

A custom Transformers version (specific commit 9293856c419762ebf98fbe2bd9440f9ce7069f1a) is required for inference. Tencent states they "will merge the improvements into the Transformers main branch later."

Vision-Language-Action Integration

HY-Embodied is positioned as the "brain" for Vision-Language-Action (VLA) pipelines. Unlike general vision-language models, the architecture emphasizes spatial-temporal perception, physical object interaction understanding, and agent dynamics—capabilities required for real-world robotic control.

The model supports both single and batch inference with optional chain-of-thought reasoning modes. Maximum generation length extends to 32,768 tokens.

Open Source Availability

Tencent has open-sourced the HY-Embodied-0.5 MoT-2B weights on Hugging Face (model ID: tencent/HY-Embodied-0.5) along with official inference code. A Gradio demo is available for testing. The full codebase is available on GitHub at Tencent-Hunyuan/HY-Embodied.

What This Means

HY-Embodied-0.5 addresses a genuine gap: most foundation models optimize for language or general vision tasks, not the specific demands of physical robots. A 2B model that matches or exceeds 4B competitors on embodied reasoning benchmarks could shift robotics development toward smaller, edge-deployable systems. However, the comparison against Qwen3-VL (which Tencent notes has "repetitive thinking patterns") rather than Gemini 3.0 or Claude variants limits independent assessment of true competitive positioning. The 32B variant's claimed parity with Gemini 3.0 Pro requires third-party validation.

Related Articles

model release

Tencent Releases Hy-MT2 Translation Models: 1.8B, 7B, and 30B-A3B Support 33 Languages

Tencent released Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B (MoE) sizes. All models support translation among 33 languages and follow translation instructions in multiple languages. The 1.8B model can be compressed to 440MB using 1.25-bit AngelSlim quantization.

model release

Tencent Releases Hy-MT2: 1.8B Translation Model Compressed to 440MB With 1.25-Bit Quantization

Tencent has open-sourced Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B parameter sizes. The models support translation across 33 languages and include extreme quantization down to 1.25-bit, reducing the 1.8B model to 440MB storage while increasing inference speed by 1.5x.

model release

Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context

Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.

model release

Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU

Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.

Comments

Loading...