Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images
Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video. The model produces editable 3D assets including meshes and 3D Gaussian Splattings that can be directly imported into game engines like Unity and Unreal Engine.
Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images
Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video.
Technical Specifications
The system centers on WorldMirror 2.0, a 1.2 billion parameter feed-forward model that outputs depth maps, surface normals, camera parameters, 3D point clouds, and 3D Gaussian Splatting (3DGS) attributes in a single forward pass. The model supports flexible-resolution inference from 50,000 to 500,000 pixels.
HY-World 2.0 operates through a four-stage pipeline:
- HY-Pano 2.0: Generates 360-degree panoramas from text or images
- WorldNav: Plans camera trajectories through the scene
- WorldStereo 2.0: Expands the world from panoramic views
- WorldMirror 2.0 + 3DGS learning: Composes final 3D assets
Pricing has not been disclosed.
What Sets It Apart
According to Tencent, HY-World 2.0 differs from existing video-based world models like Genie 3 and Cosmos by producing persistent 3D assets rather than temporary video sequences. The generated meshes and Gaussian Splattings can be directly imported into Blender, Unity, Unreal Engine, and Isaac Sim.
The company claims the model achieves state-of-the-art accuracy and produces results comparable to closed-source methods such as Marble. Unlike video world models that require per-frame inference, HY-World 2.0 performs one-time generation with near-zero rendering cost after creation.
Capabilities
The system supports two core functions:
- World Generation: Converts text or single images into navigable 3D scenes
- World Reconstruction: Transforms multi-view images or video into 3D representations
The model handles diverse visual styles including realistic, cartoon, and game aesthetics. It enables first-person navigation and third-person character exploration with physics-based collision detection.
Open Source Release
Tencent has released the WorldMirror 2.0 inference code and model weights on Hugging Face. The company plans to release additional components including full world generation code, HY-Pano 2.0, WorldNav, and WorldStereo 2.0 at unspecified future dates.
The model requires CUDA 12.4 and supports both single-GPU and multi-GPU inference via PyTorch 2.4.0. In multi-GPU mode, the number of input images must equal or exceed the number of GPUs used.
What This Means
HY-World 2.0 represents a shift from video-based world models to asset-based generation, addressing persistent issues with video models including temporal inconsistency and limited reusability. By outputting standard 3D formats compatible with major game engines, the model could accelerate 3D content creation workflows for game development and simulation. However, the staggered release schedule means the complete end-to-end pipeline remains unavailable, limiting immediate practical deployment. The 1.2B parameter count suggests efficient inference compared to larger multimodal models, though real-world performance benchmarks beyond company claims have not been independently verified.
Related Articles
Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window
Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.
ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture
ByteDance released Bernini-R, an open-source video generation and editing model that combines an MLLM-based semantic planner with a DiT-based renderer. The model requires Hopper-class GPUs (H100/H800/H200) for optimal performance and supports multiple tasks including text-to-video, video editing, and reference-guided generation.
NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua
NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.
Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context
Nvidia has released Nemotron 3.5 Content Safety, a 4-billion parameter multimodal guardrail model fine-tuned from Google Gemma-3-4B. The model is available for free, supports 128K token context windows, and moderates content across 12 languages.
Comments
Loading...