Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images
Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video. The model produces editable 3D assets including meshes and 3D Gaussian Splattings that can be directly imported into game engines like Unity and Unreal Engine.
Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images
Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video.
Technical Specifications
The system centers on WorldMirror 2.0, a 1.2 billion parameter feed-forward model that outputs depth maps, surface normals, camera parameters, 3D point clouds, and 3D Gaussian Splatting (3DGS) attributes in a single forward pass. The model supports flexible-resolution inference from 50,000 to 500,000 pixels.
HY-World 2.0 operates through a four-stage pipeline:
- HY-Pano 2.0: Generates 360-degree panoramas from text or images
- WorldNav: Plans camera trajectories through the scene
- WorldStereo 2.0: Expands the world from panoramic views
- WorldMirror 2.0 + 3DGS learning: Composes final 3D assets
Pricing has not been disclosed.
What Sets It Apart
According to Tencent, HY-World 2.0 differs from existing video-based world models like Genie 3 and Cosmos by producing persistent 3D assets rather than temporary video sequences. The generated meshes and Gaussian Splattings can be directly imported into Blender, Unity, Unreal Engine, and Isaac Sim.
The company claims the model achieves state-of-the-art accuracy and produces results comparable to closed-source methods such as Marble. Unlike video world models that require per-frame inference, HY-World 2.0 performs one-time generation with near-zero rendering cost after creation.
Capabilities
The system supports two core functions:
- World Generation: Converts text or single images into navigable 3D scenes
- World Reconstruction: Transforms multi-view images or video into 3D representations
The model handles diverse visual styles including realistic, cartoon, and game aesthetics. It enables first-person navigation and third-person character exploration with physics-based collision detection.
Open Source Release
Tencent has released the WorldMirror 2.0 inference code and model weights on Hugging Face. The company plans to release additional components including full world generation code, HY-Pano 2.0, WorldNav, and WorldStereo 2.0 at unspecified future dates.
The model requires CUDA 12.4 and supports both single-GPU and multi-GPU inference via PyTorch 2.4.0. In multi-GPU mode, the number of input images must equal or exceed the number of GPUs used.
What This Means
HY-World 2.0 represents a shift from video-based world models to asset-based generation, addressing persistent issues with video models including temporal inconsistency and limited reusability. By outputting standard 3D formats compatible with major game engines, the model could accelerate 3D content creation workflows for game development and simulation. However, the staggered release schedule means the complete end-to-end pipeline remains unavailable, limiting immediate practical deployment. The 1.2B parameter count suggests efficient inference compared to larger multimodal models, though real-world performance benchmarks beyond company claims have not been independently verified.
Related Articles
OpenAI Releases GPT-5.4 Image 2 with 272K Context Window and Image Generation
OpenAI has released GPT-5.4 Image 2, combining the GPT-5.4 reasoning model with image generation capabilities. The multimodal model features a 272K token context window and is priced at $8 per million input tokens and $15 per million output tokens.
OpenAI releases ChatGPT Images 2.0 with 3840x2160 resolution at $30 per 1M output tokens
OpenAI released ChatGPT Images 2.0, pricing output tokens at $30 per million with maximum resolution of 3840x2160 pixels. CEO Sam Altman claims the improvement from gpt-image-1 to gpt-image-2 equals the jump from GPT-3 to GPT-5.
OpenAI announces gpt-image-2 model with improved text rendering and UI generation
OpenAI is set to announce gpt-image-2, its next-generation image generation model, on April 21, 2026 at 12pm PT. The company's teaser demonstrates improved capabilities in rendering text and generating realistic user interfaces from text prompts.
Moonshot AI Releases Kimi K2.6: 1T-Parameter MoE Model with 256K Context and Agent Swarm Capabilities
Moonshot AI has released Kimi K2.6, an open-source multimodal model with 1 trillion total parameters (32B activated) and 256K context window. The model achieves 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, and supports horizontal scaling to 300 sub-agents executing 4,000 coordinated steps.
Comments
Loading...