model releaseTencent

Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images

TL;DR

Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video. The model produces editable 3D assets including meshes and 3D Gaussian Splattings that can be directly imported into game engines like Unity and Unreal Engine.

April 16, 2026 · 6:21 AM2 min read

HY-World 2.0 — Quick Specs

Compare HY-World 2.0 with other models →

Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images

Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video.

Technical Specifications

The system centers on WorldMirror 2.0, a 1.2 billion parameter feed-forward model that outputs depth maps, surface normals, camera parameters, 3D point clouds, and 3D Gaussian Splatting (3DGS) attributes in a single forward pass. The model supports flexible-resolution inference from 50,000 to 500,000 pixels.

HY-World 2.0 operates through a four-stage pipeline:

HY-Pano 2.0: Generates 360-degree panoramas from text or images
WorldNav: Plans camera trajectories through the scene
WorldStereo 2.0: Expands the world from panoramic views
WorldMirror 2.0 + 3DGS learning: Composes final 3D assets

Pricing has not been disclosed.

What Sets It Apart

According to Tencent, HY-World 2.0 differs from existing video-based world models like Genie 3 and Cosmos by producing persistent 3D assets rather than temporary video sequences. The generated meshes and Gaussian Splattings can be directly imported into Blender, Unity, Unreal Engine, and Isaac Sim.

The company claims the model achieves state-of-the-art accuracy and produces results comparable to closed-source methods such as Marble. Unlike video world models that require per-frame inference, HY-World 2.0 performs one-time generation with near-zero rendering cost after creation.

Capabilities

The system supports two core functions:

World Generation: Converts text or single images into navigable 3D scenes
World Reconstruction: Transforms multi-view images or video into 3D representations

The model handles diverse visual styles including realistic, cartoon, and game aesthetics. It enables first-person navigation and third-person character exploration with physics-based collision detection.

Open Source Release

Tencent has released the WorldMirror 2.0 inference code and model weights on Hugging Face. The company plans to release additional components including full world generation code, HY-Pano 2.0, WorldNav, and WorldStereo 2.0 at unspecified future dates.

The model requires CUDA 12.4 and supports both single-GPU and multi-GPU inference via PyTorch 2.4.0. In multi-GPU mode, the number of input images must equal or exceed the number of GPUs used.

What This Means

HY-World 2.0 represents a shift from video-based world models to asset-based generation, addressing persistent issues with video models including temporal inconsistency and limited reusability. By outputting standard 3D formats compatible with major game engines, the model could accelerate 3D content creation workflows for game development and simulation. However, the staggered release schedule means the complete end-to-end pipeline remains unavailable, limiting immediate practical deployment. The 1.2B parameter count suggests efficient inference compared to larger multimodal models, though real-world performance benchmarks beyond company claims have not been independently verified.

Source: huggingface.co ↗

tencent 3d-generation world-models gaussian-splatting open-source multimodal

model releaseJune 3, 2026

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.

model releaseJune 3, 2026

ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

ByteDance released Bernini-R, an open-source video generation and editing model that combines an MLLM-based semantic planner with a DiT-based renderer. The model requires Hopper-class GPUs (H100/H800/H200) for optimal performance and supports multiple tasks including text-to-video, video editing, and reference-guided generation.

model releaseJune 4, 2026

NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua

NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.

model releaseJune 4, 2026

Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context

Nvidia has released Nemotron 3.5 Content Safety, a 4-billion parameter multimodal guardrail model fine-tuned from Google Gemma-3-4B. The model is available for free, supports 128K token context windows, and moderates content across 12 languages.

Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images

HY-World 2.0 — Quick Specs

Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images

Technical Specifications

What Sets It Apart

Capabilities

Open Source Release

What This Means

Related Articles

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua

Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context

Comments