model releaseTencent

Tencent releases OmniWeaving, open-source video generation model with reasoning and multi-modal composition

TL;DR

Tencent's Hunyuan team released OmniWeaving on April 3, 2026, an open-source video generation model designed to compete with proprietary systems like Seedance-2.0. The model combines multimodal composition, reasoning-informed capabilities, and supports eight video generation tasks including text-to-video, image-to-video, video editing, and compositional generation.

3 min read
0

Tencent Releases OmniWeaving, Open-Source Video Generation Model

Tencent's Hunyuan team released OmniWeaving on April 3, 2026, positioning it as an open-source alternative to closed proprietary video generation systems. The model represents a significant step toward unified video generation capabilities, supporting eight distinct task configurations.

Architecture and Technical Foundation

OmniWeaving is built on HunyuanVideo-1.5 as its backbone, integrating an MLLM (Multimodal Large Language Model) + MMDiT (Multimodal Diffusion Transformer) + VAE framework. The architecture incorporates two key improvements:

Thinking Mode: The MLLM activates a reasoning mode that generates intermediate reasoning steps before video generation, translating abstract user intent into semantically precise prompts that condition the diffusion model.

Hidden States DeepStacking: Following mechanisms in Qwen3-VL, the model extracts hidden states from multiple intermediate MLLM layers, capturing semantic information across fine-grained details to high-level abstractions. These multi-level features are injected into the first three layers of the MMDiT conditioning branch.

Supported Tasks

OmniWeaving supports eight video generation configurations:

  • Text-to-Video: Generate videos from text prompts
  • First-Frame-to-Video: Animate static images with text guidance
  • Key-Frames-to-Video: Interpolate videos between start and end frames
  • Video-to-Video Editing: Instruction-based manipulation and stylization
  • Reference-to-Video: Single-subject reference-driven generation
  • Compositional Multi-Image-to-Video: Multi-subject generation from 2–4 images
  • Text-Image-Video-to-Video: Generation conditioned on combined text, image, and video inputs
  • Reasoning-Augmented Generation: Reasoning over user intent before video generation

The reasoning and composition tasks can be optionally enabled via a --think flag during inference.

Benchmarking

Tencent introduced IntelligentVBench, described as the first comprehensive benchmark for assessing unified video generation with reasoning capabilities. According to the team, OmniWeaving achieves state-of-the-art performance among open-source unified video generation models, though specific benchmark scores were not disclosed in the release announcement.

Availability and Deployment

Code and model weights were released on April 3, 2026. The model requires installation of attention libraries for optimized inference:

  • Flash Attention for faster inference and reduced GPU memory
  • Flex-Block-Attention for sparse attention optimization
  • SageAttention as an alternative optimization layer

The inference pipeline requires 8 GPUs by default but can be adapted for limited GPU memory environments through configuration adjustments and memory expansion settings. The codebase is available on GitHub with detailed checkpoint download instructions.

Research Background

OmniWeaving is the result of collaboration between Tencent Hunyuan, Zhejiang University, and Nanyang Technological University. The research was authored by Kaihang Pan, Qi Tian, and others, with the paper published on arXiv on March 26, 2026. The team trained the model on massive-scale pretraining datasets encompassing diverse compositional and reasoning-augmented scenarios.

What this means

OmniWeaving addresses a significant gap in open-source video generation by offering reasoning-aware composition capabilities previously limited to proprietary systems. The explicit integration of intermediate reasoning steps and multi-level semantic conditioning represents a technical approach to bridging user intent and pixel-level generation. For practitioners, this means access to a production-ready model supporting complex video generation workflows without closed-source dependencies. The IntelligentVBench benchmark provides a standardized evaluation framework for next-generation video models, though adoption depends on broader community adoption and reproducibility of claimed performance gains.

Related Articles

model release

Google DeepMind releases Gemma 4 with multimodal reasoning and up to 256K context window

Google DeepMind released Gemma 4, a multimodal model family supporting text, images, video, and audio with context windows up to 256K tokens. The release includes four sizes (E2B, E4B, 26B A4B, and 31B) designed for deployment from mobile devices to servers. The 31B dense model achieves 85.2% on MMLU Pro and 89.2% on AIME 2026.

model release

Google DeepMind releases Gemma 4, open multimodal models with 256K context and reasoning

Google DeepMind has released Gemma 4, a family of open-weights multimodal models ranging from 2.3B to 31B parameters with support for text, images, video, and audio. The models feature context windows up to 256K tokens, built-in reasoning modes, and native function calling for agentic workflows.

model release

Google releases Gemma 4 31B with 256K context and configurable reasoning mode

Google DeepMind has released Gemma 4 31B, a 30.7-billion-parameter multimodal model supporting text and image input. The model features a 262,144-token context window, configurable thinking/reasoning mode, native function calling, and multilingual support across 140+ languages under Apache 2.0 license.

model release

NVIDIA releases Gemma 4 31B quantized model with 256K context, multimodal capabilities

NVIDIA has released a quantized version of Google DeepMind's Gemma 4 31B IT model, compressed to NVFP4 format for efficient inference on consumer GPUs. The 30.7B-parameter multimodal model supports 256K token context windows, handles text and image inputs with video frame processing, and maintains near-baseline performance across reasoning and coding benchmarks.

Comments

Loading...