model releaseTencent

Tencent releases OmniWeaving, open-source video generation model with reasoning and multi-modal composition

TL;DR

Tencent's Hunyuan team released OmniWeaving on April 3, 2026, an open-source video generation model designed to compete with proprietary systems like Seedance-2.0. The model combines multimodal composition, reasoning-informed capabilities, and supports eight video generation tasks including text-to-video, image-to-video, video editing, and compositional generation.

3 min read
0

Tencent Releases OmniWeaving, Open-Source Video Generation Model

Tencent's Hunyuan team released OmniWeaving on April 3, 2026, positioning it as an open-source alternative to closed proprietary video generation systems. The model represents a significant step toward unified video generation capabilities, supporting eight distinct task configurations.

Architecture and Technical Foundation

OmniWeaving is built on HunyuanVideo-1.5 as its backbone, integrating an MLLM (Multimodal Large Language Model) + MMDiT (Multimodal Diffusion Transformer) + VAE framework. The architecture incorporates two key improvements:

Thinking Mode: The MLLM activates a reasoning mode that generates intermediate reasoning steps before video generation, translating abstract user intent into semantically precise prompts that condition the diffusion model.

Hidden States DeepStacking: Following mechanisms in Qwen3-VL, the model extracts hidden states from multiple intermediate MLLM layers, capturing semantic information across fine-grained details to high-level abstractions. These multi-level features are injected into the first three layers of the MMDiT conditioning branch.

Supported Tasks

OmniWeaving supports eight video generation configurations:

  • Text-to-Video: Generate videos from text prompts
  • First-Frame-to-Video: Animate static images with text guidance
  • Key-Frames-to-Video: Interpolate videos between start and end frames
  • Video-to-Video Editing: Instruction-based manipulation and stylization
  • Reference-to-Video: Single-subject reference-driven generation
  • Compositional Multi-Image-to-Video: Multi-subject generation from 2–4 images
  • Text-Image-Video-to-Video: Generation conditioned on combined text, image, and video inputs
  • Reasoning-Augmented Generation: Reasoning over user intent before video generation

The reasoning and composition tasks can be optionally enabled via a --think flag during inference.

Benchmarking

Tencent introduced IntelligentVBench, described as the first comprehensive benchmark for assessing unified video generation with reasoning capabilities. According to the team, OmniWeaving achieves state-of-the-art performance among open-source unified video generation models, though specific benchmark scores were not disclosed in the release announcement.

Availability and Deployment

Code and model weights were released on April 3, 2026. The model requires installation of attention libraries for optimized inference:

  • Flash Attention for faster inference and reduced GPU memory
  • Flex-Block-Attention for sparse attention optimization
  • SageAttention as an alternative optimization layer

The inference pipeline requires 8 GPUs by default but can be adapted for limited GPU memory environments through configuration adjustments and memory expansion settings. The codebase is available on GitHub with detailed checkpoint download instructions.

Research Background

OmniWeaving is the result of collaboration between Tencent Hunyuan, Zhejiang University, and Nanyang Technological University. The research was authored by Kaihang Pan, Qi Tian, and others, with the paper published on arXiv on March 26, 2026. The team trained the model on massive-scale pretraining datasets encompassing diverse compositional and reasoning-augmented scenarios.

What this means

OmniWeaving addresses a significant gap in open-source video generation by offering reasoning-aware composition capabilities previously limited to proprietary systems. The explicit integration of intermediate reasoning steps and multi-level semantic conditioning represents a technical approach to bridging user intent and pixel-level generation. For practitioners, this means access to a production-ready model supporting complex video generation workflows without closed-source dependencies. The IntelligentVBench benchmark provides a standardized evaluation framework for next-generation video models, though adoption depends on broader community adoption and reproducibility of claimed performance gains.

Related Articles

model release

Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis

Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.

model release

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.

model release

Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June

Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.

model release

Google launches Gemini 3.5 Flash and new Omni multimodal AI family at I/O 2026

Google launched Gemini 3.5 Flash today as the default model for its Gemini app and AI Mode in Search, with Gemini 3.5 Pro following next month. The company also introduced Gemini Omni, a new multimodal AI family capable of generating video from text, photos, video, and audio inputs.

Comments

Loading...