model releaseTencent

Tencent releases OmniWeaving, open-source video generation model with reasoning and multi-modal composition

TL;DR

Tencent's Hunyuan team released OmniWeaving on April 3, 2026, an open-source video generation model designed to compete with proprietary systems like Seedance-2.0. The model combines multimodal composition, reasoning-informed capabilities, and supports eight video generation tasks including text-to-video, image-to-video, video editing, and compositional generation.

3 min read
0

Tencent Releases OmniWeaving, Open-Source Video Generation Model

Tencent's Hunyuan team released OmniWeaving on April 3, 2026, positioning it as an open-source alternative to closed proprietary video generation systems. The model represents a significant step toward unified video generation capabilities, supporting eight distinct task configurations.

Architecture and Technical Foundation

OmniWeaving is built on HunyuanVideo-1.5 as its backbone, integrating an MLLM (Multimodal Large Language Model) + MMDiT (Multimodal Diffusion Transformer) + VAE framework. The architecture incorporates two key improvements:

Thinking Mode: The MLLM activates a reasoning mode that generates intermediate reasoning steps before video generation, translating abstract user intent into semantically precise prompts that condition the diffusion model.

Hidden States DeepStacking: Following mechanisms in Qwen3-VL, the model extracts hidden states from multiple intermediate MLLM layers, capturing semantic information across fine-grained details to high-level abstractions. These multi-level features are injected into the first three layers of the MMDiT conditioning branch.

Supported Tasks

OmniWeaving supports eight video generation configurations:

  • Text-to-Video: Generate videos from text prompts
  • First-Frame-to-Video: Animate static images with text guidance
  • Key-Frames-to-Video: Interpolate videos between start and end frames
  • Video-to-Video Editing: Instruction-based manipulation and stylization
  • Reference-to-Video: Single-subject reference-driven generation
  • Compositional Multi-Image-to-Video: Multi-subject generation from 2–4 images
  • Text-Image-Video-to-Video: Generation conditioned on combined text, image, and video inputs
  • Reasoning-Augmented Generation: Reasoning over user intent before video generation

The reasoning and composition tasks can be optionally enabled via a --think flag during inference.

Benchmarking

Tencent introduced IntelligentVBench, described as the first comprehensive benchmark for assessing unified video generation with reasoning capabilities. According to the team, OmniWeaving achieves state-of-the-art performance among open-source unified video generation models, though specific benchmark scores were not disclosed in the release announcement.

Availability and Deployment

Code and model weights were released on April 3, 2026. The model requires installation of attention libraries for optimized inference:

  • Flash Attention for faster inference and reduced GPU memory
  • Flex-Block-Attention for sparse attention optimization
  • SageAttention as an alternative optimization layer

The inference pipeline requires 8 GPUs by default but can be adapted for limited GPU memory environments through configuration adjustments and memory expansion settings. The codebase is available on GitHub with detailed checkpoint download instructions.

Research Background

OmniWeaving is the result of collaboration between Tencent Hunyuan, Zhejiang University, and Nanyang Technological University. The research was authored by Kaihang Pan, Qi Tian, and others, with the paper published on arXiv on March 26, 2026. The team trained the model on massive-scale pretraining datasets encompassing diverse compositional and reasoning-augmented scenarios.

What this means

OmniWeaving addresses a significant gap in open-source video generation by offering reasoning-aware composition capabilities previously limited to proprietary systems. The explicit integration of intermediate reasoning steps and multi-level semantic conditioning represents a technical approach to bridging user intent and pixel-level generation. For practitioners, this means access to a production-ready model supporting complex video generation workflows without closed-source dependencies. The IntelligentVBench benchmark provides a standardized evaluation framework for next-generation video models, though adoption depends on broader community adoption and reproducibility of claimed performance gains.

Related Articles

model release

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model release

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.

model release

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model release

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

Comments

Loading...