ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

TL;DR

ByteDance released Bernini-R, an open-source video generation and editing model that combines an MLLM-based semantic planner with a DiT-based renderer. The model requires Hopper-class GPUs (H100/H800/H200) for optimal performance and supports multiple tasks including text-to-video, video editing, and reference-guided generation.

June 3, 2026 · 8:51 AM2 min read

Bernini-R — Quick Specs

Compare Bernini-R with other models →

ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

ByteDance released Bernini-R, an open-source video generation and editing model that combines an MLLM-based semantic planner with a DiT-based renderer. The company made the inference code and model weights available on June 1, 2025, following a research paper published May 22.

Technical Architecture

Bernini-R is built on the Wan2.2 base model (Wan-AI/Wan2.2-T2V-A14B) and uses a dual-checkpoint architecture with separate high-noise and low-noise transformer weights. The model requires:

Python 3.11.2
CUDA 12.4 (minimum 12.3)
PyTorch 2.5.1+cu124
Hopper GPU (H100/H800/H200) recommended for FlashAttention-3 support
Other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA

The system uses pinned dependencies: diffusers 0.35.2, accelerate 0.34.2, and transformers 4.57.3.

Capabilities and Performance

The model supports seven task types:

Text-to-image (t2i)
Image editing (i2i)
Text-to-video (t2v)
Video editing (v2v)
Motion-aware video editing (mv2v)
Reference + video editing (rv2v)
Reference-to-video (r2v)

According to ByteDance, Bernini reaches "the first tier among leading closed-source commercial models" on video editing tasks. This claim is based on a self-built arena platform where human annotators blindly vote on paired edits, aggregated into Bradley-Terry scores and win-rate matrices. No specific numerical benchmarks were disclosed.

Default inference outputs 480p video at 16fps (848px max image size). The system supports higher resolutions, with examples showing 720p/24fps output at 1280px max size.

Deployment Requirements

Single-GPU inference works for image tasks (t2i, i2i with --num_frames 1). Video tasks use 8-GPU configurations via torchrun with Ulysses sequence parallelism. The --ulysses flag controls N-way sequence parallel per sample, with remaining GPUs running data parallel over task lists.

Multi-GPU setups require Open-VeOmni (Apache-2.0, Python 3.11) for sequence parallelism, though single-GPU inference does not need this dependency.

Model Access

Two distribution methods are available:

Diffusers format (recommended): ByteDance/Bernini-R-Diffusers on Hugging Face — self-contained directory bundling VAE, UMT5 text encoder, tokenizer, and Bernini-R weights
Separate checkpoints: Base Wan2.2 model plus Bernini-R high-noise/low-noise weights from ByteDance/Bernini-R

ByteDance recommends using a prompt enhancer (--use_pe flag) through an OpenAI-compatible endpoint for best generation quality. The enhancer requires configuring BERNINI_PE_API_KEY, BERNINI_PE_BASE_URL, and BERNINI_PE_MODEL environment variables.

What This Means

Bernini-R represents ByteDance's entry into open-source video generation, competing with models from Stability AI and others in the video diffusion space. The Hopper GPU requirement (H100/H800/H200) creates a high barrier to entry — these GPUs cost $25,000-40,000 each and are primarily available through cloud providers. The dual-checkpoint architecture and multi-GPU requirements suggest this is designed for research labs and companies with substantial compute budgets rather than individual developers. ByteDance's self-reported performance claims need independent verification, as the company provided no standardized benchmark scores against public datasets.

Source: huggingface.co ↗

ByteDance video-generation diffusion-models open-source video-editing multimodal Bernini

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

model releaseJuly 16, 2026

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Moonshot AI has released Kimi K3, an open-weight multimodal reasoning model with a 1-million token context window. The model is priced at $3 per 1M input tokens and $15 per 1M output tokens, available through OpenRouter.

model releaseJuly 14, 2026

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Google has released Gemma 4 E2B for TPU, a variant of its open-source Gemma 4 model optimized to run natively on the Tensor G5 chip in Pixel 10 devices. The multimodal model enables completely offline AI chat, image recognition, and audio transcription on Pixel 10, 10 Pro, 10 Pro XL, and 10 Pro Fold.

model releaseJuly 11, 2026

Cohere releases 2B parameter Arabic speech recognition model with 25.9% average WER

Cohere and Cohere Labs released Cohere Transcribe Arabic, a 2B parameter automatic speech recognition model optimized for Arabic dialects and Arabic-English code-switching. The open-source model achieves a 25.9% average word error rate across major Arabic ASR benchmarks, outperforming models up to 30B parameters.

ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

Bernini-R — Quick Specs

ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

Technical Architecture

Capabilities and Performance

Deployment Requirements

Model Access

What This Means

Related Articles

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Cohere releases 2B parameter Arabic speech recognition model with 25.9% average WER

Comments