model releaseByteDance

ByteDance releases Lance, 3B-parameter unified multimodal model handling image and video generation, editing, and unders

TL;DR

ByteDance has released Lance, a 3-billion parameter multimodal model that performs image and video generation, editing, and understanding within a single framework. The model was trained entirely from scratch using 128 A100 GPUs and achieves 84.67% on DPG-Bench and 74% on GenEval, competing with larger models despite its compact size.

2 min read
0

ByteDance Releases Lance, 3B Unified Multimodal Model

ByteDance has released Lance, a 3-billion parameter model that handles text-to-image generation, text-to-video generation, image editing, video editing, and visual question answering in a single unified framework. The model was trained entirely from scratch using 128 A100 GPUs.

Technical Specifications

Lance operates with 3 billion active parameters and supports video generation up to 121 frames at 768×768 resolution (480p preset). According to ByteDance, the model uses flow matching scheduling with a default timestep shift of 3.5 and 30 denoising steps. The architecture requires at least 40GB VRAM for inference.

The model's training used a "staged multi-task recipe," though ByteDance has not disclosed the training dataset size, training duration, or data cutoff date. Pricing information has not been announced.

Benchmark Performance

On DPG-Bench, a comprehensive image generation evaluation, Lance scores 84.67% overall, with particularly strong performance in relation understanding (93.38%) and entity recognition (91.07%). The model trails larger unified models like TUNA-27B (86.54%) and InternVL-U (85.18%) but outperforms the 7B BAGEL model.

For GenEval, which tests compositional image generation across attributes like object count and spatial positioning, Lance achieves 74% overall. This matches SD3-Medium (2B parameters) but falls behind FLUX.1-dev's 75% (though FLUX.1-dev uses 12B parameters).

ByteDance reports competitive scores on specific categories: 99% for single-object generation, 94% for two-object generation, and 72% for counting accuracy.

Capabilities

The model handles six distinct task types through a unified interface: text-to-image, text-to-video, image editing, video editing, image understanding (visual question answering), and video understanding (video captioning and analysis). ByteDance demonstrates video understanding capabilities including counting actions, spatial reasoning, and temporal analysis.

For generation tasks, Lance uses classifier-free guidance with a default scale of 4.0 for text conditioning. The model supports multi-turn consistency editing, maintaining coherent changes across sequential edit operations.

Availability

Model weights are available on Hugging Face under the bytedance-research organization. ByteDance provides a command-line inference tool and Gradio interface. The system requires Python 3.10+ and CUDA 12.4+.

What This Means

Lance represents ByteDance's entry into unified multimodal AI, directly competing with models like DeepSeek-Janus, Show-o, and OmniGen. At 3B parameters, it's significantly smaller than most unified models while maintaining competitive performance on standard benchmarks. The efficiency suggests progress in model architecture design, though the lack of disclosed training details makes it difficult to assess reproducibility or training costs beyond the stated 128-GPU budget. The model's commercial viability will depend on pricing, which ByteDance has not yet announced.

Related Articles

model release

Google releases Gemini 3.1 Flash Lite Image, its fastest and cheapest image generation model

Google has released Gemini 3.1 Flash Lite Image, also called Nano Banana 2 Lite, which the company describes as its fastest and cheapest image generation model. The model is available through Google's AI Studio and Gemini API with the identifier gemini-3.1-flash-lite-image.

model release

Google launches Gemini 3.1 Flash Lite Image with 4-second generation time, $0.25 per 1M input tokens

Google has released Gemini 3.1 Flash Lite Image, a text-to-image model that generates 1K resolution images in approximately 4 seconds — 2.7× faster than Gemini 3.1 Flash Image. The model is priced at $0.25 per 1M input tokens and $1.50 per 1M output tokens, with a 66K context window and knowledge cutoff of January 2025.

model release

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model release

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.

Comments

Loading...