ByteDance releases Lance, 3B-parameter unified multimodal model handling image and video generation, editing, and unders
ByteDance has released Lance, a 3-billion parameter multimodal model that performs image and video generation, editing, and understanding within a single framework. The model was trained entirely from scratch using 128 A100 GPUs and achieves 84.67% on DPG-Bench and 74% on GenEval, competing with larger models despite its compact size.
ByteDance Releases Lance, 3B Unified Multimodal Model
ByteDance has released Lance, a 3-billion parameter model that handles text-to-image generation, text-to-video generation, image editing, video editing, and visual question answering in a single unified framework. The model was trained entirely from scratch using 128 A100 GPUs.
Technical Specifications
Lance operates with 3 billion active parameters and supports video generation up to 121 frames at 768×768 resolution (480p preset). According to ByteDance, the model uses flow matching scheduling with a default timestep shift of 3.5 and 30 denoising steps. The architecture requires at least 40GB VRAM for inference.
The model's training used a "staged multi-task recipe," though ByteDance has not disclosed the training dataset size, training duration, or data cutoff date. Pricing information has not been announced.
Benchmark Performance
On DPG-Bench, a comprehensive image generation evaluation, Lance scores 84.67% overall, with particularly strong performance in relation understanding (93.38%) and entity recognition (91.07%). The model trails larger unified models like TUNA-27B (86.54%) and InternVL-U (85.18%) but outperforms the 7B BAGEL model.
For GenEval, which tests compositional image generation across attributes like object count and spatial positioning, Lance achieves 74% overall. This matches SD3-Medium (2B parameters) but falls behind FLUX.1-dev's 75% (though FLUX.1-dev uses 12B parameters).
ByteDance reports competitive scores on specific categories: 99% for single-object generation, 94% for two-object generation, and 72% for counting accuracy.
Capabilities
The model handles six distinct task types through a unified interface: text-to-image, text-to-video, image editing, video editing, image understanding (visual question answering), and video understanding (video captioning and analysis). ByteDance demonstrates video understanding capabilities including counting actions, spatial reasoning, and temporal analysis.
For generation tasks, Lance uses classifier-free guidance with a default scale of 4.0 for text conditioning. The model supports multi-turn consistency editing, maintaining coherent changes across sequential edit operations.
Availability
Model weights are available on Hugging Face under the bytedance-research organization. ByteDance provides a command-line inference tool and Gradio interface. The system requires Python 3.10+ and CUDA 12.4+.
What This Means
Lance represents ByteDance's entry into unified multimodal AI, directly competing with models like DeepSeek-Janus, Show-o, and OmniGen. At 3B parameters, it's significantly smaller than most unified models while maintaining competitive performance on standard benchmarks. The efficiency suggests progress in model architecture design, though the lack of disclosed training details makes it difficult to assess reproducibility or training costs beyond the stated 128-GPU budget. The model's commercial viability will depend on pricing, which ByteDance has not yet announced.
Related Articles
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens
Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google launches Gemini 3.5 Flash and new Omni multimodal AI family at I/O 2026
Google launched Gemini 3.5 Flash today as the default model for its Gemini app and AI Mode in Search, with Gemini 3.5 Pro following next month. The company also introduced Gemini Omni, a new multimodal AI family capable of generating video from text, photos, video, and audio inputs.
Comments
Loading...