Baidu releases ERNIE-Image, an 8B parameter text-to-image model with strong text rendering capabilities
Baidu has released ERNIE-Image, an 8B parameter text-to-image generation model built on a single-stream Diffusion Transformer architecture. The model is designed for complex instruction following, text rendering, and structured image generation, and can run on consumer GPUs with 24GB VRAM.
Baidu releases ERNIE-Image, an 8B parameter text-to-image model with strong text rendering capabilities
Baidu has released ERNIE-Image, an 8B parameter text-to-image generation model that claims state-of-the-art performance among open-weight models in its size class. The model is built on a single-stream Diffusion Transformer (DiT) architecture and includes a lightweight Prompt Enhancer component that expands user inputs into structured descriptions.
Technical specifications
ERNIE-Image uses 8 billion DiT parameters and can generate images at multiple resolutions including 1024x1024, 848x1264, and 1264x848 pixels. The model requires 50 inference steps with a guidance scale of 4.0 for the base version. According to Baidu, the model can run on consumer GPUs with 24GB VRAM.
Baidu has also released ERNIE-Image-Turbo, a faster variant optimized with Distribution Matching Distillation (DMD) and reinforcement learning that generates images in 8 inference steps.
Benchmark performance
On the GENEval benchmark, ERNIE-Image with Prompt Enhancer scored 0.8728 overall, outperforming FLUX.2-klein-9B (0.8481) and Z-Image (0.8400). The model scored particularly well on single object generation (0.9906) and two object generation (0.9596).
For text rendering specifically, ERNIE-Image achieved 0.9733 average score on LongTextBench across English and Chinese, trailing only Seedream 4.5 (0.9882) but ahead of GLM-Image (0.9656) and Nano Banana 2.0 (0.9650).
On the OneIG-EN benchmark measuring alignment, text, reasoning, style, and diversity, ERNIE-Image with Prompt Enhancer scored 0.5750 overall, ranking third behind Nano Banana 2.0 (0.5780) and Seedream 4.5 (0.5760).
Intended use cases
Baidu positions ERNIE-Image for commercial applications requiring precise control over generated content, including posters, comics, multi-panel layouts, infographics, and UI mockups. The model supports multiple visual styles including realistic photography, design-oriented imagery, and stylized aesthetic outputs.
The model is available on Hugging Face with both Diffusers and SGLang inference support. Baidu has not disclosed pricing for commercial API access.
What this means
ERNIE-Image represents a strategic release from Baidu targeting practical commercial applications rather than purely aesthetic generation. The 8B parameter count makes it computationally accessible while the benchmark scores suggest competitive performance with larger models. The emphasis on text rendering and instruction following addresses specific pain points in text-to-image generation where models often struggle with accurate text and complex layouts. The availability of a turbo variant with 8-step inference indicates Baidu's focus on deployment efficiency alongside quality.
Related Articles
Baidu releases ERNIE-Image-Turbo, a distilled text-to-image model generating in 8 inference steps
Baidu has released ERNIE-Image-Turbo, a distilled text-to-image diffusion transformer that generates images in 8 inference steps. The model runs on consumer GPUs with 24GB VRAM and supports resolutions up to 1376×768, with claimed strengths in text rendering and structured generation tasks.
MiniMax releases M2.7, a 229B parameter model with self-evolving capabilities and agent teams
MiniMax has released MiniMax-M2.7, a 229-billion parameter model that uniquely participates in its own evolution during development. The model achieves 66.6% medal rate on MLE Bench Lite and 56.22% on SWE-Pro benchmarks, with native support for multi-agent collaboration and complex tool orchestration.
Google releases Gemma 4, open-source on-device AI with agentic tool use for phones
Google released Gemma 4, an open-source multimodal model that runs entirely on smartphones without sending data to the cloud. The E2B and E4B variants require just 6GB and 8GB of RAM respectively and can autonomously use tools like Wikipedia, maps, and QR code generators through built-in agent skills. The model is available free via the Google AI Edge Gallery app for Android and iOS.
Liquid AI releases LFM2.5-VL-450M, improved 450M-parameter vision-language model with multilingual support
Liquid AI has released LFM2.5-VL-450M, a refreshed 450M-parameter vision-language model built on an updated LFM2.5-350M backbone. The model features a 32,768-token context window, supports 9 languages, handles native 512×512 pixel images, and adds bounding box prediction and function calling capabilities. Performance improvements span both vision and language benchmarks compared to its predecessor.
Comments
Loading...