model release

Baidu releases ERNIE-Image, an 8B parameter text-to-image model with strong text rendering capabilities

TL;DR

Baidu has released ERNIE-Image, an 8B parameter text-to-image generation model built on a single-stream Diffusion Transformer architecture. The model is designed for complex instruction following, text rendering, and structured image generation, and can run on consumer GPUs with 24GB VRAM.

2 min read
0

Baidu releases ERNIE-Image, an 8B parameter text-to-image model with strong text rendering capabilities

Baidu has released ERNIE-Image, an 8B parameter text-to-image generation model that claims state-of-the-art performance among open-weight models in its size class. The model is built on a single-stream Diffusion Transformer (DiT) architecture and includes a lightweight Prompt Enhancer component that expands user inputs into structured descriptions.

Technical specifications

ERNIE-Image uses 8 billion DiT parameters and can generate images at multiple resolutions including 1024x1024, 848x1264, and 1264x848 pixels. The model requires 50 inference steps with a guidance scale of 4.0 for the base version. According to Baidu, the model can run on consumer GPUs with 24GB VRAM.

Baidu has also released ERNIE-Image-Turbo, a faster variant optimized with Distribution Matching Distillation (DMD) and reinforcement learning that generates images in 8 inference steps.

Benchmark performance

On the GENEval benchmark, ERNIE-Image with Prompt Enhancer scored 0.8728 overall, outperforming FLUX.2-klein-9B (0.8481) and Z-Image (0.8400). The model scored particularly well on single object generation (0.9906) and two object generation (0.9596).

For text rendering specifically, ERNIE-Image achieved 0.9733 average score on LongTextBench across English and Chinese, trailing only Seedream 4.5 (0.9882) but ahead of GLM-Image (0.9656) and Nano Banana 2.0 (0.9650).

On the OneIG-EN benchmark measuring alignment, text, reasoning, style, and diversity, ERNIE-Image with Prompt Enhancer scored 0.5750 overall, ranking third behind Nano Banana 2.0 (0.5780) and Seedream 4.5 (0.5760).

Intended use cases

Baidu positions ERNIE-Image for commercial applications requiring precise control over generated content, including posters, comics, multi-panel layouts, infographics, and UI mockups. The model supports multiple visual styles including realistic photography, design-oriented imagery, and stylized aesthetic outputs.

The model is available on Hugging Face with both Diffusers and SGLang inference support. Baidu has not disclosed pricing for commercial API access.

What this means

ERNIE-Image represents a strategic release from Baidu targeting practical commercial applications rather than purely aesthetic generation. The 8B parameter count makes it computationally accessible while the benchmark scores suggest competitive performance with larger models. The emphasis on text rendering and instruction following addresses specific pain points in text-to-image generation where models often struggle with accurate text and complex layouts. The availability of a turbo variant with 8-step inference indicates Baidu's focus on deployment efficiency alongside quality.

Related Articles

model release

Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.

model release

Mistral releases Leanstral, 6B-parameter open-source model for Lean 4 formal proof verification

Mistral AI released Leanstral, the first open-source code agent designed specifically for Lean 4 formal proof verification. The model uses 6B active parameters in a sparse 120B architecture and is available under Apache 2.0 license with free API access.

model release

Mistral Releases Mistral Large 3 with 675B Parameters and Three Ministral 3 Models Under Apache 2.0

Mistral AI has released Mistral 3, consisting of Mistral Large 3—a sparse mixture-of-experts model with 675B total parameters and 41B active parameters—and three Ministral 3 models at 3B, 8B, and 14B parameters. All models are released under the Apache 2.0 license with multimodal capabilities including image understanding.

model release

Mistral AI Releases Voxtral: Apache 2.0 Speech Models with 32K Token Context at $0.001/Minute

Mistral AI released Voxtral, a family of open-source speech understanding models available in 24B and 3B parameter variants under Apache 2.0 license. The models support up to 32K token context (30 minutes of audio for transcription, 40 minutes for understanding) and are priced at $0.001 per minute via API—less than half the cost of comparable proprietary systems according to Mistral.

Comments

Loading...