model release

Baidu releases ERNIE-Image-Turbo, a distilled text-to-image model generating in 8 inference steps

TL;DR

Baidu has released ERNIE-Image-Turbo, a distilled text-to-image diffusion transformer that generates images in 8 inference steps. The model runs on consumer GPUs with 24GB VRAM and supports resolutions up to 1376×768, with claimed strengths in text rendering and structured generation tasks.

2 min read
0

Baidu releases ERNIE-Image-Turbo, a distilled text-to-image model generating in 8 inference steps

Baidu has released ERNIE-Image-Turbo, a distilled version of its ERNIE-Image text-to-image model that generates images in 8 inference steps. The model is built on a single-stream Diffusion Transformer (DiT) architecture and runs on consumer GPUs with 24GB VRAM.

Technical specifications

ERNIE-Image-Turbo supports multiple resolutions: 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, and 1200×896. The model uses a guidance scale of 1.0 and operates with bfloat16 precision. Pricing has not been disclosed.

The distillation process used Distribution Matching Distillation (DMD) and reinforcement learning to reduce the 50-step inference requirement of the base ERNIE-Image model to 8 steps while maintaining generation quality, according to Baidu.

Benchmark performance

On GENEval, ERNIE-Image-Turbo with prompt enhancement scored 0.8510 overall, compared to 0.8728 for the base ERNIE-Image model and 0.8481 for FLUX.2-klein-9B. The model achieved 0.9938 on single object detection and 0.8375 on counting tasks.

For text rendering measured on LongTextBench, ERNIE-Image-Turbo scored 0.9655 average across English and Chinese benchmarks, trailing Seedream 4.5 (0.9882) and the base ERNIE-Image model (0.9733) but outperforming FLUX.2-klein-9B (0.5413).

On the OneIG-EN benchmark measuring alignment, text, reasoning, style, and diversity, ERNIE-Image-Turbo scored 0.5656 overall. Nano Banana 2.0 led with 0.5780, while the base ERNIE-Image achieved 0.5750.

Implementation details

The model is available through Hugging Face's diffusers library and SGLang for deployment. Baidu states the model is designed for "posters, comics, multi-panel layouts, and other content creation tasks" requiring text rendering and structured generation.

Two versions are available: ERNIE-Image-Turbo with and without prompt enhancement (PE). The PE version generally shows higher benchmark scores across most metrics.

What this means

ERNIE-Image-Turbo represents Baidu's entry into fast text-to-image generation, prioritizing deployment efficiency over maximum quality. The 8-step generation and 24GB VRAM requirement make it accessible for consumer hardware, though benchmark scores indicate trade-offs compared to the base model. The focus on text rendering and structured layouts positions it for specific use cases like poster and comic generation rather than general-purpose image synthesis. Whether the speed gains justify the quality reduction will depend on application requirements.

Related Articles

model release

StepFun releases Step-3.7-Flash: 198B-parameter MoE model with 256K context at $0.20/M input tokens

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates 11B parameters per token and delivers up to 400 tokens per second. The model supports a 256K context window, three selectable reasoning levels, and is priced at $0.20 per million input tokens (cache miss) and $1.15 per million output tokens.

model release

Liquid AI Releases LFM2.5-8B: 8-Billion Parameter Hybrid Model Optimized for Edge Deployment

Liquid AI has released LFM2.5-8B-A1B, an 8-billion parameter hybrid model designed specifically for edge AI and on-device deployment. The model is available in multiple GGUF quantized formats ranging from 4-bit (4.84 GB) to 16-bit (16.9 GB), optimized for memory efficiency.

model release

StepFun launches Step 3.7 Flash: 196B MoE model with 256K context and adjustable reasoning levels at $0.20/$1.15 per 1M

StepFun has released Step 3.7 Flash, a 196B-parameter Mixture-of-Experts model that activates approximately 11B parameters per token. The multimodal model supports a 256K context window and introduces selectable reasoning levels (high/medium/low), priced at $0.20 per 1M input tokens and $1.15 per 1M output tokens.

model release

Anthropic's Opus 4.8 matches Claude Mythos Preview in alignment, cuts thinking mode costs by 67%

Anthropic released Claude Opus 4.8 on May 28, 2026, replacing Opus 4.7 at unchanged pricing. The company claims the model's misalignment rates match those of Claude Mythos Preview, the experimental model deemed too dangerous for public release in April 2026. Opus 4.8 delivers faster thinking modes at one-third the cost of version 4.7.

Comments

Loading...