model release

Baidu releases ERNIE-Image-Turbo, a distilled text-to-image model generating in 8 inference steps

TL;DR

Baidu has released ERNIE-Image-Turbo, a distilled text-to-image diffusion transformer that generates images in 8 inference steps. The model runs on consumer GPUs with 24GB VRAM and supports resolutions up to 1376×768, with claimed strengths in text rendering and structured generation tasks.

April 15, 2026 · 3:21 AM2 min read

Baidu releases ERNIE-Image-Turbo, a distilled text-to-image model generating in 8 inference steps

Baidu has released ERNIE-Image-Turbo, a distilled version of its ERNIE-Image text-to-image model that generates images in 8 inference steps. The model is built on a single-stream Diffusion Transformer (DiT) architecture and runs on consumer GPUs with 24GB VRAM.

Technical specifications

ERNIE-Image-Turbo supports multiple resolutions: 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, and 1200×896. The model uses a guidance scale of 1.0 and operates with bfloat16 precision. Pricing has not been disclosed.

The distillation process used Distribution Matching Distillation (DMD) and reinforcement learning to reduce the 50-step inference requirement of the base ERNIE-Image model to 8 steps while maintaining generation quality, according to Baidu.

Benchmark performance

On GENEval, ERNIE-Image-Turbo with prompt enhancement scored 0.8510 overall, compared to 0.8728 for the base ERNIE-Image model and 0.8481 for FLUX.2-klein-9B. The model achieved 0.9938 on single object detection and 0.8375 on counting tasks.

For text rendering measured on LongTextBench, ERNIE-Image-Turbo scored 0.9655 average across English and Chinese benchmarks, trailing Seedream 4.5 (0.9882) and the base ERNIE-Image model (0.9733) but outperforming FLUX.2-klein-9B (0.5413).

On the OneIG-EN benchmark measuring alignment, text, reasoning, style, and diversity, ERNIE-Image-Turbo scored 0.5656 overall. Nano Banana 2.0 led with 0.5780, while the base ERNIE-Image achieved 0.5750.

Implementation details

The model is available through Hugging Face's diffusers library and SGLang for deployment. Baidu states the model is designed for "posters, comics, multi-panel layouts, and other content creation tasks" requiring text rendering and structured generation.

Two versions are available: ERNIE-Image-Turbo with and without prompt enhancement (PE). The PE version generally shows higher benchmark scores across most metrics.

What this means

ERNIE-Image-Turbo represents Baidu's entry into fast text-to-image generation, prioritizing deployment efficiency over maximum quality. The 8-step generation and 24GB VRAM requirement make it accessible for consumer hardware, though benchmark scores indicate trade-offs compared to the base model. The focus on text rendering and structured layouts positions it for specific use cases like poster and comic generation rather than general-purpose image synthesis. Whether the speed gains justify the quality reduction will depend on application requirements.

Source: huggingface.co ↗

baidu text-to-image diffusion image-generation ernie model-distillation

model releaseJuly 14, 2026

PrismML releases Bonsai 27B, claims first 27B-parameter model to run on-device on iPhone at 4GB memory footprint

PrismML has released Bonsai 27B, claiming it's the first 27-billion parameter model capable of running on-device on iPhone. The model achieves 58-87 tokens per second on Apple's M5 Max chip with a 4GB memory footprint, using 1-bit and ternary quantization to fit within iPhone's approximately 6GB available app memory.

model releaseJuly 14, 2026

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Google has released Gemma 4 E2B for TPU, a variant of its open-source Gemma 4 model optimized to run natively on the Tensor G5 chip in Pixel 10 devices. The multimodal model enables completely offline AI chat, image recognition, and audio transcription on Pixel 10, 10 Pro, 10 Pro XL, and 10 Pro Fold.

model releaseJuly 14, 2026

Kwaipilot Releases KAT-Coder-Air V2.5 with 256K Context Window at $0.15/$0.60 Per Million Tokens

Kwaipilot has released KAT-Coder-Air V2.5, a coding-specialized model with a 256K token context window. The model is priced at $0.15 per million input tokens and $0.60 per million output tokens, positioning it as a mid-tier coding assistant option.

model releaseJuly 14, 2026

Kwaipilot Releases KAT-Coder-Pro V2.5 with 256K Context Window at $0.74/$2.96 Per Million Tokens

Kwaipilot has released KAT-Coder-Pro V2.5, a coding-focused language model with a 256,000-token context window. The model is priced at $0.74 per million input tokens and $2.96 per million output tokens, available through OpenRouter.

Baidu releases ERNIE-Image-Turbo, a distilled text-to-image model generating in 8 inference steps

Baidu releases ERNIE-Image-Turbo, a distilled text-to-image model generating in 8 inference steps

Technical specifications

Benchmark performance

Implementation details

What this means

Related Articles

PrismML releases Bonsai 27B, claims first 27B-parameter model to run on-device on iPhone at 4GB memory footprint

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Kwaipilot Releases KAT-Coder-Air V2.5 with 256K Context Window at $0.15/$0.60 Per Million Tokens

Kwaipilot Releases KAT-Coder-Pro V2.5 with 256K Context Window at $0.74/$2.96 Per Million Tokens

Comments