Ideogram 4: 9.3B parameter open-weight text-to-image model with native 2K resolution and structured JSON prompting
Ideogram has released Ideogram 4, its first open-weight text-to-image model with 9.3 billion parameters. The model supports native 2K resolution, structured JSON prompting with bounding-box layout controls, and is available in nf4 and fp8 quantizations under a non-commercial license.
Ideogram 4: 9.3B parameter open-weight text-to-image model with native 2K resolution and structured JSON prompting
Ideogram has released Ideogram 4, its first open-weight text-to-image model featuring 9.3 billion parameters, native 2K resolution support, and a structured JSON prompting interface. The model is available in two quantizations: nf4 (CUDA-only) and fp8 (all hardware), both released under the Ideogram 4 Non-Commercial license.
Technical specifications
Ideogram 4 is built on a fully single-stream Diffusion Transformer (DiT) architecture with 34 layers. Unlike fine-tuned models, it was trained from scratch. The model uses Qwen3-VL-8B-Instruct as its text encoder—a vision-language model instead of traditional text-only encoders like CLIP or T5. Text and image tokens are concatenated into a unified sequence processed through the same transformer, enabling deep cross-modal interaction at every layer.
The model supports flexible resolutions from 256 to 2048 pixels (multiples of 16) with aspect ratios up to 6:1, automatically adjusting the noise schedule per resolution. Both quantizations weigh 9.3B parameters, with the nf4 version requiring CUDA and supporting Diffusers integration.
Benchmark performance
According to Ideogram, the model ranks as the top open-weight model on Design Arena, a third-party image Elo leaderboard focused on design-oriented generation. On the overall Design Arena leaderboard, Ideogram 4 trails only proprietary GPT and Gemini models.
In a blind typography evaluation by ContraLabs, where ten professional designers from Contra judged outputs, Ideogram 4 achieved a 47.9% first-place win rate—ahead of Gemini 3.1 Flash Image Preview (30.0%), FLUX.2 [max] (15.5%), and Grok Imagine 1.0 (15.0%). The same designers rated it 3.55/5 for practical usability in client work.
On LMArena's general-purpose text-to-image leaderboard, Ideogram ranks as a top-5 lab overall and the highest-ranked open-weight lab. In Ideogram's internal human-preference benchmark focused on graphic design and photography, the model scored second overall behind GPT Image 2 medium.
On standard open-source benchmarks, Ideogram 4 claims best-in-class layout control (7Bench), outperforming all closed-source models tested. For text rendering (X-Omni OCR), it reportedly exceeds larger models including Qwen-Image (20B), FLUX.2 [dev] (32B), and HunyuanImage 3.0 (80B MoE).
Key capabilities
The model introduces structured JSON prompting, allowing explicit control over composition, style, lighting, color palettes, and spatial layout through bounding-box coordinates. It supports multilingual text rendering with what Ideogram claims is state-of-the-art in-image text generation for signage, logos, and multi-line text.
Inference requires accepting a license gate on Hugging Face and authentication via an access token. The model uses dual-branch classifier-free guidance, enabling independent refinement of conditional (positive) and unconditional (negative) branches. Safety screening is performed via Hive's text and visual moderation APIs.
What this means
Ideogram 4 represents a significant open-weight release in the text-to-image space, particularly for design-focused applications. The structured JSON prompting and bounding-box controls address a key limitation in many image models: precise compositional control. At 9.3B parameters, it's considerably smaller than competitors like FLUX.2 [dev] (32B) while claiming superior performance on design-specific benchmarks. However, the non-commercial license limits its use cases compared to fully open models. The choice to use a vision-language model (Qwen3-VL) as the text encoder rather than standard CLIP or T5 is architecturally notable and may explain its strong performance on visual concept understanding and text rendering.
Related Articles
Ideogram Releases First Open-Weight Image Model With 9.3B Parameters and 2K Native Resolution
Ideogram has released Ideogram 4, a 9.3B parameter open-weight text-to-image model trained from scratch. The model features structured JSON prompting, native 2K resolution output, and ranks as the top open-weight model on Design Arena. Available in fp8 and nf4 quantizations under a non-commercial license.
Alibaba's Qwen Releases Qwen3.7 Plus: 1M Context Window at $0.40 Per Million Input Tokens
Alibaba's Qwen has released Qwen3.7 Plus, a multimodal model with a 1 million token context window. The model accepts text and image input with text output, priced at $0.40 per million input tokens and $1.60 per million output tokens through OpenRouter's API.
Microsoft releases MAI-Thinking-1, its first reasoning AI model trained without third-party distillation
Microsoft announced MAI-Thinking-1, its first advanced reasoning AI model, at Build 2026. The company claims it's a medium-sized model matching leading models on key software engineering benchmarks, trained from scratch without distillation from third-party models.
NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications
NVIDIA released Cosmos3-Super-Text2Image, a 64-billion parameter text-to-image generation model as part of its Cosmos3 collection of omnimodal world models. The model uses a Mixture-of-Transformers architecture combining autoregressive and diffusion transformers, designed for Physical AI applications including robotics and autonomous vehicles.
Comments
Loading...