Microsoft Releases Lens: 3.8B-Parameter Text-to-Image Model Trained on 800M Image Dataset
Microsoft released Lens, a 3.8-parameter foundational text-to-image model trained on Lens-800M, an 800 million image-text corpus with GPT-4.1 captions. The model uses a 48-block MMDiT denoiser with FLUX.2 latents and supports generation up to 1440×1440 resolution across aspect ratios from 1:2 to 2:1.
Microsoft Releases Lens: 3.8B-Parameter Text-to-Image Model Trained on 800M Image Dataset
Microsoft released Lens, a 3.8-parameter foundational text-to-image model trained on Lens-800M, an 800 million image-text corpus with GPT-4.1 captions. The model uses a 48-block MMDiT denoiser architecture with FLUX.2 latents and supports generation up to 1440×1440 resolution.
Architecture and Training
Lens combines several technical approaches to achieve what Microsoft claims is competitive quality with "substantially less training compute than larger T2I models." The architecture uses:
- 48-block MMDiT (Multimodal Diffusion Transformer) denoiser
- FLUX.2 semantic VAE for latent representations
- Concatenated multi-layer GPT-OSS text features for prompt encoding
- Mixed-resolution training enabling flexible aspect ratios
The training dataset, Lens-800M, consists of 800 million image-text pairs with dense captions generated by GPT-4.1, which Microsoft describes as "maximizing information density per training batch."
Technical Capabilities
The model supports:
- Resolution range: up to 1440×1440 pixels
- Aspect ratios: 1:2 to 2:1
- Multilingual prompt following via GPT-OSS features
- Mixed-resolution inference across different aspect ratios
Microsoft also released post-trained variants including an RL-tuned version for improved visual quality and artifact suppression, and Lens-Turbo, a distilled variant supporting 4-step generation.
Release Details
The model is available on Hugging Face with minimal inference code for generating images from Lens DiT checkpoints. Training cutoff date was not disclosed. Pricing information, parameter count breakdown, and benchmark comparisons against established models like DALL-E 3, Midjourney, or Stable Diffusion were not provided in the release.
The sample gallery demonstrates capabilities across multiple styles including photorealistic scenes (landscapes, wildlife, architecture), artistic styles (oil painting, watercolor), text rendering (signs, typography), and multilingual prompts (French, Chinese references).
Project Team
The project is led by Dong Chen, Fangyun Wei, and Ziyu Wan, with core contributors including Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, and Zhiyang Liang. The full team includes 24 researchers from Microsoft.
What This Means
Lens represents Microsoft's entry into the competitive foundational text-to-image model space, positioning efficiency as a key differentiator at 3.8B parameters versus competitors like Stability AI's SDXL (2.6B UNet parameters) or larger proprietary models. The use of GPT-4.1 for caption generation and FLUX.2's VAE suggests Microsoft is leveraging existing infrastructure for training data preparation. The lack of disclosed benchmarks, pricing, or training compute figures makes direct comparison difficult, though the 4-step Turbo variant suggests Microsoft is targeting real-time generation use cases. This release puts Microsoft in direct competition with Stability AI, Midjourney, and its Azure OpenAI partner in the text-to-image foundation model market.
Related Articles
Ideogram Releases First Open-Weight Image Model With 9.3B Parameters and 2K Native Resolution
Ideogram has released Ideogram 4, a 9.3B parameter open-weight text-to-image model trained from scratch. The model features structured JSON prompting, native 2K resolution output, and ranks as the top open-weight model on Design Arena. Available in fp8 and nf4 quantizations under a non-commercial license.
Ideogram 4: 9.3B parameter open-weight text-to-image model with native 2K resolution and structured JSON prompting
Ideogram has released Ideogram 4, its first open-weight text-to-image model with 9.3 billion parameters. The model supports native 2K resolution, structured JSON prompting with bounding-box layout controls, and is available in nf4 and fp8 quantizations under a non-commercial license.
Microsoft releases MAI-Thinking-1, its first reasoning AI model trained without third-party distillation
Microsoft announced MAI-Thinking-1, its first advanced reasoning AI model, at Build 2026. The company claims it's a medium-sized model matching leading models on key software engineering benchmarks, trained from scratch without distillation from third-party models.
NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua
NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.
Comments
Loading...