model releaseMicrosoft

Microsoft Releases Lens: 3.8B-Parameter Text-to-Image Model Trained on 800M Image Dataset

TL;DR

Microsoft released Lens, a 3.8-parameter foundational text-to-image model trained on Lens-800M, an 800 million image-text corpus with GPT-4.1 captions. The model uses a 48-block MMDiT denoiser with FLUX.2 latents and supports generation up to 1440×1440 resolution across aspect ratios from 1:2 to 2:1.

2 min read
0

Microsoft Releases Lens: 3.8B-Parameter Text-to-Image Model Trained on 800M Image Dataset

Microsoft released Lens, a 3.8-parameter foundational text-to-image model trained on Lens-800M, an 800 million image-text corpus with GPT-4.1 captions. The model uses a 48-block MMDiT denoiser architecture with FLUX.2 latents and supports generation up to 1440×1440 resolution.

Architecture and Training

Lens combines several technical approaches to achieve what Microsoft claims is competitive quality with "substantially less training compute than larger T2I models." The architecture uses:

  • 48-block MMDiT (Multimodal Diffusion Transformer) denoiser
  • FLUX.2 semantic VAE for latent representations
  • Concatenated multi-layer GPT-OSS text features for prompt encoding
  • Mixed-resolution training enabling flexible aspect ratios

The training dataset, Lens-800M, consists of 800 million image-text pairs with dense captions generated by GPT-4.1, which Microsoft describes as "maximizing information density per training batch."

Technical Capabilities

The model supports:

  • Resolution range: up to 1440×1440 pixels
  • Aspect ratios: 1:2 to 2:1
  • Multilingual prompt following via GPT-OSS features
  • Mixed-resolution inference across different aspect ratios

Microsoft also released post-trained variants including an RL-tuned version for improved visual quality and artifact suppression, and Lens-Turbo, a distilled variant supporting 4-step generation.

Release Details

The model is available on Hugging Face with minimal inference code for generating images from Lens DiT checkpoints. Training cutoff date was not disclosed. Pricing information, parameter count breakdown, and benchmark comparisons against established models like DALL-E 3, Midjourney, or Stable Diffusion were not provided in the release.

The sample gallery demonstrates capabilities across multiple styles including photorealistic scenes (landscapes, wildlife, architecture), artistic styles (oil painting, watercolor), text rendering (signs, typography), and multilingual prompts (French, Chinese references).

Project Team

The project is led by Dong Chen, Fangyun Wei, and Ziyu Wan, with core contributors including Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, and Zhiyang Liang. The full team includes 24 researchers from Microsoft.

What This Means

Lens represents Microsoft's entry into the competitive foundational text-to-image model space, positioning efficiency as a key differentiator at 3.8B parameters versus competitors like Stability AI's SDXL (2.6B UNet parameters) or larger proprietary models. The use of GPT-4.1 for caption generation and FLUX.2's VAE suggests Microsoft is leveraging existing infrastructure for training data preparation. The lack of disclosed benchmarks, pricing, or training compute figures makes direct comparison difficult, though the 4-step Turbo variant suggests Microsoft is targeting real-time generation use cases. This release puts Microsoft in direct competition with Stability AI, Midjourney, and its Azure OpenAI partner in the text-to-image foundation model market.

Related Articles

model release

Microsoft Releases Lens-Turbo: 3.8B-Parameter Text-to-Image Model Trained on 800M GPT-4.1-Captioned Images

Microsoft has released Lens-Turbo, a 3.8B-parameter foundational text-to-image model designed for efficient training and fast generation. The model was trained on Lens-800M, an 800 million image-text corpus with GPT-4.1 captions, and supports resolutions up to 1440×1440 with 4-step distilled inference.

model release

Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context

Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.

model release

Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU

Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.

model release

Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis

Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.

Comments

Loading...