model releaseMicrosoft

Microsoft Releases Lens-Turbo: 3.8B-Parameter Text-to-Image Model Trained on 800M GPT-4.1-Captioned Images

TL;DR

Microsoft has released Lens-Turbo, a 3.8B-parameter foundational text-to-image model designed for efficient training and fast generation. The model was trained on Lens-800M, an 800 million image-text corpus with GPT-4.1 captions, and supports resolutions up to 1440×1440 with 4-step distilled inference.

2 min read
0

Microsoft Releases Lens-Turbo: 3.8B-Parameter Text-to-Image Model

Microsoft has released Lens-Turbo, a 3.8 billion-parameter text-to-image model trained on 800 million GPT-4.1-captioned images. The model uses a 48-block MMDiT (multi-modal diffusion transformer) architecture and supports generation at resolutions up to 1440×1440 pixels.

Technical Architecture

Lens combines several technical approaches:

  • Training corpus: Lens-800M dataset containing 800 million image-text pairs with long-form GPT-4.1 captions
  • Architecture: 48-block MMDiT denoiser with 3.8B parameters
  • Latent encoding: Uses FLUX.2 semantic VAE for image encoding
  • Text encoding: Concatenated multi-layer GPT-OSS features for prompt following and multilingual support
  • Resolution handling: Mixed-resolution training enables aspect ratios from 1:2 to 2:1

Inference Speed

The distilled Lens-Turbo variant supports 4-step generation, according to Microsoft. The base model went through reinforcement learning post-training for improved visual quality and artifact suppression before distillation.

Resolution and Aspect Ratio Support

The model supports flexible output resolutions:

  • Maximum resolution: 1440×1440 pixels
  • Aspect ratio range: 1:2 to 2:1
  • Multiple resolution presets: 1248×1664, 1664×1248, and square formats

Microsoft states the mixed-resolution training approach enables inference across different aspect ratios without quality degradation.

Training Efficiency Claims

Microsoft claims Lens reaches "competitive quality with substantially less training compute than larger T2I models" through dense-caption pre-training that maximizes information density per training batch. The company has not disclosed specific benchmark scores, training compute requirements, or comparisons to specific competing models.

Model Availability

The model is available on Hugging Face under the repository microsoft/Lens-Turbo. Microsoft has released minimal inference code for generating images from Lens DiT checkpoints. Pricing information for API access has not been disclosed.

What This Means

Lens-Turbo represents Microsoft's entry into the sub-4B parameter text-to-image model category, emphasizing training efficiency through high-quality captions rather than dataset scale. The 4-step distilled inference and flexible resolution support position it for applications requiring fast generation across varied aspect ratios. The reliance on GPT-4.1 for caption generation suggests Microsoft is leveraging its existing LLM infrastructure to improve training data quality, though the actual performance relative to models like Stable Diffusion 3 or FLUX.1 remains unverified without published benchmarks.

Related Articles

model release

NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200

NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.

model release

Alibaba Releases Qwen3.7 Max with 1M Token Context Window for Agent and Coding Tasks

Alibaba has released Qwen3.7 Max, the flagship model in its Qwen3.7 series, featuring a 1 million token context window. The text-only model is designed for agent-centric workloads with strengths in coding, office productivity, and long-horizon autonomous execution, and includes explicit prompt caching support.

model release

xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows

xAI has released Grok Build 0.1, a coding-specialized model with a 256K context window and unlimited text output. The model is designed for agentic software engineering workflows and powers xAI's Grok Build CLI tool.

model release

Stability AI Releases Stable Audio 3 Medium: 2B-Parameter Audio Generation Model with 180-Second Output in Under 2 Secon

Stability AI has released Stable Audio 3 Medium, a 2 billion parameter latent diffusion model capable of generating variable-length audio up to 380 seconds. The model generates music and sound effects in less than 2 seconds on an H200 GPU, trained on 1.28 million licensed and Creative Commons audio recordings.

Comments

Loading...