research

Merlin: Stanford releases 3D CT vision-language model trained on 6M images

Researchers at Stanford have released Merlin, a 3D vision-language model designed specifically for abdominal CT scan interpretation. Trained on 6+ million CT images, 1.8 million diagnosis codes, and 6+ million report tokens from 15,331 scans, Merlin outperforms 2D medical vision-language models on diagnostic classification, phenotyping, and semantic segmentation across internal and external validation sets.

March 5, 2026 · 5:10 AM2 min read

Stanford researchers have released Merlin, a 3D vision-language model purpose-built for automated analysis of abdominal computed tomography scans. The model addresses a critical bottleneck in medical imaging: the shortage of radiologists relative to scan volume.

Training Data and Architecture

Merlin was trained on a clinical dataset of 15,331 CT scans containing 6+ million volumetric images, 1.8+ million diagnosis codes extracted from electronic health records, and 6+ million tokens from paired radiology reports. The model uses a multistage pretraining framework that eliminates the need for manual annotation of individual CT slices.

Unlike existing medical vision-language models limited to 2D images and short text, Merlin processes full 3D volumetric CT scans alongside comprehensive clinical narratives and structured diagnostic data.

Evaluation Results

The researchers comprehensively evaluated Merlin across 752 individual tasks spanning six task categories:

Zero-shot tasks (no model fine-tuning required):

Classification of 30 common CT findings
Phenotype classification across 692 clinical phenotypes
Cross-modal retrieval (image-to-findings and image-to-impression matching)

Model-adapted tasks (fine-tuned on task-specific data):

5-year chronic disease prediction for 6 diseases
Automated radiology report generation
3D semantic segmentation of 20 abdominal organs

Internal validation used 5,137 CT scans. External validation spanned 44,098 scans from three independent hospital sites and two public datasets, demonstrating generalization across institutions and anatomical variations.

Merlin outperformed comparisons including 2D vision-language models, CT-specific foundation models, and off-the-shelf radiology software across evaluated tasks.

Public Release

Stanford has released trained Merlin models, source code, and the underlying dataset via GitHub at https://github.com/StanfordMIMI/Merlin, enabling further research and clinical development.

What this means

Merlin represents a step toward automating routine CT interpretation tasks, potentially reducing radiologist workload for high-volume screening and diagnosis. The 3D architecture and integration of structured clinical data (diagnosis codes, reports) distinguish it from prior medical VLMs. The scale of external validation across multiple sites strengthens evidence for real-world applicability, though clinical deployment would require regulatory approval and integration with hospital workflows. The public release of models and data accelerates reproducibility and follow-on research in medical image AI.

Source: arxiv.org ↗

medical-imaging vision-language-models 3d-models stanford ct-scans healthcare-ai semantic-segmentation foundation-models