LLM News

Every LLM release, update, and milestone.

Filtered by:vision-language-models✕ clear
research

New framework improves VLM spatial reasoning through minimal information selection

A new research paper introduces MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that improves Vision-Language Models' ability to reason about 3D spatial relationships. The method addresses two key bottlenecks: inadequate 3D understanding from 2D-centric training and reasoning failures from redundant information.

research

Merlin: Stanford releases 3D CT vision-language model trained on 6M images

Researchers at Stanford have released Merlin, a 3D vision-language model designed specifically for abdominal CT scan interpretation. Trained on 6+ million CT images, 1.8 million diagnosis codes, and 6+ million report tokens from 15,331 scans, Merlin outperforms 2D medical vision-language models on diagnostic classification, phenotyping, and semantic segmentation across internal and external validation sets.

research

VC-STaR: Researchers use visual contrast to reduce hallucinations in VLM reasoning

Researchers propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a self-improving framework that addresses a fundamental challenge in vision language models: hallucinations in visual reasoning. The approach uses contrastive VQA pairs—visually similar images with equivalent questions—to improve how VLMs identify relevant visual cues and generate more accurate reasoning paths.

benchmark

UniG2U-Bench reveals unified multimodal models underperform VLMs in most tasks

A new comprehensive benchmark called UniG2U-Bench evaluates whether generation capabilities improve multimodal understanding across 30+ models. The findings show unified multimodal models generally underperform specialized Vision-Language Models, with generation-then-answer inference degrading performance in most cases—though spatial reasoning and multi-round tasks show consistent improvements.