NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding

NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model that predicts bounding boxes in parallel rather than token-by-token, achieving up to 2.5× higher throughput compared to autoregressive approaches. The model, trained on 12M images with 138M+ queries and 785M bounding boxes, supports object detection, GUI element grounding, and robotics perception.

May 28, 2026 · 3:06 AM2 min read

nvidia vision-language-model object-detection

LocateAnything-3B

Version History

Coverage

NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding