visual-grounding
1 article tagged with visual-grounding
May 28, 2026
model releaseNVIDIA
NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding
NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model that predicts bounding boxes in parallel rather than token-by-token, achieving up to 2.5× higher throughput compared to autoregressive approaches. The model, trained on 12M images with 138M+ queries and 785M bounding boxes, supports object detection, GUI element grounding, and robotics perception.