LocateAnything-3B

NVIDIA🇺🇸 United States
active
Context window24K tokens

Version History

3Bmajor

Initial release of LocateAnything-3B with Parallel Box Decoding architecture, trained on 12M images across natural scenes, robotics, driving, GUI, and document domains.

Coverage

model releaseNVIDIA

NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding

NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model that predicts bounding boxes in parallel rather than token-by-token, achieving up to 2.5× higher throughput compared to autoregressive approaches. The model, trained on 12M images with 138M+ queries and 785M bounding boxes, supports object detection, GUI element grounding, and robotics perception.

2 min read