NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding
NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model that predicts bounding boxes in parallel rather than token-by-token, achieving up to 2.5× higher throughput compared to autoregressive approaches. The model, trained on 12M images with 138M+ queries and 785M bounding boxes, supports object detection, GUI element grounding, and robotics perception.
LocateAnything-3B — Quick Specs
NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding
NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model designed for fast visual grounding tasks including object detection, GUI element localization, and robotics perception.
Technical specifications
The model uses Qwen2.5-3B-Instruct as its language backbone and MoonViT-SO-400M as its vision encoder. Training data spans 12 million images, 138 million queries, and 785 million bounding boxes across natural scenes, robotics, driving, GUI interaction, and document understanding domains.
Input supports images up to 2.5K resolution and text prompts up to 24K tokens. The model can generate up to 8,192 new tokens during inference and uses BF16 precision with KV cache.
Parallel Box Decoding innovation
LocateAnything's core technical contribution is Parallel Box Decoding (PBD), which predicts complete bounding box coordinates in a single parallel step instead of generating coordinates token-by-token autoregressively. According to NVIDIA, this approach delivers up to 2.5× higher throughput while preserving geometric consistency.
The architecture outputs structured coordinate tokens in fixed-length blocks of 6 tokens, including semantic labels, box coordinates, and control tokens. The model supports three decoding modes: Fast Mode (parallel prediction), Slow Mode (autoregressive), and Hybrid Mode (parallel with autoregressive fallback).
Supported use cases
NVIDIA lists these supported applications:
- Open-set and long-tail object detection
- Dense multi-object detection in cluttered scenes
- Phrase and referring-expression grounding
- Automated dataset labeling
- GUI element grounding for agentic systems
- Robotics and autonomous driving perception
- Document understanding and OCR localization
- Industrial inspection and surveillance
The model has been integrated into NVIDIA's Nemotron 3 Nano Omni production models for grounding and multimodal capabilities.
Availability and licensing
LocateAnything-3B is available on Hugging Face and GitHub as of May 26, 2026. The model is released under NVIDIA's non-commercial license, permitting use only for academic and non-profit research. Commercial use is prohibited except by NVIDIA and its affiliates.
The model runs on NVIDIA Ampere, Hopper, Blackwell, and Lovelace architectures. TensorRT and TensorRT-LLM support is not yet available. Deployment on embedded platforms like NVIDIA Thor requires additional optimization including quantization or distillation.
What this means
Parallel Box Decoding represents a practical efficiency gain for visual grounding tasks that require detecting multiple objects or UI elements in a single image. The 2.5× throughput improvement matters for real-time applications like robotics and GUI automation where latency compounds across multiple API calls. However, the non-commercial license limits this to research settings, and the model's integration into production Nemotron systems suggests NVIDIA views visual grounding as infrastructure for agentic AI systems rather than standalone capability. The 3B parameter size makes local deployment feasible on consumer GPUs while the multi-domain training data indicates this is positioned as a generalist localization model rather than domain-specific detector.
Related Articles
ElevenLabs launches Music v2 with mid-track genre switching and section-by-section composition
ElevenLabs released Music v2, an AI music generation model that can switch genres within a single track and build songs section-by-section. The model, trained on licensed data cleared for commercial use, can transition from opera to heavy metal, handle fast rap, and add sound effects while maintaining coherence.
Google launches Gemini Omni, multimodal AI video generator with avatar cloning and physics modeling
Google has released Gemini Omni, a multimodal AI video generation tool that accepts text, images, audio, and video as inputs. The first tier, Gemini Omni Flash, includes avatar cloning that creates digital versions of users and incorporates physics modeling for realistic motion.
Microsoft Releases Lens: 3.8B-Parameter Text-to-Image Model Trained on 800M Image Dataset
Microsoft released Lens, a 3.8-parameter foundational text-to-image model trained on Lens-800M, an 800 million image-text corpus with GPT-4.1 captions. The model uses a 48-block MMDiT denoiser with FLUX.2 latents and supports generation up to 1440×1440 resolution across aspect ratios from 1:2 to 2:1.
NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation Than Autoregressive Decoding
NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models at 3B, 8B, and 14B scales that generate multiple tokens in parallel rather than one at a time. The 8B model achieves 6.4× higher tokens per forward pass than autoregressive models in self-speculation mode while maintaining comparable accuracy.
Comments
Loading...