model releaseNVIDIA

NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding

TL;DR

NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model that predicts bounding boxes in parallel rather than token-by-token, achieving up to 2.5× higher throughput compared to autoregressive approaches. The model, trained on 12M images with 138M+ queries and 785M bounding boxes, supports object detection, GUI element grounding, and robotics perception.

May 28, 2026 · 3:06 AM2 min read

LocateAnything-3B — Quick Specs

Context window24K tokens

Compare LocateAnything-3B with other models →

NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding

NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model designed for fast visual grounding tasks including object detection, GUI element localization, and robotics perception.

Technical specifications

The model uses Qwen2.5-3B-Instruct as its language backbone and MoonViT-SO-400M as its vision encoder. Training data spans 12 million images, 138 million queries, and 785 million bounding boxes across natural scenes, robotics, driving, GUI interaction, and document understanding domains.

Input supports images up to 2.5K resolution and text prompts up to 24K tokens. The model can generate up to 8,192 new tokens during inference and uses BF16 precision with KV cache.

Parallel Box Decoding innovation

LocateAnything's core technical contribution is Parallel Box Decoding (PBD), which predicts complete bounding box coordinates in a single parallel step instead of generating coordinates token-by-token autoregressively. According to NVIDIA, this approach delivers up to 2.5× higher throughput while preserving geometric consistency.

The architecture outputs structured coordinate tokens in fixed-length blocks of 6 tokens, including semantic labels, box coordinates, and control tokens. The model supports three decoding modes: Fast Mode (parallel prediction), Slow Mode (autoregressive), and Hybrid Mode (parallel with autoregressive fallback).

Supported use cases

NVIDIA lists these supported applications:

Open-set and long-tail object detection
Dense multi-object detection in cluttered scenes
Phrase and referring-expression grounding
Automated dataset labeling
GUI element grounding for agentic systems
Robotics and autonomous driving perception
Document understanding and OCR localization
Industrial inspection and surveillance

The model has been integrated into NVIDIA's Nemotron 3 Nano Omni production models for grounding and multimodal capabilities.

Availability and licensing

LocateAnything-3B is available on Hugging Face and GitHub as of May 26, 2026. The model is released under NVIDIA's non-commercial license, permitting use only for academic and non-profit research. Commercial use is prohibited except by NVIDIA and its affiliates.

The model runs on NVIDIA Ampere, Hopper, Blackwell, and Lovelace architectures. TensorRT and TensorRT-LLM support is not yet available. Deployment on embedded platforms like NVIDIA Thor requires additional optimization including quantization or distillation.

What this means

Parallel Box Decoding represents a practical efficiency gain for visual grounding tasks that require detecting multiple objects or UI elements in a single image. The 2.5× throughput improvement matters for real-time applications like robotics and GUI automation where latency compounds across multiple API calls. However, the non-commercial license limits this to research settings, and the model's integration into production Nemotron systems suggests NVIDIA views visual grounding as infrastructure for agentic AI systems rather than standalone capability. The 3B parameter size makes local deployment feasible on consumer GPUs while the multi-domain training data indicates this is positioned as a generalist localization model rather than domain-specific detector.

Source: huggingface.co ↗

nvidia vision-language-model object-detection research-model multimodal visual-grounding robotics gui-automation

product updateJuly 10, 2026

AWS Adds NVIDIA Nemotron 3 Nano (30B) and Super (120B) to SageMaker Serverless Fine-Tuning

Amazon SageMaker AI now supports serverless fine-tuning for NVIDIA Nemotron 3 Nano (30B parameters, 3B active) and Nemotron 3 Super (120B parameters, 12B active). The integration includes supervised fine-tuning, reinforcement learning with verifiable rewards (RLVR), and reinforcement learning from AI feedback (RLAIF).

model releaseJuly 9, 2026

NVIDIA releases Nemotron-Labs-3-Puzzle-75B, compressed from 120B to 75B parameters with 2× throughput

NVIDIA has released Nemotron-Labs-3-Puzzle-75B-A9B, a compressed variant of Nemotron-3-Super that reduces the model from 120.7B total/12.8B active parameters to 75.3B total/9.3B active parameters. According to NVIDIA, the model achieves approximately 2× higher server throughput on a single 8×B200 node and increases sustainable 1M-token single-H100 concurrency from 1 request to 8 requests while maintaining strong accuracy across benchmarks.

model releaseJuly 11, 2026

Cohere releases 2B parameter Arabic speech recognition model with 25.9% average WER

Cohere and Cohere Labs released Cohere Transcribe Arabic, a 2B parameter automatic speech recognition model optimized for Arabic dialects and Arabic-English code-switching. The open-source model achieves a 25.9% average word error rate across major Arabic ASR benchmarks, outperforming models up to 30B parameters.

model releaseJuly 10, 2026

Meta stock surges 15% as company releases Muse Spark 1.1 agentic model and Muse Image generator

Meta's stock surged 15% this week following the release of two AI models: Muse Spark 1.1 for agentic and coding workloads on Thursday, and Muse Image for image generation on Tuesday. The releases come three months after Meta introduced its first foundation model, Muse Spark, as the company competes with OpenAI, Anthropic, and Google.

NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding

LocateAnything-3B — Quick Specs

NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding

Technical specifications

Parallel Box Decoding innovation

Supported use cases

Availability and licensing

What this means

Related Articles

AWS Adds NVIDIA Nemotron 3 Nano (30B) and Super (120B) to SageMaker Serverless Fine-Tuning

NVIDIA releases Nemotron-Labs-3-Puzzle-75B, compressed from 120B to 75B parameters with 2× throughput

Cohere releases 2B parameter Arabic speech recognition model with 25.9% average WER

Meta stock surges 15% as company releases Muse Spark 1.1 agentic model and Muse Image generator

Comments