model release

Meta releases SAM 3.1, adding 7x faster multi-object tracking to vision foundation model

TL;DR

Meta has released SAM 3.1, an update to its Segment Anything Model that adds Object Multiplex, a shared-memory approach for joint multi-object tracking. The new version achieves approximately 7x faster inference when tracking 128 objects on a single H100 GPU while improving video object segmentation (VOS) performance on 6 out of 7 benchmarks.

March 30, 2026 · 9:35 AM2 min read

Meta Releases SAM 3.1 with 7x Faster Multi-Object Tracking

Meta has released SAM 3.1, an updated version of its Segment Anything Model that introduces Object Multiplex, a shared-memory architecture designed for efficient joint multi-object tracking in videos.

Key Improvements

SAM 3.1 builds on SAM 3, Meta's unified foundation model for promptable segmentation that handles both images and videos. The new version maintains backward compatibility with SAM 3's core capabilities—detecting, segmenting, and tracking objects via text or visual prompts (points, boxes, masks)—while adding significant performance gains for tracking scenarios.

The Object Multiplex approach delivers approximately 7x faster inference when tracking 128 objects on a single H100 GPU, according to Meta's claims. The company states this speedup occurs without sacrificing accuracy. Additionally, SAM 3.1 shows improved video object segmentation (VOS) performance across 6 out of 7 benchmarks tested.

Concept Vocabulary Scale

SAM 3's underlying capability to handle open-vocabulary concepts remains central to SAM 3.1. The model can exhaustively segment all instances of concepts specified via short text phrases, handling over 50x more unique concepts than existing benchmarks, according to Meta.

Deployment Details

The SAM 3.1 model checkpoints are available on Hugging Face at facebook/sam3.1. Meta notes that there is no Hugging Face Transformers integration for this release. Users must access the SAM 3 GitHub repository for installation instructions, code examples, and full documentation. The model requires users to share contact information via Meta's privacy framework before access.

As of the latest data, the model has recorded 1,865 downloads in the previous month on Hugging Face. Currently, no inference providers have deployed SAM 3.1, meaning users must run the model locally or on their own infrastructure.

What This Means

SAM 3.1 positions Meta's segmentation foundation model for production use cases in video understanding and multi-object tracking—domains where inference speed directly impacts real-world feasibility. The 7x speedup at scale (128 objects) suggests the model targets enterprise applications in video analysis, autonomous systems, and robotics rather than single-object or lightweight use cases. The lack of inference provider deployment currently limits accessibility to developers with GPU infrastructure, though the open availability of checkpoints aligns with Meta's broader strategy of releasing foundational models for research and production adoption.

Source: huggingface.co ↗

meta vision-model segmentation video-understanding multi-object-tracking foundation-model

model releaseMay 14, 2026

Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens

Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.

model releaseMay 13, 2026

DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter

DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per forward pass. The model supports a 1M-token context window and is available free through OpenRouter, targeting high-throughput coding and chat applications.

model releaseMay 12, 2026

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.