video-understanding
6 articles tagged with video-understanding
NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context
NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters (3 billion active) that processes video, audio, images, and text in a single inference pass. The model features a 131K token context window and uses a Mamba2 Transformer Hybrid MoE architecture combining three specialized encoders.
Nvidia releases Nemotron 3 Nano Omni: 30B-parameter multimodal model with 256K context, free on OpenRouter
Nvidia has released Nemotron 3 Nano Omni, a 30-billion-parameter multimodal model available free on OpenRouter. The model features a 256,000-token context window, accepts text, image, video, and audio inputs, and claims 2× higher throughput for video reasoning compared to separate vision and speech pipelines.
Alibaba releases Qwen3.5 Plus with 1M token context window at $0.40 per million input tokens
Alibaba released an updated version of Qwen3.5 Plus on April 27, 2026, with a 1 million token context window. The multimodal model accepts text, image, and video input and is priced at $0.40 per million input tokens and $2.40 per million output tokens, with tiered pricing above 256K tokens.
Meta releases SAM 3.1, adding 7x faster multi-object tracking to vision foundation model
Meta has released SAM 3.1, an update to its Segment Anything Model that adds Object Multiplex, a shared-memory approach for joint multi-object tracking. The new version achieves approximately 7x faster inference when tracking 128 objects on a single H100 GPU while improving video object segmentation (VOS) performance on 6 out of 7 benchmarks.
Amazon Bedrock adds three video analysis workflows for multimodal understanding at scale
Amazon Bedrock has introduced three distinct video analysis workflows that leverage multimodal foundation models to extract insights from video content at scale. The approaches—frame-based, shot-based, and multimodal embedding—are designed for different use cases and cost-performance trade-offs, with open-source reference implementations available on GitHub.
Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find
An international research team released the largest video reasoning dataset to date—roughly 1,000 times larger than previous alternatives. Testing reveals that state-of-the-art models including Sora 2 and Veo 3.1 substantially underperform humans on reasoning tasks, suggesting the limitation isn't data scarcity but architectural constraints.