product updateAmazon Web Services

Amazon Launches Nova Multimodal Embeddings for Video Semantic Search Across Visual, Audio, and Text Signals

TL;DR

Amazon released Nova Multimodal Embeddings on Amazon Bedrock, a unified embedding model that processes text, documents, images, video, and audio into a shared 1024-dimensional semantic vector space. The model supports up to 30 seconds of video per embedding and enables semantic search across all modalities simultaneously without converting video to text first.

2 min read
0

Amazon Launches Nova Multimodal Embeddings for Video Semantic Search Across Visual, Audio, and Text Signals

Amazon released Nova Multimodal Embeddings on Amazon Bedrock, a unified embedding model that processes text, documents, images, video, and audio into a shared 1024-dimensional semantic vector space for video search applications.

The model supports up to 30 seconds of video per embedding and processes all modalities directly without requiring text conversion. According to Amazon, this approach preserves temporal understanding and avoids information loss that occurs when converting video signals to text through transcription or manual tagging.

Technical Architecture

The reference implementation uses a two-phase architecture. The ingestion pipeline processes uploaded videos through:

  • FFmpeg scene detection to segment video at natural boundaries (targeting 10-second segments with 5-15 second ranges)
  • Parallel processing generating separate 1024-dimensional embeddings for visual and audio content
  • Amazon Transcribe for speech-to-text conversion with timestamp alignment
  • Amazon Rekognition for celebrity detection
  • Amazon Nova 2 Lite for caption and genre generation
  • Indexing into Amazon OpenSearch Service

The search pipeline executes parallel operations:

  • Intent analysis using Claude Haiku to assign relevance weights (0.0-1.0) across visual, audio, transcription, and metadata modalities
  • Query embedding three times for visual, audio, and transcription similarity search
  • Hybrid search combining semantic and lexical signals

Segmentation Strategy

The system uses adaptive scene-based segmentation rather than fixed-length chunks. FFmpeg's scene detection identifies natural visual boundaries, and the algorithm snaps cuts to the nearest scene change within an acceptable window. This produces segments like 8.3s, 11.1s, 9.8s, 12.4s, 7.6s aligned to actual scene boundaries.

According to Amazon, fixed-length segmentation can split scenes mid-action or sentences mid-thought, degrading embedding quality and retrieval precision. The scene-based approach maintains semantic continuity where each segment represents a coherent unit of meaning.

Use Cases

Amazon targets three primary applications:

  • Sports broadcasters surfacing exact moments when players scored for instant highlight delivery
  • Studios finding every scene with specific actors across thousands of archived hours
  • News organizations retrieving footage by mood, location, or event for breaking stories

The model handles complex queries like "a tense car chase with sirens" that require simultaneous visual and audio understanding, or searches for athletes who appear on screen but are never mentioned in dialogue.

Availability

Pricing for Nova Multimodal Embeddings was not disclosed. A complete reference implementation is available on GitHub for deployment on AWS infrastructure including Lambda, Fargate, Step Functions, S3, DynamoDB, OpenSearch Service, and CloudFront.

What This Means

The release addresses a fundamental limitation in video search: existing systems convert all signals to text before indexing, losing temporal context and visual information that text cannot capture. By processing video, audio, and visual data natively in a shared embedding space, the model enables retrieval based on any combination of signals without preprocessing bottlenecks. The 30-second context window and scene-aware segmentation suggest Amazon is prioritizing semantic coherence over simple throughput, though the lack of disclosed pricing makes cost comparison with text-based approaches difficult.

Related Articles

product update

OpenAI GPT-5.5 and GPT-5.4 Launch on Amazon Bedrock at Parity Pricing

OpenAI's GPT-5.5 and GPT-5.4 models are now generally available on Amazon Bedrock, with pricing matching OpenAI's first-party rates. Codex, OpenAI's coding agent used by 5 million developers weekly, is also available with pay-per-token pricing and no seat licenses.

product update

AWS adds Policy Engine and Lambda interceptors to Bedrock AgentCore gateway for agent security controls

Amazon Web Services launched Policy Engine and Lambda interceptors for Bedrock AgentCore gateway, enabling enterprises to control which tools AI agents can access and validate requests dynamically. The Policy Engine uses Cedar declarative policy language for deterministic access decisions, while Lambda interceptors run custom code before or after each tool call for validation, token exchange, and response filtering.

product update

AWS launches dataset management in Bedrock AgentCore for versioned agent test suites

Amazon Web Services introduced dataset management in Bedrock AgentCore, enabling developers to build versioned test suites with immutable baselines for agent evaluation. The feature supports predefined scenarios with ground truth assertions and user simulation scenarios where LLM-backed actors conduct multi-turn conversations.

product update

Mistral Releases OCR API at $1 per 1,000 Pages, Claims 94.89% Accuracy on Document Benchmarks

Mistral AI has released an OCR API priced at $1 per 1,000 pages with batch inference costs approximately half that rate. The company claims 94.89% overall accuracy on internal benchmarks, ahead of GPT-4o (89.77%), Gemini 2.0 Flash (88.69%), and Azure OCR (89.52%). The model processes up to 2,000 pages per minute on a single node.

Comments

Loading...