product updateAmazon Web Services

Amazon Nova Multimodal Embeddings adds audio search capabilities to Bedrock

TL;DR

Amazon Nova Multimodal Embeddings, announced October 28, 2025, now supports audio content for semantic search alongside text, images, and video. The model offers four embedding dimension options (3,072, 1,024, 384, 256) and uses Matryoshka Representation Learning to balance accuracy with storage efficiency.

2 min read
0

Amazon Nova Multimodal Embeddings Adds Audio Search to Bedrock

Amazon has expanded Nova Multimodal Embeddings to support audio content, enabling semantic search across audio libraries through unified cross-modal retrieval. The model, available in Amazon Bedrock, processes audio alongside text, documents, images, and video through a single model architecture.

Audio Embedding Architecture

Amazon Nova generates audio embeddings as float32 arrays in four dimension sizes: 3,072 (default), 1,024, 384, and 256. The model uses Matryoshka Representation Learning (MRL), a hierarchical structure that allows truncation without reprocessing. A full 3,072-dimension embedding contains information at all scales—users can extract just the first 256 dimensions and retain accuracy, trading off computation against storage costs.

Audio embeddings encode both acoustic and semantic features: rhythm, pitch, timbre, emotional tone, and semantic meaning. The model processes audio as mel-spectrograms or learned audio features rather than raw waveforms, using temporal convolutional networks or transformer architectures to capture spectro-temporal patterns. Individual audio segments up to 30 seconds preserve temporal context and long-range acoustic dependencies.

Two API Modes

Amazon Nova provides synchronous and asynchronous embedding generation:

Synchronous API (invoke_model): For real-time queries. Users submit search text like "upbeat jazz piano" or an audio clip, receiving embeddings in milliseconds for k-nearest neighbor database searches.

Asynchronous API: For batch processing. Audio files upload to Amazon S3, and the model automatically segments files over 30 seconds with temporal metadata. Embeddings store in vector databases with metadata (filename, duration, genre) for one-time indexing.

Requests specify taskType (SINGLE_EMBEDDING or SEGMENTED_EMBEDDING), embeddingPurpose (GENERIC_INDEX for content, GENERIC_RETRIEVAL for queries, DOCUMENT_RETRIEVAL for documents), embeddingDimension, and truncationMode.

Search Mechanism

Similarity measurement uses cosine similarity between embedding vectors:

similarity = (v₁ · v₂) / (||v₁|| × ||v₂||)

Values range from -1 to 1, with higher values indicating greater semantic similarity. Vector databases convert this to distance (1 − similarity) for k-NN searches, retrieving top-k most similar embeddings.

The approach captures acoustic similarity beyond text transcription. While traditional speech-to-text and metadata tagging focus on linguistic content, audio embeddings encode tone, emotion, musical characteristics, and environmental sounds—enabling users to find audio by acoustic properties rather than spoken words alone.

What This Means

Amazon positions Nova Multimodal Embeddings as a unified solution for cross-modal retrieval, removing the need for separate embedding models per modality. The inclusion of audio search addresses a gap in content libraries where manual transcription and speech-to-text methods miss acoustic nuance. Matryoshka learning reduces operational costs by avoiding reprocessing when adjusting embedding dimensions—a practical advantage for large-scale deployments. The synchronous/asynchronous dual-mode design separates real-time search latency from batch indexing, aligning API patterns with actual workload requirements. Organizations building audio search now have production-ready infrastructure within Bedrock's managed environment.

Related Articles

product update

Amazon Nova Act Becomes HIPAA Eligible for Healthcare Workflows

Amazon Nova Act, AWS's browser-based AI agent service, now qualifies as HIPAA eligible, allowing healthcare organizations to deploy autonomous agents for workflows involving electronically protected health information. The service automates repetitive browser tasks including claims processing, referral coordination, and prior authorization.

product update

AWS launches AgentCore Code Interpreter to process documents beyond context window limits using recursive LLM architectu

Amazon Web Services released AgentCore Code Interpreter, a sandboxed Python environment that enables recursive language models to process documents of unlimited length by treating context as an external environment rather than loading it into the model's context window. The system orchestrates sub-LLM calls from within the sandbox, maintaining intermediate results as Python variables across a persistent session.

product update

AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK

AWS has added four multimodal evaluators to its Strands Evals SDK that judge image-to-text AI outputs by directly analyzing source images. The evaluators—Overall Quality, Correctness, Faithfulness, and Instruction Following—use multimodal large language models to detect visual hallucinations, factual errors, and instruction violations that text-only judges miss.

product update

Gemini Live on Android adds 15 new Connected Apps including YouTube Music, Spotify, and Home controls

Google has expanded Gemini Live's Connected Apps integration on Android, adding support for 15 new services including YouTube Music, Spotify, Home controls, Flights, Hotels, Workspace, and Utilities. The update includes a redesigned floating interface that allows users to switch between text and voice conversations.

Comments

Loading...