Amazon Nova Multimodal Embeddings adds audio search capabilities to Bedrock
Amazon Nova Multimodal Embeddings, announced October 28, 2025, now supports audio content for semantic search alongside text, images, and video. The model offers four embedding dimension options (3,072, 1,024, 384, 256) and uses Matryoshka Representation Learning to balance accuracy with storage efficiency.
Amazon Nova Multimodal Embeddings Adds Audio Search to Bedrock
Amazon has expanded Nova Multimodal Embeddings to support audio content, enabling semantic search across audio libraries through unified cross-modal retrieval. The model, available in Amazon Bedrock, processes audio alongside text, documents, images, and video through a single model architecture.
Audio Embedding Architecture
Amazon Nova generates audio embeddings as float32 arrays in four dimension sizes: 3,072 (default), 1,024, 384, and 256. The model uses Matryoshka Representation Learning (MRL), a hierarchical structure that allows truncation without reprocessing. A full 3,072-dimension embedding contains information at all scales—users can extract just the first 256 dimensions and retain accuracy, trading off computation against storage costs.
Audio embeddings encode both acoustic and semantic features: rhythm, pitch, timbre, emotional tone, and semantic meaning. The model processes audio as mel-spectrograms or learned audio features rather than raw waveforms, using temporal convolutional networks or transformer architectures to capture spectro-temporal patterns. Individual audio segments up to 30 seconds preserve temporal context and long-range acoustic dependencies.
Two API Modes
Amazon Nova provides synchronous and asynchronous embedding generation:
Synchronous API (invoke_model): For real-time queries. Users submit search text like "upbeat jazz piano" or an audio clip, receiving embeddings in milliseconds for k-nearest neighbor database searches.
Asynchronous API: For batch processing. Audio files upload to Amazon S3, and the model automatically segments files over 30 seconds with temporal metadata. Embeddings store in vector databases with metadata (filename, duration, genre) for one-time indexing.
Requests specify taskType (SINGLE_EMBEDDING or SEGMENTED_EMBEDDING), embeddingPurpose (GENERIC_INDEX for content, GENERIC_RETRIEVAL for queries, DOCUMENT_RETRIEVAL for documents), embeddingDimension, and truncationMode.
Search Mechanism
Similarity measurement uses cosine similarity between embedding vectors:
similarity = (v₁ · v₂) / (||v₁|| × ||v₂||)
Values range from -1 to 1, with higher values indicating greater semantic similarity. Vector databases convert this to distance (1 − similarity) for k-NN searches, retrieving top-k most similar embeddings.
The approach captures acoustic similarity beyond text transcription. While traditional speech-to-text and metadata tagging focus on linguistic content, audio embeddings encode tone, emotion, musical characteristics, and environmental sounds—enabling users to find audio by acoustic properties rather than spoken words alone.
What This Means
Amazon positions Nova Multimodal Embeddings as a unified solution for cross-modal retrieval, removing the need for separate embedding models per modality. The inclusion of audio search addresses a gap in content libraries where manual transcription and speech-to-text methods miss acoustic nuance. Matryoshka learning reduces operational costs by avoiding reprocessing when adjusting embedding dimensions—a practical advantage for large-scale deployments. The synchronous/asynchronous dual-mode design separates real-time search latency from batch indexing, aligning API patterns with actual workload requirements. Organizations building audio search now have production-ready infrastructure within Bedrock's managed environment.
Related Articles
Amazon Nova 2 Sonic enables real-time AI podcast generation with 1M token context
Amazon has published a technical guide for building real-time conversational podcasts using Amazon Nova 2 Sonic, its speech understanding and generation model. The solution demonstrates streaming audio generation, multi-turn dialogue between AI hosts, and stage-aware content filtering through a web interface.
YouTube Shorts adds AI avatars that replicate your voice and appearance
YouTube is rolling out an AI avatar feature that lets users create photorealistic versions of themselves for YouTube Shorts. Users record a live selfie and voice prompts to generate an avatar that can create up to 8-second video clips. The feature includes watermarks, digital labels (SynthID and C2PA), and AI-generated content disclosures.
Google Gemini app gains 'notebooks' feature to organize chats, integrates with NotebookLM
Google is introducing 'notebooks' to the Gemini app, a new organizational feature that lets users create personal knowledge bases across chats and files. The notebooks sync directly with NotebookLM and are rolling out first to Google AI Plus, Pro, and Ultra subscribers on web, with mobile and free user access coming in the following weeks.
Amazon Bedrock now supports fine-tuning for Nova models with three customization approaches
Amazon Bedrock now enables fine-tuning of Amazon Nova models using supervised fine-tuning (SFT), reinforcement fine-tuning (RFT), and model distillation. The service automates infrastructure provisioning and training orchestration, requiring only data upload to S3 and a single API call. Fine-tuned models run on-demand at standard inference pricing without provisioned capacity requirements.
Comments
Loading...