Stability AI Releases Stable Audio 3 Medium: 2B-Parameter Audio Generation Model with 180-Second Output in Under 2 Secon
Stability AI has released Stable Audio 3 Medium, a 2 billion parameter latent diffusion model capable of generating variable-length audio up to 380 seconds. The model generates music and sound effects in less than 2 seconds on an H200 GPU, trained on 1.28 million licensed and Creative Commons audio recordings.
Stability AI Releases Stable Audio 3 Medium: 2B-Parameter Audio Generation Model
Stability AI has released Stable Audio 3 Medium, a 2 billion parameter latent diffusion model that generates music and sound effects in variable lengths up to 380 seconds (6+ minutes). According to Stability AI, the model produces audio in under 2 seconds on an H200 GPU and "a few seconds" on a MacBook Pro M4.
Stable Audio 3 is the medium version in a three-tier family (small, medium, large) of fast latent diffusion models designed for consumer-grade hardware deployment.
Technical Architecture
The model operates on a novel semantic-acoustic autoencoder that compresses audio into a compact latent space, enabling efficient generation while preserving audio fidelity. Stability AI claims the architecture encourages semantic structure in the latent representation.
The model underwent adversarial post-training to reduce inference steps while improving generation quality and prompt adherence. It requires 8 diffusion steps at inference time using a "pingpong" sampler, with a CFG scale of 1.0.
Text conditioning uses Google's pre-trained T5Gemma model (t5gemma-b-b-ul2), which is redistributed under separate Gemma Terms of Use.
Training Data
The model was trained on 1,278,902 audio recordings:
- 806,284 recordings licensed from AudioSparx
- 472,618 recordings from Freesound (266,324 CC-0, 194,840 CC-BY, 11,454 CC-Sampling+)
Stability AI reports that music recordings in the Freesound portion were identified using PANNs tagging and sent to a content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed.
Key Capabilities
The model supports:
- Variable-length audio generation (up to 380+ seconds demonstrated)
- Audio inpainting for targeted editing
- Continuation of short recordings
- BPM-specific music generation
- Style and mood control through text prompts
Availability and Licensing
Stable Audio 3 Medium is available on Hugging Face under the Stability AI Community License. Commercial use requires a separate license from Stability AI. The model requires users to accept both the Stability AI license and Gemma Terms of Use, including use restrictions in Section 3.2.
Inference code is available through two libraries: the stable-audio-3 inference library and the stable-audio-tools research library. The model weights are distributed in FP32 format.
What This Means
Stable Audio 3 Medium represents a significant step in accessible audio generation, with claimed sub-2-second generation times that could enable real-time workflows for sound design and music production. The 2B parameter size positions it as deployable on consumer hardware, though actual performance will depend on available GPU memory and compute. The variable-length generation capability addresses a key limitation of fixed-length audio models, reducing computational waste for short sound effects. However, commercial users should note the dual licensing requirement and review Section 3.2 restrictions in the Gemma terms before deployment.
Related Articles
Google releases Gemini 3.5 Flash and autonomous agent Gemini Spark at I/O 2026
Google announced Gemini 3.5 Flash and Gemini Spark at I/O 2026. Gemini 3.5 Flash now powers Google's AI Mode search, while Spark is a cloud-based autonomous agent that can monitor credit card statements, track emails, and interact with third-party services like OpenTable and Instacart.
Anthropic's Unreleased Claude Mythos Preview Finds 10,000+ Vulnerabilities in One Month
Anthropic's unreleased Claude Mythos Preview model has discovered more than 10,000 vulnerabilities across partner organizations in its first month of deployment through Project Glasswing. The company reports partners are finding bugs at 10x their previous rate, with Cloudflare discovering 2,000 bugs and Mozilla finding 271 Firefox vulnerabilities — 10x more than with previous Claude models.
Tencent Releases Hy-MT2 Translation Models: 1.8B, 7B, and 30B-A3B Support 33 Languages
Tencent released Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B (MoE) sizes. All models support translation among 33 languages and follow translation instructions in multiple languages. The 1.8B model can be compressed to 440MB using 1.25-bit AngelSlim quantization.
NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200
NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.
Comments
Loading...