Hume AI open-sources TADA: speech model 5x faster than rivals with zero hallucination
Hume AI has open-sourced TADA, a speech generation model that maps exactly one audio signal to each text token, achieving 5x faster processing than comparable systems. The model produced zero transcription hallucinations across 1,000+ test samples and runs on smartphones, available in 1B and 3B parameter versions under MIT license.
Hume AI has open-sourced TADA, an AI system for speech generation that synchronizes text and audio processing without the overhead of previous approaches.
Key Technical Specifications
The core innovation: TADA maps exactly one audio signal to each text token. This contrasts with previous systems that generate multiple audio frames per text token, introducing latency and complexity.
The model comes in two sizes:
- 1B parameters: English only
- 3B parameters: English plus seven additional languages (specific languages not detailed in announcement)
Both versions are based on Llama and released under MIT license with code and models available on GitHub and Hugging Face.
Performance Claims
According to Hume AI:
- Speed: Over 5x faster than comparable systems
- Hallucination rate: Zero transcription hallucinations across 1,000+ test samples (no made-up or skipped words compared to source text)
- Naturalness: 3.78 out of 5 in human evaluations
- Device compatibility: Compact enough to run on smartphones
Known Limitations
Hume AI notes that longer texts can cause the voice to occasionally drift, indicating potential stability issues with extended audio generation.
Availability
All code, models, and technical details are publicly available. The full technical paper has been published alongside the release.
What this means
TADA addresses two critical issues in speech synthesis: latency and hallucination. The one-to-one token-to-audio mapping is architecturally simpler than existing approaches, explaining both the speed advantage and the zero hallucination rate. The smartphone compatibility removes a practical barrier for deployment. However, the voice drift issue on longer texts suggests the model works best for short-form speech generation. The MIT license and open-source release position this as infrastructure for downstream applications rather than a consumer product, and its Llama foundation means it inherits that ecosystem's community tools and fine-tuning approaches.
Related Articles
Nvidia releases Nemotron 3 Super: 120B MoE model with 1M token context
Nvidia has released Nemotron 3 Super, a 120-billion parameter hybrid Mamba-Transformer Mixture-of-Experts model that activates only 12 billion parameters during inference. The open-weight model features a 1-million token context window, multi-token prediction capabilities, and pricing at $0.10 per million input tokens and $0.50 per million output tokens.
Stability AI releases Stable Audio 2.5 for enterprise sound production
Stability AI released Stable Audio 2.5, positioned as the first audio generation model built specifically for enterprise sound production. The model introduces improvements in quality and control for creating dynamic compositions adaptable to custom brand needs.
Stable Video 4D 2.0 generates 4D assets from single videos with improved quality
Stability AI has released Stable Video 4D 2.0 (SV4D 2.0), an upgraded version of its multi-view video diffusion model designed to generate 4D assets from single object-centric videos. The update claims to deliver higher-quality outputs on real-world video footage.
Stability AI releases Stable Audio Open Small for on-device audio generation with Arm
Stability AI has open-sourced Stable Audio Open Small in partnership with Arm, a smaller and faster variant of its text-to-audio model designed for on-device deployment. The model maintains output quality and prompt adherence while reducing computational requirements for real-world edge deployment on devices powered by Arm's technology, which runs on 99% of smartphones globally.
Comments
Loading...