Hume AI open-sources TADA: speech model 5x faster than rivals with zero hallucination
Hume AI has open-sourced TADA, a speech generation model that maps exactly one audio signal to each text token, achieving 5x faster processing than comparable systems. The model produced zero transcription hallucinations across 1,000+ test samples and runs on smartphones, available in 1B and 3B parameter versions under MIT license.
Hume AI has open-sourced TADA, an AI system for speech generation that synchronizes text and audio processing without the overhead of previous approaches.
Key Technical Specifications
The core innovation: TADA maps exactly one audio signal to each text token. This contrasts with previous systems that generate multiple audio frames per text token, introducing latency and complexity.
The model comes in two sizes:
- 1B parameters: English only
- 3B parameters: English plus seven additional languages (specific languages not detailed in announcement)
Both versions are based on Llama and released under MIT license with code and models available on GitHub and Hugging Face.
Performance Claims
According to Hume AI:
- Speed: Over 5x faster than comparable systems
- Hallucination rate: Zero transcription hallucinations across 1,000+ test samples (no made-up or skipped words compared to source text)
- Naturalness: 3.78 out of 5 in human evaluations
- Device compatibility: Compact enough to run on smartphones
Known Limitations
Hume AI notes that longer texts can cause the voice to occasionally drift, indicating potential stability issues with extended audio generation.
Availability
All code, models, and technical details are publicly available. The full technical paper has been published alongside the release.
What this means
TADA addresses two critical issues in speech synthesis: latency and hallucination. The one-to-one token-to-audio mapping is architecturally simpler than existing approaches, explaining both the speed advantage and the zero hallucination rate. The smartphone compatibility removes a practical barrier for deployment. However, the voice drift issue on longer texts suggests the model works best for short-form speech generation. The MIT license and open-source release position this as infrastructure for downstream applications rather than a consumer product, and its Llama foundation means it inherits that ecosystem's community tools and fine-tuning approaches.
Related Articles
Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning
Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.
Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens
Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.
IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support
IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.
IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes
IBM has released the Granite 4.1 family of language models under Apache 2.0 license. The models come in 3B, 8B, and 30B parameter sizes. Unsloth has released 21 GGUF quantized variants of the 3B model ranging from 1.2GB to 6.34GB.
Comments
Loading...