Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages
Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2, requires no GPU, and claims competitive accuracy against models 7-20x larger.
Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages
Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2 and requires no GPU for inference.
Technical Specifications
- Parameters: 99 million across ONNX assets
- Languages: 31 (expanded from 5 in Supertonic 2)
- Inference: CPU-only via ONNX Runtime, no cloud calls required
- Model type: Text-to-speech
- License: OpenRAIL-M for model weights, MIT for sample code
Performance Claims
According to Supertone, Supertonic 3 achieves competitive word error rates (WER) and character error rates (CER) against larger open-source TTS models like VoxCPM2, which range from 0.7B to 2B parameters. The company provides benchmark comparisons showing the model runs faster on CPU than larger baselines measured on A100 GPU.
Supertonic 3 claims improvements over version 2 in three areas: reduced repeat and skip failures during reading, higher speaker similarity across shared languages, and the 6x expansion in language coverage.
New Features
- Expression tags: Supports
<laugh>,<breath>, and<sigh>tags for expressive synthesis - Improved stability: Fewer reading errors on both short and long text inputs
- 31 languages: English, Korean, Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Spanish, Estonian, Finnish, French, Hindi, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, Vietnamese
Deployment
The model ships as ONNX assets and runs through a Python SDK. Users can install via pip install supertonic and generate speech locally. The SDK auto-downloads model assets from Hugging Face on first run.
from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
What This Means
Supertonic 3 targets the growing demand for privacy-preserving, on-device AI inference. At 99M parameters, the model is 7-20x smaller than comparable open TTS systems, making it practical for browser and edge deployment where GPU access is limited or unavailable. The CPU-only requirement and sub-100MB footprint address real constraints in mobile and embedded applications.
The 31-language support positions Supertonic 3 as a lightweight alternative to larger multilingual TTS systems. However, without independent benchmarks, it remains unclear how the model's accuracy-size tradeoff compares to cloud-based alternatives or other on-device TTS solutions across different hardware profiles and use cases.
Related Articles
Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters
Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.
Alibaba Qwen Releases 35B Language World Model for Agent Environment Simulation Across 7 Domains
Alibaba's Qwen team released Qwen-AgentWorld-35B-A3B, a 35 billion parameter language world model designed for agentic environment simulation. The model covers seven domains—MCP tool calling, Search, Terminal, Software Engineering, Android, Web, and OS—in a single model with a 262,144 token context window.
Sakana AI Releases Fugu Ultra: Multi-Agent Orchestration System with 1M Context Window at $5/$30 per Million Tokens
Sakana AI has released Fugu Ultra, a multi-agent orchestration system that routes tasks across pools of underlying models rather than operating as a single monolithic model. The system supports a 1M token context window and is priced at $5 per million input tokens and $30 per million output tokens.
Krea Releases 12-Billion Parameter Text-to-Image Model with 8-Step Generation
Krea.ai released Krea 2 Turbo, a 12-billion parameter diffusion transformer model for text-to-image generation. The open-weight model generates images in 8 inference steps and supports resolutions up to 2048x2048 pixels.
Comments
Loading...