on-device

5 articles tagged with on-device

May 10, 2026
model release

Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages

Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2, requires no GPU, and claims competitive accuracy against models 7-20x larger.

model releaseGoogle DeepMind

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.

May 7, 2026
model release

Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks

Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.

May 6, 2026
model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.

April 4, 2026
model release

PrismML releases 1-bit Bonsai 8B model, claims 14x smaller and 5x more energy efficient than full-precision peers

PrismML, a Caltech-founded startup, has released Bonsai 8B, a 1-bit quantized large language model that the company claims is 14x smaller and 5x more energy efficient than full-precision counterparts while remaining competitive with standard 8B models. The model fits into 1.15GB of memory and uses a novel 1-bit weight representation (binary signs with shared scale factors per weight group) instead of traditional 16-bit or 32-bit precision.