inference-speed
3 articles tagged with inference-speed
Google releases DiffusionGemma 26B, open-weight model generates 500+ tokens/second
Google has released DiffusionGemma 26B, an open-weight text generation model under Apache 2 license. The model generates over 500 tokens/second according to testing on NVIDIA's free NIM API, where it produced 2,409 tokens in 4.4 seconds.
Google DeepMind's Gemini 3.1 Flash-Lite generates websites in real time, 2.5x faster than predecessor
Google DeepMind released Gemini 3.1 Flash-Lite, a model that generates functional websites in real time through a new pseudo-browser demo. The model achieves first response token 2.5 times faster than Gemini 2.5 Flash and outputs over 360 tokens per second, though output pricing has tripled from $0.40 to $1.50 per million tokens.
Inception's Mercury 2 uses diffusion for language reasoning, claims 5x speed over autoregressive models
Inception has released Mercury 2, positioning it as the first diffusion-based language reasoning model. Rather than generating text sequentially word-by-word like standard language models, Mercury 2 refines entire passages in parallel, according to the company.