product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

TL;DR

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.

2 min read
0

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Canadian hardware startup Taalas has launched its first product: a custom silicon implementation of Meta's Llama 3.1 8B model capable of generating 17,000 tokens per second.

The company describes their "Silicon Llama" as "aggressively quantized, combining 3-bit and 6-bit parameters." This aggressive quantization approach allows the model to run at extreme speeds while maintaining usable inference performance. The implementation is accessible to test at chatjimmy.ai.

Speed Claims vs. Reality

The 17,000 tokens/second figure represents a significant departure from typical cloud-based inference speeds. For context, leading inference providers typically deliver 500-2,000 tokens/second for comparable models. Taalas achieves this through custom hardware purpose-built for the Llama 3.1 8B architecture, rather than relying on general-purpose GPUs.

Technical Approach

The quantization strategy merits technical scrutiny. While 3-bit and 6-bit quantization can preserve model functionality for smaller models like 8B parameters, inference quality at these bit depths depends heavily on implementation. The startup has not yet published benchmarks comparing output quality against standard Llama 3.1 8B inference.

Taalas indicated their next generation will shift to 4-bit quantization, suggesting they believe this offers improved quality-speed tradeoffs. The long lead times for custom silicon production mean these next-generation chips likely reached design freeze months ago.

What This Means

Taalas represents a broader trend: specialized inference hardware becoming commercially viable for specific model architectures. Unlike general-purpose accelerators, custom silicon for fixed models like Llama 3.1 8B can optimize at every level—memory hierarchy, dataflow, quantization scheme—yielding speed improvements cloud providers cannot match.

The company's focus on 8B models is strategic. These models occupy a sweet spot: small enough for custom hardware optimization, large enough to be commercially useful. Their democratization of extreme-speed inference could shift applications from cloud-dependent to edge-deployable, though output quality claims require independent verification.

Related Articles

product update

Anthropic silently tests 5x price increase for Claude Code, reverses within hours after backlash

Anthropic updated its pricing page on April 22, 2026, removing Claude Code from the $20/month Pro plan and restricting it to $100-200/month Max plans. The company reversed the change within hours after significant backlash across Reddit, Hacker News, and Twitter.

product update

Anthropic's Claude Cowork now runs on Amazon Bedrock with consumption-based pricing

Anthropic announced Claude Cowork is now available on Amazon Bedrock, allowing organizations to deploy the desktop AI assistant through their AWS infrastructure with consumption-based pricing. Unlike Claude Enterprise, pricing flows through existing AWS agreements with no per-seat licensing from Anthropic.

product update

OpenAI's ChatGPT Images 2.0 adds web search and multi-image generation with reasoning mode

OpenAI released ChatGPT Images 2.0, powered by the new GPT Image 2 model. The update enables web search integration for paid subscribers in thinking mode, generates up to eight images from a single prompt while maintaining visual consistency, and supports 2K resolution output.

product update

OpenRouter Launches Pareto Code Router with Dynamic Model Selection Based on Quality Threshold

OpenRouter has released Pareto Code Router, a dynamic routing system that automatically selects from a curated list of coding models based on a user-defined quality threshold. Users set a min_coding_score between 0 and 1, and the router selects an appropriate model from its shortlist without requiring commitment to a specific model.

Comments

Loading...