product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.

2 min read

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Canadian hardware startup Taalas has launched its first product: a custom silicon implementation of Meta's Llama 3.1 8B model capable of generating 17,000 tokens per second.

The company describes their "Silicon Llama" as "aggressively quantized, combining 3-bit and 6-bit parameters." This aggressive quantization approach allows the model to run at extreme speeds while maintaining usable inference performance. The implementation is accessible to test at chatjimmy.ai.

Speed Claims vs. Reality

The 17,000 tokens/second figure represents a significant departure from typical cloud-based inference speeds. For context, leading inference providers typically deliver 500-2,000 tokens/second for comparable models. Taalas achieves this through custom hardware purpose-built for the Llama 3.1 8B architecture, rather than relying on general-purpose GPUs.

Technical Approach

The quantization strategy merits technical scrutiny. While 3-bit and 6-bit quantization can preserve model functionality for smaller models like 8B parameters, inference quality at these bit depths depends heavily on implementation. The startup has not yet published benchmarks comparing output quality against standard Llama 3.1 8B inference.

Taalas indicated their next generation will shift to 4-bit quantization, suggesting they believe this offers improved quality-speed tradeoffs. The long lead times for custom silicon production mean these next-generation chips likely reached design freeze months ago.

What This Means

Taalas represents a broader trend: specialized inference hardware becoming commercially viable for specific model architectures. Unlike general-purpose accelerators, custom silicon for fixed models like Llama 3.1 8B can optimize at every level—memory hierarchy, dataflow, quantization scheme—yielding speed improvements cloud providers cannot match.

The company's focus on 8B models is strategic. These models occupy a sweet spot: small enough for custom hardware optimization, large enough to be commercially useful. Their democratization of extreme-speed inference could shift applications from cloud-dependent to edge-deployable, though output quality claims require independent verification.