product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

TL;DR

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.

2 min read
0

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Canadian hardware startup Taalas has launched its first product: a custom silicon implementation of Meta's Llama 3.1 8B model capable of generating 17,000 tokens per second.

The company describes their "Silicon Llama" as "aggressively quantized, combining 3-bit and 6-bit parameters." This aggressive quantization approach allows the model to run at extreme speeds while maintaining usable inference performance. The implementation is accessible to test at chatjimmy.ai.

Speed Claims vs. Reality

The 17,000 tokens/second figure represents a significant departure from typical cloud-based inference speeds. For context, leading inference providers typically deliver 500-2,000 tokens/second for comparable models. Taalas achieves this through custom hardware purpose-built for the Llama 3.1 8B architecture, rather than relying on general-purpose GPUs.

Technical Approach

The quantization strategy merits technical scrutiny. While 3-bit and 6-bit quantization can preserve model functionality for smaller models like 8B parameters, inference quality at these bit depths depends heavily on implementation. The startup has not yet published benchmarks comparing output quality against standard Llama 3.1 8B inference.

Taalas indicated their next generation will shift to 4-bit quantization, suggesting they believe this offers improved quality-speed tradeoffs. The long lead times for custom silicon production mean these next-generation chips likely reached design freeze months ago.

What This Means

Taalas represents a broader trend: specialized inference hardware becoming commercially viable for specific model architectures. Unlike general-purpose accelerators, custom silicon for fixed models like Llama 3.1 8B can optimize at every level—memory hierarchy, dataflow, quantization scheme—yielding speed improvements cloud providers cannot match.

The company's focus on 8B models is strategic. These models occupy a sweet spot: small enough for custom hardware optimization, large enough to be commercially useful. Their democratization of extreme-speed inference could shift applications from cloud-dependent to edge-deployable, though output quality claims require independent verification.

Related Articles

product update

Google expands Gemini Android overlay menu with six new tools accessible without opening app

Google has expanded the Gemini overlay plus menu on Android to include six tools: Videos, Music, Canvas, and Guided Learning join the existing Images and Personal Intelligence options. The update, rolling out in Google app version 17.32, allows users to access most Gemini features from anywhere on Android without opening the full app.

product update

Trail of Bits and OpenAI's Daybreak initiative produce 64 pull requests across 19 open-source projects in one week using

Trail of Bits launched Patch the Planet, a security initiative using OpenAI's GPT-5.5-Cyber model to find and fix bugs in critical open-source projects. The first week produced 64 pull requests and 51 issues across 19 projects including cURL, Python, PyPI, and Sigstore, with 37 patches already merged.

product update

Tencent tests AI assistant Xiaowei in WeChat's 1.4 billion user base

Tencent is testing an AI assistant called Xiaowei in Weixin, the Chinese version of WeChat, which has over 1.4 billion monthly active users combined with WeChat. Users can interact with Xiaowei through text or voice, communicate with friends, and launch mini-programs within the app.

product update

U.S. government orders Anthropic to halt exports of Mythos and Fable AI models, both now offline for one week

The White House ordered Anthropic to restrict exports of its Mythos and Fable AI models last Friday, citing national security concerns. Anthropic pulled both models offline within 90 minutes of the Commerce Department directive, marking the first major test of AI export controls.

Comments

Loading...