inference

8 articles tagged with inference

March 23, 2026
product updateNVIDIA

NVIDIA Nemotron 3 Super now available on Amazon Bedrock with 256K context window

NVIDIA Nemotron 3 Super, a hybrid Mixture of Experts model with 120B parameters and 12B active parameters, is now available as a fully managed model on Amazon Bedrock. The model supports up to 256K token context length and claims 5x higher throughput efficiency over the previous Nemotron Super and 2x higher accuracy on reasoning tasks.

March 12, 2026
product update

Meta unveils four custom AI inference chips to cut costs and reduce Nvidia dependency

Meta has unveiled four generations of custom-designed AI chips focused on inference workloads, aiming to reduce inference costs across its platforms serving billions of users. The move represents a significant step toward reducing Meta's dependence on GPU manufacturers like Nvidia and AMD.

March 11, 2026
product update

Meta develops four custom AI chips to reduce Nvidia dependence

Meta has developed four new custom AI chips called MTIA (Meta Training and Inference Accelerator) processors designed to power its AI models and recommendation systems. The move represents the company's ongoing effort to reduce dependence on Nvidia's expensive processors while managing massive compute requirements.

March 9, 2026
product updateNVIDIA

NVIDIA Nemotron 3 Nano now available on Amazon Bedrock as serverless model

Amazon Bedrock now offers NVIDIA's Nemotron 3 Nano as a fully managed serverless model, expanding its Nemotron portfolio alongside previously available Nemotron 2 Nano 9B and Nemotron 2 Nano VL 12B variants. The addition enables developers to deploy NVIDIA's smallest inference-optimized model without managing infrastructure.

product updateAmazon Web Services

Anthropic Claude models now available in India via Amazon Bedrock with cross-region inference

Amazon Bedrock now enables access to Anthropic Claude models in India with global cross-region inference support. The service allows developers to build generative AI applications with Claude variants across AWS regions.

March 3, 2026
model release

Google releases Gemini 3.1 Flash-Lite, fastest model in 3 series

Google has released Gemini 3.1 Flash-Lite, positioning it as the fastest and most cost-efficient model in its Gemini 3 series. The release targets deployment scenarios requiring high-speed inference at reduced computational cost.

model release

Google releases Gemini 3.1 Flash-Lite, fastest model in 3 series

Google DeepMind has released Gemini 3.1 Flash-Lite, positioning it as the fastest and most cost-efficient model in the Gemini 3 series. The release targets applications requiring high-speed inference at scale, continuing Google's multi-tier model strategy across the Gemini family.

February 20, 2026
product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.