benchmarkxAI

Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate

TL;DR

xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.

2 min read
0

Grok 4.20 — Quick Specs

Context window2000K tokens
Input$2/1M tokens
Output$6/1M tokens

Grok 4.20 Trails GPT-5.4 and Gemini 3.1 but Achieves Record 78% Non-Hallucination Rate

xAI's Grok 4.20 released today demonstrates a clear performance gap against the industry's leading models while establishing a new benchmark for factual reliability.

Benchmark Performance Gap

According to Artificial Analysis, Grok 4.20 Beta scores 48 on the Intelligence Index with reasoning enabled, placing it significantly behind Gemini 3.1 Pro Preview and GPT-5.4, both at 57. The 9-point deficit represents the gap between xAI's latest offering and the current generation of frontier models from Google DeepMind and OpenAI.

The result does show incremental improvement: Grok 4.20 gains 6 points over its predecessor, Grok 4, indicating progress on benchmark performance, though not enough to close the gap with top-tier competitors.

Factual Reliability Breakthrough

Where Grok 4.20 distinguishes itself is factual accuracy. On Artificial Analysis' Omniscience test, the model achieved a 78% non-hallucination rate—the highest score recorded across any model tested to date. The Omniscience test measures both how often a model fabricates answers versus admitting knowledge gaps, and its factual recall accuracy.

In practical terms, Grok 4.20 only produces hallucinations or incorrect responses approximately one in five times when faced with questions outside its training knowledge—a measurably lower rate than existing models.

Technical Specifications and Pricing

xAI released three API variants: standard reasoning-disabled mode, reasoning-enabled mode, and a multi-agent configuration. The model supports a 2-million-token context window, matching current enterprise standards.

Pricing is competitive: $2 per million input tokens and $6 per million output tokens. This undercuts Grok 4's prior pricing and positions it within the mid-tier pricing range among Western models, below OpenAI's GPT-5.4 but comparable to other similarly-capable offerings.

What This Means

Grok 4.20 represents a deliberate trade-off optimization: xAI appears to have prioritized factual reliability over raw benchmark performance. While the Intelligence Index gap suggests users requiring maximum capability should still prefer GPT-5.4 or Gemini 3.1, enterprises prioritizing factual accuracy over general reasoning may find Grok 4.20's combination of 78% non-hallucination rate and 2M token context window compelling at its price point. The result also questions the utility of benchmark-centric model comparisons; superior Intelligence Index scores don't guarantee reliability on factual claims.

Related Articles

model release

xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows

xAI has released Grok Build 0.1, a coding-specialized model with a 256K context window and unlimited text output. The model is designed for agentic software engineering workflows and powers xAI's Grok Build CLI tool.

product update

xAI's Grok Entering Apple CarPlay After iOS 26.4 Opens Dashboard to Third-Party AI Chatbots

xAI is deploying Grok to Apple CarPlay following iOS 26.4's April 2026 update that introduced a Voice Control template for third-party AI chatbots. The move puts Grok alongside ChatGPT, Perplexity, Claude, and Gemini on 800 million iPhones, marking xAI's first deployment outside the Musk ecosystem.

model release

xAI releases Grok 4.3 reasoning model with 1M token context at $1.25/M input tokens

xAI has released Grok 4.3, a reasoning model with a 1 million token context window and no output token limit. The model accepts text and image inputs, has always-on reasoning that cannot be disabled, and uses tiered pricing starting at $1.25 per million input tokens and $2.50 per million output tokens.

benchmark

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.

Comments

Loading...