Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate
xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.
Grok 4.20 — Quick Specs
Grok 4.20 Trails GPT-5.4 and Gemini 3.1 but Achieves Record 78% Non-Hallucination Rate
xAI's Grok 4.20 released today demonstrates a clear performance gap against the industry's leading models while establishing a new benchmark for factual reliability.
Benchmark Performance Gap
According to Artificial Analysis, Grok 4.20 Beta scores 48 on the Intelligence Index with reasoning enabled, placing it significantly behind Gemini 3.1 Pro Preview and GPT-5.4, both at 57. The 9-point deficit represents the gap between xAI's latest offering and the current generation of frontier models from Google DeepMind and OpenAI.
The result does show incremental improvement: Grok 4.20 gains 6 points over its predecessor, Grok 4, indicating progress on benchmark performance, though not enough to close the gap with top-tier competitors.
Factual Reliability Breakthrough
Where Grok 4.20 distinguishes itself is factual accuracy. On Artificial Analysis' Omniscience test, the model achieved a 78% non-hallucination rate—the highest score recorded across any model tested to date. The Omniscience test measures both how often a model fabricates answers versus admitting knowledge gaps, and its factual recall accuracy.
In practical terms, Grok 4.20 only produces hallucinations or incorrect responses approximately one in five times when faced with questions outside its training knowledge—a measurably lower rate than existing models.
Technical Specifications and Pricing
xAI released three API variants: standard reasoning-disabled mode, reasoning-enabled mode, and a multi-agent configuration. The model supports a 2-million-token context window, matching current enterprise standards.
Pricing is competitive: $2 per million input tokens and $6 per million output tokens. This undercuts Grok 4's prior pricing and positions it within the mid-tier pricing range among Western models, below OpenAI's GPT-5.4 but comparable to other similarly-capable offerings.
What This Means
Grok 4.20 represents a deliberate trade-off optimization: xAI appears to have prioritized factual reliability over raw benchmark performance. While the Intelligence Index gap suggests users requiring maximum capability should still prefer GPT-5.4 or Gemini 3.1, enterprises prioritizing factual accuracy over general reasoning may find Grok 4.20's combination of 78% non-hallucination rate and 2M token context window compelling at its price point. The result also questions the utility of benchmark-centric model comparisons; superior Intelligence Index scores don't guarantee reliability on factual claims.
Related Articles
xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows
xAI has released Grok Build 0.1, a coding-specialized model with a 256K context window and unlimited text output. The model is designed for agentic software engineering workflows and powers xAI's Grok Build CLI tool.
xAI's Grok Entering Apple CarPlay After iOS 26.4 Opens Dashboard to Third-Party AI Chatbots
xAI is deploying Grok to Apple CarPlay following iOS 26.4's April 2026 update that introduced a Voice Control template for third-party AI chatbots. The move puts Grok alongside ChatGPT, Perplexity, Claude, and Gemini on 800 million iPhones, marking xAI's first deployment outside the Musk ecosystem.
xAI releases Grok 4.3 reasoning model with 1M token context at $1.25/M input tokens
xAI has released Grok 4.3, a reasoning model with a 1 million token context window and no output token limit. The model accepts text and image inputs, has always-on reasoning that cannot be disabled, and uses tiered pricing starting at $1.25 per million input tokens and $2.50 per million output tokens.
Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro
Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.
Comments
Loading...