Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate
xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.
Grok 4.20 Beta — Quick Specs
Grok 4.20 Trails GPT-5.4 and Gemini 3.1 but Achieves Record 78% Non-Hallucination Rate
xAI's Grok 4.20 released today demonstrates a clear performance gap against the industry's leading models while establishing a new benchmark for factual reliability.
Benchmark Performance Gap
According to Artificial Analysis, Grok 4.20 Beta scores 48 on the Intelligence Index with reasoning enabled, placing it significantly behind Gemini 3.1 Pro Preview and GPT-5.4, both at 57. The 9-point deficit represents the gap between xAI's latest offering and the current generation of frontier models from Google DeepMind and OpenAI.
The result does show incremental improvement: Grok 4.20 gains 6 points over its predecessor, Grok 4, indicating progress on benchmark performance, though not enough to close the gap with top-tier competitors.
Factual Reliability Breakthrough
Where Grok 4.20 distinguishes itself is factual accuracy. On Artificial Analysis' Omniscience test, the model achieved a 78% non-hallucination rate—the highest score recorded across any model tested to date. The Omniscience test measures both how often a model fabricates answers versus admitting knowledge gaps, and its factual recall accuracy.
In practical terms, Grok 4.20 only produces hallucinations or incorrect responses approximately one in five times when faced with questions outside its training knowledge—a measurably lower rate than existing models.
Technical Specifications and Pricing
xAI released three API variants: standard reasoning-disabled mode, reasoning-enabled mode, and a multi-agent configuration. The model supports a 2-million-token context window, matching current enterprise standards.
Pricing is competitive: $2 per million input tokens and $6 per million output tokens. This undercuts Grok 4's prior pricing and positions it within the mid-tier pricing range among Western models, below OpenAI's GPT-5.4 but comparable to other similarly-capable offerings.
What This Means
Grok 4.20 represents a deliberate trade-off optimization: xAI appears to have prioritized factual reliability over raw benchmark performance. While the Intelligence Index gap suggests users requiring maximum capability should still prefer GPT-5.4 or Gemini 3.1, enterprises prioritizing factual accuracy over general reasoning may find Grok 4.20's combination of 78% non-hallucination rate and 2M token context window compelling at its price point. The result also questions the utility of benchmark-centric model comparisons; superior Intelligence Index scores don't guarantee reliability on factual claims.