benchmarkxAI

Grok 4.20 trails GPT-5.4 and Gemini 3.1 but achieves record 78% non-hallucination rate

TL;DR

xAI's Grok 4.20 scores 48 on Artificial Analysis' Intelligence Index—6 points ahead of Grok 4 but trailing Gemini 3.1 Pro Preview and GPT-5.4 at 57. The model distinguishes itself with a 78% non-hallucination rate on the AA Omniscience test, the highest recorded across any model tested.

2 min read
0

Grok 4.20 — Quick Specs

Context window2000K tokens
Input$2/1M tokens
Output$6/1M tokens

Grok 4.20 Trails GPT-5.4 and Gemini 3.1 but Achieves Record 78% Non-Hallucination Rate

xAI's Grok 4.20 released today demonstrates a clear performance gap against the industry's leading models while establishing a new benchmark for factual reliability.

Benchmark Performance Gap

According to Artificial Analysis, Grok 4.20 Beta scores 48 on the Intelligence Index with reasoning enabled, placing it significantly behind Gemini 3.1 Pro Preview and GPT-5.4, both at 57. The 9-point deficit represents the gap between xAI's latest offering and the current generation of frontier models from Google DeepMind and OpenAI.

The result does show incremental improvement: Grok 4.20 gains 6 points over its predecessor, Grok 4, indicating progress on benchmark performance, though not enough to close the gap with top-tier competitors.

Factual Reliability Breakthrough

Where Grok 4.20 distinguishes itself is factual accuracy. On Artificial Analysis' Omniscience test, the model achieved a 78% non-hallucination rate—the highest score recorded across any model tested to date. The Omniscience test measures both how often a model fabricates answers versus admitting knowledge gaps, and its factual recall accuracy.

In practical terms, Grok 4.20 only produces hallucinations or incorrect responses approximately one in five times when faced with questions outside its training knowledge—a measurably lower rate than existing models.

Technical Specifications and Pricing

xAI released three API variants: standard reasoning-disabled mode, reasoning-enabled mode, and a multi-agent configuration. The model supports a 2-million-token context window, matching current enterprise standards.

Pricing is competitive: $2 per million input tokens and $6 per million output tokens. This undercuts Grok 4's prior pricing and positions it within the mid-tier pricing range among Western models, below OpenAI's GPT-5.4 but comparable to other similarly-capable offerings.

What This Means

Grok 4.20 represents a deliberate trade-off optimization: xAI appears to have prioritized factual reliability over raw benchmark performance. While the Intelligence Index gap suggests users requiring maximum capability should still prefer GPT-5.4 or Gemini 3.1, enterprises prioritizing factual accuracy over general reasoning may find Grok 4.20's combination of 78% non-hallucination rate and 2M token context window compelling at its price point. The result also questions the utility of benchmark-centric model comparisons; superior Intelligence Index scores don't guarantee reliability on factual claims.

Related Articles

model release

xAI releases Grok 4.20 with 2M context window and native reasoning capabilities

xAI released Grok 4.20 on March 31, 2026, its flagship model featuring a 2 million token context window, $2 per million input tokens and $6 per million output tokens pricing, and toggleable reasoning capabilities. The model includes web search functionality at $5 per 1,000 queries and claims industry-leading speed with low hallucination rates.

benchmark

ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%

OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.

benchmark

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.

benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

Comments

Loading...