benchmarkTiiuae

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TL;DR

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.

April 21, 2026 · 10:20 AM2 min read

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA (Arabic for "summit"), an Arabic LLM evaluation platform that validates benchmark quality before running model evaluations. The platform found systematic errors across widely-used Arabic benchmarks, discarding 3.1% of ArabicMMLU samples and up to 12.3% of MizanQA questions.

Validation Pipeline Details

QIMMA applies a two-stage validation process to 52,164 samples across 109 benchmark subsets:

Stage 1: Two LLMs (Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B) independently score each sample against a 10-point quality rubric. Samples scoring below 7/10 from either model are flagged.

Stage 2: Native Arabic speakers review flagged samples for cultural context, dialectal nuance, and subtle quality issues.

Quality Issues Found

The validation revealed four categories of systematic problems:

Answer Quality: Mismatched gold indices, factually wrong answers
Text & Formatting: Corrupt text, spelling errors, duplicate samples
Cultural Sensitivity: Stereotype reinforcement, monolithic generalizations
Gold Answer Compliance: Misalignment with evaluation protocols

Discard rates by benchmark:

Benchmark	Total Samples	Discarded	Rate
ArabicMMLU	14,163	436	3.1%
MizanQA	1,769	412	12.3%
PalmX	3,001	25	0.8%
MedAraBench	4,960	33	0.7%
FannOrFlop	6,984	43	0.6%

Code Benchmark Modifications

For code evaluation, QIMMA refined Arabic problem statements in 3LM's adaptations of HumanEval+ and MBPP+ without changing task logic:

3LM HumanEval+: 145 of 164 prompts modified (88%)
3LM MBPP+: 308 of 378 prompts modified (81%)

Modifications addressed linguistic refinement, clarity improvements, consistency normalization, structural corrections, and semantic refinements.

Coverage

QIMMA evaluates 7 domains across 14 benchmarks:

Cultural: AraDiCE-Culture, ArabCulture, PalmX
STEM: ArabicMMLU, GAT, 3LM STEM
Legal: ArabLegalQA, MizanQA
Medical: MedArabiQ, MedAraBench
Safety: AraTrust
Poetry & Literature: FannOrFlop
Coding: 3LM HumanEval+, 3LM MBPP+

The platform uses LightEval, EvalPlus, and FannOrFlop frameworks with metrics including normalized log-likelihood accuracy for MCQ, F1 BERTScore for generative QA, and pass@1 for code.

What This Means

QIMMA is the first Arabic leaderboard combining open source code, predominantly native Arabic content (99%), systematic quality validation, code evaluation, and public per-sample outputs. The validation results demonstrate that widely-used Arabic benchmarks contain systematic quality issues that can corrupt evaluation results, with discard rates ranging from near-zero to 12.3%. This suggests existing Arabic LLM rankings may be partially based on flawed ground truth data. The platform's approach of validating benchmarks before evaluation sets a new standard for non-English LLM assessment.

Source: huggingface.co ↗

arabic benchmark leaderboard evaluation tiiuae quality-validation multilingual

benchmarkJune 9, 2026

ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language

ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.

benchmarkMay 18, 2026

IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture

IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

benchmarkJune 12, 2026

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

Validation Pipeline Details

Quality Issues Found

Code Benchmark Modifications

Coverage

What This Means

Related Articles

ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language

IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Comments