benchmarkTiiuae

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TL;DR

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.

2 min read
0

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA (Arabic for "summit"), an Arabic LLM evaluation platform that validates benchmark quality before running model evaluations. The platform found systematic errors across widely-used Arabic benchmarks, discarding 3.1% of ArabicMMLU samples and up to 12.3% of MizanQA questions.

Validation Pipeline Details

QIMMA applies a two-stage validation process to 52,164 samples across 109 benchmark subsets:

Stage 1: Two LLMs (Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B) independently score each sample against a 10-point quality rubric. Samples scoring below 7/10 from either model are flagged.

Stage 2: Native Arabic speakers review flagged samples for cultural context, dialectal nuance, and subtle quality issues.

Quality Issues Found

The validation revealed four categories of systematic problems:

  • Answer Quality: Mismatched gold indices, factually wrong answers
  • Text & Formatting: Corrupt text, spelling errors, duplicate samples
  • Cultural Sensitivity: Stereotype reinforcement, monolithic generalizations
  • Gold Answer Compliance: Misalignment with evaluation protocols

Discard rates by benchmark:

Benchmark Total Samples Discarded Rate
ArabicMMLU 14,163 436 3.1%
MizanQA 1,769 412 12.3%
PalmX 3,001 25 0.8%
MedAraBench 4,960 33 0.7%
FannOrFlop 6,984 43 0.6%

Code Benchmark Modifications

For code evaluation, QIMMA refined Arabic problem statements in 3LM's adaptations of HumanEval+ and MBPP+ without changing task logic:

  • 3LM HumanEval+: 145 of 164 prompts modified (88%)
  • 3LM MBPP+: 308 of 378 prompts modified (81%)

Modifications addressed linguistic refinement, clarity improvements, consistency normalization, structural corrections, and semantic refinements.

Coverage

QIMMA evaluates 7 domains across 14 benchmarks:

  • Cultural: AraDiCE-Culture, ArabCulture, PalmX
  • STEM: ArabicMMLU, GAT, 3LM STEM
  • Legal: ArabLegalQA, MizanQA
  • Medical: MedArabiQ, MedAraBench
  • Safety: AraTrust
  • Poetry & Literature: FannOrFlop
  • Coding: 3LM HumanEval+, 3LM MBPP+

The platform uses LightEval, EvalPlus, and FannOrFlop frameworks with metrics including normalized log-likelihood accuracy for MCQ, F1 BERTScore for generative QA, and pass@1 for code.

What This Means

QIMMA is the first Arabic leaderboard combining open source code, predominantly native Arabic content (99%), systematic quality validation, code evaluation, and public per-sample outputs. The validation results demonstrate that widely-used Arabic benchmarks contain systematic quality issues that can corrupt evaluation results, with discard rates ranging from near-zero to 12.3%. This suggests existing Arabic LLM rankings may be partially based on flawed ground truth data. The platform's approach of validating benchmarks before evaluation sets a new standard for non-English LLM assessment.

Related Articles

benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

benchmark

Claude Mythos achieves 73% success rate on expert-level hacking challenges, completes full network takeover in 3 of 10 a

The UK's AI Safety Institute reports Claude Mythos Preview achieved a 73% success rate on expert-level capture-the-flag cybersecurity challenges and became the first AI model to complete a full 32-step simulated corporate network takeover, succeeding in 3 out of 10 attempts. The testing occurred in environments without active security monitoring or defenders.

benchmark

OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark

Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.

benchmark

Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources

An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.

Comments

Loading...