QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation
TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.
QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation
TII UAE released QIMMA (Arabic for "summit"), an Arabic LLM evaluation platform that validates benchmark quality before running model evaluations. The platform found systematic errors across widely-used Arabic benchmarks, discarding 3.1% of ArabicMMLU samples and up to 12.3% of MizanQA questions.
Validation Pipeline Details
QIMMA applies a two-stage validation process to 52,164 samples across 109 benchmark subsets:
Stage 1: Two LLMs (Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B) independently score each sample against a 10-point quality rubric. Samples scoring below 7/10 from either model are flagged.
Stage 2: Native Arabic speakers review flagged samples for cultural context, dialectal nuance, and subtle quality issues.
Quality Issues Found
The validation revealed four categories of systematic problems:
- Answer Quality: Mismatched gold indices, factually wrong answers
- Text & Formatting: Corrupt text, spelling errors, duplicate samples
- Cultural Sensitivity: Stereotype reinforcement, monolithic generalizations
- Gold Answer Compliance: Misalignment with evaluation protocols
Discard rates by benchmark:
| Benchmark | Total Samples | Discarded | Rate |
|---|---|---|---|
| ArabicMMLU | 14,163 | 436 | 3.1% |
| MizanQA | 1,769 | 412 | 12.3% |
| PalmX | 3,001 | 25 | 0.8% |
| MedAraBench | 4,960 | 33 | 0.7% |
| FannOrFlop | 6,984 | 43 | 0.6% |
Code Benchmark Modifications
For code evaluation, QIMMA refined Arabic problem statements in 3LM's adaptations of HumanEval+ and MBPP+ without changing task logic:
- 3LM HumanEval+: 145 of 164 prompts modified (88%)
- 3LM MBPP+: 308 of 378 prompts modified (81%)
Modifications addressed linguistic refinement, clarity improvements, consistency normalization, structural corrections, and semantic refinements.
Coverage
QIMMA evaluates 7 domains across 14 benchmarks:
- Cultural: AraDiCE-Culture, ArabCulture, PalmX
- STEM: ArabicMMLU, GAT, 3LM STEM
- Legal: ArabLegalQA, MizanQA
- Medical: MedArabiQ, MedAraBench
- Safety: AraTrust
- Poetry & Literature: FannOrFlop
- Coding: 3LM HumanEval+, 3LM MBPP+
The platform uses LightEval, EvalPlus, and FannOrFlop frameworks with metrics including normalized log-likelihood accuracy for MCQ, F1 BERTScore for generative QA, and pass@1 for code.
What This Means
QIMMA is the first Arabic leaderboard combining open source code, predominantly native Arabic content (99%), systematic quality validation, code evaluation, and public per-sample outputs. The validation results demonstrate that widely-used Arabic benchmarks contain systematic quality issues that can corrupt evaluation results, with discard rates ranging from near-zero to 12.3%. This suggests existing Arabic LLM rankings may be partially based on flawed ground truth data. The platform's approach of validating benchmarks before evaluation sets a new standard for non-English LLM assessment.
Related Articles
IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture
IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.
Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response
Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.
Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely
In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.
ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%
OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.
Comments
Loading...