evaluation

4 articles tagged with evaluation

May 6, 2026
researchGitHub

GitHub develops dominance analysis method to validate AI coding agent outputs without deterministic correctness

GitHub has published research on validating agentic AI behavior when there's no single "correct" answer. The company proposes dominance analysis as an alternative to brittle scripts or opaque LLM-as-judge approaches for building a trust layer in GitHub Copilot coding agents.

April 21, 2026
benchmarkTiiuae

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.

March 31, 2026
product updateAmazon Web Services

Amazon Bedrock AgentCore Evaluations now generally available for testing AI agents

Amazon Bedrock AgentCore Evaluations, a fully managed service for assessing AI agent performance, is now generally available following its public preview debut at AWS re:Invent 2025. The service addresses the core challenge that LLMs are non-deterministic—the same user query can produce different tool selections and outputs across runs—making traditional single-pass testing inadequate for reliable agent deployment.

February 23, 2026
benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.