evaluation
6 articles tagged with evaluation
AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK
AWS has added four multimodal evaluators to its Strands Evals SDK that judge image-to-text AI outputs by directly analyzing source images. The evaluators—Overall Quality, Correctness, Faithfulness, and Instruction Following—use multimodal large language models to detect visual hallucinations, factual errors, and instruction violations that text-only judges miss.
IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture
IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.
GitHub develops dominance analysis method to validate AI coding agent outputs without deterministic correctness
GitHub has published research on validating agentic AI behavior when there's no single "correct" answer. The company proposes dominance analysis as an alternative to brittle scripts or opaque LLM-as-judge approaches for building a trust layer in GitHub Copilot coding agents.
QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation
TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.
Amazon Bedrock AgentCore Evaluations now generally available for testing AI agents
Amazon Bedrock AgentCore Evaluations, a fully managed service for assessing AI agent performance, is now generally available following its public preview debut at AWS re:Invent 2025. The service addresses the core challenge that LLMs are non-deterministic—the same user query can produce different tool selections and outputs across runs—making traditional single-pass testing inadequate for reliable agent deployment.
OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.