benchmark

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

TL;DR

Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.

2 min read
0

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark

Artificial Analysis and IBM Research have launched ITBench-AA, the first benchmark series evaluating AI models on agentic enterprise IT tasks. Starting with Site Reliability Engineering (SRE), no frontier model exceeds 50% accuracy on Kubernetes incident response tasks.

Benchmark Results

Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. Among open weights models, GLM-5.1 (Reasoning) scores 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) achieves 38%, and Gemma 4 31B (Reasoning) reaches 37%—ahead of Gemini 3.1 Pro Preview at 30%.

The benchmark includes 59 SRE tasks: 40 public and 19 held-out. Each task presents a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. Models must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident.

Methodology and Scoring

Models run in Artificial Analysis's open-source Stirrup reference harness with shell access to sandboxed file systems containing logs and snapshots. Each task allows 100 turns maximum, with 3 repeats per task.

Scoring uses average precision at full recall: models must identify all ground-truth root causes to receive any points. If successful, they score based on precision—the share of submitted entities that are actual root causes (true positives / total submissions). Missing any root cause results in a 0.0 score for that repeat.

Key Findings

Turn counts vary nearly 3x across models, but longer trajectories don't correlate with accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46% accuracy, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to identify upstream fault-injection mechanisms or co-occurring symptoms as false positives.

In one public task, agents must diagnose user-facing failures by inspecting alerts, traces, and logs to narrow failures to frontend traffic, then use topology and Kubernetes manifests to identify a network policy blocking the frontend. The correct diagnosis identifies the root-cause entity: otel-demo/NetworkPolicy/frontend-block-all-ports.

Cost Analysis

Open weights models occupy the cost frontier. Gemma 4 31B (Reasoning) scores 37% at $0.14 per task, outperforming Gemini 3.1 Pro Preview ($2.23 per task, 30%) on both accuracy and cost. GLM-5.1 (Reasoning) matches Gemini 3.5 Flash (high) at 40% while costing $1.23 versus $1.70 per task. Claude Opus 4.7 leads at 47% but costs $5.38 per task.

What This Means

ITBench-AA represents the first systematic evaluation of AI models on real enterprise IT operations tasks, revealing a significant capability gap. The sub-50% scores indicate frontier models still struggle with multi-step diagnostic reasoning across complex distributed systems—a critical limitation for enterprise deployment. The benchmark will expand to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks, developed over six months between Artificial Analysis and IBM Research leveraging IBM's enterprise IT expertise. The underlying ITBench dataset and methodology are documented in a February 2025 arXiv paper.

Related Articles

benchmark

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.

benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

benchmark

ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks

The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.

benchmark

ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%

OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.

Comments

Loading...