ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
Original reporting by Hugging Face

Artificial Analysis, in partnership with IBM Research, has unveiled ITBench-AA, a groundbreaking benchmark designed to test the capabilities of large language models on complex, agentic enterprise IT tasks. Starting with Site Reliability Engineering (SRE), the benchmark plunges models into realistic Kubernetes incident response scenarios, requiring them to diagnose live systems by sifting through logs, tracing dependencies, and pinpointing root-cause entities across intricate infrastructure.
The initial results from ITBench-AA SRE reveal a significant challenge for even the most advanced frontier models. With top performers like Claude Opus 4.7 scoring 47% and GPT-5.5 at 46%, no model breaks the 50% threshold, marking this as one of the least saturated agentic benchmarks to date. This indicates a substantial gap between current AI capabilities and the nuanced demands of enterprise IT operations.
Precision over verbosity A counterintuitive finding emerged regarding problem-solving strategy: models employing longer, more verbose trajectories did not achieve higher accuracy. In fact, over-investigation often led to false positives, penalizing scores. For instance, Gemini 3.1 Pro Preview averaged 83 turns but scored 30%, while Gemma 4 31B (Reasoning) achieved 37% with just 58 turns. This highlights the critical need for precise diagnosis. Open-weight models like GLM-5.1 are also proving to be cost-effective contenders, demonstrating strong performance at a fraction of the cost.
The launch of ITBench-AA by Artificial Analysis and IBM Research represents a pivotal advancement in evaluating agentic AI’s real-world efficacy within enterprise IT. Focusing initially on Site Reliability Engineering (SRE), this rigorous benchmark has revealed that even leading frontier models struggle significantly, with top scores remaining below 50%. This "least saturated" status for the benchmark underscores a substantial performance gap between current AI capabilities and the exacting demands of complex incident response, where precise identification of root causes across intricate Kubernetes infrastructures is paramount. Furthermore, the data indicates that simply increasing an agent's investigatory turns does not correlate with improved accuracy, often penalizing models that over-investigate with false positives. Notably, several open-weight models demonstrate competitive performance at a fraction of the cost, suggesting promising avenues for more accessible and efficient AI solutions in this domain.