Printing PressAI
← Back to front page
Generative AI & Tools

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Original reporting by Hugging Face

Image via Hugging Face

Artificial Analysis, in partnership with IBM Research, has unveiled ITBench-AA, a groundbreaking benchmark designed to test the capabilities of large language models on complex, agentic enterprise IT tasks. Starting with Site Reliability Engineering (SRE), the benchmark plunges models into realistic Kubernetes incident response scenarios, requiring them to diagnose live systems by sifting through logs, tracing dependencies, and pinpointing root-cause entities across intricate infrastructure.

The initial results from ITBench-AA SRE reveal a significant challenge for even the most advanced frontier models. With top performers like Claude Opus 4.7 scoring 47% and GPT-5.5 at 46%, no model breaks the 50% threshold, marking this as one of the least saturated agentic benchmarks to date. This indicates a substantial gap between current AI capabilities and the nuanced demands of enterprise IT operations.

Precision over verbosity A counterintuitive finding emerged regarding problem-solving strategy: models employing longer, more verbose trajectories did not achieve higher accuracy. In fact, over-investigation often led to false positives, penalizing scores. For instance, Gemini 3.1 Pro Preview averaged 83 turns but scored 30%, while Gemma 4 31B (Reasoning) achieved 37% with just 58 turns. This highlights the critical need for precise diagnosis. Open-weight models like GLM-5.1 are also proving to be cost-effective contenders, demonstrating strong performance at a fraction of the cost.

The launch of ITBench-AA by Artificial Analysis and IBM Research represents a pivotal advancement in evaluating agentic AI’s real-world efficacy within enterprise IT. Focusing initially on Site Reliability Engineering (SRE), this rigorous benchmark has revealed that even leading frontier models struggle significantly, with top scores remaining below 50%. This "least saturated" status for the benchmark underscores a substantial performance gap between current AI capabilities and the exacting demands of complex incident response, where precise identification of root causes across intricate Kubernetes infrastructures is paramount. Furthermore, the data indicates that simply increasing an agent's investigatory turns does not correlate with improved accuracy, often penalizing models that over-investigate with false positives. Notably, several open-weight models demonstrate competitive performance at a fraction of the cost, suggesting promising avenues for more accessible and efficient AI solutions in this domain.

A Path Forward These initial findings from ITBench-AA illuminate not only the current limitations of AI in complex IT operations but also the immense opportunity for significant advancements. The benchmark serves as a robust yardstick, directly challenging developers to refine AI’s diagnostic reasoning, contextual understanding, and efficient problem-solving in high-stakes enterprise environments. Its planned expansion into Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks further signals a broader ambition to integrate intelligent agents into an array of critical operational workflows. Ultimately, ITBench-AA is a crucial guidepost, pointing to the specific areas where foundational research and applied innovation must converge to unlock AI’s transformative potential in making enterprise IT operations more resilient, secure, and cost-effective, fundamentally augmenting human expertise in the process.

Intro and outro generated by Printing Press AI from the source article above. Always consult the original reporting for verbatim quotes and primary sources.