NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark
Original reporting by NVIDIA Blog

The rise of agentic AI marks a profound evolution beyond simple conversational interfaces. Unlike a single-turn chat that triggers one large language model (LLM) call, an AI agent functions like a sophisticated relay team, breaking down complex goals into numerous steps. It chains together dozens, even hundreds, of LLM calls and tool interactions—from code execution to web browsing—to observe, reason, and act until a task is complete. This intricate, multiplicative workload places unique and immense demands on accelerated computing systems, fundamentally different from the single-request focus of existing AI inference benchmarks. For companies building and deploying agents at scale, a critical gap has emerged: how to accurately measure and compare infrastructure performance for these demanding, multi-step applications.
A new benchmark emerges
This challenge has now been addressed with the launch of AgentPerf by Artificial Analysis, the industry’s first dedicated agentic AI benchmark. Built on real-world coding agent trajectories, AgentPerf provides an authoritative method to evaluate how efficiently systems handle these complex workloads. In its highly anticipated first round of published results, the NVIDIA Blackwell Ultra NVL72 platform has delivered leading performance. Running DeepSeek V4 Pro, a frontier mixture-of-experts model powering today's most capable agents, the NVIDIA GB300 NVL72 system demonstrated its prowess by executing up to 20x more agents per megawatt than NVIDIA Hopper systems. This significant leap in efficiency, stemming from extreme full-stack codesign, offers developers and enterprises a vital, clear understanding of how much productive work their AI infrastructure can truly deliver for the agentic future.
The introduction of AgentPerf marks a pivotal moment for the burgeoning field of agentic AI. By providing the industry's first dedicated benchmark for these complex, multi-step workloads, Artificial Analysis has established a critical standard for evaluating the true capabilities of accelerated computing systems. NVIDIA's Blackwell platform, particularly the GB300 NVL72, has demonstrated a significant lead, showcasing its ability to support vastly more concurrent agents per megawatt compared to previous generations. This isn't merely a performance upgrade; it's an architectural validation that underscores the necessity of full-stack co-design to meet the intricate demands of chained LLM calls, tool integration, and growing context windows.
Shaping AI's Future
The implications of AgentPerf extend far beyond immediate performance metrics. This benchmark provides enterprises and developers with the clarity needed to make informed infrastructure investments, ensuring that their AI deployments are not just powerful, but also efficient and scalable for real-world applications. It shifts the focus from theoretical LLM throughput to the practical execution of complex tasks, fostering a more mature and accountable ecosystem for agentic AI. Looking ahead, this standardization will accelerate innovation, driving both hardware and software developers to optimize for true agentic intelligence. The ability to reliably measure and compare systems will unlock new applications, from sophisticated autonomous assistants to intricate enterprise automation, fundamentally transforming how businesses operate and how we interact with AI. As the NVIDIA Vera Rubin architecture and ongoing optimizations continue to push boundaries, AgentPerf will remain a crucial compass, guiding the evolution of AI towards increasingly capable and intelligent agents.