NVIDIA AgentPerf Benchmark Shows 20x Blackwell Speedup

What AgentPerf Measures

NVIDIA and Artificial Analysis jointly developed the AgentPerf benchmark, the first dedicated evaluation metric specialized for agentic AI workloads. The biggest feature is adopting “agents per megawatt” as the primary indicator instead of the conventional tokens per second. The reason power efficiency has become the most important criterion in infrastructure selection is that power costs and rack capacity in AI data centers are rapidly becoming constrained. When enterprises operate large numbers of agents simultaneously, the number of agents per megawatt directly impacts TCO.

📑Table of Contents

What AgentPerf Measures
Blackwell GB300 NVL72 Performance
Benchmark Workload Details
Infrastructure Impact
Hopper vs Blackwell Comparison
FAQ
Summary

Blackwell GB300 NVL72 Performance

Blackwell GB300 NVL72 achieved up to 20 times more agents per megawatt compared to Hopper H200. This result has been confirmed in both the official NVIDIA blog and Artificial Analysis measured data. GB300 NVL72 adopts a 72-GPU rack-scale configuration, significantly improving power-normalized performance through disaggregated prefill/decode and TensorRT-LLM optimizations. KV cache reuse and speculative decoding also contribute.

Benchmark Workload Details

The workload used in the benchmark consists of real-world coding agent trajectories. Using DeepSeek V4 Pro (MoE), it reproduces multi-turn tool calls with over 200 turns and over 100K context tokens. Unlike conventional single-shot inference, agentic workloads involve dozens to hundreds of chained LLM calls, making KV cache efficiency and disaggregated inference particularly important.

Infrastructure Impact

The impact of Agents per Megawatt on enterprise infrastructure is significant. Power cost calculations show Blackwell’s advantage is pronounced in rack-scale deployments. Two Service Level Tiers are defined: 20 tokens/s and 60 tokens/s, allowing selection based on your workload requirements. Plans for 1M token context support are also in place for the future.

Hopper vs Blackwell Comparison

The performance comparison between Hopper and Blackwell is as follows.

Item	Hopper H200	Blackwell GB300 NVL72	Improvement
Agents per MW (20 tok/s)	Baseline	20x	20x
Agents per MW (60 tok/s)	Baseline	~18x	~18x
Rack-scale efficiency	Low	High (NVL72)	Significant
KV cache efficiency	Standard	Optimized	Improved

FAQ

Here are answers to frequently asked questions.

Q: Which model was AgentPerf measured on?

It primarily used DeepSeek V4 Pro (MoE) and was verified with real coding workloads. Multi-turn tool calls and behavior during context growth were primarily evaluated.

Q: Is the 20x figure sustainable?

The result is a snapshot, and further improvements are possible as optimizations progress. Attention to obsolescence is necessary. The benchmark is scheduled to be updated regularly.

Q: What precautions should enterprises take when referencing AgentPerf?

The benchmark is measured at specific SLO tiers, so please judge based on your workload’s token speed requirements. Differences between rack-scale environments and the cloud should also be considered.

Q: Does Blackwell’s power efficiency also affect cloud providers?

Adoption of Blackwell is already advancing at providers such as Together AI and DeepInfra, directly linking to cost competitiveness. Improved power efficiency leads to direct fee reductions.

Q: Are additional metrics planned for future AgentPerf?

Additions of TCO-related metrics such as agents per $/hr and tool execution performance are planned. Support for 1M token contexts and additional models is also scheduled.

Related articles:

Summary

In summary, AgentPerf has established a new standard for agentic AI infrastructure selection. Blackwell’s 20x performance holds great significance in an era that prioritizes power efficiency. Enterprises should consider adoption while taking into account their workload characteristics and referencing the SLO tiers.

Author

krona23

Over 20 years in the IT industry, serving as Division Head and CTO at multiple companies running large-scale web services in Japan. Experienced across Windows, iOS, Android, and web development. Currently focused on AI-native transformation. At DevGENT, sharing practical guides on AI code editors, automation tools, and LLMs in three languages.

DevGENT about →

Leave a ReplyCancel reply

How to Dispose of a Broken Monitor: Two Practical Options

OpenAI Hires Noam Shazeer Amid AI Talent Race