What AgentPerf Measures
NVIDIA and Artificial Analysis jointly developed the AgentPerf benchmark, the first dedicated evaluation metric specialized for agentic AI workloads. The biggest feature is adopting “agents per megawatt” as the primary indicator instead of the conventional tokens per second. The reason power efficiency has become the most important criterion in infrastructure selection is that power costs and rack capacity in AI data centers are rapidly becoming constrained. When enterprises operate large numbers of agents simultaneously, the number of agents per megawatt directly impacts TCO.
📑Table of Contents
Blackwell GB300 NVL72 Performance
Blackwell GB300 NVL72 achieved up to 20 times more agents per megawatt compared to Hopper H200. This result has been confirmed in both the official NVIDIA blog and Artificial Analysis measured data. GB300 NVL72 adopts a 72-GPU rack-scale configuration, significantly improving power-normalized performance through disaggregated prefill/decode and TensorRT-LLM optimizations. KV cache reuse and speculative decoding also contribute.
Benchmark Workload Details
The workload used in the benchmark consists of real-world coding agent trajectories. Using DeepSeek V4 Pro (MoE), it reproduces multi-turn tool calls with over 200 turns and over 100K context tokens. Unlike conventional single-shot inference, agentic workloads involve dozens to hundreds of chained LLM calls, making KV cache efficiency and disaggregated inference particularly important.
Infrastructure Impact
The impact of Agents per Megawatt on enterprise infrastructure is significant. Power cost calculations show Blackwell’s advantage is pronounced in rack-scale deployments. Two Service Level Tiers are defined: 20 tokens/s and 60 tokens/s, allowing selection based on your workload requirements. Plans for 1M token context support are also in place for the future.
Hopper vs Blackwell Comparison
The performance comparison between Hopper and Blackwell is as follows.
| Item | Hopper H200 | Blackwell GB300 NVL72 | Improvement |
|---|---|---|---|
| Agents per MW (20 tok/s) | Baseline | 20x | 20x |
| Agents per MW (60 tok/s) | Baseline | ~18x | ~18x |
| Rack-scale efficiency | Low | High (NVL72) | Significant |
| KV cache efficiency | Standard | Optimized | Improved |
FAQ
Here are answers to frequently asked questions.
Related articles:
- AI Agentjacking攻撃:Sentry MCP経由でClaude Code・Cursor・Codexが乗っ取り被害
- 東急「車内コンセントでモバイルバッテリー充電しないで」 注意喚起を更新
- Databricks、AI Agent向けメタハーネス「Omnigent」をオープンソース公開 — Claude Code / Codex横断でmulti-agent制御
Summary
In summary, AgentPerf has established a new standard for agentic AI infrastructure selection. Blackwell’s 20x performance holds great significance in an era that prioritizes power efficiency. Enterprises should consider adoption while taking into account their workload characteristics and referencing the SLO tiers.
Author
krona23
Over 20 years in the IT industry, serving as Division Head and CTO at multiple companies running large-scale web services in Japan. Experienced across Windows, iOS, Android, and web development. Currently focused on AI-native transformation. At DevGENT, sharing practical guides on AI code editors, automation tools, and LLMs in three languages.







Leave a Reply