What is the AgentPerf Benchmark?
NVIDIA, in collaboration with Artificial Analysis, has released AgentPerf — the first public benchmark specifically designed for agentic AI workloads. Unlike traditional LLM benchmarks, it uses real multi-turn agent trajectories with tool calls and growing context to measure the maximum number of concurrent agents per megawatt.
📑Table of Contents
From my operational experience, “agents per MW” is the most critical metric for production AI agents. After shifting from MCP to CLI for better context efficiency, I have seen firsthand how power-efficient infrastructure directly impacts development productivity.
NVIDIA Blackwell GB300 Performance Results
The Blackwell GB300 NVL72 achieved up to 20× higher agents/MW compared to Hopper HGX H200. Measurements used DeepSeek V4 Pro (MoE) with realistic coding agent trajectories across 12+ languages (average 27K tokens, up to 200 turns).
| Tier | Output Speed (P25) | P95 TTFT |
|---|---|---|
| 1 | 20 tokens/s | ≤10s |
| 2 | 60 tokens/s | ≤5s |
| 3 | 180 tokens/s | ≤3s |
Source: Artificial Analysis (June 2026)
Rack-scale NVL72 configuration and TensorRT-LLM disaggregated prefill/decode optimizations contributed significantly to the power efficiency gains. My own experience confirms that full-stack co-design dramatically increases concurrent agent capacity.
Impact on Enterprise Agentic AI Infrastructure Selection
When enterprises select infrastructure for AI coding agents, the new “agents per MW” metric becomes essential alongside traditional tokens/s and cost/token. Adopting Blackwell allows running more agents within the same power budget, directly boosting team productivity.
In my experience operating agents with Zed and external service integrations, choosing power-efficient hardware while maintaining high throughput has proven key to long-term cost control.
Why Agents per MW Matters
Production agentic workloads involve hundreds of LLM calls per task. Traditional single-shot inference benchmarks fail to capture power efficiency, which directly affects TCO. AgentPerf is the first benchmark to quantify this critical dimension.
Limitations and Caveats
While AgentPerf uses realistic trajectories, it may not perfectly match every agent framework. Results vary significantly based on chosen SLO tiers. Enterprises should re-evaluate using their own SLO definitions.
Frequently Asked Questions
Conclusion
The release of AgentPerf marks a shift in how agentic AI infrastructure is evaluated. Blackwell GB300 demonstrates overwhelming power efficiency advantages. Enterprises should actively assess it for future agentic workload expansion.
From my experience, combining CLI-centric agent operations with power-efficient hardware is the key to sustainable long-term development productivity.
For full details, refer to the NVIDIA official blog and Artificial Analysis website.
Related articles: Cursor Bugbotが3倍高速化・新/reviewコマンド追加 — コードレビューが90秒に短縮、Claude Opus 4.8 リリース:Claude CodeのDynamic Workflowsと高速・低コスト化を解説、Codex app 26.609:リセット貯金・Developer mode・Browser Use高速化が追加。
Author
krona23
Over 20 years in the IT industry, serving as Division Head and CTO at multiple companies running large-scale web services in Japan. Experienced across Windows, iOS, Android, and web development. Currently focused on AI-native transformation. At DevGENT, sharing practical guides on AI code editors, automation tools, and LLMs in three languages.







Leave a Reply