What is the AgentPerf Benchmark?

NVIDIA, in collaboration with Artificial Analysis, has released AgentPerf — the first public benchmark specifically designed for agentic AI workloads. Unlike traditional LLM benchmarks, it uses real multi-turn agent trajectories with tool calls and growing context to measure the maximum number of concurrent agents per megawatt.

📑Table of Contents
  1. What is the AgentPerf Benchmark?
  2. NVIDIA Blackwell GB300 Performance Results
  3. Impact on Enterprise Agentic AI Infrastructure Selection
  4. Why Agents per MW Matters
  5. Limitations and Caveats
  6. Frequently Asked Questions
  7. Conclusion

From my operational experience, “agents per MW” is the most critical metric for production AI agents. After shifting from MCP to CLI for better context efficiency, I have seen firsthand how power-efficient infrastructure directly impacts development productivity.


NVIDIA Blackwell GB300 Performance Results

The Blackwell GB300 NVL72 achieved up to 20× higher agents/MW compared to Hopper HGX H200. Measurements used DeepSeek V4 Pro (MoE) with realistic coding agent trajectories across 12+ languages (average 27K tokens, up to 200 turns).

TierOutput Speed (P25)P95 TTFT
120 tokens/s≤10s
260 tokens/s≤5s
3180 tokens/s≤3s

Source: Artificial Analysis (June 2026)

Rack-scale NVL72 configuration and TensorRT-LLM disaggregated prefill/decode optimizations contributed significantly to the power efficiency gains. My own experience confirms that full-stack co-design dramatically increases concurrent agent capacity.


Impact on Enterprise Agentic AI Infrastructure Selection

When enterprises select infrastructure for AI coding agents, the new “agents per MW” metric becomes essential alongside traditional tokens/s and cost/token. Adopting Blackwell allows running more agents within the same power budget, directly boosting team productivity.

In my experience operating agents with Zed and external service integrations, choosing power-efficient hardware while maintaining high throughput has proven key to long-term cost control.


Why Agents per MW Matters

Production agentic workloads involve hundreds of LLM calls per task. Traditional single-shot inference benchmarks fail to capture power efficiency, which directly affects TCO. AgentPerf is the first benchmark to quantify this critical dimension.


Limitations and Caveats

While AgentPerf uses realistic trajectories, it may not perfectly match every agent framework. Results vary significantly based on chosen SLO tiers. Enterprises should re-evaluate using their own SLO definitions.


Frequently Asked Questions

Which model was used for AgentPerf measurements?

Primarily DeepSeek V4 Pro (MoE). Additional models such as gpt-oss-120b are planned for future updates.

How many times more efficient is Blackwell vs Hopper?

Up to 20× agents/MW improvement was recorded, driven by rack-scale design and optimizations.

How should enterprises use these results?

Define your agentic workload SLOs and evaluate rack-scale systems like Blackwell NVL72 to maximize agents within your power budget.

Is the benchmark reproducible?

It uses a private test set and is open for vendor submissions. Live results are available at artificialanalysis.ai/benchmarks/hardware.

How does it differ from other benchmarks?

The key difference is the use of real multi-turn agent trajectories with realistic tool-call delays, rather than single-shot inference.

What is an SLO?

Service Level Objective defining TTFT and output speed thresholds across three tiers of increasing strictness.


Conclusion

The release of AgentPerf marks a shift in how agentic AI infrastructure is evaluated. Blackwell GB300 demonstrates overwhelming power efficiency advantages. Enterprises should actively assess it for future agentic workload expansion.

From my experience, combining CLI-centric agent operations with power-efficient hardware is the key to sustainable long-term development productivity.

For full details, refer to the NVIDIA official blog and Artificial Analysis website.

Related articles: Cursor Bugbotが3倍高速化・新/reviewコマンド追加 — コードレビューが90秒に短縮Claude Opus 4.8 リリース:Claude CodeのDynamic Workflowsと高速・低コスト化を解説Codex app 26.609:リセット貯金・Developer mode・Browser Use高速化が追加

krona23

Author

krona23

Over 20 years in the IT industry, serving as Division Head and CTO at multiple companies running large-scale web services in Japan. Experienced across Windows, iOS, Android, and web development. Currently focused on AI-native transformation. At DevGENT, sharing practical guides on AI code editors, automation tools, and LLMs in three languages.

DevGENT about →

Leave a Reply

Trending

Discover more from DevGENT

Subscribe now to keep reading and get access to the full archive.

Continue reading