Arbor: Hypothesis-Tree AI Optimization Framework Beats Claude Code & Codex by 2.5x [2026]

Arbor is a long-horizon autonomous optimization framework developed by Renmin University of China and Microsoft Research. At its core is Hypothesis Tree Refinement (HTR), a persistent hypothesis tree structure that enables AI agents to conduct research and optimization autonomously. It achieves over 2.5x average held-out performance gains compared to Claude Code and Codex under the same compute budget.

📑Table of Contents

What is Arbor? The Basics of Hypothesis Tree Refinement (HTR)
Coordinator and Executor Role Separation and Worktree Utilization
Accumulating and Propagating Failures with the Persistent Hypothesis Tree
Benchmark Results: Comparison with Claude Code/Codex and 2.5x Gains
BrowseComp Task Improvement Example (45.33% → 67.67%)
GitHub Repository and arXiv Paper Public Content
Practical Implications for AI Agent Developers
Frequently Asked Questions (FAQ)
Comparison Table
Summary

The heart of Arbor is the Hypothesis Tree Refinement (HTR) method. It persistently manages hypotheses in a tree structure, accumulating and refining them with experimental results and evidence. Rather than one-off task execution, it updates and propagates hypotheses over the long term to enhance AI agent autonomy. The arXiv paper (https://arxiv.org/abs/2606.11926) demonstrates its effectiveness on Autonomous Optimization tasks such as model training and data synthesis.

HTR’s advantage lies in linking hypotheses with artifacts and distilling insights for future use. This overcomes limitations of existing tools like context loss and reward hacking.

Coordinator and Executor Role Separation and Worktree Utilization

Arbor clearly separates two roles: Coordinator and Executor. The Coordinator handles long-term strategy and manages the overall hypothesis tree. The Executor runs experiments in isolated worktree environments and safely feeds back results. This separation prevents experimental failures from affecting the entire system, enabling stable operation.

The GitHub repository (https://github.com/RUC-NLPIR/Arbor) publishes examples using worktrees, and setup procedures can be verified there. The project page (https://ruc-nlpir.github.io/Arbor/) is also useful.

Accumulating and Propagating Failures with the Persistent Hypothesis Tree

Arbor’s strength is treating failures not as mere errors to discard but as constraints to accumulate and propagate. The hypothesis tree is persistent, so past failures and evidence influence future decisions. This allows AI agents to adjust strategies more intelligently and achieve long-term optimization.

VentureBeat coverage also highlights this mechanism as the reason it surpasses Claude Code and Codex. Utilizing failures differentiates it from conventional approaches where insights dissipate easily.

Benchmark Results: Comparison with Claude Code/Codex and 2.5x Gains

Benchmarks show Arbor’s clear superiority over Claude Code and Codex. Under the same compute budget, it recorded average held-out performance gains of over 2.5x. On MLE-Bench Lite, it achieved 86.36% Any Medal with GPT-5.5, the highest among comparators.

See the comparison table in the configuration section. Arbor excels in long-term strategy management and failure utilization.

BrowseComp Task Improvement Example (45.33% → 67.67%)

On the BrowseComp task, a significant improvement from 45.33% baseline to 67.67% was confirmed. This task involves complex optimization with web browsing, where HTR’s hypothesis management proved effective. It is backed by detailed experimental results in the arXiv paper.

GitHub Repository and arXiv Paper Public Content

Arbor’s code is public on GitHub (https://github.com/RUC-NLPIR/Arbor), and the arXiv paper (https://arxiv.org/abs/2606.11926) details the theoretical background and evaluation results. The project page (https://ruc-nlpir.github.io/Arbor/) provides demos and additional materials. VentureBeat (https://venturebeat.com/orchestration/new-ai-optimization-framework-beats-claude-code-and-codex-by-2-5x-on-the-same-compute-budget) also reports from a practical perspective.

Practical Implications for AI Agent Developers

For engineers developing AI agents, Arbor provides valuable insights for next-generation workflow design. Introducing Coordinator/Executor separation and HTR improves stability and performance on long-horizon tasks. We recommend cloning the GitHub repository and trying it in a local environment first.

Frequently Asked Questions (FAQ)

Is Arbor a replacement for Claude Code or Codex?
No, Arbor is a framework that complements and extends existing tools. By combining Coordinator and Executor, it further enhances the performance of Claude Code and Codex.
How does Hypothesis Tree Refinement (HTR) manage hypotheses?
It persistently holds hypotheses in a tree structure, linking experimental results and evidence to refine them. Failures are also accumulated as constraints.
How do Coordinator and Executor collaborate?
The Coordinator formulates overall strategy, while the Executor executes experiments in isolated worktrees. Results are fed back into the tree for collaboration.
On which tasks was the 2.5x performance gain confirmed?
Average gains of over 2.5x across Autonomous Optimization tasks in general. Specific numbers are shown on MLE-Bench Lite and BrowseComp.
What does the 86.36% Any Medal on MLE-Bench Lite mean?
It indicates the any-medal acquisition rate, representing top-tier results among comparators. It was achieved in combination with GPT-5.5.

Comparison Table

Item	Claude Code / Codex	Arbor (HTR)
Long-term strategy management	Limited	Persistent tree by Coordinator
Failure utilization	Prone to dissipation	Accumulated and propagated as constraints
Held-out performance gain	Baseline	Average 2.5x or more
BrowseComp improvement example	–	45.33% → 67.67%
Isolated execution environment	Standard	Executor worktree separation

Source: arXiv:2606.11926 (June 2026), VentureBeat coverage, official GitHub and project page (as of June 2026)

Related articles:

Summary

Arbor opens new possibilities for AI agent development as a long-horizon autonomous optimization framework leveraging Hypothesis Tree Refinement. The separation of Coordinator and Executor and persistent failure management deliver performance far exceeding Claude Code and Codex. Check the details via GitHub and arXiv, and try incorporating it into your projects.

Author

krona23

Over 20 years in the IT industry, serving as Division Head and CTO at multiple companies running large-scale web services in Japan. Experienced across Windows, iOS, Android, and web development. Currently focused on AI-native transformation. At DevGENT, sharing practical guides on AI code editors, automation tools, and LLMs in three languages.

DevGENT about →

Leave a ReplyCancel reply

Porffor: The AOT Compiler That Turns “Impossible” JavaScript-to-WebAssembly into Reality

Japan Standardizes 2 Million Characters to 70K for Government Systems — Why Your Name Might Change

Does AI Reliance Erode Skills? Doctors, Engineers & Nature’s Warning

Trending

Porffor: The AOT Compiler That Turns “Impossible” JavaScript-to-WebAssembly into Reality

Japan Standardizes 2 Million Characters to 70K for Government Systems — Why Your Name Might Change

Does AI Reliance Erode Skills? Doctors, Engineers & Nature’s Warning

Monitoring External Communications When DNS and SNI Become Invisible

Arbor: Hypothesis-Tree AI Optimization Framework Beats Claude Code & Codex by 2.5x [2026]

What is Arbor? The Basics of Hypothesis Tree Refinement (HTR)

Coordinator and Executor Role Separation and Worktree Utilization

Accumulating and Propagating Failures with the Persistent Hypothesis Tree

Benchmark Results: Comparison with Claude Code/Codex and 2.5x Gains

BrowseComp Task Improvement Example (45.33% → 67.67%)

GitHub Repository and arXiv Paper Public Content

Practical Implications for AI Agent Developers

Frequently Asked Questions (FAQ)

Comparison Table

Summary

Share this:

Like this:

Leave a ReplyCancel reply

Trending

Porffor: The AOT Compiler That Turns “Impossible” JavaScript-to-WebAssembly into Reality

Japan Standardizes 2 Million Characters to 70K for Government Systems — Why Your Name Might Change

Does AI Reliance Erode Skills? Doctors, Engineers & Nature’s Warning

Monitoring External Communications When DNS and SNI Become Invisible

Discover more from DevGENT