Project Lifelong Agents

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

A self-play RL framework for training general-purpose tool-calling agents from scratch, without any human data.

Emre Can Acikgoz1, Cheng Qian1, Jonas HΓΌbotter2, Heng Ji1, Gokhan Tur1, Dilek Hakkani-TΓΌr1

1UIUC  2ETH Zurich

Zero Data. Zero Labels.

Define your domains. Let the agents train themselves. Tool-R0 takes a lightweight task specification and builds a full self-evolving curriculum from it. No datasets, no annotations, no human supervision.

Abstract

TL;DR

Large language models are increasingly used as autonomous agents that interact with external tools to solve complex tasks. Reinforcement learning has become a go-to approach for building these agentic capabilities, but it usually relies on carefully constructed task-solution pairs and a good amount of human supervision.

We propose Tool-R0, a framework for training general-purpose tool-calling agents from scratch with self-play RL, under a zero data assumption. Starting from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes challenging tasks right at the other's competence frontier, and the other learns to solve them through real-world tool calls.

How It Works

The Self-Evolution Loop

🟑 Generator Synthesizes tasks via RLVR
β†’
πŸ“‹ Curriculum Filter, deduplicate, rank
β†’
πŸ”΅ Solver Learns tool calls via RLVR

Both agents are initialized from the same base LLM and trained independently with GRPO.
The Generator is rewarded by the frozen Solver's answer uncertainty to target its competence frontier.

Results

Main Results

We evaluate Tool-R0 on five diverse tool-calling benchmarks. It turns out that self-play alone is enough to produce substantial gains across all of them, covering single-turn API selection, multi-step tool composition, conversational tool use, and intent tracking. On our primary model (Qwen2.5-1.5B), Tool-R0 yields a 92.5% relative improvement over the base model on average.

Model ToolAlpaca SealTool NexusRaven API-Bank SNIPS Avg
Qwen2.5-0.5B 20.1737.074.7113.851.5715.47
+ Tool-R0 31.5863.9517.6128.0014.2930.57
+56.6%+72.5%+273.9%+102.2%+810.2%↑101.0%
Qwen2.5-1.5B 35.9647.2717.6119.134.2924.85
+ Tool-R0 47.3683.0034.5950.6220.8647.84
+31.7%+75.6%+86.4%+164.6%+386.3%↑92.5%
Qwen2.5-3B 45.6169.7244.3344.9514.2843.97
+ Tool-R0 53.5178.2347.8047.9415.5748.50
+17.3%+12.2%+7.8%+6.7%+9.0%↑10.3%
Llama-3.2-3B 35.9668.7045.6027.0812.2936.12
+ Tool-R0 43.8677.2146.8630.2414.4240.47
+22.0%+12.4%+2.8%+11.7%+17.3%↑12.0%

An interesting pattern here: after training with Tool-R0, the 0.5B model surpasses the 1.5B base, and the 1.5B surpasses the 3B base. Self-play appears to unlock latent tool-use capabilities even in very small models that otherwise show limited tool-calling performance.

Comparison

Surpassing Supervised Baselines

To put these numbers in context, we compare Tool-R0 against models fine-tuned on existing curated tool-calling datasets ranging from 4k to 210k examples. For a fair comparison, all models are re-trained on the same Qwen2.5-1.5B backbone.

Method Data ToolAlpaca SealTool NexusRaven API-Bank SNIPS Avg
Base model – 35.9647.2717.6110.134.2924.85
xLAM 60k 60k 51.7569.0538.6834.6523.8543.60
Hammer 210k 210k 45.6168.7051.8833.1019.4243.74
ToolACE 12k 12k 45.6167.0143.0853.7114.1444.71
ToolRL 4k 4k 46.4972.7834.1462.0414.8646.06
Tool-R0 0 data 0 47.3683.0034.5950.6220.8647.84

With zero curated data, Tool-R0 reaches the highest average accuracy at 47.84%, outperforming all supervised baselines. We think the key reason is that the self-generated curriculum adaptively targets the model's evolving weaknesses rather than being locked to a fixed human-designed distribution. In other words, the model itself knows best what data it needs.

Analysis

Ablation Studies

We run a series of ablations to understand which design choices actually matter.

Configuration Avg Accuracy βˆ† (pp) Relative Drop
Tool-R0 (full) 47.84––
⊒ shared weights 30.42βˆ’17.42↓ 36.4%
⊒ frozen Generator 41.65βˆ’6.19↓ 12.9%
⊒ w/o difficulty reward 43.54βˆ’4.30↓ 9.0%
⊒ w/o Gaussian falloff 44.10βˆ’3.74↓ 7.8%

Parameter separation is critical. Sharing weights between Generator and Solver leads to a βˆ’17.4 pp drop. We attribute this to gradient interference: the exploration-driven Generator and the execution-driven Solver pull the shared representation in conflicting directions. The Generator also needs to actively learn, not just generate; freezing it costs 6.2 pp. Finally, difficulty calibration matters: both the band-pass reward and its smooth Gaussian transitions contribute to stable training.

Insights

Key Findings

Finding 1

Self-play can teach complex tool-calling from zero data. Even starting from weak priors, Tool-R0 produces consistent gains across model scales, architectures, and benchmarks. Self-play RL on its own is sufficient to learn non-trivial tool-calling capabilities.

Finding 2

Self-generated curricula outperform static human supervision. The curriculum that emerges from self-play achieves the highest average similarity and the most uniform coverage across benchmarks, without ever seeing test data. It produces broader, more balanced training distributions than any of the curated datasets we compared against.

Finding 3

Role separation is essential for stable co-evolution. Training Generator and Solver with separate parameters turns out to be necessary, especially when both roles operate over high-entropy action spaces with different reward objectives. Sharing weights leads to catastrophic forgetting.

Finding 4

Difficulty-aware rewards are what drive learning. Self-play only works well when the Generator is actively learning and guided by a band-pass difficulty signal with smooth Gaussian transitions. Without calibrated difficulty, or with a frozen Generator, the system fails to produce the kind of targeted challenges that keep the Solver improving.

Finding 5

Tool-R0 works as effective mid-training for post-training. Running self-play first and then doing supervised fine-tuning outperforms both SFT alone and standalone Tool-R0, reaching 48.1% accuracy. Self-play builds a stronger foundation that lets the model extract more from the same human-curated data.

Finding 6

Larger models keep improving for longer. Smaller models tend to converge within about 3 iterations, settling near what looks like a Nash-like equilibrium. The 3B model, on the other hand, shows continuous improvement with no clear sign of saturation, suggesting that higher capacity delays convergence and leaves room for further gains from additional self-play rounds.

Citation

If you find Tool-R0 useful in your research, please consider citing our work.

@article{acikgoz2026toolr0,
  title   = {Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data},
  author  = {Acikgoz, Emre Can and Qian, Cheng and H{\"u}botter, Jonas and Ji, Heng and Tur, Gokhan and Hakkani-T{\"u}r, Dilek},
  journal = {arXiv preprint},
  year    = {2026}
}