A self-play RL framework for training general-purpose tool-calling agents from scratch, without any human data.
1UIUC 2ETH Zurich
Large language models are increasingly used as autonomous agents that interact with external tools to solve complex tasks. Reinforcement learning has become a go-to approach for building these agentic capabilities, but it usually relies on carefully constructed task-solution pairs and a good amount of human supervision.
We propose Tool-R0, a framework for training general-purpose tool-calling agents from scratch with self-play RL, under a zero data assumption. Starting from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes challenging tasks right at the other's competence frontier, and the other learns to solve them through real-world tool calls.
Both agents are initialized from the same base LLM and trained independently with GRPO.
The Generator is rewarded by the frozen Solver's answer uncertainty to target its competence frontier.
We evaluate Tool-R0 on five diverse tool-calling benchmarks. It turns out that self-play alone is enough to produce substantial gains across all of them, covering single-turn API selection, multi-step tool composition, conversational tool use, and intent tracking. On our primary model (Qwen2.5-1.5B), Tool-R0 yields a 92.5% relative improvement over the base model on average.
| Model | ToolAlpaca | SealTool | NexusRaven | API-Bank | SNIPS | Avg |
|---|---|---|---|---|---|---|
| Qwen2.5-0.5B | 20.17 | 37.07 | 4.71 | 13.85 | 1.57 | 15.47 |
| + Tool-R0 | 31.58 | 63.95 | 17.61 | 28.00 | 14.29 | 30.57 |
| +56.6% | +72.5% | +273.9% | +102.2% | +810.2% | β101.0% | |
| Qwen2.5-1.5B | 35.96 | 47.27 | 17.61 | 19.13 | 4.29 | 24.85 |
| + Tool-R0 | 47.36 | 83.00 | 34.59 | 50.62 | 20.86 | 47.84 |
| +31.7% | +75.6% | +86.4% | +164.6% | +386.3% | β92.5% | |
| Qwen2.5-3B | 45.61 | 69.72 | 44.33 | 44.95 | 14.28 | 43.97 |
| + Tool-R0 | 53.51 | 78.23 | 47.80 | 47.94 | 15.57 | 48.50 |
| +17.3% | +12.2% | +7.8% | +6.7% | +9.0% | β10.3% | |
| Llama-3.2-3B | 35.96 | 68.70 | 45.60 | 27.08 | 12.29 | 36.12 |
| + Tool-R0 | 43.86 | 77.21 | 46.86 | 30.24 | 14.42 | 40.47 |
| +22.0% | +12.4% | +2.8% | +11.7% | +17.3% | β12.0% |
An interesting pattern here: after training with Tool-R0, the 0.5B model surpasses the 1.5B base, and the 1.5B surpasses the 3B base. Self-play appears to unlock latent tool-use capabilities even in very small models that otherwise show limited tool-calling performance.
To put these numbers in context, we compare Tool-R0 against models fine-tuned on existing curated tool-calling datasets ranging from 4k to 210k examples. For a fair comparison, all models are re-trained on the same Qwen2.5-1.5B backbone.
| Method | Data | ToolAlpaca | SealTool | NexusRaven | API-Bank | SNIPS | Avg |
|---|---|---|---|---|---|---|---|
| Base model | β | 35.96 | 47.27 | 17.61 | 10.13 | 4.29 | 24.85 |
| xLAM 60k | 60k | 51.75 | 69.05 | 38.68 | 34.65 | 23.85 | 43.60 |
| Hammer 210k | 210k | 45.61 | 68.70 | 51.88 | 33.10 | 19.42 | 43.74 |
| ToolACE 12k | 12k | 45.61 | 67.01 | 43.08 | 53.71 | 14.14 | 44.71 |
| ToolRL 4k | 4k | 46.49 | 72.78 | 34.14 | 62.04 | 14.86 | 46.06 |
| Tool-R0 0 data | 0 | 47.36 | 83.00 | 34.59 | 50.62 | 20.86 | 47.84 |
With zero curated data, Tool-R0 reaches the highest average accuracy at 47.84%, outperforming all supervised baselines. We think the key reason is that the self-generated curriculum adaptively targets the model's evolving weaknesses rather than being locked to a fixed human-designed distribution. In other words, the model itself knows best what data it needs.
We run a series of ablations to understand which design choices actually matter.
| Configuration | Avg Accuracy | β (pp) | Relative Drop |
|---|---|---|---|
| Tool-R0 (full) | 47.84 | β | β |
| β’ shared weights | 30.42 | β17.42 | β 36.4% |
| β’ frozen Generator | 41.65 | β6.19 | β 12.9% |
| β’ w/o difficulty reward | 43.54 | β4.30 | β 9.0% |
| β’ w/o Gaussian falloff | 44.10 | β3.74 | β 7.8% |
Parameter separation is critical. Sharing weights between Generator and Solver leads to a β17.4 pp drop. We attribute this to gradient interference: the exploration-driven Generator and the execution-driven Solver pull the shared representation in conflicting directions. The Generator also needs to actively learn, not just generate; freezing it costs 6.2 pp. Finally, difficulty calibration matters: both the band-pass reward and its smooth Gaussian transitions contribute to stable training.
Self-play can teach complex tool-calling from zero data. Even starting from weak priors, Tool-R0 produces consistent gains across model scales, architectures, and benchmarks. Self-play RL on its own is sufficient to learn non-trivial tool-calling capabilities.
Self-generated curricula outperform static human supervision. The curriculum that emerges from self-play achieves the highest average similarity and the most uniform coverage across benchmarks, without ever seeing test data. It produces broader, more balanced training distributions than any of the curated datasets we compared against.
Role separation is essential for stable co-evolution. Training Generator and Solver with separate parameters turns out to be necessary, especially when both roles operate over high-entropy action spaces with different reward objectives. Sharing weights leads to catastrophic forgetting.
Difficulty-aware rewards are what drive learning. Self-play only works well when the Generator is actively learning and guided by a band-pass difficulty signal with smooth Gaussian transitions. Without calibrated difficulty, or with a frozen Generator, the system fails to produce the kind of targeted challenges that keep the Solver improving.
Tool-R0 works as effective mid-training for post-training. Running self-play first and then doing supervised fine-tuning outperforms both SFT alone and standalone Tool-R0, reaching 48.1% accuracy. Self-play builds a stronger foundation that lets the model extract more from the same human-curated data.
Larger models keep improving for longer. Smaller models tend to converge within about 3 iterations, settling near what looks like a Nash-like equilibrium. The 3B model, on the other hand, shows continuous improvement with no clear sign of saturation, suggesting that higher capacity delays convergence and leaves room for further gains from additional self-play rounds.
If you find Tool-R0 useful in your research, please consider citing our work.
@article{acikgoz2026toolr0,
title = {Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data},
author = {Acikgoz, Emre Can and Qian, Cheng and H{\"u}botter, Jonas and Ji, Heng and Tur, Gokhan and Hakkani-T{\"u}r, Dilek},
journal = {arXiv preprint},
year = {2026}
}