Large Language Models (LLMs) with API-calling capabilities enabled building Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs often struggle to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective Conversational Agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA)—and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CoALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models CoALM-8B, CoALM-70B, and CoALM-405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for Conversational Agents. We release all model weights, datasets, and training artifacts to support further research.
Imagine chatting with an AI that not only understands your every question across multiple turns but can also seamlessly call external services—like booking a hotel, checking flight tickets, or retrieving product information—when needed. This vision has long been the holy grail of conversational AI, and the emergence of Large Language Models (LLMs) has brought us closer than ever. However, there remains a significant trade-off: traditional Task-Oriented Dialogue (TOD) systems excel at carefully orchestrated, multi-turn conversations but lack the adaptability to use new tools or APIs. Language Agents (LAs), on the other hand, can dynamically invoke APIs but often fumble through multi-turn scenarios, losing track of what the user really wants. Our paper tackles this dilemma head-on.
Our release includes:
@misc{acikgoz2025singlemodelmastermultiturn,
title={Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model},
author={Emre Can Acikgoz and Jeremiah Greer and Akul Datta and Ze Yang and William Zeng and Oussama Elachqar and Emmanouil Koukoumidis and Dilek Hakkani-Tür and Gokhan Tur},
year={2025},
eprint={2502.08820},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.08820},
}