Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model

Emre Can Acikgoz¹, Jeremiah Greer², Akul Datta¹, Ze Yang¹, William Zeng²,
Oussama Elachqar², Emmanouil Koukoumidis², Dilek Hakkani-Tur¹, Gokhan Tur¹

University of Illinois Urbana-Champaign¹, Oumi²,

ConvAI Lab | Oumi

💡 Motivation Blog arXiv 🤗 Models 🤗 Dataset Code

Abstract

Large Language Models (LLMs) with API-calling capabilities enabled building Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs often struggle to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective Conversational Agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA)—and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CoALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models CoALM-8B, CoALM-70B, and CoALM-405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for Conversational Agents. We release all model weights, datasets, and training artifacts to support further research.

Why do we need CoALM?

Imagine chatting with an AI that not only understands your every question across multiple turns but can also seamlessly call external services—like booking a hotel, checking flight tickets, or retrieving product information—when needed. This vision has long been the holy grail of conversational AI, and the emergence of Large Language Models (LLMs) has brought us closer than ever. However, there remains a significant trade-off: traditional Task-Oriented Dialogue (TOD) systems excel at carefully orchestrated, multi-turn conversations but lack the adaptability to use new tools or APIs. Language Agents (LAs), on the other hand, can dynamically invoke APIs but often fumble through multi-turn scenarios, losing track of what the user really wants. Our paper tackles this dilemma head-on.

Our release includes:

CoALM, a family of model series at different scales CoALM 8B, CoALM 70B and the largest open-source conversational agent CoALM 405B—all unified by unified multi-turn dialogue skills and advanced function-calling capabilities. Our larger models, CoALM 70B and CoALM 405B, outperform GPT-4o and GPT-4o-mini on both TOD and LA tasks, narrowing the gap between closed-source and open-source models.
We introduce CoALM-IT, a hybrid dataset for conversational agents featuring unique ReAct reasoning steps in multi-turn settings, encompassing 312K samples across diverse domains, tasks, and abilities.
To foster further research within the open-source community, we publicly release all model weights, datasets, intermediate checkpoints, and wandb reports.

Confronting the TOD vs. LA Dilemma

TOD systems typically rely on domain-specific data and rigid pipelines. They can accomplish booking tasks well in a controlled setting (e.g., "reserve a table at 6 PM"), but adding a new service—say, a flight API—requires new training data or extensive manual modifications. LAs, powered by advanced LLMs, handle a broad range of APIs on the fly, yet they may go off track in longer dialogues: user intentions get muddled, or the conversation derails. As user demands expand beyond narrowly defined tasks, we need an agent that can juggle wide-ranging domains, maintain conversation flow, and call upon a varied set of tools without skipping a beat.

Introducing CoALM Framework

CoALM. To bridge this gap, we propose CoALM (Conversational Agentic Language Model), a unified approach that combines the robust multi-turn dialogue management of TOD with the dynamic function-calling prowess of LAs. Drawing on diverse data sources, we interleave standard booking flows, open-ended dialogues, and complex function calls, all unified under the umbrella of a single system. The result is an agent that adapts to new services without exhaustive retraining, while also preserving the clarity of multi-turn user interactions.

CoALM-IT Dataset

CoALM-IT. At the heart of CoALM is our CoALM-IT dataset—a carefully constructed multi-task corpus that blends state-tracking tasks, comprehensive TOD dialogues, and ReAct-based function usage. We apply this dataset to train models of various scales: CoALM-8B, CoALM-70B, and CoALM-405B. Unlike most TOD datasets that focus on a limited set of actions, CoALM-IT introduces scenarios where the AI must pick and choose from an array of potential APIs, reason out the user’s need (sometimes over multiple turns), and decide how best to respond. By consistently interleaving standard dialogue tasks with tool-calling scenarios, we foster a system that simultaneously learns to maintain context and interact with external services.

Results

Results on MultiWOZ. we compare CoALM against specialized approaches and baseline LLMs. Traditional dialogue systems outperformed LAs on MultiWOZ’s core metrics like Inform Rate and Success Rate—often surpassing 40–50% success when well-trained. But LAs with advanced API-calling capabilities saw their success rates plummet, sometimes below 20%, revealing their struggle to track user needs in multi-turn dialogues. In contrast, CoALM soared: even our smallest CoALM-8B model doubled the success rates of typical LAs, and our larger models rivaled or exceeded top proprietary solutions.
Results on BFCL and API-Bank. Next, we turn to two popular benchmarks for function calling: BFCL V3 and API-Bank. Here, specialized LAs shine, adeptly generating syntactically correct API calls with high success rates on short, single-turn prompts. When extended to multi-turn, they often stumble. Our results show how CoALM closes this gap. CoALM-8B significantly outperforms TOD-only models—revealing that domain-restricted architectures can’t pivot to new tools. More impressively, our larger CoALM-70B and CoALM-405B not only handle unknown APIs but also maintain coherent conversations, besting strong baselines like GPT-4o on both TOD and tool-calling tasks in key metrics.

The Path Forward

By unifying TOD strengths and LA flexibility, we believe CoALM sets a new paradigm for Conversational Agents. Beyond the promising numbers, our open-source release of model weights, datasets, and training artifacts offers the community an unprecedented opportunity to explore and refine these capabilities further. Whether you’re looking to build a chat assistant that can manage a business meeting or an AI travel agent that can handle flight, hotel, and taxi bookings all in one conversation, CoALM blueprint is designed to adapt. We look forward to the innovations that researchers and developers will create, pushing us closer to a future where AI can truly converse, reason, and act in one unified frame.