Large Language Models (LLMs) have brought significant advancements in Natural Language Processing, transforming our interactions with AI by enhancing understanding, reasoning, and task-solving. Traditionally, before generative LLMs, task-oriented dialogue (TOD) systems guided AI-based assistants to perform predefined tasks by following a structured flow. But with LLMs' ability to understand complex language, interpret instructions, and generate high-quality responses, we are moving towards a new breed of AI—Conversational AI Agents—that engage users in more dynamic, context-rich conversations. Our blog post scopes Conversational Agents in the era of Large Language Models; clarifies key terminology, explores current challenges, and discusses future directions in this domain, especially for conversational task completion agents.

Figure 1. Example of a dialogue between a user and an agent with support from tool calls.

Paradigm Shift in the Era of LLMs

The backbone of the LLM-based domain shift in language understanding and generation tasks started its first steps with the large-scale self-supervision trend, where LLMs are trained on vast datasets. Paired with advances in GPU technology, this approach has empowered LLMs to handle progressively larger models and datasets. However, LLMs are initially pre-trained to simply complete sentences, which limits their ability to follow complex human instructions. This led to instruction tuning, where models are fine-tuned specifically to understand and follow instructions. This process calibrates LLM responses with user preferences, like a cherry on top. Parallel to these parameter-update strategies, in-context learning emerged with GPT-3 (Brown et al, 2020), enabling LLMs to learn from a few examples provided within the prompt. Further methods like Chain-of-Thought (CoT) (Wei et al., 2022) prompting let LLMs to tackle complex queries by breaking them down into smaller subtasks for step-by-step reasoning.

Despite these, when interacting with users, LLMs remain fundamentally limited to their parametric knowledge during response generation, lacking emergent agent abilities like reasoning, planning, decision-making, and acting.

Figure 2. Illustration of LLM-oriented agent research growth over time, categorized into three key areas: Tools and LLMs, Reasoning-Planning-Acting, and Datasets and Benchmarks. Each category is color-coded to highlight advancements according to their release timelines.

Conversational AI Agents

In this blog post, we explicitly want to motivate the emergent field of Conversational Agents, where an LLM-based agent is designed to perform multi-turn interactions with users, by integrating reasoning and planning capabilities with action execution.

We categorize recent developments in LLM Agents into three main areas: (i) tool usage, (ii) thinking, planning, and acting, and (iii) evaluation. Figure 2 presents an overview of LLM Agent advancements in these categories over the past few years.

Tools Usage. While LLMs are good at solving language tasks with their own reasoning, they struggle with real-time tasks that require performing actions like checking live weather or executing specific commands. Integrating tools like APIs can help LLMs perform these tasks by letting them call functions directly OpenAI Function Calling. Studies such as ToolFormer (Shick et al., 2023), Gorilla (Patil et al, 2023), and ToolLLM (Qin et al., 2023) have shown that enabling LLMs to use tools that improve their ability to complete tasks.

Reasoning, Planning, and Acting. Adding tools helps, but LLMs still need the skill to think through tasks and decide on when and how to call the right actions. ReAct (Yao et al., 2022) allows LLM agents to think and act by integrating thought with execution for effective outcomes. Proprietary models like GPT-4 are ahead in these agent skills, but methods like AgentTuning (Zeng et al., 2023) and FireAct (Chen et al., 2023) work on training open-source models for agent thinking and acting. On the other hand, agents sometimes follow planned steps to reach goals, from moving physically such as Helper-X (Sarch et al., 2024) and CodeAct (Wang et al., 2024) to online navigation like Webarena (Zhou et al., 2023), sharing progress with users for real-time feedback and smoother interactions.

Evaluation. Agent evaluation has evolved from checking basic commands (Tur et al., 2011) to more advanced tests, like tracking multi-turn dialogue (Rastogi et al., 2019) and completing complex tasks in real time (Hudecek et al., 2023). New benchmarks like AgentBench (Liu et al., 2023) and GAIA (Mialon et al., 2023) assess agents on tougher challenges, including maintaining consistency over multiple interactions and adapting to changing environments. Recent work, like $\tau$-bench (Yao et al., 2024) and TravelPlanner (Xie et al., 2024), push the frontier by measuring both interactive consistency and environmental adaptability in real-world scenarios.

Figure 3. A modern flowchart for LLM-based Conversational Agent that integrates explicit reasoning (or "thinking"), real-time API calls, aacting, and continuous learning from environmental observations.

Next-Generation Conversational Task-Completion Agents

Besides the general Agent domain, LLMs have reshaped dialogue systems by transitioning from rigid, modular architectures (Young et al., 2002) to adaptive, agent-based frameworks that can handle complex and multi-turn interactions with refined prompting and fine-tuning methods (Gupta et al., 2022). For Example, Figure 3 shows the new generation of dialogue structures powered by Conversational Agents, where a single Agent manages the entire process—handling user query reasoning, planning, tool calls (if needed), and response generation—replacing the separate modules found in traditional dialogue frameworks, such as in Steve Young’s well-known structure (Young et al., 2002). Rest, we will investigate the urgent required capabilities of Conversational Agents for better systems towards AGI.

Memory and Personalization. Effective Conversational Agents integrate short- and long-term memory (Huang et al., 2023) which enables them to create personalized interactions that recall user preferences. This integration fosters a natural connection, such as remembering a favorite coffee order or special days like a birthday, or Valentine's Day (Park et al., 2023).

Policy Alignment and Control. Traditional dialogue systems use structured, rule-based policies to ensure controlled responses. However, in modern LLM-based frameworks like LangGraph and DSPy, they provide emerging solutions for approximate policy control but remain limited in complex scenarios.

Interactivity. Recent advances, like ReSpAct (Vardhan et al., 2024), advocate for user-guided interactions to clarify and adapt agent behavior in real-time, addressing ambiguities and obstacles for more natural and controllable dialogues.

Multi-Agent Collaboration. Finally, multi-agent frameworks like AutoGen (Wu et al., 2023) allow specialized agents to collaborate, with a central "concierge" agent orchestrating tasks. This collaborative interaction between different agents can enhance accuracy and user experience in complex settings such as customer service.

Figure 4. LLM-oriented Agent publications in higly reputable (ACL, NAACL, EMNLP, AAAI, ICML, ICLR, NeurIPS) from 2021 to 2024. Publication counts were determined through simple string matching by searching for agents in paper titles and categorizing each as LLM-based. For ambiguous titles, further reading was performed.

Challenges and Future Directions

Conversational AI Agents have made remarkable progress but challenges remain. Ensuring controllability, managing context, and avoiding hallucinations (where the agent generates inaccurate responses or tryies to call the tools that do not exist) are interesting areas for improvement. Agents will also need personalized interactions, where memory-based systems track user preferences for more tailored responses, enhancing trust and user satisfaction. Finally, proper benchmarks focused on these agentic abilities are essential—serving as an "ImageNet" for Agents—to significantly accelerate research in Conversational Agents. Furthermore, it is crucial to streamline these benchmarks, as they currently require significant manual setup. Ideal Agent benchmarks should be engineered for easy deployment and allow researchers to set up and run evaluation environments within minutes.

The rising prominence of LLM-based agents has been evident in academic research in the last couple of years, with Figure 4 illustrating their accelerating presence across some popular conferences. Looking forward, this shift presents both exciting opportunities and important challenges. While significant progress has been made in agentic features like tool usage, memory, policy control, and collaboration, there remains much to explore in creating intelligent, adaptive, and reliable agents.

Citation

@article{acikgoz2024convagents,
  title   = "The Rise of Conversational AI Agents with Large Language Models",
  author  = "Emre Can Acikgoz and Dilek Hakkani-Tur and Gokhan Tur",
  journal = "emrecanacikgoz.github.io",
  year    = "2024",
  month   = "November",
  url     = "https://emrecanacikgoz.github.io/Conversational-Agents/"
}