Research
My research focuses on Self-Evolving Agents. To achieve this, I work on Agent Learning and Test-Time Training. My high-level goal is to build agents that can continuously learn from few-samples, with minimal updates; regardles of the task, domain, and environment.
In parallel to that, I like to explore the capabilities of LLMs, identify their limitations, and enhance them to new tasks and domains. My previous work spans a wide range of topics around language modeling and post-training, including multi-modality.
|
SIGDial 2025 (Oral)
|
TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons
Emre Can Acikgoz*, Carl Guo*, Suvodip Dey*, Akul Datta, Takyoung Kim, Gokhan Tur, Dilek Hakkani-Tür
SIGDial (Oral), 2025
arxiv
/ website
/ code
We propose TD-EVAL, a two-step evaluation framework for TOD systems that combines fine-grained turn-level analysis—focusing on conversation cohesion, knowledge consistency, and policy compliance—with dialogue-level comparisons via a pairwise TOD Agent Arena. This unified approach captures both local and global errors missed by traditional metrics.
|
NeurIPS 2025
|
ToolRL: Reward is All Tool Learning Needs
Cheng Qian*, Emre Can Acikgoz*, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
NeurIPS, 2025
arxiv
/ code
/ huggingface
In ToolRL, we explored different reward strategies for these issues and present a new approach for tool utilization tasks that achieves 17% improvement over base models and 15% over SFT versions.
|
|
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
Emre Can Acikgoz*, Cheng Qian*, Hongru Wang*, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur
arXiv, 2025
arxiv
/ code
Our proposed taxonomy systematically analyzes Conversational Agents around three essential dimensions: (i) Reasoning—logical and structured thinking for decision-making, (ii) Monitoring—self-awareness and continuous user intention tracking, (iii) Control—effective tool utilization and policy adherence, all together with the representative list of works.
|
ACL 2025
|
SMART: Self-Aware Agent for Tool Overuse Mitigation
Cheng Qian*, Emre Can Acikgoz*, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
Findings of ACL 2025
arxiv
/ code
/ huggingface
Inspired by human metacognition, SMART enhances LLM's self-awareness to reduce tool overuse while boosting performance. Our experiments show that SMARTAgent reduces tool use by 24% while improving performance by 37%.
|
ACL 2025
|
Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model
Emre Can Acikgoz, Jeremiah Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Gokhan Tur, Dilek Hakkani-Tür
ACL 2025 Main
arxiv
/ website
/ code
CoALM unifies multi-turn dialogue management and complex API usage in a single model. Trained on the CoALM-IT multi-task dataset, CoALM (8B, 70B, 405B) outperforms domain-specific models like GPT-4o on MultiWOZ 2.4, BFCL V3, and API-Bank benchmarks.
|
IEEE SPM 2025
|
Conversational Agents in the Era of Large Language Models [Perspectives]
Emre Can Acikgoz, Dilek Hakkani-Tür, Gokhan Tur
IEEE SPM, 2025
IEEEXplore
Large language models (LLMs) have driven a paradigm shift in task-oriented dialogue by enabling AI agents with stronger reasoning, tool use, and instruction-following abilities. These developments give rise to conversational AI agents—systems that merge advanced language understanding with agentic decision-making to achieve dynamic, context-aware, and task-oriented interactions. This work covers ongoing challenges include multi-turn context management, controllability, personalization, and user alignment.
|
|
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, Dilek Hakkani-Tür
arXiv, 2025
arxiv
PIPA is a unified evaluation protocol for task planning agents that models their behavior within a POMDP framework, enabling fine-grained assessment across the entire agentic process. Unlike traditional task completion metrics, PIPA uses atomic evaluation criteria to diagnose strengths and weaknesses in context understanding, tool use, and decision-making, aligning evaluation more closely with user satisfaction.
|
IWSDS 2025
|
ReSpAct: Harmonizing Reasoning, Speaking, and Acting
Vardhan Dongre, Xiaocheng Yang, Emre Can Acikgoz, Suvodip Dey, Gokhan Tur, Dilek Hakkani-Tür
IWSDS, 2025
arxiv
/ website
/ code
ReSpAct is a framework that enables LLM agents to engage in interactive, user-aligned task-solving. It enhances agents' ability to clarify, adapt, and act on feedback.
|
NeurIPS GenAI4Health (Oral)
|
Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
Emre Can Acikgoz, Osman Batur İnce, Rayene Bech, Arda Anıl Boz, Ilker Kesen, Aykut Erdem, Erkut Erdem
NeurIPS GenAI4Health (Oral), 2024
arxiv
/ website
/ poster
We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback.
|
ICLR 2024
|
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt, Aykut Erdem, Erkut Erdem
ICLR, 2024
arxiv
/ website
/ code
ViLMA (Video Language Model Assessment) presents a comprehensive benchmark for Video-Language Models, starting with a fundamental comprehension test and followed by a more advanced evaluation for temporal reasoning skills.
|
Academic Service
Reviewer: AACL 2025, EMNLP 2025, NeurIPS 2025, ACL 2025, NeurIPS 2024
|
Two Truths and a Lie
Two truths and one fun fact about me: (i) I won national math olympiads twice, (ii) I made my own custom Harley Davidson when I was 16, (iii) I used to play college football.
|
|