TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

University of Illinois Urbana-Champaign,
Conversational AI Lab

Abstract

Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and τ-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TDEVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.

Why We Need TD-Eval


  • Traditional evaluation metrics in task-oriented dialogue (TOD) systems, like Inform and Success rates, often provide an incomplete picture, focusing solely on final outcomes and ignoring intermediate errors during conversations. With the rise of sophisticated Large Language Models (LLMs), these outdated metrics fail to capture subtle yet significant mistakes such as incorrect information provided mid-dialogue or inconsistencies in responses. For example, a system might hallucinate a restaurant's existence and provide false information to a user, but still receive a perfect score if it eventually corrects itself later in the conversation. This failure to detect turn-level errors creates an accountability gap in TOD research. TD-EVAL addresses this problem by combining fine-grained turn-level analysis—evaluating conversation cohesion, backend knowledge accuracy, and policy adherence—with holistic dialogue-level comparisons to ensure comprehensive evaluation aligned closely with real-world conversational quality.

TD-Eval


  • TD-Eval adopts a structured two-step evaluation protocol, clearly illustrated in Figure 2. In the first step (left), each response within a dialogue is assessed at the turn-level across three key dimensions—conversation cohesion, backend knowledge consistency, and policy compliance—by an LLM-based judge. The judge evaluates the responses in the context of dialogue history, user queries, and backend database results, assigning scores from 1 to 5 along with detailed justifications. The second step (right) involves a holistic dialogue-level assessment using the TOD Agent Arena, where entire dialogues from different agents compete and are ranked through pairwise Elo-based comparisons. This dual-step evaluation ensures both detailed error analysis and broad performance comparisons, as clearly depicted in the main illustration.

Human Evaluation


  • Our human evaluation study with 10 annotators (PhD students in NLP, all advanced in English) demonstrated that TD-EVAL significantly outperforms traditional metrics in alignment with human judgments. When compared against traditional Success rate, τ-Bench reward, and LMUnit metrics, TD-EVAL achieved higher agreement scores with human ratings (Gwet's AC1 of 0.56 for turn-level and 0.57 for dialogue-level evaluations). This strong alignment validates TD-EVAL's effectiveness in capturing the nuanced aspects of conversational quality that matter to users. Notably, our evaluation revealed a tendency for human annotators to rate high-quality TOD interactions in the 4-5 range on our 5-point Likert scale, indicating that our framework successfully identifies the subtle distinctions between good and excellent conversations that traditional metrics miss entirely.

Main Results


  • Our comprehensive evaluation of several state-of-the-art LLMs across MultiWOZ 2.4 and τ-Bench datasets revealed fascinating insights into model capabilities. At the turn level, the o1 model achieved the highest overall score (4.49/5.00), demonstrating strong performance across all dimensions, particularly in policy compliance. Claude-3.5-Sonnet ranked second overall, while Llama-3.1-405B emerged as the top open-source model with particularly strong backend knowledge consistency. Our TOD Agent Arena (dialogue-level evaluation) produced somewhat different rankings, with Claude-3.5-Sonnet dominating with an Elo rating of 1279.66, winning most head-to-head matchups. Interestingly, models like Mistral-Large performed better in dialogue-level comparisons than turn-level scores would suggest, indicating their ability to recover from localized errors in end-to-end conversations. These results highlight the importance of evaluating TOD systems from multiple perspectives.

Conclusion and Limitations

  • We present TD-Eval, a simple yet powerful framework for TOD evaluation that combines fine-grained turn-level checks with a holistic dialogue-level ranking. By adopting an LLM-as-judge paradigm, TD-Eval goes beyond standard metrics to reveal subtle yet critical errors, such as inconsistent database usage and policy violations, which often remain undetected by final-turn or dialogue-level summaries. Through Elo-based ranking and targeted turn-level scoring, our experiments on MultiWOZ~2.4 and τ-Bench demonstrate TD-Eval’s alignment with human judgments. This work opens a new path for LLM-driven TOD evaluation—one that is both flexible and transparent—ensuring greater accountability and accuracy in developing next-generation dialogue systems. We intend to release our framework, system responses, and human evaluations to foster reproducibility and community adoption.

  • While metrics in TD-Eval cover core aspects occurring in general TOD scenarios, it still remains open questions to design more flexible, fine-grained evaluation metrics that can cover diverse scenarios during multi-turn interactions. Furthermore, practitioners should consider that the performance of LLM-based evaluation can be improved when appending qualified few-shot demonstrations or tailored scoring rubrics in specific service domains. Lastly, it should be noted that conventional evaluation metrics are still useful to evaluate TOD in specific aspects, thus TD-Eval should be used alongside existing metrics in a complementary relationship.

License and BibTeX

Please don't forget to kindly cite our paper if you use our models, data, codes, or results:


        @article{acikgoz2025td,
          title={TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons},
          author={Acikgoz, Emre Can and Guo, Carl and Dey, Suvodip and Datta, Akul and Kim, Takyoung and Tur, Gokhan and Hakkani-T{\"u}r, Dilek},
          journal={arXiv preprint arXiv:2504.19982},
          year={2025}
        }
      
This model is licensed under Creative Commons NonCommercial (CC BY-NC 4.0).

Ethics Statement

We conduct our experiments using the publicly available MultiWOZ and τ-Bench datasets, adhering fully to their terms of use. Since we employ LLMs to generate evaluations with justifications, the risk of producing harmful, biased, or discriminatory statements is minimal. However, we acknowledge the potential ethical concerns associated with this work.