Evaluating small talk using language models: Language model transformers as evaluators for open-domain dialogues

|© Uni Bonn / Fraunhofer IAIS|© Uni Bonn / Fraunhofer IAIS|© Irina/stock.adobe.com & ML2R|© Irina/stock.adobe.com & ML2R|© Uni Bonn / Fraunhofer IAIS

Dialogue systems, which are nowadays referred to as chatbots, have existed since the 1960s. One of the earliest known examples is ELIZA by Joseph Weizenbaum, utilizing keyword matching and rules to mimic simple Rogerian psychotherapy. Since then, the research field has significantly evolved, and dialogue systems are now prevalent in everyday life. They are integrated into voice assistants like Siri or Alexa, as well as chatbots on social media platforms, aiding in tasks such as restaurant reservations or providing support for various issues. They also play an increasingly vital role in industrial dialogue environments. But how do we know if a custom-developed chatbot actually works?  

Task-oriented chatbots (like those assisting in flight bookings) are typically component-based, breaking down their tasks into sub-tasks for automated evaluation. However, the development of such systems still involves rigorous testing conducted by real individuals as the final stage. Additionally, there are chatbots that are not task-oriented (such as those focused on small talk) and need to be evaluated. So far, the best approach we have for evaluating them is to compare their response against a reference. However, in informal conversations, there can be more than one acceptable response. So far, no one has succeeded in developing a tool, method, or algorithm to measure how well these programs engage in conversation.

For many, this might sound like an automated Turing test (also referred to as the “imitation game” by Alan Turing himself). However, such evaluation requires two essential abilities:  

  1. Understanding whether a dialogue meets certain quality criteria, such as fluency (correct language use) or coherence (contextually relevant response).  
  2. Being able to engage in a conversation, meaning providing a response that adheres to the mentioned criteria. Since we don’t have a system that can perform the latter correctly (at least not yet), we cannot automate the Turing test. Instead, we focus on measuring the fluency and coherence of a dialogue. 

Why do language models have a “sense” for fluent and coherent dialogues?

A question quickly arises: To know what a fluent and coherent dialogue is, don’t you need to know how to engage in a good conversation?

Not quite! It’s understandable for anyone that reading books can help in mastering a language, whether as a native speaker or non-native speaker. Essentially, this is what language models like BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), or XLNet (Yang et al., 2019) more or less do. They “read” numerous articles from online news or Wikipedia, thereby “acquiring knowledge” about the consumed language. However, none of them have learned to actively participate in a dialogue (more information on the “abilities and limitations of language models”).

Bild 1 9 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Irina/stock.adobe.com & ML2R

In our paper, “Language Model Transformers as Evaluators for Open-domain Dialogues”, we demonstrate that language models possess a “sense” of what a coherent and fluent dialogue might be. They have acquired this “sense” solely through “reading books.” Simply put, language models have learned to guess the most likely word or words in a given context. Each of the three approaches mentioned above accomplishes this in its own way. We aimed to find out whether their “ability” could be a good indicator of conversation quality.

Thus, we asked language models how “likely” responses in dialogues are. For our tests, we used the participating systems in the ConvAI1 and ConvAI2 challenges. We then examined whether there is a correlation between the “likelihood score” given by language models and the ratings provided by human annotators. It turned out that there were (some)! Depending on the language model and dialogue dataset used, we discovered positive correlation coefficients (Pearson’s and Spearman’s) ranging from 0.13 to 0.49 with high statistical significance. BERT’s Next Sentence Prediction (NSP) performed the best, as it operates at the utterance level rather than the token level. It is followed by XLNet, which uses positional information for each target word. Finally, GPT-2 is included, providing a standard word-by-word prediction from left to right.


Bild 2 1 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Uni Bonn / Fraunhofer IAIS

That’s amazing! So, if language models “have their own opinions”, then we not only asked them to evaluate the dialogue, but we also asked them what, from their perspective, constitutes a good response?

Yes, indeed! Two of them, GPT-2 and XLNet, are capable of generating complete sentences. So, we asked both to continue the conversations from ConvAI1 and ConvAI2. While their responses weren’t entirely fluent, they were understandable and made sense in context. Moreover, when comparing them with the previously mentioned likelihood score, the likelihood scores of these hypothetical responses showed an even higher correlation with human annotator scores. Depending on the language model and dataset, there was an average increase in correlation of about 0.05. Does that mean that language models are better than the ConvAI1 and ConvAI2 systems? Possibly! Both competitions took place before the advent of transformer LMs, making such a comparison unfair.

Bild 3 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Uni Bonn / Fraunhofer IAIS
At first glance, the scanned generated answers seem anything but good. However, if we remove the first and last token, we get a perfect answer.


Currently, the approach only works with utterance pairs. It needs improvement to consider the entire context and not just the last utterance. We have observed that a more comprehensive approach like BERT’s Next Sentence Prediction (NSP) is advantageous. Therefore, further exploration is needed to obtain dialogue evaluation without the need for an aggregation step. For more information, you can refer to the associated publication:

 For more information, you can refer to the associated publication:

Language Model Transformers as Evaluators for Open-domain Dialogues
Nedelchev, Rostislav, Jens Lehmann, and Ricardo Usbeck, Proceedings of the 28th International Conference on Computational Linguistics, 2020, PDF 

Link to the code: https://github.com/SmartDataAnalytics/transformers_dialogue_evaluators 

This was a guest post from the SDA Blog.

Rostislav Nedelchev,

23. February 2022


lamarr institute person Nedelchev Rotislav - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
|© Uni Bonn / Fraunhofer IAIS|© Uni Bonn / Fraunhofer IAIS|© Irina/stock.adobe.com & ML2R|© Irina/stock.adobe.com & ML2R|© Uni Bonn / Fraunhofer IAIS
to the profile

Rostislav Nedelchev

Rostislav Nedelchev is a PhD student at the Lamarr Institute (University of Bonn) and works as a Senior Machine Learning Engineer at Alexander Thamm GmbH. Rostislav’s research interests lie in the areas of Natural Language Processing, Machine Learning and Data Science. In particular, he works on the automatic evaluation of dialogue systems.

More blog posts