Conversational AI 17-EVALUATING OPEN-DOMAIN DIALOGUE SYSTEMS
EVALUATION AT THE LEVEL OF THE EXCHANGE
Evaluation at the level of the exchange involves making a judgment about some aspect of the system’s response to the user’s utterance. This approach, which is known as single-turn pairwise evaluation, has the advantage that it can provide a more fine-grained evaluation compared with a dialogue-level evaluation but there is also the disadvantage that issues that arise in the longer flow of the dialogue tend to be overlooked.
Using Metrics from Machine Translation
In Machine Translation (MT) using the Seq2Seq approach an utterance in one language is encoded into a representation that is then decoded as an utterance in another language. This idea has been extended to dialogue in which the utterance that is encoded is from one of the dialogue participants, typically the user, and the decoded utterance is the response from the other participant, typically the system. Building on this dialogue researchers have investigated whether evaluation methods that have been used successfully in MT could be extend to the evaluation of dialogue.
Next Utterance Classification (NUC)
Next Utterance Classification (NUC) as a metric for evaluating the performance of a dialogue system. NUC involves selecting the best response to a previous utterance from a candidate list, as in retrieval-based response generation. The advantage of NUC is that it is easy to compute automatically, however it is important to determine whether this automatic classification would correlate with the ratings of human evaluators.
EVALUATION AT THE LEVEL OF THE DIALOGUE
Evaluation at the level of the dialogue is less fine-grained than evaluation at the level of the exchange although it has the advantage that it can capture important aspects of the conversation flow that occur in multi-turn dialogues and that otherwise would be missed in an exchange based approach. The following are some approaches to dialogue-level evaluation
CHATEVAL: A TOOLKIT FOR CHATBOT EVALUATION
ChatEval is a framework for the evaluation of open-domain Seq2Seq chatbots that aims to address the problem of the wide variety of evaluation procedures currently in use. ChatEval provides an open-source codebase for automatic and human evaluations based on the dataset from the Dialogue Breakdown Detection Challenge (DBDC). Automatic evaluation metrics provided in the ChatEval evaluation toolkit include: lexical diversity, average cosine-similarity, sentence average BLEU-2 score and response perplexity. Human evaluation involves A/B comparison tests in which the evaluator is shown a prompt and two possible responses from models that are being compared.
EVALUATIONS IN CHALLENGES AND COMPETITIONS
Evaluation in the Loebner Prize:
The Loebner Prize is an annual competition in which judges engage simultaneously in text-based interactions with a dialogue system (or chatbot) and a human interlocutor with the aim of determining which interlocutor is the human and which is the chatbot. The Loebner Prize is intended as an implementation of the Turing test.
Evaluation in the Amazon Alexa Challenge:
The main aim of the Amazon Alexa Challenge is to advance the current state-of-the-art in Conversational AI by inviting teams of researchers from universities to develop dialogue systems, known as socialbots, that users can interact with and evaluate.
Evaluation in the ConvAI Competitions:
ConvAI is a competition that aims to measure the quality of open-domain dialogue systems. The first ConvAI competition was held in 2017 and involved interactions between humans and dialogue systems about news articles. The evaluations were performed by human assessors. The second competition, which was held in 2018, focused on general chit-chat about people’s interests.