Conversational AI 16-EVALUATING TASK-ORIENTED DIALOGUE SYSTEMS
Dialogue systems developed in research laboratories and for commercial deployment have traditionally been task-oriented and so evaluation metrics are used that measure the performance of the system in the task, for example, task completion, dialogue duration, and user satisfaction. Task-oriented dialogue systems can be viewed as supervised systems since they incorporate an objective metric for the evaluation whereas non-task-oriented dialogue systems are unsupervised as they do not have such an objective evaluation metric.
QUANTITATIVE METRICS FOR OVERALL DIALOGUE SYSTEM EVALUATION
Quantitative metrics are computed from the logs of interactions with users. Some metrics can be retrieved automatically while others have to be calculated by annotators of the logs. The following are some commonly used metrics.
- Time-to-task: measures the amount of time that it takes to start engaging in a task after any instructions and other messages provided by the system.
- Correct transfer rate: measures whether the customers are correctly redirected to the appropriate human agent.
- Containment rate: measures the percentage of calls not transferred to human agents and that are handled by the system. This metric is useful in determining how successfully the system has been able to reduce the costs of customer care through automation.
- Abandonment rate: this metric is the converse of the containment rate. It measures the percentage of callers who hang up before completing a task with an automated system.
QUANTITATIVE METRICS FOR THE EVALUATION OF THE SUB-COMPONENTS OF DIALOGUE SYSTEMS
Automatic Speech Recognition (ASR)
The standard evaluation measure in speech recognition is the Word Error Rate (WER), which is calculated by comparing the recognized string against a reference string, such as a transcription by a human annotator.
Natural Language Understanding (NLU)
Evaluation of the NLU component involves comparing the component’s output with a reference representation. Various metrics have been used, depending on how the output of the NLU component is represented. Formerly, the metric sentence accuracy was used where the output of the NLU component was a syntactic parse tree, or alternatively concept accuracy where the output was a semantic frame.
The overall performance of a modularised dialogue system is a result of the combined performance of its various sub-components. For example, task success depends on the system being able to accurately recognize the user’s spoken input (ASR), create a meaning representation from the recognized string (NLU), and engage in an appropriate dialogue to achieve the task, including where necessary requesting clarification when the user’s input is unclear or underspecified (DM).
Natural Language Generation
The NLG component takes the output from DM in the form of an abstract meaning representation and converts it into text.
Text-to-Speech Synthesis (TTS)
Assessing TTS systems has generally involved subjective measures such as intelligibility, naturalness, likeability, and human likeness. The TTS systems then synthesize a prescribed set of test sentences and a panel of listeners evaluates the outputs.
Qualitative evaluation involves collecting data on the quality of a system in questionnaires where the users are asked to rate various statements on a Likert scale. Subjective Assessment of Speech System Interfaces (SASSI) is a widely used tool for the evaluation of spoken dialogue systems. SASSI consists of 34 items distributed across six scales: System Response Accuracy, Likeability, Cognitive Demand, Annoyance, Habitability, and Speed.