Conversational AI 18-EVALUATION FRAMEWORKS
PARAdigm for Dialogue System Evaluation (PARADISE) is a framework for evaluating task oriented and multimodal dialogue systems. The starting point in PARADISE is that the overall goal for a dialogue system is to maximize User Satisfaction (US). This goal is subdivided into the sub-goals of maximizing task success and minimizing costs. The latter is in turn subdivided into efficiency measures and qualitative measures. The main strength of the PARADISE framework is that US can be predicted automatically from the objective measures involved in maximizing task success along with the efficiency and qualitative measures involved in minimizing costs. US is modeled using multiple linear regression in which US is the dependent variable and the objective features are the independent variables.
QUALITY OF EXPERIENCE (QOE)
The focus has shifted to QOE which describes the user’s perceptions of quality in terms of aspects of usability such as effectiveness, efficiency, and user satisfaction. User-perceived quality is a subjective measure of the user’s perceptions of their interactions with the system in relation to what they expect or desire from the interactions. Much of the work on QOE has taken the form of developing taxonomies of aspects of perceived quality.
The term Interaction Quality (IQ) was introduced to describe ratings performed by experts as opposed to US, which is a measure of user ratings. As mentioned earlier, an investigation comparing expert and user ratings found a high correlation between the two, leading to the suggestion that expert ratings, which are easier and less costly to obtain, could replace user ratings in evaluations of dialogue systems. While evaluations of interactions with dialogue systems are usually carried out at the end of an interaction, the innovation in the IQ approach is that evaluation is carried out at the exchange level during the ongoing dialogue. Assessing quality at the exchange level enables a more fine-grained analysis of an interaction as problematic situations can be identified when they occur. Furthermore, if quality ratings can be estimated automatically during the ongoing dialogue, this information can be used by the DM to adapt the dialogue strategy dynamically.