Conversational AI 13-REINFORCEMENT LEARNING (RL)
RL has been widely used as a means of optimizing dialogue strategies. RL operates as follows. An agent explores an environment consisting of a set of states with transitions between the states. At each time step the agent is in a particular state, chooses an action from a number of options available in that state and moves to another state, receiving a reward. The exploration continues until a final state is reached, resulting in a final reward. The aim is to find an optimal path that maximizes the expected rewards.
REPRESENTING DIALOGUE AS A MARKOV DECISION PROCESS
In the first applications of RL to spoken dialogue systems dialogue was represented as an MDP. An MDP can be defined as a tuple; where there is a set of system states; a set of actions that the system can take; a set of transition probabilities. The probability of the next state given the previous state and the previous system action; an immediate reward that is associated with taking a particular action in a given state; and a geometric discount factor that makes more distant rewards worth less than more immediate rewards.
FROM MDPS TO POMDPS
One of the main problems with using MDPs to model dialogue is that in an MDP it is assumed that the contents of the system state are fully observable. However, given the various uncertainties inherent in dialogue interactions, it cannot be assumed that the system’s belief state is correct. For example, the system cannot be certain that it has correctly interpreted the user’s intentions given the uncertainties associated with speech recognition and natural language understanding. There may also be ambiguities and uncertainties related more generally to the user’s goals and intentions, even when speech recognition and natural language understanding are perfect. For these reasons a Partially Observable Markov Decision Process (POMDP) is preferred over the standard MDP model since it can represent a probability distribution over all the different states that the system might be in, albeit with much larger state spaces and resulting in problems of tractability.
DIALOGUE STATE TRACKING
Dialogue management consists of two sub-components: Dialogue state tracking and Dialogue Policy. The Dialogue state tracking sub-component interprets the user’s utterances and updates the dialogue state accordingly, while the Dialogue Policy sub-component determines the next system action on the basis of the current dialogue state. The Dialogue State, also known as the Belief State, contains information about the dialogue from the viewpoint of the system. In slot-filling dialogues the dialogue state contains information about the slots, whether they have been filled, which slots still have to be filled, and possibly the system’s level of confidence about the filled slots.
Given the representation of the dialogue state space as an MDP or POMDP, the task of RL is to find the optimal path through the state space that maximizes an objective function. The optimal path is known as a policy. The policy determines the system’s next action in the current state, i.e., it maps states to actions. However, while in the current state the system cannot be sure whether the action it plans to take is good, as it can only estimate its long-term value. For example, taking an action such as seeking an explicit confirmation will result in a longer dialogue but might be preferable to allowing a misunderstanding to persist. The system will only know if the action chosen was optimal when it reaches the end of the dialogue. In this way RL performs within an environment with a delayed final reward.
PROBLEMS AND ISSUES WITH REINFORCEMENT LEARNING AND POMDPS
In POMDP method a summary space is created in which the probability mass of the highest-ranking value is represented along with the combined mass of other hypotheses but the value itself is disregarded. This allows scaling up to a larger number of slots in a slot-filling dialogue. Another solution, the Hidden Information State model, involves partitioning the state space so that partition beliefs are computed instead of state beliefs and the master state space is reduced into a summary space. The Hidden Information State addresses the dialogue-modeling problem of how to track multiple hypotheses efficiently while the summary space addresses the problem of how to keep the learning of the optimal dialogue policy tractable.