Conversational AI 8-A TYPICAL DIALOGUE SYSTEMS ARCHITECTURE
The architecture applies to text based as well as spoken dialogue systems. The main difference is that spoken dialogue systems have a speech recognition component to process the user’s spoken input and a text-to-speech component to render the system’s output as a spoken message. Text-based dialogue systems do not have these components.
AUTOMATIC SPEECH RECOGNITION (ASR)
NATURAL LANGUAGE UNDERSTANDING (NLU)
Given a string of words from ASR, the NLU component analyses the string to determine its meaning. NLU can take several forms. In syntax-driven semantic analysis the input is first analyzed syntactically to determine how the words and phrases group together as constituents and then the meaning is derived using semantic rules attached to the constituents. Logic is often used to represent the meaning of the string, with the advantage that this allows the application of standard mechanisms for inference to provide a deeper level of understanding.
The Dialogue Manager (DM) is the central component of a spoken dialogue system, accepting interpreted input from the ASR and NLU components, interacting with external knowledge sources, producing messages to be output to the user, and generally controlling the dialogue flow. DM consists of two components:
- The Dialogue Context Model – The Dialogue Context Model keeps track of information relevant to the dialogue in order to support the process of dialogue management. This may include information about what has been said so far in the dialogue and the extent to which this information is grounded.
- The Dialogue Decision Model – The Dialogue Decision Model determines the next system action given the user’s input utterance and the information in the Dialogue Context Model. Decisions may include prompting the user for more input, clarifying or grounding the user’s previous input, or outputting some information to the user.
NATURAL LANGUAGE GENERATION (NLG)
The Natural Language Generation (NLG) component is responsible for the text to be output to the user, based on the output of the dialogue manager. NLG involves converting the output from DM into words.
TEXT-TO-SPEECH SYNTHESIS (TTS)
Once the output from NLG has been determined, the next step is to transform it into a spoken utterance. Commercial systems often use pre-recorded prompts in cases where the output can be predicted at the design stage. The alternative is to use Text to Speech Synthesis(TTS) in which the text to be spoken is synthesized. TTS is required for messages that are dynamically constructed and that cannot be predicted in advance, such as delivering up to the minute news and weather reports or reading emails aloud. The quality of TTS engines has improved considerably over the past decade so that output using TTS is not only easy to comprehend but also pleasant for the listener.