February 14, 2022

Conversational AI 26-Challenges and Future Directions


Interactions with the dialogue systems have been mostly text-based and/or speech-based. However, many human-machine interactions make use of other modalities. For example, interacting with a smartphone can involve input that uses text, speech, and touch, while output may use combinations of text, speech, images, audio, and video. Multimodal dialogue systems bring some advantages over speech and text-based systems.

  1. They are more flexible as they allow the user to choose the input and output modes that they prefer, thus potentially helping to reduce cognitive load on the user.
  2. They are able to deal with speech recognition errors and problems associated with the possibility of visual feedback compared with the limited options available in a speech-only interface.


A system that is able to process a variety of multimodal inputs can provide a richer conversational experience. Engagement is a key indicator of conversation quality, and if the system is able to detect an issue with engagement it can take steps to address the issue. Multimodal fusion involves integrating input from different modalities into a single meaning representation. A variety of methods for multimodal integration, including unification of the elements, handling them as lattice elements, and handling them in a state chart. Multimodal dialogue systems in the early 2000s used handcrafted rules to process the input. More recently, statistical and machine learning methods have been used.


Multimodal output is useful in use cases where a textual or spoken output would be less useful. A fission module created plans to determine the content of the multimodal output that comprised facial expressions, gaze, lip movements, and nodding, as well as visual channels, such as drawings and graphics. The execution of the plans was coordinated in order to avoid delays in the output of any particular elements that took longer to process.


















