Conversational AI 27-VISUAL DIALOGUE AND VISUALLY GROUNDED LANGUAGE
Visual dialogue is a new and rapidly developing area that combines Computer Vision and Conversational AI. Building on advances in Computer Vision, e.g., image classification, scene and object recognition, and question answering about images, the aim of visual dialogue is to enable AI agents to engage with humans in a dialogue about visual content. Visual dialogue has the potential to contribute to several application areas, including: aid for visually impaired users, surveillance, robotics, e.g., in search and rescue missions, and tourist navigation.
In the evaluation protocol the AI agent sorts a list of candidate answers and is evaluated on metrics that compare its responses with human responses. Quantitative measures included the use of coreference in the dialogues, where nearly all the dialogues contained at least one pronoun, and a measure of topic continuity, which found that there was little topic change during the dialogues. Overall the results indicated that, although the performance of the AI-agent was far from optimal, the Visual Dialog task could serve as a useful testbed for measuring progress toward visual intelligence. In visual dialogue an important task is being able to correctly identify objects and their relationships in a visual context, as in intelligent scene understanding. This involves visually grounded language. In dialogue, grounding is where the participants work to achieve mutual understanding about a concept, a reference to a person, object or event, or some idea or proposition that has been discussed. A dialogue about objects in a visual world is more complex as it requires a combination of image understanding, spatial reasoning and language grounding to accurately identify the objects being referenced.