Conversational AI 25-DIALOGUE SYSTEMS: DATASETS
DATASETS AND CORPORA
Training a neural dialogue system requires very large datasets. A survey of corpora that are suitable for training data-driven dialogue systems, they distinguish between different types of corpus, for example, written vs. spoken vs. multimodal; human-human vs. human-machine interaction; spontaneous vs. constrained, i.e., where the participants had to talk about a particular task. Each of these different types has its advantages and disadvantages. For example, corpora of human-human dialogues may not be suitable for training human-machine dialogue systems as they have a different distribution of errors. Spoken dialogue systems need to account for the effects on performance of speech recognition errors and on the handling of uncertainty in the belief state, whereas these errors are less frequent in human-human dialogue.