Full Record

New Search | Similar Records

Title The Effect of Data Quantity on Dialog System Input Classification Models
Publication Date
Discipline/Department Health Informatics and Logistics
University/Publisher KTH

This paper researches how different amounts of data affect different word vector models for classification of dialog system user input. A hypothesis is tested that there is a data threshold for dense vector models to reach the state-of-the-art performance that have been shown with recent research, and that character-level n-gram word-vector classifiers are especially suited for Swedish classifiers–because of compounding and the character-level n-gram model ability to vectorize out-of-vocabulary words. Also, a second hypothesis is put forward that models trained with single statements are more suitable for chat user input classification than models trained with full conversations. The results are not able to support neither of our hypotheses but show that sparse vector models perform very well on the binary classification tasks used. Further, the results show that 799,544 words of data is insufficient for training dense vector models but that training the models with full conversations is sufficient for single statement classification as the single-statement- trained models do not show any improvement in classifying single statements.

Detta arbete undersöker hur olika datamängder påverkar olika slags ordvektormodeller för klassificering av indata till dialogsystem. Hypotesen att det finns ett tröskelvärde för träningsdatamängden där täta ordvektormodeller när den högsta moderna utvecklingsnivån samt att n-gram-ordvektor-klassificerare med bokstavs-noggrannhet lämpar sig särskilt väl för svenska klassificerare söks bevisas med stöd i att sammansättningar är särskilt produktiva i svenskan och att bokstavs-noggrannhet i modellerna gör att tidigare osedda ord kan klassificeras. Dessutom utvärderas hypotesen att klassificerare som tränas med enkla påståenden är bättre lämpade att klassificera indata i chattkonversationer än klassificerare som tränats med hela chattkonversationer. Resultaten stödjer ingendera hypotes utan visar istället att glesa vektormodeller presterar väldigt väl i de genomförda klassificeringstesterna. Utöver detta visar resultaten att datamängden 799 544 ord inte räcker till för att träna täta ordvektormodeller väl men att konversationer räcker gott och väl för att träna modeller för klassificering av frågor och påståenden i chattkonversationer, detta eftersom de modeller som tränats med användarindata, påstående för påstående, snarare än hela chattkonversationer, inte resulterar i bättre klassificerare för chattpåståenden.

Subjects/Keywords Chatbot; Chatterbot; Virtual Assistant; Dialog System; Natural Language Understanding; Word Embedding; Word Vector Models; Text Classification; Chattbot; Virtuell Assistent; Dialogsystem; Naturlig språkbehandling; Ordinbäddning; Ordvektormodeller; Textklassificering; Language Technology (Computational Linguistics); Språkteknologi (språkvetenskaplig databehandling)
Language en
Country of Publication se
Record ID oai:DiVA.org:kth-237282
Repository diva
Date Indexed 2020-01-03

Sample Search Hits | Sample Images | Cited Works

Text 8 2.2.1 Intents 8 2.2.2 Word Vectors 9 Bag of Words 3 1 9 TF-IDF 10 N-grams 10 2.2.3 Supervised Machine Learning and Text Classification 12 2.2.4 Two or More Classes 16 2.2.5 Analyzing Classification Results 16 Confusion…

…models and implementing them as part of a dialog system? 1.1 Problem Statement At the outset of 2018, Telenor customer services were available via phone, email, online free text forms and social media. Each of these communication channels have resulted…

…in a log of analyzable text data which, for the purpose of information extraction, were used to train language models for binary classification. Different training methods and word representations had been tested for these models and the resulting…

…future work move on from binary classification into the multi-labelling classification1 which is needed for a dialog system. Jurafsky and Martin [2] state that “[b]efore almost any natural language processing of a text, the text has to…

…be normalized [and that] three tasks are commonly applied as part of any normalization process: - Segmenting/tokenizing words from running text - Normalizing word formats - Segmenting sentences in running text” These first two tasks…

…of the project was to evaluate the impact of different word embeddings on classification, and not to test all the techniques that NLP has to offer. Therefore, the data preprocessing has been limited to extracting text from chat logs; among other…

…tested the performance of different text classification methods to further the understanding of how much data may be needed for reliable classifications using n-gram word representations. We begin by turning to the former of those two aspects of dialog…

…CS, are divided into Consumer and Small and Medium Enterprises and all chat services on social media channels are treated as one category. 2.2 Understanding Text The Natural Language Understanding of a dialog system often has the goal of identifying…