An Evaluation on Information Composition in Dementia Detection Based on Speech

In recent years, scientists are paying much attention to the research on automatic dementia detection that could be applied to the speech samples of dementia patients. In a related context, recent research has seen the fast development of Deep Learning (DL) and Natural Language Processing (NLP). The techniques developed for text classification or sentiment analysis have been applied to the field of early dementia detection by many researchers. However, text classification and sentiment analysis are different tasks from dementia detection, which makes us believe that for dementia detection, some adjustments would help improve the performance of the machine learning models. In this work, we implemented experiments with various language models including traditional $n$ -gram language models, Average stochastic gradient descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) models, and attention-based models to evaluate the speech data of dementia patients. Unlike traditional works where the text is stripped from stop words, we propose the idea of exploiting the stop words themselves, since they offer non-context information which helps to identify dementia. As a result, 3 different language models are prepared in this work: a model processing only context words, a model processing stop words and Part-of-Speech (PoS) tag sequences, and a model processing both of them. By performing the aforementioned experiments, we show that both grammar and vocabulary contribute equally to classification: The 3 models achieve an accuracy equal to 70.00%, 76.16%, and 81.54%, respectively.

gradually in a long term. Dementia usually has a severe influ- 23 ence on language ability, memory, and executive functions. 24 It also leads to a lack of motivation, motor problems, and 25 emotional distress. With the development of the disease, these 26 symptoms become increasingly severe, which reduces the 27 autonomy of the patients as well as their well-being and that 28 of their caregivers [1]. 29 The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy .
With the age being the main risk for Alzheimer's disease 30 which accounts for the majority of dementia patients, the 31 number of dementia patients is expected to increase in the fol-32 lowing years because the population over 65 years old is pre-33 dicted to triple between 2000 and 2050 [2]. As such, dementia 34 is expected to have an ever-growing immense impact on 35 society. In 2015, the estimated number of dementia patients 36 worldwide is over 47.5 million. According to World Health 37 Organization (WHO) [3], a longitudinal study where the 38 researchers keep tracking the status of the subjects through 39 the years finds the annual incidence of dementia is between 40 10 and 15 cases per thousand people. Patients who developed 41 dementia on average have 7 years of life expectancy and less 42 than 3% of dementia patients would live longer than 14 years 43 or more [3]. 44 This severe situation is calling the institutions and 45 researchers to put more effort on dementia prevention and 46 early detection. Cost-effective and scalable methods for detection of dementia that can capture the subtle symptoms assessment. The language patterns are related to the cognitive 94 status and reflect the decline of cognitive functioning. Thus, 95 it could be used in the design of assistive technologies [8]. 96 For one thing, dementia usually causes language impairment, 97 which is shown by difficulties in word-finding, understand-98 ing, accuracy, and lack of coherence in speech [1]. Fur-99 thermore, language also relies on other cognitive functions 100 including executive functions so that communication happens 101 in a sound and meaningful way. Cognitive functions also play 102 important roles in decision making, strategy planning, and 103 problem-solving, which are significant to communications 104 [9]. Speech data are also common and easy to collect. In the 105 past few years, using Natural Language Processing (NLP) 106 and machine learning techniques to detect dementia based 107 on speech and language data is receiving attention from 108 researchers around the world [9]. 109 Language is a good indicator for early dementia detection. 110 However, analyzing the language is difficult, challenging, 111 and time-consuming because it requires the involvement of 112 manual analysis performed by professionals. The advances 113 in speech and language analysis techniques are bringing us 114 3-fold advantages. First, it could help to develop reliable tools 115 for detecting the differences between dementia speech sam-116 ples and non-dementia speech samples. Besides, it can quan-117 tify the stages of dementia. It also can distinguish between 118 different types of dementia [9], [10], [11]. 119 From a medical perspective, dementia is not a single 120 disease. The term applies to a wide spectrum of medical 121 disorders. AD accounts for more than 60% of dementia 122 cases [12], [13]. Even while certain dementia disorders may 123 be healed if discovered early enough, the vast majority of 124 dementia diseases are incurable. Expert evaluation and early 125 diagnosis of dementia symptoms, however, may help to halt 126 the advancement of the disease. Another merit of the early 127 detection of dementia is that it largely helps others around 128 the patient better understand the patient's previously puzzling 129 behavior. Scientists are paying increasing attention to demen-130 tia diagnosis and developing novel ways for identifying it 131 due to its importance. As a consequence, various research 132 works in the past had focused on dementia detection [14], 133 [15], [16]. Dementia testing may take several forms, rang-134 ing from cognitive and brain imaging to laboratory testing 135 and brain scans [17], [18]. However, these techniques are 136 usually expensive and time-consuming to implement. This 137 work might be automated to save money and make it more 138 accessible to the general population.

139
As a result, the scientific community has been looking at 140 numerous ways to execute the work of dementia diagnosis 141 automatically. Automation of dementia diagnosis utilizing 142 cutting-edge Artificial Intelligence (AI) technologies, in par-143 ticular, might make this activity considerably more econom-144 ical and accessible. This is because AI technologies have 145 made a few advances in recent years, allowing it to recognize 146 small patterns in a range of data formats while also being 147 substantially less expensive [19], [20], [21].

148
Concerning the topic of dementia detection, a few data 149 sets have been publicly available to experiment with, such 150 as DementiaBank 1 and Dem@Care. 2 They present data in 151 different formats, notably audio, video, and transcribed text 152 of dementia patients and control subjects. Among these, the 153 speech format, whether in audio or text format, is a very 154 informative type of information that has attracted the most 155 attention. Many works have addressed the idea of processing 156 the text in its transcribed format or as an audio signal for 157 dementia detection [11], [22], [23], thanks to the advances 158 in the field of NLP as well as audio processing. For instance, 159 a wide variety of techniques related to text classification have 160 been proposed in the literature [11], [24]. Whether the task 161 is sentiment analysis, hate speech detection, or automated 162 bots identification [25], the overall way to perform the task 163 is roughly the same: extract clues from the text itself and 164 use Artificial Intelligence (AI), namely machine learning and 165 deep learning to identify the target class. Applying these 166 techniques in dementia detection has led to some promising 167 results [26]. 168 However, we believe that the distinctions between opinion  includes recordings of interviews with dementia patients 209 and healthy people. By using these features, they composed 210 an additive logistic regression model that could distinguish 211 speech between healthy subjects and dementia patients.

212
In some studies, signal processing, and NLP techniques 213 are used to detect signs of dementia that may be impercep-214 tible to human professionals. For example, Tóth et al.
[30] 215 discovered that even though human annotators could not 216 recognize pauses (sounds like ''hmmm,'' etc.) reliably, these 217 features are easy to collect with an ASR system. In this 218 research, several acoustic parameters (hesitation ratio, speech 219 speed, length, number of silent and filled pauses, and 220 duration of utterance) were extracted from the recorded 221 speech of 38 healthy controls subjects and 48 patients 222 with MCI talking about two short films. They found that 223 ASR-extracted features outperformed manually computed 224 features (69.1% accuracy) when combined with machine 225 learning approaches, notably with a Random Forest clas-226 sifier (75% accuracy). König et al.
[31] employed a simi-227 lar machine learning approaches and showed an accuracy 228 of 79% when discriminating MCI individuals from healthy 229 counterparts, 94% for AD vs. healthy, and 80% for MCI vs. 230 AD. Their tests, on the other hand, were conducted on non-231 spontaneous speech data collected under controlled settings 232 as part of a neuropsychological evaluation that also included 233 mechanically transcribed text.

234
The idea of combining two perplexity values, one from 235 a language model trained on speech samples of dementia, 236 and one from a language model trained on speech samples 237 of healthy, was proposed by Wankerl et al. [32]. Perplexity 238 is used to estimate the fit between a probabilistic language 239 model, and a sample of previously unseen text in the training. 240 The n-gram language model is a method widely used in 241 processing speech or written language [33]. N -gram language 242 models create probability density from training text data by 243 calculating the frequencies of the word sequences. In the 244 simplest uni-gram/1-gram language model, the sequence only 245 contains one word. The model counts the words in the training 246 data and assigns a probability to them. For a sentence S: 247 S = (w 1 , w 2 , . . . , w k ), 248 w 1 , w 2 , . . . , w k represent the 1st word, 2nd word, . . . , k-th 249 word in the sentence S. For any sentence in the test data, 250 the model estimates its possibility of existence based on the 251 training data. In the case of the uni-gram language model, the 252 sentence probability p(S) equals the product of each word's 253 probability k i=1 p(w i ).

254
The uni-gram language model cannot comprise any con-255 textual information because it only gives the probability dis-256 tribution of individual words. On the other hand, calculating 257 the probability distribution of individual sentences leads to a 258 unique probability. It might be hard for the model to make 259 proper predictions for new unseen data. Therefore, the length 260 of sequences is limited to a certain small number. A model 261 that calculates the probability distribution of sequences com-262 posed of n words is called the n-gram language model. For 263 example, the tri-gram language model calculates the prob-264 ability distribution of sequences composed of 3 words. For 265 a sentence S = (w 1 , w 2 , . . . , w k ) of k words, when the 266 sequence length is n, the probability is evaluated by [32]: (1)

268
In training the n-gram language model, each sentence is is repeated for all the subjects in the data set. In addition to 300 p other , p own , the difference between them is added as another 301 feature, which is calculated in the following: Cohen et al.
[36] interrogated the two perplexities methods 314 by using artificially synthesized speech data that are created 315 to simulate progressive dementia detection. Bird et al.
[37] 316 created synthetic narratives by creating a baseline sample and 317 removing and/or replacing the nouns and verbs with higher 318 lexical frequency (mother vs. woman vs. person). Lexical 319 frequency shows how specific a word is in describing the con-320 text information. In Fig. 1, we give two groups of examples. 321 In both examples, the words in the outer circles have broader 322 meanings and that includes the meanings of the words shown 323 in the inner circles. Cohen et al. [36] followed the work of 324 Bird et al. [37] and implemented the two perplexities methods 325 by comparing the original data (words are not replaced nor 326 removed) and the modified data (some words are replaced 327 with higher lexical frequency words). By doing so, they 328 noticed that the perplexity distribution is highly influenced 329 by words' lexical frequency. Their research confirmed that 330 the lexical frequency of vocabulary is effective in detecting 331 dementia.

332
Previous works analyzed language models and data. How-333 ever, the analysis does not answer all questions about the 334 topic. They found that language models' perplexities are 335 associated with lexical frequency, but is it the primary infor-336 mation in the detection? Which one contributes the most to 337 the neural network classifiers, syntax, or semantic aspects 338 of the language? What kind of information composition or 339 format do the neural networks take to improve the accuracy 340 performance in dementia detection? When it comes to the 341 medicine area, data is often limited and related to the personal 342 privacy of the patients. Therefore, not all the data is available 343 in the desired amount and form. This results in that deep 344 learning often cannot reach its best performance. Hence, 345 answering these questions helps us develop more reliable, 346 explainable, and accurate models by manually manipulat-347 ing the data we have. Besides, using language as a source 348 for dementia diagnosis manually is common in traditional 349 methods. This research also aims to provide a new viewpoint 350 for the manual analysis methods, like which part or which 351 component of the sentence deserves more attention.

352
In this work, we first explored whether the richness, speci-353 ficity, or variety of the vocabulary along with the difficulty 354 to predict the next word should be the primary indicators 355 in the task of dementia detection. Or if the text's grammat-356 ical structure may be a better indicator of dementia detec-357 tion. We re-implemented the two perplexities methods with 358  words and PoS tags, which contain the sentence patterns 394 information but with finer details.

395
Lastly, we discussed whether vocabulary variety or rich-396 ness should be the primary indicator for dementia detection. 397 The overall pipeline of the research is shown in Fig. 2. In the 398 work [36], they evaluated why the two perplexities methods 399 work. They found that perplexities of neural network models 400 are associated with lexical frequencies. We researched more 401 about this topic by the aforementioned methods and showed 402 by experiments that despite the importance of the context 403 words, they are not necessarily the most valuable indicators 404 for dementia detection.

405
The major contributions of our work are summarized as 406 follows:

407
• We re-implemented the two perplexities methods with 408 PoS tags and stop word sequences, which costs less 409 computation and has better generality. The DementiaBank data set provided in TalkBank [38] is 422 used in this work to evaluate the performance of the differ-423 ent introduced models in dementia detection. TalkBank is a 424 learning models. Following the fact mentioned above, for 461 each speech sample, 4 instances that are composed of dif-462 ferent information (i.e. original texts, PoS tag sequences, and 463 stop words list), are created. These instances are generated as 464 described below (Fig. 3).

465
• An instance with only context words: All of the 466 words in this case are context words. As previously 467 noted, this is a typical method for deleting ''noisy'' text 468 parts and enhances classification in a variety of natural 469 language processing applications, including sentiment 470 analysis [4]. Previous studies, such as [39], have used 471 this method in the field of dementia and CI detection. 472 The speech samples processed in this manner are used 473 to create a data set we refer to as C .

474
• An instance without context words: The context words 475 in the speech samples are replaced with their PoS tags, 476 yet we keep the stop words as they are. Although it is 477 counterintuitive, we process the data in this way because 478 it allows us to notice when a phrase or paragraph does 479 not follow the natural flow of language and reveal com-480 mon language patterns regardless of the context. Despite 481 its lack of value in tasks such as sentiment analysis or 482 hate speech detection, we believe that this information 483 is highly useful when dealing with the issue of dementia 484 diagnosis. The speech samples processed in this manner 485 are used to create a data set we refer to as P. happen. Hence, a smoothing method that avoids this issue 542 is necessary to ensure the system works well on the test 543 data. Additive smoothing simply assigns a fixed value to 544 the sequences that do not appear in the training data [34].

545
In this experiment, we use Laplace smoothing which is a kind [32], we train two language models 559 on dementia data and non-dementia data respectively and use 560 the two language models to calculate the perplexities of the 561 test samples. By checking the perplexity difference between 562 the two language models, we decide if the test data belongs to 563 which category by setting a threshold. The flowchart of this 564 method is shown in Fig. 4. We employ two neural networks for classification as pre-567 viously described. The first neural network is based on 568 the Averaged Stochastic Gradient Descent Weight-Dropped 569 LSTM (AWD-LSTM) as proposed in [39]. The second neural 570 network is trained from scratch using a standard attention net-571 work architecture. We will show the details of these networks 572 in the following part.  thoroughly in [39]. This language model was trained using

588
Target Task Language Model Fine-Tuning. We fine-tune 589 the pre-trained language model using the dementia data set in 590 hand. We use all of the data in the data set. In this stage, the 591 labels are not utilized at all because our aim at this point is for 592 the language model to learn to understand specific linguistic 593 characteristics. Linguistic characteristics here refer to how 594 words are related to one another and the hidden meanings 595 of slangs, etc. The size of the embedding matrix model is 596 N × 400, where N represents the number of different words 597 in the data set, which is also the size of the first layer of 598 the network. To fine-tune the model, the Universal Language 599 Model Fine-Tuning (ULMFiT) technique proposed in [39] 600 is employed, which involves progressively unfreezing and 601 adjusting learning rates. The softmax layer is the first to 602 be unfrozen, enabling its parameters to be fine-tuned using 603 the learning rate of 0.1 for the first 1 epoch. The remaining 604 layers are then unfrozen and adjusted with a learning rate 605 of 0.001 for 5 epochs, after which we continue training by 606 lowering the learning rate. The dropout rate for all the layers 607 is set to 0.3, while the Adam optimizer's parameters β 1 and 608 β 2 are set to 0.90 and 0.99, respectively.

609
Target Task Classifier. The last step in the language model 610 adjustment process is the classification. We used two linear 611 blocks, one with Rectified Linear Unit (ReLU) activation and 612 the other one with softmax activation to substitute the last 613 softmax layer in the original network. In other words, in this 614 model, they are inserted after the three LSTM layers. This 615 is because the model is no longer employed as a language 616 model to predict the next word, but as a classification model 617 to predict the text's class. The model is fine-tuned using 618 slow unfreezing, discriminative learning rates, and slanted 619 triangle learning rates. Unlike the previous phase, we utilize 620

628
The learning rate of the LSTM layer is set by following the 629 work [39], which states that if the final layer's learning rate 630 is η l , prior layers should have a learning rate of η l − 1 = 631 η l /2.6. Similarly, we unfreeze the second LSTM layer next 632 following the same rule but with a smaller learning rate.

633
Finally, the whole network is unfrozen and trained using the 634 previously mentioned progressively decreasing learning rate 635 described above. network is given in Fig. 6.

647
The main differences between the proposed methods and 648 the previous works are summarized in the Table 1. between the two language models. The X-axis is the patient 658 index manually assigned to each subject. For the held-out 659 test subject, each test subject contributes multiple samples 660 because they visit the doctor more than one time during their 661 treatment. There are two optional schemes for the machine 662 learning model to perform classification. It can decide the 663 category for each speech sample individually. It can also 664 decide the category of the patient after concatenating the 665 samples of him or her together as a big sample. 666 We determine if a test sample is a dementia patient or a 667 healthy control subject by setting a threshold for the perplex-668 ity difference. We reached an accuracy of 75.3% in the task 669 of classification for each patient. We achieved a classification 670 accuracy of 71.5% in the task of classification for each sample 671 using an equal error rate as the threshold. 672 We implemented the two perplexities methods using two 673 tri-gram language models on the data set instance P, where 674 the context words are replaced by PoS tags and the stop words 675 are kept as they are. The perplexity difference results are 676 presented in Fig. 8. The range of the Y-axis is different from 677 the previous one because we add stop words in the training 678 process and the vocabulary for training is different, which 679 makes the complexity of the models different. 680 We reached an accuracy of 80.8% in the task of classifica-681 tion for each patient, and 72.8% in the task of classification 682 for each sample using an equal error rate as the threshold. 683 The accuracy performance improved when the stop words are 684 included, approaching that of [32]. The Receiver Operating 685 Characteristic (ROC) curve of this experiment is also shown 686 in Fig. 9. The ROC curve shows the true positive rate and 687 false positive rate of our method under different thresholds. 688 It shows the trade-off between sensitivity and specificity. 689 We achieved an AUC of 0.78. To diagnose dementia, we uti-690 lized only 36 PoS tags with 127 stop words. It drastically low-691 ers the cost of calculation and annotation while maintaining 692 high accuracy.

693
In addition, we implemented experiments to see how 694 much influence the high-frequency words have on the 695  where each word is ranked based on its appearance frequency.

704
Using this list, we pick the top N words to form the actual lists 705 of words to keep. We composed the lists that we are going to 706 keep strictly following the order of English word frequency.

707
In the experiments, we create 100 sub-lists following this 708 logic. The 1st list includes the 20 most used words in English.

709
The 2nd list includes the 40 most used words in English. The the n-gram language model based on the two perplexities 722 technique. To identify dementia, we assume that the n-gram 723 language model captures particular sentence patterns from 724 the sequences of PoS tags. Dementia symptoms are reflected 725 not just in vocabulary choice, but also in the syntax and 726 sentence structures.

727
Furthermore, the word employed is heavily influenced by 728 the data collection. Subjects in the Pitt Corpus data collection 729 are requested to complete a picture description task. As a 730 result, the number of words utilized is restricted to those con-731 tents that are visible in the image. When the subject is asked 732 to explain what he or she is seeing, the vocabulary available 733 to him or her has already been limited to a tiny number of 734 terms related to the image. This implies that, regardless of 735 how large the stop words list is, the majority of terms are 736 not utilized in the description. Due to the small size of the 737 Pitt Corpus data set, training an n-gram language model on 738 PoS tags with stop words saves time and prevents over-fitting. 739 It merely handles 36 PoS tags and a few stop words on the one 740 hand. Converting words into their PoS tags, on the other hand, 741 reduces the likelihood of seeing unexpected words/sequences 742 in the test data, which helps minimize over-fitting.

743
B. AWD-LSTM NETWORK 744 We display the performance of the classification using the 745 AWD-LSTM network in Table 3     In Table 2, we show that the n-gram language model-800 based two perplexities methods can detect dementia with an 801 accuracy of 72.78% without any context words. Fig. 10 shows 802 that keeping more high-frequency words could improve the 803 performance of the system. Yet, after the number of kept 804 words reaches beyond 260, keeping more words does not 805 necessarily improve the performance. We believe that this is 806 because, among the most used English words, words without 807 context information contribute to the proper grammar struc-808 ture account for the main part. By keeping these words in 809 the data, more grammar information could be conveyed by 810 the sequence of PoS tags and kept words. After the number 811 of kept words goes beyond 260, the later included words are 812 mostly conveying specific context information but they do not 813 add much information in terms of grammar. 814 We showed in Tables 3 and 4 that maintaining only the part 815 of texts that convey their context and subject (keeping only 816 context words) results in a significant decline in classification 817 performance using both classifiers. Keeping just the syntactic 818 component of the information, on the other hand, results in a 819 higher performance, while it is still inferior to utilizing the 820 complete text. This indicates that a person's ability to con-821 struct grammatically accurate phrases may be used to detect 822 dementia patients. This is consistent with our findings in the 823 first experiment utilizing n-gram language models, in which 824 we utilized perplexity applied to the syntactic section of the 825 text to diagnose dementia using multiple language models. 826 Using the complete text without discarding any information, on the other hand, yields the greatest accuracy, indicating that 828 both the syntactic and semantic components of the texts are 829 required for better categorization.

831
In this paper, we utilized a data set of transcribed texts 832 obtained from dementia patients and control people to con-