Transformer-based deep neural network language models for Alzheimer’s disease detection from targeted speech

Background: We developed transformer-based deep learning models based on natural language processing for early diagnosis of Alzheimer’s disease from the picture description test. Methods: The lack of large datasets poses the most important limitation for using complex models that do not require feature engineering. Transformer-based pre-trained deep language models have recently made a large leap in NLP research and application. These models are pre-trained on available large datasets to understand natural language texts appropriately, and are shown to subsequently perform well on classiﬁcation tasks with small training sets. The overall classiﬁcation model is a simple classiﬁer on top of the pre-trained deep language model. Results: The models are evaluated on picture description test transcripts of the Pitt corpus, which contains data of 170 AD patients with 257 interviews and 99 healthy controls with 243 interviews. The large bidirectional encoder representations from transformers (BERT Large ) embedding with logistic regression classiﬁer achieves classiﬁcation accuracy of 88.08%, which improves the state-of-the-art by 2.48%. Conclusions: Using pre-trained language models can improve AD prediction. This not only solves the problem of lack of suﬃciently large datasets, but also reduces the need for expert-deﬁned features.

Unlike most earlier studies, the features are extracted in our approach by the 36 model itself in an unsupervised manner. As a result, more complex features are 37 discovered and used for diagnosis. More precisely, the models are pre-trained on a 38 large dataset to learn a good high dimensional (such as 1024 dimensions) vector 39 representation for the input sentence or text, which will be used as input to AD 40 versus healthy control (HC) classifiers. Another approach taken in this study to 41 address the problem of insufficiently-sized datasets is text augmentation. Similar to 42 most related works, the methods are evaluated on the Cookie-Theft picture descrip-43 tion test transcripts of the Pitt corpus [10] from the DementiaBank [10] dataset. 44 As mentioned earlier, the overall classification framework takes raw interview text 45 as input. Our evaluation shows that pre-trained deep transformer-based language 46 models with a simple logistic regression classifier work well in AD prediction and 47 the results generally outperform those of the existing methods while the proposed 48 method does not require any hand-crafted features for training the classifier.  In all earlier works, in order to automatically diagnose the disease using speech, 75 information content units were introduced by human experts, and a classifier used  In languages other than English, Khodabakhsh et al. [24] and Weiner et al. [25] 83 respectively examined the subject in Turkish and German. Also, Li et al. [26] and 84 Fraser et al. [27] both focused on multilingual approach for diagnosing AD using tar-85 geted speech. They respectively tried to improve the AD prediction in Chinese and 86 French languages (which the existing datasets were insufficient) using an English 87 classifier trained on a larger English dataset.

115
The most challenging problem in developing technique for recognizing Alzheimer's 116 patients from speech transcripts is the lack of a large dataset. Currently, the largest 117 available dataset is the Pitt corpus from the DementiaBank dataset, which contains 118 500 picture description interviews from the AD and control groups. For the men-119 tioned reason, most of the earlier work was based on features designed by experts, 120 as it was not possible to use models capable of learning informative features, by 121 themselves. In this study, we simultaneously employ the two ideas of employing 122 a highly pre-trained language model and dataset augmentation to address this is-123 sue and enhance the classification accuracy. Our implementation of these ideas is 124 described next.

125
Pre-trained deep language model 126 Every model that defines a probability distribution over a sequence of words is 127 called a language model. If a computational model wants to implement a language 128 model, it is necessary to have a good understanding of the syntactic and semantic 129 structures of that language. Therefore, using a model that has already learned a 130 probabilistic distribution that correlates with these structures for classification al-most eliminates the need for large target-specific datasets. The transfer of knowledge 132 from one model to another with a similar purpose is called transfer learning. We 133 use transformer-based language models that have offered a breakthrough in many 134 language understanding tasks in recent years [32]. The general flow of using pre-135 trained language model for classification task consists of three steps:    To address the problems facing recurrent models such as the issue of short-term 143 memory and the challenges facing the parallelization of training, Vaswani et al. [33] 144 introduced transformers which consist of an extreme use of the attention mechanism 145 that underpins many NLP models. The paper argues that the attention mechanism 146 allows the model to focus on certain parts of the text for decision making. This    In both of these models, there is no attention mechanism.

192
Dataset augmentation 193 Another approach to overcome the lack of access to large training input is dataset 194 augmentation which means increasing the number of labeled samples of the dataset 195 using some probabilistic or even heuristic algorithms. For example, the word "beau-tiful" in a sentence such as "What a beautiful car!" can be replaced with the word 197 "nice" without changing the meaning of the sentence a lot. Augmentation in NLP 198 can be done at the character, word, and sentence levels, and in this study, the 199 word and sentence levels are used for augmenting the dataset. The most crucial 200 challenge of augmentation in the text classification task is preserving the text class 201 during augmentation. For example, a probabilistic model can replace "beautiful" 202 with "dirty" in the mentioned sentence, which is grammatically and semantically 203 correct but changes the sentence category. Two general approaches to augmentation 204 have been used in this study, which are described below.   [41]. In the mentioned methods, there is no guarantee of the correct grammar in 212 the output sentence. It is also possible that the output sentence category changes 213 by augmentation. For example, one of the markers of Alzheimer's disease is the re-214 duction in the vocabulary used in the conversation, so replacing a simple word like 215 "Delicious" with its sophisticated synonym like "Scrumptious" can change the sen-216 tence category from patient to healthy and mislead the classifier. Another method 217 that considers grammatical correctness along with the sentence context was intro-218 duced by Kobayashi [42] and is called contextual augmentation. In the contextual 219 augmentation method, there is a language model which takes both the word's con-220 text (i.e. the sentence that contains the word) and the whole sentence's category the approach by using BERT as an underlying model.

225
All the mentioned methods were evaluated in this study, and the implementation 226 was done using the NLPAug library [44] except for contextual augmentation for 227 which the released code by the authors of [42] was used.

Sentence removal augmentation 229
Another ad-hoc approach which does not change the sentence category and also 230 retains grammatical correctness is sentence removal. In this approach, one sentence 231 is removed from the transcript, and it is expected that the output is still a valid 232 transcript in the same category. Although it can be argued that the label may be 233 changed by reducing the length of the text, considering the results of using or not 234 using this idea, it is appropriate to use it in models that process the entire text at 235 once (not sentence by sentence). if the classifier layer outputs multiple labels (that may happen when working on 248 sentences), the voter makes the final decision using a majority voting mechanism.

249
In this study, two different approaches for classifying a transcript are implemented.

250
In the first approach, the entire transcript is passed to an embedder and then the 251 embedded transcript is directly classified. In this approach, the splitter and voter 252 layers are disabled. In the second approach, the transcript is first split into sentences, 253 and then these sentences are embedded and are subsequently classified. Finally, 254 the label of the entire transcript is decided by majority voting on the labels of all 255 sentences in the transcript. The second approach is more compliant with pre-trained 256 embedders since they are mostly pre-trained on single-or two-sentence inputs.

257
The embedding models (which used in this study as an embedder layer) are only 258 passed through Phase 1 and 3 of the flow described in Section Pre-trained deep language model .

259
The reason for this is that the dataset used is insufficient for unsupervised fine-tun-ing even when using vast augmentation methods. In practice, using unsupervised  Detailed demographics of the data is specified in Table 1. poses. The first problem with using the entire corpus is that the corpus is highly 282 unbalanced, and as a result, a naïve classifier that always outputs AD labels can 283 achieve a classification accuracy of 80% on such a dataset.

284
The second problem is that except for the Cookie-Theft picture description test,    Table 3 reports precision, recall, accuracy, and F 1 scores of the compared methods 345 as well as those of the proposed methods in the framework introduced in this paper.

346
The reported scores are averaged on a 10-fold cross-validation procedure. Note that   Furthermore, the CNN method is used with the synonym substitution augmentation (CA) methods separately. The CA and SSA augmentations had almost no effect on 358 the methods which used pre-trained language models, so they are not reported in 359   Table 3.  both classes for the desired task. The other advantage is that our models are highly 408 pre-trained on large datasets which enables them to start training on new, smaller 409 datasets with good initialization parameters and also avoid overfitting.

410
Among the methods evaluated in this study, on average, the models based on the 411 BERT family of embedders worked better than the others. Although XLNet has 412 historically been designed to address BERT problems, BERT and its derivatives 413 still perform better in many activities [32]. One important point to note is that 414 pre-trained deep language models are unaffected by augmentation because these 415 models are highly pre-trained on a large dataset (and hence the evaluation of their 416 versions with augmentation is not reported in Table 3).  In this study, neural network interpretation methods were not used but in Table 4, As mentioned in Section Pre-trained deep language model , the proposed approach 439 takes advantage of the powerful pre-trained language models that attempt to learn 440 the structure and features of the language from a large dataset, and only uses the 441 target dataset to learn how to use these features for AD prediction. This not only 442 reduces the need for expert-defined language features, but also makes it possible 443 for more complex features to be extracted from the data. The next advantage of 444 sentence embedding models is that they consider the entire raw text and there is 445 no out-of-context word embedding layer that would convert each word to a repre-446 sentation vector without considering its context.

447
As mentioned earlier, even using augmentation methods, the largest currently 448 available dataset for AD prediction is still insufficient in size for unsupervised fine-    Figure 1 Overall classification procedure. The overall classification procedure contains the steps of augmentation, splitting, embedding, classification, and voting, where augmentation is only used in the training phase. Also, when passing the entire transcript to the embedding layer, the splitting and voting layers are disabled. The underlined models are trainable here, and the others are fixed.     Table 4 Two invalid predicted transcripts by the model with the best accuracy score (S-BERT Large -LR).

Transcript Actual Label
Predicted Label

Predicted AD Probability
And the boy in the cookie jar. And the girl reaching up to him. The stool slanting ready to topple. And the cookie jar is open. And the lid's in there. And the door's open. And mother's drying the dishes and standing in a pool of water it looks water running down from the sink. ...

AD HC 0.483
Okay. It was summertime and mother and the children were working in the kitchen. And the window was open and there was a slight breeze blowing in. Mother was daydreaming and forgot and left the water in the sink running and it was overflowing. The children were hungry and ...

HC
AD 0.532