Detection of Alzheimer Disease on Online Handwriting Using 1D Convolutional Neural Network

Building upon the recent advances and successes in the application of deep learning to the medical field, we propose in this work a new approach to detect and classify early-stage Alzheimer patients using online handwriting (HW) loop patterns. To cope with the lack of training data prevalent in the tasks of classification of neuro-degenerative diseases from behavioral data, we investigate several data augmentation techniques. In this respect, compared to the traditional data augmentation techniques proposed for HW-based Parkinson detection, we investigate a variant of Generative Adversarial Networks (GANs), DoppelGANger, especially tailored for times series and hence suitable for synthesizing realistic online handwriting sequences. Based on a 1D-Convolutional Neural Network (1D-CNN) to perform Alzheimer classification, we show, on a real dataset related to HW and Alzheimer, that our DoppelGANger-based augmentation model allow the CNN to significantly outperform both the current state of the art and the other data augmentation techniques.


I. INTRODUCTION
Alzheimer's Disease (AD), a progressive neuro-degenerative disorder, is the cause of memory loss as well as a decline in cognitive functioning [1]. It is the most common type of dementia, affecting typically people of advanced ages. Currently there is no cure to reverse its symptoms, and only medications to delay the progression are available. As a matter of fact, the earlier the patient is diagnosed, the more likely the treatment will be effective.
Because people with AD are significantly impacted by episodic memory impairment, it comes as no surprise that a large number of studies have focused on language disorders involving spelling, grammatical, syntactic or semantic errors, etc. [2], [3]. It has been shown, however, that AD can be predicted by non-cognitive symptoms, in particular by The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy . motor impairment [4]. In this respect, several studies have assessed gait impairment, mild parkinsonian signs, fatigue and frailty [5], [6], [7], [8]. Other studies have investigated fine motor impairment, especially handwriting (HW) changes due to AD [9], [10], [11]. Such studies make sense given that AD induces cognitive and visuospatial impairment that makes the physical act of writing difficult, thereby triggering HW impairment [12].
As a result, there have been numerous studies and researches into applying machine learning methods to build a system that can reliably detect Alzheimer at an early stage, so that the patient can receive a timely and successful treatment. These researches work with a wide range of input data recorded from medical experiments and examinations, with handwriting and graphical gestures being ones of the most prominent sources. Traditionally, research on handwriting analysis applies either statistical tests or traditional classification models. Only recently have deep learning approaches been explored, albeit these were mostly for the detection of Parkinson and not Alzheimer. The goal of our work is to fill this gap and to explore this approach by building a deep learning model that can classify whether a subject is an Alzheimer patient, using online handwriting data. We also look to address one of the most prominent problems in deep learning medical applications, the lack of training data, by investigating the latest data augmentation techniques. Concretely, we introduce, in this work, the following key contributions: • Application of deep learning architecture for Alzheimer detection from online handwriting. We show how our models achieves a new state-of-the-art result on the proposed real dataset. This is the first deep learning based model applied for Alzheimer's disease classification from handwriting in the context of a very limited training dataset.
• Implementation of both traditional and recent data augmentation methods for time series. Especially, we propose a variant of Generative Adversarial Networks (GANs), DoppelGANger, specifically tailored for times series and hence suitable for generating synthetic online handwriting data. Our DoppelGANger-based data augmentation scheme outperforms classical data augmentation methods proposed for Parkinson's disease classification, and lead to new state of the art results for Alzheimer's classification, that are dramatically above previous state of the art.

II. RELATED WORK
Traditionally, online handwritings are analyzed using statistical tests such as ANOVA [9], [10], [11], [13], [14], [15], [16]. Some works also implement classical classification methods based on Linear Discriminant Analysis (LDA) [9], [10], [13]. More recently, [17] propose a semi-supervised learning approach to discover homogenuous cognitive profiles. This work addresses the problem of encoding spatiotemporal dynamics with the full online handwriting trajectory, by applying representation learning for the treatment of sequential data, and shows how such encoding outperforms global or semi-global parameters. [18] follow a similar approach, extracting loop-like patterns from handwriting and modeling velocity trajectory through unsupervised learning.
With the rapid development of deep learning recently, there have been a lot of research and experiments focused on applying deep learning architectures, as they provide the possibility to learn meaningful features and extract intricacies by themselves just from the raw data, thereby eliminating the need for extensive hand-crafting of features. For problems similar to ours, that is, to deal with medical data that is limited in available samples, and in the form of different channels of long time series, specifically online handwriting, there have been a lot of works on the popular dataset PaHaW [19] for Parkinson detection, notably [20], [21], [22]. Another notable work on Parkinson online handwriting is [23], in which the authors experiment with different data augmentation methods and achieve significant improvement compared to using only original data. For Alzheimer's, however, there is no equivalent publicly available dataset. For our work, we use the same dataset presented in the study [18] based on K-medoids clustering and Bayesian classifier. This work will serve at the benchmark for our results.

A. DATASET
The dataset, the same that was used in [18], was collected at the Broca Hospital in Paris. It consists of 54 subjects, from which 27 are patients who have early stage Alzheimer disease (AD) and 27 healthy control (HC). The Alzheimer patients are diagnosed based on DSM-5 criteria. The selected subjects were all able to understand French fluently and agreed to sign a consent form. The subjects perform multiple handwriting tasks that are recorded on a WACOM Intuos Pro Large tablet with an inking pen. The sampling rate that the tablet records pen position is 125 Hz. The data are recorded in the form of channels of time series, containing static and dynamic information of the pen movement during which the subjects perform the tasks. There are six channels in total: timestamps, x coordinates, y coordinates, pressure, azimuth and altitude.
To compare with the results in [20] and [21], we focus in this work on the task of drawing four series of repetitive llll letters, as shown on the left of Figure 1.

B. LOOP SEGMENTATION
Because our dataset consists of only 54 subjects, it is necessary to come up with a method that increases the number of samples, especially for the training of deep learning models. Inspired by [18], we notice the repetitive pattern of the l letter loops that can thus be extracted into individual training samples. As we usually have 16 l letters for each subject, we first split the sample into strokes and then keep only the 16 loops, by discarding the ligatures between them, as illustrated on the right of Figure 1. This gives us about 16 times more data to train with as we now treat the loops and not the subjects as individual training samples.

C. DATA PREPROCESSING
In order to achieve good performance on deep learning models, researches have shown that it is necessary to standardize the input data. For this project we decide to standardize each feature of each subject individually, so that they all have a mean of 0 and standard deviation of 1.
As we choose to split the training samples into loops and consider training based on individual loops, we also subtract the timestamps of each loop by its first timestamps, so that every loop's first timestamp is 0.
To design an adequate CNN architecture that is welldimensioned for effective model training, it is necessary that the loops have lengths that are not too scattered. We notice in Figure 2 that the maximum length of the loops, i.e. 281 time-steps, is actually due to three loop outliers, while the length of the remaining loops follows roughly a Gaussian  distribution with a mean around 60 and a maximum value of 176. We set, therefore, the maximum loop length to 176, trim the three outlier loops to this length and add zero-padding to all the loops that are shorter than the defined maximum length. This considerably reduces the amount of zero-padding that would be created if the length was kept to the original maximum (i.e. 281). This means the length of the feature vector taken as input by the CNN is significantly optimized, with minimal information loss.

D. VELOCITY FEATURES
Beside the available raw features directly provided by the recording tablet, another approach is to create new features that represents better the difference between how a healthy subject and an Alzheimer patient performs the task. The result from [18] shows that velocity features calculated from the coordinates and timestamps are good indicators and can help improve classification performance. Therefore, we create two new channels: x-velocity (Vx(n)) and y-velocity (Vy(n)), with Combined with the six channels already available in the dataset, we now have eight time series channels to represent every loop as shown in Figure 3.

E. DATA AUGMENTATION FOR TIME SERIES
In addition to the segmentation of loops, it is also possible to increase the amount of training data by generating new synthetic samples. There are two principal ways with which a new time series sample can be generated: slight random transformation of real data (jittering, scaling, warping), and data synthesis from the distribution of the real data (pattern mixing, generative models, pattern decomposition methods) [24].
In this work, we experiment with both traditional augmentation techniques as well as implement the latest innovations in synthetic time series generation using Generative Adversarial Networks.

1) TRADITIONAL AUGMENTATION METHODS
As our baseline, we consider different time series augmentation techniques as follows. For the cases of Jittering, Scaling and Time Warping, the parameters are defined as suggested in [25].

2) TIME SERIES DATA AUGMENTATION WITH DoppelGANger
Generative Adversarial Networks (GANs) [28] are a type of generative model that aim to train two neural networks -a generator G and a discriminator D -using an adversarial training workflow. The generator tries to create synthetic samples that the discriminator is unable to distinguish from real data. The goal of GANs is to create models that can draw samples from the same distribution of the training data. While data augmentation techniques using GANs have been actively researched for the application on images, they are not as often considered for time series. Some notable GANs architectures that have been developed for time series data augmentation are RCGAN [29], TimeGAN [30] and DoppelGANger [31]. After experimentation, we decide to implement DoppelGANger as our GAN model to generate online handwriting time series, as it addresses the weaknesses of the previous works [31].
DoppelGANger attempts to capture the relationship between attributes A i (metadata) and time series values T i by using two separated generators, one to generate metadata and one to generate time series conditioned on metadata: For the metadata generator, a multi-layer perceptron network is used. The time series generator is a recurrent neural network (RNN). At every time step the generated metadata A i is added to the RNN as input. In order to retain the long-term correlations within time series, the RNN is combined with a novel idea called batch generation. The goal of batch generation is to reduce the number of RNN passes by generating S records for each pass instead of one.
Furthermore, an auxiliary discriminator is also introduced to discriminate only atributes. The final loss function is a combination of the losses from the two discriminators, with a weighting parameter α: min G max D 1 ,D 2 L 1 (G, D 1 ) + αL 2 (G, D 2 ) with L 1 as the Wasserstein loss of the original discriminator and L 2 of the auxiliary.
As GANs are notoriously difficult to train, and according to the loop distribution in Figure 2 the number of loops with length greater than 100 is very small. As the more time steps we have the more parameters there are to optimize, we decide to train the model only on loops with maximum length 96. The batch generation parameter S is set as 8. The training labels of our dataset (AD or HC) are considered as attributes and are trained together with the time series. This allows us to later conditionally generate synthetic samples of each class by specifying the attributes. The GAN model is trained for 1000 epochs with learning rate 0.001.

F. MODEL SELECTION
As the data we work with are time series channels, it is necessary to select a model architecture that is suitable for the task. Using recurrent neural networks (RNNs) is a possible solution as it is designed to deal with sequential data. However it has been shown that for very long sequences RNN struggles to capture the temporal correlation [31]. Considering the promising results that has been obtained in other works using 1D Convolutional Neural Network (1DCNN) on medical time series, especially EEG signals [32], [33], we decide to select it as the architecture to our model.
To figure out the best 1DCNN architecture for our task, we perform random search on a range of different hyperparameters such as: number of convolutional layers, number of convolutional filters, convolutional filter size, pool filter size, etc. In order to reduce overfitting, dropout [34] is also applied.

A. TOOLS
The experiments are coded in Python, and the process of building and training the model performed with the help of the libraries Pytorch and sklearn. In order to have access to GPUs and improve the traning speed, we make use of the resources available on Google Colab.

B. EVALUATION METRICS
In order to maintain consistency and make comparisons with previous works, we will evaluate our results by three main metrics: accuracy, sensitivity and specificity, which are calculated from the percentage of true positives (tp), true negatives (tn), false positives (fp) and false negatives (fn): a Accuracy: the overall ability of the model to make correct classification ((tp + tn)/(tp + tn + fp + fn)). b Sensitivity: the ability to correctly classify Alzheimer patients (tp/(tp + fn)). VOLUME 11, 2023    c Specificity: the ability to correctly classify healthy subjects (tn/(tn + fp)). In all our experiments except for the final, the model is trained using 10-fold stratified cross validation. For each fold, the model will be trained for 40 epochs, with a learning rate of 0.001. For training, each of the loops is considered as an individual training sample, and for validation we average the output score of all the loops that belonged to one subject to get the final classification for that subject.

C. HYPERPARAMETER OPTIMIZATION
As indicated in Section III-F, we perform random search in order to find the best hyperparameter combination. To avoid experimental bias, we decide to use, as training/validation data for the process of hyperparameter optimization, the similarly recorded online handwriting dataset PaHaW [19] associated with Parkinson disease. The second task of PaHaW, as shown in Figure 4 is also to write repetitive l letters, which then can be splitted into loops using our segmentation process.
We show in Figure 5 the 1D CNN architecture that is obtained as the result. The model consists of two 1D convolutional layers, each followed by ReLU activation and Max Pooling. There are 128 1D filters in the first convolutional layer and 64 in the second, all of them have size 4 × 1. The Max Pooling filters for both layers have size 2 × 1. Dropout is applied after both poolings with a dropout rate set as 0.2. The output is then flattened and put into a fully connected layer (FC) and finally softmax for classification. This model achieves 78% accuracy, 85% sensitivity and 73% specificity when training with our dataset using all eight available channels as the input. As we work with limited data, such a simple architecture can be enough to learn well without overfitting, in contrast to deeper models.

D. FEATURE SELECTION
By adding the two extra velocity features, we get eight time series channels in total: x-coordinate, y-coordinate, timestamps, pressure, azimuth, altitude, x-velocity, y-velocity. However, not all of these channels contain equal information and some of them may even introduce noise. Therefore, it is necessary to train the model with different feature combinations in order to figure out the one that works best. We start out with the two basic combinations (x-coordinate, y-coordinate) and (x-velocity, y-velocity) then try to improve the result by combining them with the rest. From the results obtained in Table 1, we observe that the best performance is achieved by using only the two velocity features. Adding additional features to them does not further improve accuracy. Another notable observation is that the pressure feature is very noisy and reduces significantly the performance of the combinations that include it.

E. DATA AUGMENTATION
We apply data augmentation on input data that are time series of two channels: x-velocity and y-velocity, according to the results of the experiment in Section IV-D. Table 2 gives us the classification results comparing different data augmentation techniques. For each test fold of cross validation, the training data are augmented once while the validation data are kept intact. This means we perform training with twice the original amount of data. We observe that using DoppelGANger [31], we obtain an accuracy of 89%, the highest among all the augmentation schemes. With Jittering and SPAWNER, the accuracy also improves but only slightly (2%). Scaling does not affect the performance in terms of accuracy, while the data generated with warping methods are noisy, which is detriment to classification. We also attempt to combine different augmentation methods with DoppelGANger, but this does not further improve the results.

F. ANALYSIS OF GAN-GENERATED SYNTHETIC DATA
In order to evaluate the fidelity of the time series data generated by DoppelGANger compared to the original training data, we compare the length distribution of real and synthetic data ( Figure 6). Note that as we mentioned in Section III-E2, the maximum length of the time series that are used for GAN training is 96. As observed in Figure 6, apart from a few outliers that have too short lengths, the synthetic data that  we generate match closely the length distribution of the real training data.
We also perform a qualitative assessment, by comparing the generated series by our DoppelGANger model with the original ones (Figure 7). The comparison shows basically that even though the generated data are not as smooth, they do approximate the velocity curves of the original data at different lengths. This shows that DoppelGANger are suitable GAN models for generating good quality online handwriting time series, which ultimately significantly improves the

G. FINAL EVALUATION WITH LEAVE-ONE-OUT CROSS VALIDATION
In order to better evaluate the generalization ability of the model, as well as to make comparison with the previous state of the art on the dataset obtained in [18], we decide to perform leave-on-out cross validation (LOOCV) in addition to stratified 10-fold cross validation that is used in previous experiments. This means that from the dataset of 54 subjects, we have 54 folds, and for each fold we train on 53 subjects and test on the one left out. For each fold, we also randomly initialize the weights of the model. The model architecture as well as the feature combination (x-velocity, y-velocity) and augmentation method (DoppelGANger) are chosen from the previous best results. We illustrate this finalized training procedure in Figure 8.
We are also able to compare our final results in Table 3 with the results of [18], which are also evaluated using LOOCV. We observe that using the same data and features, our approach with deep learning using 1DCNN and data augmentation with DoppelGANger is able to improve the result significantly. This confirms the viability of applying deep learning solutions to this problem, even if the data available is very limited, thanks to our GAN-based augmentation scheme, specifically tailored for time series.

V. CONCLUSION
In this work, we have been able to develop an effective deep learning approach for the problem of classifying Alzheimer patients. Despite the lack of sufficient training data, the result obtained from training 1D Convolutional Neural Network model is the new state of the art on the dataset, at 87.04% accuracy, 85.19% sensitivity and 88.89% specificity. This is very promising and opens up further possibilities for research.
The newly achieved state of the art has been made possible by tackling the problem of limited data through synthetic data generation, based on the application of of Generative Adversarial Network adapted to time series, namely Doppel-GANger. The generated data have been shown to approximate the quality and distribution of the real data, which helped significantly improve the classification performance.
For future improvements, several research directions can be taken. [20] has shown that it is possible to apply transfer learning from unrelated data such as ImageNet [35] and perform classification based on the images created by the handwriting coordinates. Another possibility is to combine online handwriting with other biomarkers such as voice and facial expressions in order to make better detection.
QUANG DAO received the master's degree in data analysis and pattern classification from Telecom SudParis, University Paris Saclay. He is currently a Data Scientist and a Computer Vision Engineer. His expertise includes the application of machine learning on medical data, such as signals and images.
MOUNÎM A. EL-YACOUBI (Member, IEEE) received the Ph.D. degree from the University of Rennes, France, in 1996. He was with the Service de Recherche Technique de la Poste (SRTP), France, in 1992 and 1996. He was a Visiting Scientist with the Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Montreal, Canada, and an Associate Professor (1998)(1999)(2000) with the Catholic University of Parana, Curitiba, Brazil. From 2001 to 2008, he was a Senior Software Engineer at Parascript, Boulder, Colorado. At SRTP and Parascript, he has developed handwriting recognition software for real-life automatic mail sorting, bank check reading, and form processing. Since June 2008, he has been a Professor at Telecom SudParis, Institut Polytechnique de Paris. His research interests include artificial intelligence, data science, machine learning, and modeling human user data, especially behavioral signals like handwriting, voice, gesture and activity recognition, biometrics, e-health, smart agriculture, and smart cities.
ANNE-SOPHIE RIGAUD received the M.D. and Ph.D. degrees in neurosciences. She is currently a Geriatrician and a Psychiatrist. She is also a Professor in geriatrics with University Paris 5. She heads the Geriatric Department, Broca Hospital, and the Research Unit EA 4468, including LUSAGE Laboratory. She has expertise in dementia and gerontechnology. VOLUME 11, 2023