EOG-Based Reading Detection in the Wild Using Spectrograms and Nested Classification Approach

Electrooculography, also known as EOG, is a technique that is used to calculate the corneo-retinal standing potential, which is located between the cornea and the retina of the human eye. Applications of EOG include eye disease diagnosis and eye movement tracking. There has been various research on reading activity detection from EOG signals in controlled laboratory settings. However, determining reading behaviours from data collected from real-world environments remains a challenging problem. Detecting reading in practical scenarios can lead us to track our daily reading activity, thereby improving our learning experience and even workplace productivity. Tracking regular reading behaviour can also lead to further research in cognitive psychology, literacy development, reading motivation, and reading comprehension. In this study, we investigated an electrooculogram dataset that was collected on the field from 10 users who were engaged in their daily activities on two separate days. We propose a pipeline combining the statistical features with deep learning features from pre-trained ImageNet models. To detect the fine-grained reading activities, we adopted a nested classification approach. Initially, we differentiate between reading and not reading and then we employ an additional classification step to discriminate among three distinct types of reading activities. With our pipeline, we could achieve 66.56% accuracy in detecting the reading activities whereas the original dataset publication showed a baseline performance of only 32%.


I. INTRODUCTION
Human Activity Recognition (HAR) has now become one of the major application areas of computer vision.Activities like walking, running, sitting, etc. can now be easily detected by data collected from various sensors, such as accelerometers and gyroscopes, using different machine learning and deep learning techniques [1], [2].As the number of ambient and body-worn sensors has grown, other ways of sensing have come about, especially in the field of ubiquitous computing.
The associate editor coordinating the review of this manuscript and approving it for publication was Junhua Li .
With the advancement of these sensing technologies, different modalities like body movement, posture, sound, and interactions between people are being used to detect human activities [3], [4].However, most of these sensors can only be used to detect physical activities and can't be easily used to recognize tasks that are mostly visual, such as reading, surfing the web, or watching a movie.The movement of the eye can provide a lot of information in recognizing these activities where body movements are not involved that much [5].Detecting these activities through eye movement could provide valuable insight into human behaviour and cognitive processes leading to a better understanding of human perception, attention, and decision-making.Detecting activities during human-computer interaction, such as reading on a screen or interacting with virtual environments, could enable the development of more intuitive and userfriendly interfaces, thereby enhancing the overall usability and accessibility of technology.In safety-critical industries, such as transportation or healthcare, monitoring eye movements could help ensure individuals remain attentive and focused on critical tasks, reducing the risk of accidents and errors [6], [7].
Reading detection can greatly enhance one's learning experience and is essential to develop learning strategies [8].During Covid-19 as remote classes or work became more popular, people started to spend more time on their digital devices (e.g.laptops, smartphones, etc.) [9].This resulted in too much screen time which could affect their mental health [10].Therefore, monitoring and understanding our reading or office activities have become paramount to managing screen time and promoting a healthier digital lifestyle [11], [12].By accurately detecting reading activities, individuals can take proactive steps to strike a balance between digital engagement and well-being, leading to more productive and healthier learning and working environments.Moreover, in education, detecting reading patterns and eye movements while studying can help students and educators assess their comprehension, attention, and learning difficulties.
Reading activity detection can mean different aspects of reading.The key step is to first detect whether a person is reading or not [13], [14].Then the reading can be further classified to detect reading different languages or reading plain text against problem-solving tasks (answering multiple choice questions) [15], [16].Different reading activities can also mean reading, re-reading and skipping lines [17], [18].In this study, we concentrate on classifying reading against not reading and reading different languages.We have explored two languages and reading styles in this regard where we classify reading English and Japanese with Japanese being read horizontally and vertically.
Reading activity recognition can be divided into two steps: (a) obtaining eye movement information and (b) classifying the reading activity.For the first step, to obtain information about the movement of the eye, camera-based eye tracking and EOG are two applicable methods.Even though camera-based eye trackers offer more accuracy, they can be bulky and expensive for reading activity recognition [19].It is also susceptible to different lighting conditions and head movements [20].On the other hand, EOG can be a simple and cheap alternative to the camera based approach.With the appearance of smart eyeglasses, the collection of EOG signals has been a lot easier and more user-friendly paving the way for tracking eye movement in the wild [21], [22].For the second step, to classify the reading activity, the two conventional supervised approaches are the feature-based machine learning approach, and the deep learning approach [18], [23].Some researchers have also moved to semi-supervised learning approaches to address the lack of large labelled datasets [24].
In our study, we explored a dataset proposed by Ishimaru et al. namely Electrooculography dataset for reading detection in the wild [15], which is the only publicly available EOG dataset for reading detection in the wild, to the best of our knowledge.After typical preprocessing steps like applying the bandpass filter, we also applied the Savitzky-Golay filter [25] to further smoothen the signals.We employed variational mode decomposition (VMD) [26] to find out the motion artefact component of the signal and remove it.Our proposed pipeline extracts features from the signal both statistically and using deep learning models converting 1D EOG signals into 2D spectrograms.After that, we use different machine learning classifiers (XGBoost, Extra Trees, Random Forest, SVM, KNN) in a nested two-step approach to first discriminate the reading activities from not reading and then classify the reading activities into three different categories: i) Reading English, ii) Reading Japanese Horizontal, and iii) Reading Japanese Vertical.We have obtained promising classification accuracies overcoming the baseline performance proposed by the original dataset paper.
We have organized the rest of the paper in the following manner: in Section II we discuss the related works in EOG-based reading detection and EOG-based reading detection in the wild.Followed by Section III where we discuss our methodology for this study in detail.The obtained results and discussion are elaborated on in Section IV.Finally, we finish with Section IV containing the conclusion and remarks for the future.

II. RELATED WORKS
Our work relates to EOG-based reading detection and precisely EOG-based reading detection in the wild.In this section, we discuss the related works in these fields.

A. EOG-BASED READING DETECTION
Over the years, researchers relied on eye movement information for reading detection.While some researchers obtained eye-movement information using camera-based eye trackers [17], [27], [28], others relied on the EOG signal [29], [30], [31].Researchers explored reading detection in different controlled environments in the office, in transit and also during social interactions [32], [33], [34], [35].These studies mainly focused on extracting important eye-movement features from EOG signals and used different algorithms or machine learning (ML) classifiers to detect reading activity.
In one of the early works of eye-based activity recognition (EAR), Bulling et al. [32] used a feature-based machine learning method to detect reading activities showing that eye movement information can help detect activities which are very difficult, sometimes impossible, to detect using other common sensing modalities.They calculated 90 different features by analyzing the repetitive patterns of three key eye movement information -saccades, fixations and blinks.They used a machine learning model, support vector machine (SVM) for classification and obtained precision and recall of 76.1% and 70.5% respectively.
Huda et al. [36] have done a similar experiment where they detect the reading activity based on EOG signal features and claimed a perfect score of 100% accuracy.It is apparent from their work that in laboratories when users are instructed to stay still or the body movements are minimized, reading activity can be detected from EOG signals very accurately.
Ou Yang et al. [37] have focused on reading activity detection based on EOG signal encoding.They used wavelet transformation and applied thresholds on the transformed signal to encode the eye movements into blocks of strings where each string is a sequence of characters.Then, they performed string matching (checking the similarity) between the recorded signal code and template code using Levenshtein distance, a string metric for measuring the difference between two sequences, to decide whether the person is in a reading state or resting.They were able to obtain an accuracy of 90.7% showing how wavelet transformation and thresholds can help extract specific eye movement information from EOG signals.
As the popularity of Deep Learning (DL) increased in recent times, researchers also focused on DL applications in EOG signal analysis.F. Latifoglu et al. [18] presented a system based on DL where dyslexic children could be assessed using an EOG signal focusing on retrieving words or re-reading and skipping lines.They applied a sliding windowing technique to segment the signals into different windows and detected the retrieving words/rereading and skipping line windows using the zero crossing points of the normalized signals.After that, applying STFT (Short Time Fourier Transform) each window was converted to spectrogram images which were used as inputs to a CNN (Convolutional Neural Network) model to classify the activities.Their proposed method could detect word retrieval/ re-reading with a 99% success rate.All studies mentioned in this section focused on reading detection in controlled environments where users were instructed to do particular activities, limiting their applicability in the real world.

B. EOG-BASED READING DETECTION IN THE WILD
In this section, we extend the analysis of reading detection from EOG signals to real-life scenarios.Pointing out the limitations of the previous studies of reading detection, Ishimaru et al. [38] first show that the methods that perform well for reading detection using EOG in controlled environments don't perform very well in real-life scenarios using EOG data collected from commercial EOG glasses (JINS MEME).For their approach, they used both a classical feature-based approach and a deep learning approach.They showed that deep learning models are more accurate for detecting reading activity in the wild whereas feature-based approaches are better for a controlled environment achieving 80.7% and 69.6% accuracy respectively.Later on, they introduced a publicly available dataset on reading detection with four different reading activities (reading English, reading Japanese horizontally, reading Japanese vertically, and not reading) [15].They showed a baseline performance with a statistical feature-based approach and SVM classifier with only 32% accuracy in user-independent testing in detecting In more recent works of EOG-based reading detection in the wild Islam et al. [24] proposed SSL (Semi-Supervised Learning) methods to tackle the problem showing the weakness of the supervised deep learning models.Their approach achieved better accuracies compared to SVM and fully supervised DL but the accuracy was only close to 60%.
In another work, a more efficient network was presented by them called Simple Framework for Contrastive Learning of Visual Representations (SimCLR) claiming that it outperforms the previous SSL method with a maximum power gain of 3.02% achieving an accuracy of 60.38%.It also shows that SimCLR can produce the SSL method's equivalent results with a lower amount of information (80%) [39].
Overall, all these studies indicate that differentiating reading activity from other tasks is a challenging problem as all our daily activities involve some amount of eye movement.Consequently, the current body of research in this particular domain is not yet sufficiently developed to be effectively used in practical contexts.

III. METHODOLOGY
In this study, we used a combination of feature-based machine learning and deep learning to classify reading activities.At first, we performed signal processing where we filtered the EOG and accelerometer signals to remove noise, baseline drift and motion artefacts.After that, the signals were divided into 30s non-overlapping windows.The window length was chosen to be 30s because, in the original dataset, the activities were annotated by observing the images taken from a body-worn camera at each 30s interval.Each of the windows was used to calculate 68 different statistical features and was converted to spectrogram images for the deep learning method.
In the next step, we use state-of-the-art ''ImageNet'' pretrained deep learning models to extract features from the spectrograms.After that, we combine the statistical and deep learning features for the classification process.The whole process is shown in Figure 1.At first, we classified the signals to identify reading vs. not reading which is denoted as binary classification in the figure.Machine learning models trained in this step learn to classify only the reading activity and not the different classes of reading (reading English, reading Japanese Horizontal and Japanese vertical).After that, we trained another model to differentiate the different categories of reading activity.The signals which are classified as reading activity are passed to the second model which again identifies which type of reading activity is it.This scheme is represented as Nested Classification in Figure 1.We also performed different oversampling steps using Synthetic Minority Over-sampling TEchnique (SMOTE), Support Vector Machine based Synthetic Minority Over-sampling TEchnique (SVM-SMOTE) etc. to balance the features from different classes.

A. DATASET DESCRIPTION
The dataset employed in this study was originally published by Ishimaru et al. [15] and is publicly available on GitHub1 .The dataset contains EOG, accelerometer and gyroscope data collected from JINS MEME2 eyeglasses.For this study, we only considered the EOG and accelerometer signals.The eyeglasses were connected to an Android Nexus 5X via Bluetooth and a mobile app3 was used to start and stop the recording.The sampling frequency was set at 100Hz and a body-worn camera was used to annotate the signals at each 30-second interval.The dataset contained almost 220h of total data collected from 10 different people on two different days as shown in Figure 2.There were four different activity classes -reading English (EN), reading Japanese Horizontal (JH), reading Japanese vertical (JV) and not reading (NR).

B. PRE-PROCESSING
EOG signals have a bandwidth of 0 to 50 Hz due to the limitation of eye movement speed.Studies suggest that almost 90% of the EOG signal energy is concentrated in the frequency range of 0-12 Hz [40].Following the study of Barea et al. [41], we used a low pass filter (LPF) with a cut-off frequency of 35Hz to remove the high-frequency inferences on EOG.A high pass filter (HPF) of 0.1 Hz cut-off frequency was also used to nullify the slow noises.Both of these filters were Butterworth type.
For further noise reduction and to achieve a more streamlined signal, we employed two different filtering techniques: the median filter and the Savitzky-Golay (SG) filter [25].The primary objective of this step was to refine the data and improve the quality of our results by minimizing unwanted and random variations in the signals.The choice of the SG filter for our noise reduction and signal smoothing process was based on its superior performance and its ability to maintain the essential characteristics of the data.By applying the SG filter, visually we achieved a significant improvement in signal quality, leading to more robust and reliable outcomes in our analysis.
Based on observational results, we used a window length of 0.09s (nine data points at 100 Hz) and polynomial order of two for the SG filter.Figure 3 shows an example signal before filtration, after applying the LPF and HPF filters and after the SG filter along with LPF and HPF filters.
To remove the long-term baseline shift of the signals, they were fitted to polynomials of degree 20 and were subtracted where EOGl and EOGr are EOG signals collected from the left and right electrodes, respectively.To remove the motion artefact from the EOG signal, we used Variational Mode Decomposition (VMD) as suggested by Shuzan et al. [43].At first, to identify the motion artefact component of the signal, we decomposed the signal into 5 IMF (Intrinsic Mode Functions) signals.After that, observing the decomposed signals, the IMF signal containing the motion artefact component was identified as shown in Figure 4, which is typically the low-frequency signal.Then the signal was reconstructed without the corrupted IMF mode signal to obtain the artefact-free signal.After all the pre-processing steps, the signals were segmented into 30s non-overlapping windows for the feature extraction steps.

C. STATISTICAL FEATURE EXTRACTION
After the signal had been processed and segmented into separate 30s windows, we calculated different statistical features from the signals.For each time window, we had five different signals -EOGh, EOGv and accelerometer signals for three axes.We added an additional signal calculating the magnitude of the accelerometer signal using the formula in Equation 3: where |A| is the magnitude and A x , A y , and A z represent the value of the accelerometer corresponding to that axis.After that, we took the derivative of these six signals with respect to time resulting in a total of twelve signals.From these 12-time domain signals, we extract typical statistical features used for activity recognition such as mean, median, variance, total zero-crossing-point, and entropy [44], [45].For each of the EOG signal windows, we also calculated the Area Under Curve (AUC) and the total positive threshold crossing points at −30µv and +30µv.Additionally, we calculated the correlation value of the EOG signal and accelerometer magnitude signal as the head movements can create motion artefacts in the signal.Furthermore, we do a Fast Fourier Transform

D. DEEP LEARNING FEATURE EXTRACTION
In this step, we extract features utilizing deep learning models.We transformed all the 30-second windows of the preprocessed signals (as described in section III-B) from the five sources, namely EOGh, EOGv, and accelerometer signals for the three axes, into spectrogram images and used them as inputs to pre-trained deep learning models.

1) SPECTROGRAM GENERATION
Each of the 30s signal windows was converted to spectrogram images using Fast Fourier Transform (FFT).We used 16 data points in a block to calculate FFT and an overlap of 8 data points was used to slide through the 30s window generating the image.The maximum and minimum power range of each spectrogram was set to 0db and −350db to keep the colour map uniform for all the images.Each signal window was converted to 224px × 224px images using the matplotlib.pyplot.specgram4function.Figure 5 shows sample spectrograms for each of the five signals (EOGh, EOGv, acc_x, acc_y, and acc_z) for a window of reading activity.

2) FEATURE EXTRACTION
For the feature extraction process, we used different stateof-the-art (SOTA) models in the image domain (DenseNet, ResNet, VGG and Vision Transformers).All of these models were pre-trained on the 'ImageNet' dataset [46] which contains 3.2 million images in total.In the process of feature extraction using deep learning models, we exclude the standard fully linked layers located at the final stage of these models.The purpose of this particular elimination is to achieve feature values rather than the probability linked to certain classes.By removing the last fully connected layers, which largely handle the transformation of learnt features into class probabilities, the emphasis is placed on collecting and retrieving the fundamental traits or patterns present in the image data.This modification facilitates the ability of the deep learning model to provide a concise depiction of the input data, describing its intrinsic properties.These models were used to extract features from spectrograms for all five channels (2 EOG and 3 accelerometer channels) separately which were combined afterwards.

3) DEEP LEARNING MODELS
For the feature extraction process, we employed four different state-of-the-art (SOTA) models in the image domain -DenseNet-121 [47], ResNet-18 [48], VGG-19 [49] and Vision Transformers (ViT) base-sized model [50].For our study, we used the pytorch implementation of the models provided by fastai. 5enseNet is a convolutional neural network architecture characterized by dense connectivity, whereby each layer is directly linked to all subsequent layers in the network that reside deeper in the network structure.The architecture of DenseNet-121 starts with an initial convolutional layer with 64 filters, each with a size of 7 × 7 and a stride of 2. This is then followed by a max pooling layer of size 3 × 3. The structure afterwards has four dense blocks, in which each block contains a series of 1 × 1 convolutions and 3 × 3 convolutions, repeated in varying quantities (6, 12, 24, and 16).To conduct the down-sampling operation of the standard convolutional networks, transition layers are positioned between the dense blocks.These layers consist of a 1 × 1 convolutional layer and a 2 × 2 average pooling layer with a stride of 2 [48].
ResNet-18 is a member of the ResNet model family, a collection of deep convolutional neural network (CNN) architectures designed to mitigate the issue of the vanishing gradient problem and enhance the convergence capabilities.ResNet-18 has five convolutional layers.Except for the first one, which is similar to DenseNet-121, each layer contains two series of 3 × 3 convolutions, repeated twice and there are skip connections after every pair of 3 × 3 convolutions.ResNet-18 exhibits a favourable balance between the depth, which directly impacts computing time, and the performance [48].
The VGG19 architecture is also a convolutional neural network (CNN) developed for the purpose of classifying images.The key characteristic of VGG networks lies in their straightforward and consistent architectural design.The VGG19 architecture has a total of 16 convolutional layers arranged in a sequential manner, followed by three fully linked layers.The second, fourth, eighth, twelfth, and sixteenth convolutional layers are succeeded by a max-pooling layer, which serves to decrease the spatial dimensions of the feature maps [49].
The ViT (Vision Transformer) is a structure for deep learning that uses the transformer model [51], which was first developed for problems in natural language processing, to process and analyze image data.We used vit_base_patch16 which denotes a baseline version of the Vision Transformer framework, whereby input pictures are partitioned into patches of size 16 × 16.The patches are embedded linearly, and absolute position embeddings are included in the sequence prior to giving it as input to the Transformer's encoder layer.In the case of classification tasks, the model incorporates a [CLS] token at the beginning of the sequence [50].

E. CLASSIFICATION
We divided the classification into two different steps: binary classification (reading vs not reading) and four class classification (reading English vs reading Japanese Horizontal vs reading Japanese Vertical vs not reading).

1) BINARY CLASSIFICATION
Initially, we classify reading against not-reading which is a two class (binary) problem.As the dataset contains 4-class annotation having labels as Reading English (RE), Reading Japanese Horizontal (RJH), Reading Japanese Vertical (RJV) and Not Reading (NR), we convert them to 2-class by adding the three reading classes (RE+RJH+RJV) and classify against not reading (NR).The features from section III-C and section III-D were combined together and PCA (Principal Component Analysis) [52] was performed on the feature set keeping a sufficient number of principal components that explain 99% of the variance in data.As PCA can sometimes result in data loss and hamper classification performance, the experiments were done with and without PCA.However, the effect of PCA on model accuracy was insignificant and was kept as a part of the pipeline for faster classification.
From Figure 2, we can see that there is a huge data imbalance present for the four classes in the dataset.For this step, even though we combined the three reading classes, there were still imbalances present in the number of samples for the classes and we used different SMOTE [53] techniques to balance the number of samples per class.Among various SMOTE techniques, SVM-SMOTE [54] had better results and was selected for further experimentation.Finally, for classification five different ML models (XGBoost, Extra Trees, Random Forest, SVM, KNN) were explored which are discussed in section III-F.

2) NESTED 4-CLASS CLASSIFICATION
To classify the four classes presented in the dataset -Reading English (RE), Reading Japanese Horizontal (RJH), Reading Japanese Vertical (RJV) and Not Reading (NR), we employ a nested classification scheme.Here, two different models (model-1 and model-2) are trained as shown in Figure 6.The figure illustrates that model-1 is exposed to the entire training set, enabling it to discern between the reading (REJ) and not reading (NR) instances.Conversely, model-2 is trained exclusively on the reading cases, allowing it to distinguish between the three reading activities: Reading English (RE), Reading Japanese Horizontal (RJH), and Reading Japanese Vertical (RJV).For the testing phase, the features are fed into model-1 which predicts REJ vs NR and then the predicted REJ cases are given to model-2 to further classify the RE, RJH and RJV cases.All the evaluation matrices are calculated after combining the results of these two models.Similar to the binary classification, PCA has been performed after combining the statistical and deep-learning features.SVM-SMOTE is applied to balance the number of samples per class for each of the models.
XGBoost is a tree boosting method that exhibits great scalability and comprehensiveness, making it a popular choice among data scientists for reaching cutting-edge outcomes in various machine learning endeavours.Finding the most optimum split locations is a major obstacle in the field of tree learning.In the conventional approach, the task is accomplished by using the exact greedy algorithm that systematically explores all potential splits across all characteristics.The computing cost increases significantly when dealing with continuous characteristics, making it impractical to enumerate all feasible divisions.In order to effectively address this issue, the XGBoost algorithm organizes the data by feature values and then analyses the data in a sorted manner, enabling the aggregation of gradient statistics for the structure score [55], [64].
The Random Forest (RF) Classifier is a machine learning technique that focuses on classification and employs an ensemble method by mixing several decision trees.The phrase ''random forest'' refers to the forest of decision trees that the algorithm generates using arbitrary branching patterns.Building decision trees based on several criteria involves using significant division points in the data, such as Gini impurity or information gain.However, rather than taking into account all characteristics in Random Forest, the 105626 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.selection of split points is restricted to a random subset of features at each node [56], [65], [66].
Support Vector Machines, also known as Support Vector Networks, is a machine learning model that can be used to solve classification and regression problems.It can handle both linear and non-linear problems and is helpful for a wide range of applications.SVM is based on a basic concept: the algorithm generates a line or hyperplane that divides the data into categories.However, there can be many hyperplanes separating the classes.SVM tries to find the best possible hyperplane with the help of support vector machines.Support vector machines are the closest data points to the line from each class.The margin, or distance between the plane and support vector machine, is calculated.Then the optimal hyperplane is identified with the maximum margin [57].
Similar to Random Forest (RF) classifier, Extra Trees (ET) is also an ensemble based decision tree classifier but slightly different considering how the decision trees are trained.While RF creates multiple versions of the dataset making sure that the decision trees are different, ET trains the decision trees on original data.Additionally, the RF uses a greedy search to select where to split the feature but the ET selects the values randomly [58], [67].
The K-Nearest Neighbours method is considered to be a fundamental classification technique in the field of Machine Learning.The key idea is predicated on the notion that data points exhibiting similarities are likely to possess matching labels or values.The technique is very applicable in practical situations due to its non-parametric nature, which implies that it does not rely on any specific assumptions about the data distribution.In the training phase, the K-nearest neighbours (KNN) method preserves the whole training dataset as a reference.When formulating predictions, the algorithm computes a distance between the input data point and all the training instances, using a specified distance metric such as the Euclidean distance.Subsequently, the method proceeds to identify the K closest neighbours to the input data point by considering their respective distances.In the context of classification, the method selects the class label that is most often seen among the K nearest neighbours and designates it as the predicted label for the input data point [59], [68].

G. EXPERIMENTAL SETUP
At first, we conducted an auxiliary experiment to select the best deep learning model among DenseNet121, ResNet18, Vgg19 and vit_base_patch16 for this study.For this, we only used the deep learning features from the corresponding models (without statistical features) and trained the machine learning models explained in section III-F.We conducted a 5-fold cross-validation experiment where we divided the deep learning feature set into training and testing sets in a random manner following the ratio of 80 to 20.We do binary classification (reading vs. not reading) and observing the result through this step, we chose the best deep-learning feature set.
For the core experiment, we considered a Leave One Out Cross Validation (LOOCV) where each person's data is taken as a fold as shown in Figure 7.As the figure shows, there were 10 experiments where for each experiment, one-fold or one person's data was selected for testing and others for training.As a result, for each experiment, models were trained on nine persons' data and were tested on the other one person's data.It ensured that no data was leaking between the train and the test set that could affect the results making sure the results would reflect model performance on truly unseen data.For classification models (model-1 and model-2) 5 different machine learning models were explored -XGBoost Classifier (XGB), Random Forest Classifier (RF), Support Vector Machine Classifier (SVC), Extra Tree Classifier (ET) and K-Nearest Neighbors Classifier (KNN).
In case of the XGBoost, the classifier was configured with 400 estimators, 500 iterations, a learning rate of 0.001, and the 'Logloss' as a loss function.Random Forest Classifier was used with the configurations of a maximum tree depth of 10, the 'gini' criteria for node splitting, a total of 100 decision trees and default values (by scikit-learn-1.3 6) for other parameters.Additionally, repeatability was ensured by setting a fixed random seed of 0. Default parameters were used for SVC classifier too except the probability estimate was enabled.ET classifier had the bootstrap set to true and default values for other parameters.The KNN classifier was trained with number of neighbours set to 3.

H. EVALUATION METRICS
We primarily used two evaluation metrics for our experiment, one of them being accuracy and the other being the f1-score.Accuracy represents the overall correctness of the model and is calculated using the formula shown in Equation 4: where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives.The number of cases in which the model correctly predicts that a sample belongs in a class is given as TP, while the number of cases in which the model correctly predicts that a sample does not belong in a class is given as TN.
Consequently, FP and FN are the cases where the model fails to do those tasks respectively.Additionally, the ratio of true positives to the total of true positives and false positives is known as precision, whereas the ratio of true positives to the sum of true positives and false negatives is known as recall.
We also used a weighted f1-score which is particularly useful when dealing with imbalanced datasets and is calculated using Equation 5: where precision is the ratio of true positives to the total of true positives and false positives and recall is the ratio of true positives to the sum of true positives and false negatives. 6https://scikit-learn.org/stable/

IV. RESULTS
This section contains the experimental outcomes of each step of the study and a concise analysis of the results.

A. SELECTION OF THE DEEP LEARNING MODEL
To select the best DL model from Densenet121, Resnet18, Vgg19 and vit-base-patch16 we trained five different machine learning models with only the DL features in a random 5-fold cross-validation manner.The results from the best machine learning models for each of the feature sets are presented in Table 2, where the accuracy and f1-scores were averaged across all five folds.From the table, we can see that the vit base features give the best result with 75.62% accuracy and 75.27% f1 score.The visual transformer's patch-based processing provides advantages in capturing global dependencies and multi-scale patterns which could have helped in this case to extract better features from the spectrograms.

B. BINARY CLASSIFICATION
We combined the features from vit_base_patch_16 and the statistical features and split the feature set in a LOOCV method for binary classification (reading vs not reading).The performance of the five machine learning models is presented in Table 3. From the table, we can see that the best result using the Random Forest classifier with an accuracy of 67.02% and an f1 score of 66.18%.Also, persons 1 and 4 showed slightly better results while all others' accuracies were quite similar.

C. NESTED CLASSIFICATION
To detect the fine-grained reading activities two models were trained simultaneously -one to detect reading vs not reading and another to detect the 3 types (RE, RJH, and RJV) of reading.The combined results of the nested classification using these two models are presented in Table 4. From the table, we can see that random forest performs best in terms of accuracy with an average score of 66.56%.However, the XGBoost model shows the best f1-score of 65.10%.Except for these two models, the performance of all the other models decreases significantly from the binary classification.
Our proposed method demonstrates superior performance in reading detection, achieving an accuracy of 66.56%.A comparative analysis of our results with other studies on the dataset is shown in Table 5 where all the works concentrated on classifying four reading activities (RE, RJH, RJV and NR) in LOOCV or user-independent configuration.However, our work only used EOG and accelerometer data while the other works also included gyroscope signals.In the dataset paper [15], the authors reported a baseline performance of only 32% accuracy using a classical machine learning approach with 10 statistical features and an SVM classifier.In contrast, Islam et al. [24] utilized Self-Supervised Learning (SSL) and achieved an accuracy of 57.37% when using all available training samples.Their findings revealed that the SSL method outperformed SVM-based Machine Learning (ML) and supervised Deep Learning (DL) methods.Moreover, in one of the recent works, Islam et al. [39] reported accuracies of 56.42% and 60.38% using Supervised DL and SimCLR (Contrastive Learning of Visual Representations), respectively.However, our proposed method significantly outperformed all these existing methods, attaining the highest accuracy of 66.56% for reading detection on this dataset.

V. DISCUSSION AND CONCLUSION
We conducted the classification of a range of reading activities and non-reading based on EOG and accelerometer signals using a hybrid approach involving both deep learning and machine learning techniques.Our methodology leveraged cutting-edge deep learning models, employing spectrogram images as inputs.This choice was motivated by the remarkable advancements in image-based models, which have outpaced the progress in models designed for 1-D signals.By fusing deep learning features with meticulously crafted statistical features, we trained standard machine learning models to perform reading activity classification.Despite achieving state-of-the-art results on our dataset, there remains substantial room for further enhancement.
The dataset posed significant challenges due to a scarcity of reading samples and a substantial data imbalance between reading and non-reading classes.Considering all the data present in the dataset, for a total of 17348 not reading windows of 30s interval, there were only 3207, 3141, and 2948 windows present for reading English, Japanese horizontal and Japanese vertical respectively.The adoption of the two-stage nested classification strategy played a pivotal role in mitigating class imbalance issues and substantially enhancing the results of the 4-class classification, approaching the accuracy achieved in binary classification.However, in addition to utilizing EOG and accelerometer data, the incorporation of gyroscope values holds the potential to further refine the outcomes.While reducing the window size could augment sample numbers, its feasibility was limited by the 30-second annotation intervals, introducing potential misrepresentation if users weren't engaged in reading for the full duration.Furthermore, the dataset lacked annotations for activities such as reading billboards or quickly perusing leaflets, which could have led to model confusion in distinguishing reading from non-reading instances.We also observed relatively better model performance for person 1 and person 4, possibly attributable to individual-specific data variations encompassing dissimilar reading patterns, electrode contact disparities, focus on reading, and other environmental factors.
The user may have done other things or may have been distracted for a period in some of the reading windows which may have introduced noise and plausibly affected the model performance.For practical implementations, a more precise and well-annotated dataset in an uncontrolled environment can show a boost in model performance.The dataset contained data from only 10 users and so data from a large number of diverse users in different environments can increase the model's generalizability and robustness.Also, as the data were collected only on 2 different days and not a longer period, it may introduce bias depending on the change in habit of the user.Even though the dataset authors mention that they categorized activities as reading which required frequent line breaks, a clearer and more consistent definition of reading with proper documentation of which activities were considered is required.It is pertinent to acknowledge that the extracted features using the deep learning models lack interpretability keeping us uncertain about what the model is learning.Our work sheds light on the challenges of detecting reading activities in day-to-day life using EOG and accelerometer signals.While we propose potential avenues for improvement, addressing the mentioned limitations will be essential for future research in this area.

FIGURE 1 .
FIGURE 1. Overview of the reading activity detection pipeline.

FIGURE 2 .
FIGURE 2. Available data for all the persons doing four different activities.
from the original signals.The same filtering was also applied to accelerometer signals.The dataset contained left and right EOG signals collected from the two nose pads of the eyeglasses where the left and right electrodes were connected.However, the traditional EOG signals are represented by two different signals: horizontal and vertical EOG (EOGh, EOGv) denoting the eye movements in the two axes.When the eye moves in an up-and-down direction there occurs a voltage change in the vertical EOG signal and in contrast when the eye moves side-to-side, voltage changes in the horizontal EOG signal.We calculated the horizontal and vertical EOG (EOGh, EOGv) signals in accordance with the method proposed by Kanoh et al. [42], given as Equations 1, and 2 below: EOGh = EOGl−EOGr (1)

FIGURE 6 .
FIGURE 6. Overview of the two-step classification process.

TABLE 1 .
Features calculated from different signals for each window.(FFT) of the EOG signals to get the frequency domain signal.From the frequency domain signals, we calculate two features for each signal -skewness and kurtosis.Table 1 lists all the features, which became 70 features in total.

TABLE 2 .
Result comparison of binary classification with different deep learning (ImageNet pre-trained) features for random 5-fold CV.

TABLE 3 .
Leave one person out test result for binary classification.

TABLE 4 .
Result of 4 class nested classification using LOOCV.

TABLE 5 .
Comparison of reading detection results with existing studies on the dataset.