Depression Level Classification Using Machine Learning Classifiers Based on Actigraphy Data

Estimating and classifying depression status are critical in the clinical and psychological domains to map the course of treatment. Prior researchers used biosignal time-series data to reflect the variation in factors associated with depression. In addition, machine learning algorithms were applied to determine the underlying relationships between depressive symptoms and these factors. In this study, we introduce a classification framework for depression levels using actigraphy data based on machine learning algorithms. Fourteen circadian rhythm features (minimum, amplitude, alpha, beta, acrotime, upmesor, downmesor, mesor, f_pseudo, interdaily stability (IS), intradaily variability (IV), relative amplitude (RA), M10, and L5) extracted from accelerometer-based actigraphy data were used to model depression status with survey variables. Six evaluation metrics (accuracy, precision, recall, F1-score, receiver operating characteristic curve, and area under the curve) were applied to validate the performance of the proposed framework. Among the four candidate classifiers (XGBoost classifier, support vector classifier, multilayer perceptron, and logistic regression), the XGBoost classifier was the best at classifying depression levels. Moreover, we confirmed that the actigraphy data of two days were optimal for feature extraction and classification. The results of this study provide novel insights into the relationship between depression and physical activity in terms of both identification of depression and application of actigraphy data.


I. INTRODUCTION
Depression is commonly recognized as the main factor in psychiatric and psychophysiological disorders [1]. In addition, depressive symptoms affect daily life, leading to feelings of helplessness, anxiety, sleep disturbances, and decreased concentration [2]. To identify factors associated with related disorders, many researchers have focused on both internal and external elements of the patient group. For example, chronic stress affects the onset of major depressive disorder [3], acute stroke is associated with the occurrence of major depression [4], and socio-environmental factors (e.g., family members' health and relationships) affect unipolar depression [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan .
Classifying the depression level is critical in both the clinical and psychological domains. In the clinical domain, continuously monitoring depressive symptoms is important for mental health management. In addition, it is beneficial for all patient groups, including the treatment group and the candidate group expected to receive treatment. In the psychological aspects, it is important that individual factors are considered to confirm a subject's depression level [6]- [9].
Many previous researchers utilized various modalities to identify factors associated with depression, such as electroencephalogram (EEG), brain magnetic resonance imaging (MRI), and self-report physical activity [10]- [13]. However, these methods have several limitations. In the case of EEG and brain MRI, an expensive MRI equipment should be prepared, and data can only be collected in a limited laboratory environment. In terms of self-report physical activity, many researchers collected intensity or duration of activity by selfreport questionnaires from participants. However, self-report measures can be influenced by self-report bias or response bias [14], [15]. To overcome these limitations, we used accelerometer-based actigraphy data (i.e., actigraphy) in our study. This time series data can collect continuous activity data and is relatively inexpensive.
Considering previous studies, biosignal and life-log time series data collected from participants (e.g., Electrocardiogram, accelerometer-based actigraphy) have been widely used to screen for depression [16]- [18]. Heart rate variability (HRV) parameters extracted from ECG signals were used with machine learning (ML) algorithms to diagnose major depressive disorder [19]. In addition, multimodal data (steps, energy expenditure, body movement, sleep time, heart rate (HR), and skin temperature) including several life-log time series data collected from wristband-type wearable devices were analyzed using a ML model to evaluate depression [20].
In related studies using self-report physical activity, researchers focused on comparing the intensity level (e.g., weak, moderate, or vigorous) and activity duration (e.g., once a week or five hours a day) of participants in study groups [21]- [23]. Moreover, actigraphy time-series data collected using wearable devices were used to analyze activity variation [24]- [26]. To deduce the activity patterns from actigraphy, the characteristics of actigraphy time series (e.g., peak, slope, amplitude) were calculated from cosinor modeling as indices [27]- [30]. These indices represent the circadian rhythm cycles of the participants, which reveal major symptoms of depression [31]- [33].
Circadian rhythm patterns from actigraphy data reflect the stability or fluctuation of activity from participants [34], [35]. These patterns can be summarized using parametric and nonparametric metrics [36]- [39]. In previous studies, circadian indices related to activity, sleep disturbances, and daily mood variations were compared between the patient groups and control groups to identify differences in the circadian patterns of depression patients [40]. Disturbances of the circadian cycle indicated that depression patients exhibited differences compared to the healthy group [41], [42]. Moreover, ML algorithms have been used to characterize daily sleep-activity cycles using actigraphy data [43].
Based on previous studies identifying the association between physical activity and depression, we hypothesized that actigraphy data have sufficient potential to classify depression levels when an adequate analysis is conducted. To prove our hypothesis, we constructed an experimental design using ML classification algorithms. In this study, we proposed a classification framework for depression levels (e.g., 'mild,' 'moderate,' or 'severe' depression) through ML algorithms based on actigraphy data. We collected the actigraphy data ( Fig. 1), as well as demographic, physical activity, subjective health status, and mental health variables of the same participants. Fourteen circadian rhythm indices based on parametric and non-parametric metrics were extracted to verify the validity of these data as features. Four non-parametric features were used: interdaily stability (IS), intradaily variability (IV), total activity of the ten most active hours (M10), and total activity of the five least active hours (L5). The nine parametric features used were minimum, alpha, beta, acrotime, amplitude, mesor, upmesor, downmesor, and f_pseudo. The distribution of all features, including the survey variables and extracted circadian features, was verified; in sequence, they were log-transformed and standardized. To prevent multicollinearity between features, we selected only fifteen features based on the lasso and ridge regression models. The selected features were applied to four classification algorithms, XGBoost classifier, support vector classifier (SVC), logistic regression (LR), and multilayer perceptron (MLP). Finally, the performance of each model was evaluated using six evaluation metrics: accuracy, precision, recall, F1-score, receiver operating characteristic (ROC) curve, and area under the curve (AUC).
The objective of this study was to develop a depression level classification framework based on actigraphy data with ML algorithms. The major contributions of this study are as follows: (1) We proposed an ML-based classification framework for depression levels based on the circadian rhythm characteristics embedded in physical activity. In addition, we evaluated the performance of the model under various conditions, including binary and multiclass classification, on a large dataset. Moreover, we compared the classification performance of commonly used ML classification algorithms: XGBoost classifier, SVC, MLP classifier, and LR; (2) Advancing from analyzing simple characteristics of physical activity data, we extracted various features about the inherent circadian rhythm of activity from participants. Furthermore, we identified the optimal length of actigraphy data for extraction of circadian rhythm features. In addition, we confirmed that the feature importance of the trained XGBoost model in our framework was in agreement to that of previous studies.
The remainder of the paper is organized as follows: Section II includes a detailed description of the dataset and methodologies used in the study. In Section III, the classification performance of the four ML models in the experiments is reported. In Section IV, we discuss the results and their implementation. Finally, the conclusions and summary of our study are presented in Section V.

A. OVERVIEW
To prove our hypothesis, we used six steps in our experimental design. First, we extracted demographic, physical activity, mental health, subjective health status variables, and collected actigraphy data from the Korea National Health and Nutrition Examination Survey (KNHANES) dataset. Selected variables and actigraphy data were combined based on the participant ID. Second, we extracted thirteen circadian indices VOLUME 9, 2021  from the actigraphy data using parametric and non-parametric methods. Third, selected variables were log-transformed and standardized after their distribution was checked. Fourth, to select suitable features, all the variables, including selected variables and extracted circadian indices, were filtered using the coefficient with the lasso and ridge regression models. Fifth, we generated six types of datasets to evaluate the optimal length of the actigraphy data used in an ML algorithm. Finally, four classification algorithms were trained and evaluated using the evaluation metrics. The detailed steps are shown in Fig. 2.  Agency (KDCA). KNHANES is a longitudinal survey conducted by the KDCA to investigate the health status, healthrelated awareness and behavior, and nutritional status of people in Korea. The survey started in 1998 and was conducted every three years until 2005; subsequently, it has been conducted annually. The original dataset is available on the KDCA website and the current dataset for 18 years from 1998 to 2019 is publicly available. This dataset consists of nine categories of survey variables, which are listed in Table 2.
This dataset consists of two sub-datasets in separated csv files. The first sub-dataset consists of health behavior, blood tests, and grip test results covering the first four categories. The second sub-dataset includes the last five categories. All the survey results in the sub-datasets can be combined with the participant ID. A total of 216,815 people participated in this survey from 1998 to 2019. In our study, we selected the datasets for two years only (2014 and 2016), which were the only ones containing actigraphy data [44], [45]. The baseline characteristics of both datasets are presented in Table 3.
The actigraphy data in the KNHANES dataset were collected using an ActiGraph GTX3 wearable device. Acceleration values from the participants' activity were measured in one sample in one minute (sampling rate: 1/60 Hz). All the participants were instructed to wear the device for one week. Although the acceleration values of the three axes   We analyzed whether the actigraphy data were in agreement with survey variables. Among the nine categories included in the survey data, demographics, physical activity, subjective health status, and mental health variables were selected to identify levels of depression. The dimensions of the original dataset in 2014 and 2016 were (7,550,746), and (8,150, 800) before and (7,550,73) and (8,150,71), after the extraction of relevant variables, respectively. For the remaining dataset, we merged the actigraphy dataset and survey dataset based on the participant ID in the actigraphy dataset (977 participants in 2014 and 575 participants in 2016). Following merging, the dimensions of both datasets were unchanged.

2) EXTRACTION OF PARAMETRIC AND NON-PARAMETRIC CIRCADIAN INDICES
To classify the depression status, we extracted circadian rhythm indices from actigraphy time-series data. Both parametric and non-parametric metrics were applied to deduce the various characteristics of the circadian cycles from activity patterns. To extract the indices, the duration of the actigraphy data collection was varied (two, three, four, five, six, and seven days) to confirm the optimal actigraphy length for index extraction. For each length condition, the data were sliced into windows and data of several length conditions were overlapped. For example, in a two-day length condition, a window with a length of two days was sequentially sliced from actigraphy data (e.g., the first window consists of days 1 and 2 from actigraphy data, the second window consists of days 2 and 3, . . . . In total, six windows were applied). Consequently, we obtained six index vectors from the twoday actigraphy data. The detailed process for the two-day length condition is depicted in Fig. 3.
The indices were extracted using the same process and also for the other conditions. After extraction, the indices were arranged in column-wise matrices. Each column was denoted as x i , where i = 1, . . . , N. The rows of the matrix indicate the extracted indices for the six conditions per participant. We termed this matrix the circadian index matrix, which is denoted as follows: Finally, the corresponding condition labels vector y = [two days, . . . , seven days] were merged with the circadian indices matrix. A detailed description of each circadian rhythm index is provided in the following subsections.

[1] INTERDAILY STABILITY (IS)
In this study, the stability of activity over multiple days was calculated by normalizing the number of actigraphy samples at 24-h values. This indicator was calculated using (2) [46]: where N is the total number of samples, p is the number of samples per day,X is the mean value of all samples,X h are the hourly means, and x i indicates the individual actigraphy samples. Changes in IS can represent a coupling between the rest-activity cycle and decreased IS values indicated higher day-to-day variation in activity patterns [47].

[2] INTRADAILY VARIABILITY (IV)
The IV index is the ratio of the mean squared first derivative of the sample to the total variance from the actigraphy samples as in (3): The elements included in the equation have the same meaning as in (2). This index indicates fragmentation of the restactivity rhythm [46]. [

3] MOST ACTIVE 10-H PERIOD (M10)
The M10 was computed by averaging the ten highest hourly means. This index indicates the activity during the most active period of the day.

[4] LEAST ACTIVE 5-H PERIOD (L5)
The L5 represents movement during sleep and nighttime arousals. This value indicates the average value in the five least active hours in the entire actigraphy. [

5] RELATIVE AMPLITUDE (RA)
The RA of the activity cycle in actigraphy can be calculated from M10 and L5 values, as in (4) [48], [49]: To extract parametric indices from the actigraphy data, we used cosinor analysis. The least squares method was used to fit a cosine wave to the actigraphy data [50]. We calculated nine parametric indices: minimum, amplitude, alpha, beta, acrotime, upmesor, downmesor, mesor, and f_pseudo.
[1] MINIMUM This index is the minimum value of the fitted cosine function with actigraphy data.
[2] AMPLITUDE This index represents the highest activity value in the activity cycle.
[3] ALPHA Alpha indexes determine whether the peaks of the curve are wider than the troughs. High alpha values indicate wide troughs and narrow peaks. On the contrary, low alpha values indicate narrow troughs and wide peaks. [4] BETA This index determines whether the transformed function rises and falls more steeply than the cosine curve. Large values of the beta index indicate that the curves are nearly square waves. [5] ACROTIME The acrotime indicates the time of the peak activity from the total activity time.
[6] UPMESOR The upmesor is the time of the day in which the switch from low to high activity occurs. In the rest-activity rhythm, this value indicates the timing of the variation. Lower values indicate increased activity earlier in the day.
[7] DOWNMESOR The downmesor is the time of the day in which the switch from high to low activity occurs. It indicates the timing of the change in the rest-activity cycle. Lower values represent a decline in activity.
[8] MESOR This index, calculated similarly to the MESOR of the cosine model, can be calculated using (5).
However, as it goes through the middle of the peak, it is not equal to the MESOR of the cosine model. Generally, this index represents the mean of the actigraphy data.
[9] F_PSEUDO Measure the improvement of the fit obtained by nonlinear estimation of the transformed cosine model.

3) REMOVAL OF VARIABLES WITH NO-RESPONSE DATA
In the KNHANES dataset, -8 indicates the 'not applicable' answers of the participants. To reflect the exact response in each variable, we checked the distribution of each variable, including both the survey and circadian indices

4) CHECK DISTRIBUTION AND TRANSFORM BY LOG TRANSFORMATION AND STANDARDIZATION
After removing the irrelevant variables with invalid responses, we confirmed the distributions of the remaining variables to improve the evaluation of the ML algorithms. In addition, we applied log transformation and z-score standardization to the arranged dataset to overcome possible unequal and skewed distribution of variables.
In the case of the 'PHQ-9' variable (target variable), which was discrete and not continuous, we could not apply log transformation. Distributions of variables used in our study, including 'PHQ-9,' are depicted in Fig. 4.

5) FEATURE SELECTION
Features of both the extracted circadian rhythm indices and selected survey variables may have two problems. First, features can have a high correlation between them (multicollinearity or redundant variables in classifying dependent variables). Second, a low correlation can be observed between features and class (irrelevant features for classifying  dependent variables). To select adequate features, we applied three-step rank and frequency feature selection methods. Two feature selection criteria, lasso and ridge regression models [51], [52], were applied.
In the feature selection steps, we first fitted the lasso and ridge regression models based on a dataset including both circadian indices and survey variables. The coefficients were confirmed for the individual features and sorted by their magnitude. Second, we selected the top 15 features based on each coefficient. All high-ranking features selected from the regression models were collected. Finally, the collected features were sorted again according to their frequency. After feature sorting, we chose the top-15 features to reflect both rank and frequency from both selection criteria. The features selected in this section are listed in Table 4.
Considering the selected features, 'EQ5D' indicates subjective quality of life index, 'BP1' indicates awareness of usual stress, 'LQ4_00' indicates uncomfortable physical activity, 'D_2_1' indicates uncomfortable experience in the last two weeks, 'HE_BMI' indicates the BMI index values, 'mh_stress' indicates stress awareness, 'DF2_dg' indicates the doctor's diagnosis about depression, 'D_1_1' indicates the subjective health status, and 'pa_aerobic' indicates aerobic physical activity.

6) GENERATION OF DATASET UNDER SIX CONDITIONS TO CONFIRM THE OPTIMAL LENGTH OF ACTIGRAPHY DATA
To confirm the optimal length of the actigraphy monitoring for classification, we constructed six datasets by varying the duration of the actigraphy data. The first dataset contained circadian rhythm indices extracted from the actigraphy data of two days. Similarly, the second to sixth datasets consisted of three, four, five, six, and seven days of actigraphy data, respectively. The numbers of rows in the first and second datasets were 9158, 7681, 6160, 4627, 3086, and 1544, respectively. Additionally, each dataset was split into training and test datasets at a 9:1 ratio.

7) EVALUATION OF CLASSIFICATION PERFORMANCE IN EACH CONDITION
In the final step, we constructed an additional dataset with various class conditions to compare the classification performance at diverse class levels. Classification performance was evaluated by four conditions: binary, three, four, and five classes.
Four classification algorithms (XGBoost, SVC, MLP, and LR) were compared in a total of 24 conditions (six actigraphy length conditions × four conditions for classification labels). To check the relevance as input features for depression level classification, we confirmed the list of features sorted by feature importance from trained ML models.
Due to the imbalance in the number of subjects belonging to each class label, weights were applied to complement the algorithm training. We conducted a random search to determine the optimal hyperparameters of the four ML classifiers, as listed in Table 5. In addition, 10-fold cross validations were applied to prevent overfitting of classification algorithms.

D. CLASSIFICATION ALGORITHMS
In this study, we utilized four classification algorithms to model the relationship between the selected features and the level of depression. The first classification model was the XGBoost classifier, which is based on an ensemble of several decision tree models, according to (6) [53]. This In our cases, we used    algorithms with regularized objectives constitute the basis for the model. where To optimize the algorithms with a dataset, we minimize the regularized objective function in (6), where y i indicates the predicted value from the tree model and each f k corresponds to individual trees. Function l is a differentiable convex loss function that compares the difference between the predicted y i and target y i . In the second term, function is the penalization term for the complexity of models. To avoid overfitting the partial dataset, an additional regularization term smoothens the last learned weight. In this study, we set y i as class labels to which depression levels are assigned (e.g., 'mild,' 'moderate,' 'severe' depression).
The second classification algorithm applied was the SVC with nonlinear kernels [54]. This algorithm classifies the feature space using hyperplanes that are separated by class labels. In previous studies, researchers used linear kernels to classify binary-class conditions for stress [55]. In contrast, we used a nonlinear kernel (radial basis function kernel) to evaluate the classification performance with more diverse class levels. In addition, to avoid overfitting when nonlinear kernels are used, we developed and tested the model performance using completely participant-separated datasets.
The third classification algorithm was an LR classifier. To estimate the coefficient of the regression model, a maximum likelihood estimation method was applied. Consequently, the classifier yields a likelihood value L(x), where 0 ≤ L(x) ≤ 1. This value indicates the association between class labels and input vectors. A likelihood value higher than 0.5, which is the assigned threshold, indicates that the condition was classified as severe depression levels in binary cases. For this classifier, we considered the basic form of the LR model with our features and depression classes as follows: where z = α + β 1 X 1 + β 2 X 2 + · · · + β k X k (10) where Y represents the depression level as a class. We considered Y as a specified value of either 'mild,' 'moderate,' or 'severe' in the three classes. In summary, the LR model suggested probability values to categorize each class under various conditions. The final classification algorithm used in this study was an MLP classifier (i.e., an artificial neural network model). It consists of multiple layers of, at least, three layers of nodes (input, hidden, and output layers). Each node calculates the output vectors through the activation function g with weight and bias vectors. The detailed calculation is as follows: where h (l)

E. EVALUATION METRICS
We compared the classification performance of the ML classifiers based on six evaluation metrics. To evaluate the classification results of the algorithms using other indicators rather than only the accuracy, we calculated the true positive (TP), true negative (TN), false negative (FN), and false positive (FP) values from the confusion matrix. The correctly classified samples were calculated using the TP and TN values. In contrast, incorrectly classified samples were indicated by FN and FP. Based on the four basic values from the confusion matrix, we obtained four additional indicators: precision, recall, F1-score, and accuracy, calculated using (13-16), respectively. Furthermore, we confirmed the true positive rate (TPR) and false positive rate (FPR), using (17) and (18), respectively, to draw the ROC curve. In addition, we evaluated the performance based on AUC values using an ROC curve.
To validate the classification performances of each classifier, we applied a one-way analysis of variance (ANOVA) test considering the evaluation indices values from classification algorithms.

III. RESULTS
The performance of ML algorithms at classifying depression levels is shown in Tables 6-9. Specifically, we examined the classification performance and optimal length of the actigraphy data extracted for circadian rhythm indices. First, in terms of the classification performance, the XGBoost classifier outperformed the other algorithms based on all the evaluation metrics. In addition, to identify the classification performance in terms of various label conditions, we compared the values of evaluation metrics under four conditions (binary, three, four, and five classes). The evaluation  Second, the performance of each classification algorithm was compared based on the length of the actigraphy data from which the circadian indices were extracted. The maximum evaluation metric values of the classifiers were obtained for the actigraphy data for two days. Additionally, we investigated whether the number of rows differed for each of the datasets obtained for the five durations of actigraphy monitoring. The number of rows gradually decreased as the length of the actigraphy data increased. To prevent the dataset size from affecting the performance, we controlled the size of the dataset and conducted an additional experiment. In this experiment, 1000 rows were sampled through stratified random sampling for all the datasets to reduce the bias for each class label. Subsequently, the same experimental process was applied to each dataset. We confirmed the same tendency in additional experiments and concluded that circadian indices extracted from two-day actigraphy data were sufficient to classify depression levels. The detailed results are presented in Appendix B. Finally, we validated the classification performance using a one-way ANOVA test. The null hypothesis established was that the average performance of the four algorithms was the same. We verified that the test results of evaluation indices (accuracy, precision, recall, F1-score, AUC) were statistically significant. As a result, statistical significance of performance was confirmed, and the null hypothesis was rejected. Detailed one-way ANOVA test results are shown in Table 10.

IV. DISCUSSION
In this study, we attempted to classify depression levels using actigraphy data based on ML algorithms. Survey variables and circadian rhythm indices extracted from actigraphy data were collected from the KNHANES dataset. To obtain reasonable evidence for depression status identification with physical activity, we found several studies related to clinical and technical aspects. First, considering the relationship between depression and physical activity, Wu et al. [56] established that physical inactivity in patients with Parkinson's disease caused depression and degeneration of motor skills through a comprehensive review of relevant studies. Teixeira et al. [57] proved that physical activity was associated with depression and anxiety in elderly groups. Moreover, Roshanaei-Moghaddam et al. [58] verified that decreased levels of physical exercise or sedentary lifestyle were a significant risk factor of depression. Ku et al. [59] tracked elderly groups for 11-year periods. They identified that physical activity engagement was associated with lower risk of depressive symptoms. Based on these previous studies, we determined that physical activity including aerobic exercises can work as a main factor to depression. VOLUME 9, 2021 Second, related to analyses with ML algorithms, Albahli et al. [60] suggested a thoracic disease identification framework through deep neural network models. Albahli et al. [61] showed that the detection performance of a convolutional neural network in X-ray images was superior to that of other models. In addition, Chekroud et al. [62] built ML models to find predictive factors for determining the responsiveness to antidepressant treatment in patients with depression. Furthermore, Bhakta and Arkaprabha [63] compared five ML algorithms to predict depression in the elderly population. Based on these studies, we concluded that ML algorithms can be used to detect or identify diseases. Therefore, our topic about classification of the level of depression using ML algorithms was well-founded.
To reflect variations in specific factors, time-series data collected from study participants directly (EEG or ECG recorded by electrodes attached to the skin) or indirectly (actigraphy data measured using wearable devices) were utilized with structured data. For example, Hosseinifard et al. [64] used electrical activities of the brain to evaluate depression. EEG signals of depression patients were utilized to extract feature vectors. Both linear features (e.g., power values of four EEG bands from power analysis) and nonlinear features (e.g., detrended fluctuation analysis (DFA), Higuchi fraction, and Lyapunov exponent) were applied to ML classifiers. Three classifiers, linear discriminate analysis, LR, and k-nearest neighbors, were compared to identify depression patients in a study of patient and control groups. LR classifiers yielded a classification accuracy of 83.3% with a correlation dimension. In addition, the LR classifiers showed 90% accuracy with all nonlinear features. The authors indicated that the model performance  was significantly better when a combination of linear and nonlinear features was used, compared to the case when only linear features were used.
Mohammadi et al. [65] proposed a fuzzy function-based ML classifier trained by three nonlinear features (fuzzy entropy, Katz fractal dimension, and fuzzy fractal dimension) to distinguish depression levels. To reflect variation of brain activities, the researchers collected EEG signals from depression patient groups, based on which all the features were calculated. To evaluate the classification performance in combination with each feature, three nonlinear features were randomly combined into groups with one, two, and three feature groups. The proposed algorithms (fuzzy functionbased algorithms) were compared with SVM classifiers with 90.0% accuracy. Among the classifiers, those trained using all features (three features) showed the best performance under all conditions.
To classify the stress status of study participants, Rizwan et al. [66] proposed classification algorithms using SVM with features extracted from ECG signals. Three features (QT interval, RR interval, and ECG-derived respiration) were applied to the classifiers. In addition, two VOLUME 9, 2021  kernels (Gaussian and cubic) and three model types (linear, quadratic, and cubic) in SVM algorithms were compared to find high-performance algorithms for stress status. The classification algorithms yielded their best performance when all the three features were used, compared to the cases in which only one or two were used. In addition, models with Gaussian kernels exhibited promising accuracy (linear SVM: 98.6%, quadratic SVM: 98.6%, and cubic SVM: 98.6%) compared to that of cubic kernel SVM models (linear SVM: 97.2%, quadratic SVM: 97.1%, and cubic SVM: 97.2%).
Zhong et al. [67] used whole-brain resting-state functional MRI data (rs-fMRI) from both depression and healthy groups to identify major depressive disorders. From collected rs-fMRI data, brain activity time series data of 116 brain regions were extracted to construct a functional connectivity network. Functional connectivity represented by Pearson correlation matrix and correlation coefficient vectors in several matrices were applied to SVM classifiers as input features. To select high discriminate features for classification, the Kendall Tau rank method was applied. SVM classifiers with linear kernel function showed the best classification accuracy (91.9%) in experimental conditions. In addition, only six features were confirmed as efficient from a total 116 features.
To enable comparison with previous studies, we constructed an experimental design of our research composed of similar steps (feature extraction from time series data, feature selection, and classification through ML algorithms). Different from time series data widely used in previous works (EEG, ECG, and time series from rs-fMRI), we attempted to use the variation of physical activity to investigate a possible relationship with depression levels.
To reflect the variation in physical activity, diverse methods can be used to collect physical activity data from participants. Physical activity data obtained by self-report questionnaires are widely utilized to measure the averaged physical activity. De Mello et al. [68] used a self-report physi-cal activity questionnaire to assess the physical activity of depression patients. They surveyed various types of physical activity (e.g., weak, moderate, or vigorous) and regularity of activity (e.g., the number of activities in a week). Additionally, to monitor a subject's physical activity over a long period, Sabia et al. [69] collected physical activity questionnaires from elderly groups with dementia for 28 years. Furthermore, detailed physical activity patterns can be collected as time-series data by accelerometer or pedometer.
Harris et al. [70] compared self-report and time-series physical activity data to validate each metric. The researchers suggested that each data has an advantage in accordance with research methods and topics. Because of densely continuous activity values (acceleration or step values), time-series data collected by wearable devices can offer detailed intensity of activity on an hourly or daily basis. In the case of self-report questionnaire data, the authors suggested that self-report is more convenient for long-term follow-up studies and more useful for evaluating activity type or in combination with other structured datasets.
In our study, we attempted to confirm the relationship between physical activity and depression status. Furthermore, characteristics of the circadian cycle embedded in physical activity were focused to identify an inherent relation with depression. To extract circadian indices from physical activity, actigraphy data, which are time-series data, were used. Moreover, continuous activity values composed of actigraphy data were more favorable for establishing detailed patterns. Therefore, we used accelerometer-based actigraphy data to calculate circadian rhythm indices. Furthermore, we determined which features were more effective for classification among the features extracted from actigraphy data. After classification by ML classifiers, we evaluated the results based on both classification performance and feature importance. In terms of classification performance, we compared performance under various conditions (24 conditions). To confirm changes in performance with class conditions, we set four conditions (binary-, three-, four-, and five-class labels). Among the four classifiers (XGBoost, SVC, MLP classifier, and LR), the XGBoost classifier showed the best performance in all experimental conditions. Furthermore, we compared the performance of our framework with that of the classifiers proposed in previous studies. The performance of each classifier is listed in Table 11. Despite using different data, we confirmed that the XGBoost classifier proposed in our study showed excellent performance compared to that of classifiers developed in previous studies. Moreover, based on these results, we found that circadian characteristics of physical activity not widely used were valuable to classify depression levels.
Among several factors influencing algorithm performance, the length of actigraphy for extracting features was critical. We expected that classification performance would be affected by the length of the actigraphy data from which features were extracted. Furthermore, the length is one of the hyperparameters that the researcher must determine. To determine the optimal length of actigraphy, six conditions (two, three, four, five, six, and seven days) were evaluated. As a result, all the evaluation metric values showed the highest values when actigraphy monitoring was conducted for two days.
However, the dataset consisted of extracted features of different sizes. Because different dataset sizes can affect the evaluation results, we constructed a dataset with 1000 samples through stratified random sampling. After experimenting with a different-sized dataset, we found that the tendency in previous experiment results was repeated in these experiments (i.e., performance under a two-day length condition showed the highest evaluation values).
Based on the two experimental results, we found that the optimal duration of actigraphy monitoring to effectively determine depression status was two days, regardless of the dataset size. Similarly, Thomas et al. [75] investigated the reliability of actigraphy length using different individuals as a case study and suggested that a two-day period adequately reflected the circadian rhythm of actigraphy. Thus, we concluded that the actigraphy data of two days were sufficient for feature extraction to classify depression levels.
In the case of feature importance, we investigated the ranked features in XGBoost classifiers. In the XGBoost algorithm, the F1-score was calculated based on the number of times that the decision tree model was used for estimation. A detailed list of important features is presented in Table 12.
The ranked features of the XGBoost classifiers were the same under all conditions. A total of 15 input features were selected by regression models in the feature selection steps and consisted of 10 survey variables and 5 circadian rhythm indices. The factors identified in previous studies on depression and physical activity were justified by the feature ranking in this study. The authors focused on mean activity levels and lower values of physical activity to identify factors associated with depressed individuals. Additionally, they observed lower values of physical activity in the depressed groups [76]- [78]. In summary, our study represents a reliable experimental paradigm in terms of both classification performance and feature importance for classification.

V. CONCLUSION
Classifying depression levels is critical for various fields, including clinical and psychological domains. In this study, we proposed a framework for classifying depression levels using ML algorithms. Based on previous studies on the relationship between depression and physical activity, actigraphy data using an accelerometer were used to extract circadian rhythm indices as features. To evaluate our framework from a diverse perspective, we designed experiments with various class labels and actigraphy length conditions. We found that the XGBoost classifier exhibited the best classification performance and that two days of actigraphy data were suitable for representing the circadian cycle in physical activity.
The first strength of this study was the application of accelerometer-based actigraphy data, which are not widely used to classify depression levels. Second, we determined the ideal length of actigraphy data for feature extraction. Third, a large-scale real-world dataset collected from people living in Korea was used to reflect practical tendencies.
Our study has some limitations. First, actigraphy data included detailed differences (e.g., gap between morning and afternoon, difference between weekdays and weekends). These differences can affect depression levels. However, we considered overall characteristics instead of specific changes to classify depression levels. Second, various classification methodologies including deep learning algorithms can be applied to solve our research questions. To facilitate confirmation of feature importance, we used ML algorithms in our study. Third, we need to consider external validation through datasets collected from other countries to generalize our framework in further study.

APPENDIX A
The ROC curves under other (three, four, five, six, and seven days length) conditions with class conditions (binary, three, four, and five classes). See