Periodic Time Series Data Analysis by Deep Learning Methodology

The detection of periodicity in a time series is considered a challenge in many research areas. The difficulty of period length extraction involves the varying noise levels among working environments. A system that performs well in one environment may not be accurate in another. Different methods, including deep neural networks, have been proposed across many applications to find suitable solutions to the period length extraction problem. This article proposes a convolutional neural network (CNN) based period classification algorithm, named PCA, to detect the dataset periods. In particular, assuming that a data stream contains periodical features, the PCA utilizes historical labeled data as training material and classifies new instances accordingly based on their periods. Its performance has been tested on both synthetic and real-world periodic time series data (PTSD) with very encouraging results. In particular, We have observed that the PCA is capable of achieving 100% accuracy in the case of low noise PTSD. Even the training of the PCA is not economical if the data do not contain much noise, it still demonstrates high performance on both synthetic and real-world datasets. Besides, we have shown that our new algorithm can capture the relationship between the shape of the waves and the target period, which is significantly different from the classical methods that mainly focus on the wave’s amplitude.


I. INTRODUCTION
A. MOTIVATION AND OBJECTIVE THE analysis of periodic time series datasets (PTSDs) is relevant to many research fields. In addition, precisely classifying the period length of PTSDs is useful for industries, such as health care. For example, knowing the period length from general public health data can enable pharmaceutical organizations to choose the delivery season of their flu vaccines [1] in a timely manner. Consumers and patients generally hope that there are sufficient resources to meet their demands. However, companies do not wish to have idle resources, which would increase their running budget. The ideal scenario is thus one in which the available resources match the demand with neither surplus nor shortage.
In the past decades, several decision optimization algorithms have been proposed [2]. Through examining their approaches, we observed that the majority of supply chain optimization algorithms require predetermined period The associate editor coordinating the review of this manuscript and approving it for publication was Adnan Kavak .
information. The vast quantity of data makes manual analysis infeasible; however, because the tasks are simple but tedious, they are suitable for artificial intelligence (AI) programs. The motivation of this article began with this problem. Currently, many AI applications are integrated into our daily life, and we believe that it is possible to develop an AI agent that can free business managers, medical doctors, and customers from manually extracting period information. In this article, we limit our discussions to the dataset containing periodic features.
This article is an extended version of the preliminary conference article [3] in which we proposed the original period classification algorithm.

B. RELATED WORK
In the literature of machine learning, one of the most challenging problems has always be Time Series Classification (TSC) [4], [5]. In more recent years, the availability of real-world data has significantly increased [6]; hundreds of TSC algorithms have been proposed with different design philosophies [7]. As a result of lacking integrity, most TSC problems require some level of human involvement [8]. Meanwhile, if a classification problem takes ordered data, it can be transformed into a TSC problem [9]. TSC problems are not only challenging but also have high diversity. The related problems include but not limited to: 1) electronic health records [10], 2) human activity recognition [9], [11], 3) acoustic scene classification [12], 4) cyber-security [13].
Researchers have proposed many algorithms to achieve required accuracy level for variant tasks [7]. A classical method is to train a K-th nearest neighbor (KNN) classifier using a distance function [14]. Specifically, the Dynamic Time Warping (DTW) distance function combined with a KNN classifier appears to be a very strong baseline [7]. Different distance functions have been tested empirically, however these studies showed no statistical significant differences between individual distance functions [14]. On the other hand, researchers have demonstrated that the ensemble of KNN classifiers using different distance measures outperforms an individual classifier. Therefore, recent advance is heavily focused on ensemble techniques [15]- [22]. Most of these works outperform the KNN-DTW approach [7] and share one common technique, transforming the raw data into another feature dimension (for instance, the shapelets transformation [17] can encode raw data into the feature dimension defined by DTW [20]).
Researchers believe that the ensemble KNNs performing various distance measurements can outperform a system only using one distance measurement. This view motivated the development of an ensemble model called COTE (Collective Of Transformation-based Ensembles) [15] which takes advantage from 35 independent classifiers. COTE ensembles not only different distance measurements but also different feature transformations. Then, Lines et al. [18], [23] upgraded COTE to HIVE-COTE using a Hierarchical Vote system. Recently, scholars believe the deep learning models can extract hierarchical features from raw data with very little human involvements. We believe it was a huge motivation for the recent up rise of deep learning models for TSC [24].
The Fourier transform and DFT [25] have also been widely used in fields such as single analysis and sound analysis. After the computation is completed, the period information can be easily obtained. However, if the original function f (x) is unknown or sampled with noise, the resulting function may have multiple peaks, which would lead to confusion.
As discussed, using the Fourier transform and related approaches to obtain period information is powerful when the periodic functions are clearly defined and noise-free. However, when the target period is not the most dominating period with the highest Fourier coefficient, then the Fourier transform will have difficulty processing it. More importantly, if the data samples are noisy enough to offset the real distribution, or if the noise is also periodic, the Fourier transform will produce incorrect period information.

C. MAIN CONTRIBUTION
The main contribution of this article involves the period length extraction problem. This article proposes the Period Classification Algorithm (PCA) which takes advantage of historical data, using them to tolerate noise. This algorithm requires pre-labeled data, and raw data streams must be converted into images prior to period extraction. Our empirical study shows the PCA performs quite well against both synthetic and real-world data. Moreover, the study also shows the algorithm can learn the relationship between the changing patterns of the waves and the period targets. This exclusive feature makes the algorithm quite different from the classical ones that mainly focus on the amplitudes of the waves.

D. ARTICLE ORGANIZATION
The remaining of this article is organized as follows. Section II first introduces the preprocessing algorithm. This algorithm converts the raw data samples to a red-greenblue (RGB) image. Then, Section III presents the proposed model in detail. Section IV presents the performance measurements, whereas Section V provides the training policy along with problems during training. Afterward, Section VI introduces the synthetic data generation and discusses the algorithm's performance on them. Furthermore, Section VII contains insights that relate to the selected real-world data, and Section VIII demonstrates the PCA's performance on real-world data. We draw the conclusion in Section IX and list the potential future work in Section X.

II. PREPROCESSING ALGORITHM
The PCA algorithms can be partitioned into two modules: one is the preprocessing algorithm and the other one is the neural network for classification. The second module is mainly implemented with the convolutional neural networks (CNN).
Considering that the CNN models were initially proposed for the algorithms that have images as inputs [26], Algorithm 1 is introduced to transform raw PTSDs into image-like representations. The algorithm has two superparameters: the desired width and height of the output image representations, and it converts the raw PTSDs into the image of the required size.
The algorithm starts by marking the data points and connecting them using line segments on a white canvas. Notice that, given various PTSDs, the generated raw graphs may have different sizes as they may have different lengths of periods and amplitudes. To facilitate the latter's processing, we pad the images to make them have the same sizes. The padded images will have most of the pixels white which contain no information. This will effect the latter neural network training and lower the data processing efficiency. To fix the problem, we use Algorithm 1 to partition rawImage into pieces of the same desiredWidth and desiredHeight. Then we stack all the pieces up by taking the sum of the pixel values mod 255. Note that the mod operation ensures the data validity by keeping the RGB values between 0 and 255. VOLUME 8, 2020 Algorithm 1 PTSD Conversion Algorithm Input: rawdataset, desiredWidth, desiredHeight Output: stackedImage 1: plot rawdataset to rawImage 2: compute the closest multiple mul, where mul × desiredWidth ≥ rawImage.width 3: put white surrounding padding on rawImage to make the size equal to (desiredHeight, desiredWidth × mul) 4: segment rawImage into small pieces with size equal to (desiredHeight, desiredWidth) 5: stack each pieces on top of each other to generate stackedImage 6: return stackedImage; The stacking process would get the line segments having intersections with the others, which could increase the difficulty of the neural network training introduced in Section V. However, the process is invariant to the correlation among the pixels and will not change the period information contained in the image. Moreover, the white backgrounds provide irrelevant information for the classification. Stacking the image patches can thus alleviate the information sparsity problem and decrease the GPU demand, which will also increase the convergence speed of the CNN.

III. DEEP CONVOLUTIONAL MODEL DESIGN
The neural network we applied in this article includes a list of building layers. Given an image represented in a 3D tensor (the first two axes are the coordinates of the pixels, and the last one records the RGB value), the tensor is first fed into the feature extraction module, including six convolution-pooling (conv-pooling) layers in cascades followed by a flatten layer. The flattened tensor is then fed into the classification module that consists of two fully connected layers.
As shown in Table 1, the i-th (i = 1, 2, · · · , 6) feature extraction layer contains two consecutive convolution layers consisting of 2 4+i kernels of size 3 × 1 and of size 1 × 3 respectively. We add nonlinearities using Leaky ReLu function with α = 0.1. All pooling layers have the pooling size 2 × 2, and we apply the dropout method with the dropout rate 0.25 to avoid the overfitting problem.
Note that, instead of applying a convolution layer of a 3×3 square kernel, we used two parallel layers with the kernels of sizes 3 × 1 and 1 × 3, respectively. The design, originally proposed by Szegedy et al. [27], has a similar performance as the classical CNN but contains fewer parameters. In our case, the kernels need 3 + 3 = 6 parameters, which has a similar performance as the the regular kernel implementation that requires 3 × 3 = 9 parameters. In principle, fewer parameters require less training samples and training iterations and result in a better generalization.
The leaky ReLu function simply returns the input value if it is greater than zero; otherwise, it returns α times the input. Compared to the traditional ReLu function, the design introduces nonlinearities without dropping information when the input is less than zero and maintains a high computation efficiency.
The introduction of pooling layers removes the irrelevant features and keeps the most important ones. A stack of convpooling, then, can extract low-level features, like edges and corners, while the more complicated features including heart rate diagrams periods are composed from those low-level features and will be captured by the layers closer to the output. The extracted features are sent to the classifier module consisting of two fully connected layers, which makes the period length prediction.
The dropout design removes the kernels of the CNN with the probability of the dropout rate. This forces the model prediction not to rely on any particular kernels and their outputs, which could cause serious overfitting problem [28]. Fig. 1 visualizes the part of the model that follows the stack of the conv-pooling layers. A well designed model should have sufficient independent features after the completion of the training process. Assuming that each neuron at the last conv-pooling layer represents a single feature (that is, whether a certain pattern appears in the input image), there is no need for us to keep the spatial information. Thus, we can flatten the outputs to meet the input size requirement of the classifier module without losing much information. In the classifier module, we still apply the leaky ReLu function for nonlinearities. We also use dropout with dropout rate 0.25 to alleviate the overfitting problem. From the math view, the fully connect layers (without activation functions) can be considered as an affine transformation: where D j is the output size of the previous layer and n is the size of the current one. Each f j is defined as: and a fully connected layer is the concatenation of the functions f 1 , f 2 , · · · , f n . Remark that in order to increase the model's capacity, we also add the leaky ReLu functions for nonlinearities. At last, the classification prediction is made in the output layer. The layer is similar to the fully connected layer but uses the softmax function [29] (defined in Equation (2)) instead of the activation function. Notice that the output layer has the number of outputs same as the number of classes to be classified.
The output of the softmax function can be seen as the probability mass function of a multinomial distribution. When performing the prediction task, the class corresponding to the highest probability value is selected. We summarize the entire neural network model in Table 2 alone with the hyper-parameters we used for the experiments.
The main task to be solved in this article is to classify the PTSDs according to the length of their period. Notice that, the value of the target labels solely depends on the period length. As a result, the choice of the unit time length is not a concern as long as all input data sets share the same unit.
Moreover, to facilitate the model implementation, the labels are converted into the one-hot representation instead of the integers. For instance, assuming that the possible period length choices are from zero to three, then (1, 0, 0, 0) denotes the period length zero and (0, 0, 1, 0) represents the length three.

IV. PERFORMANCE MEASUREMENT
The performance of deep models can be measured by loss and accuracy. The accuracy value measures deep models from a macro perspective. Its calculation is based on a set of instances, and the result simply tells the correctly classified percentage of the set. It provides a general understanding of how well or poor a model performs; however, it lacks details in errors made for each instance.
Unlike accuracy, loss is not a percentage measurement. Loss is a number telling how far a prediction was on a single instance. The same deep model may obtain different loss values by predicting two independent instances. So the goal is to find a set of weights and biases that minimize loss values for all possible instances.
We utilize cross entropy [30] defined by Equation (3). Cross entropy was introduced to quantify the difference between two probability distributions. When measuring the performance of a CNN model, we take the ground truth label as the target distribution y, and the model predicts another distribution,ŷ.
The ground truth label should be definite; however, CNNs are not designed to produce one certain single label but a possibility distribution among all possible labels. For a binary classification task, ground truth labels include positive < 1.0, 0.0 > and negative < 0.0, 1.0 >. A CNN could produce a prediction like < 0.7, 0.3 >. This signifies that the CNN is 70% certain the instance is positive. In case of the same CNN produced another prediction < 0.9, 0.1 > for another instance which shares the same ground truth label, then Equation (3) measures a greater loss value for the prediction on the earlier instance than the later one.

V. TRAINING ALGORITHMS
After measuring the performance of the CNN models, we require a strategy to adjust the convolutional features and coefficients of the density functions. As mentioned earlier, loss functions measure the performance of a deep model, where a lower loss value indicates a better performance. Predictionsŷ i are produced by model(x i , θ), where θ represents the parameters. Therefore, we can treat loss functions f loss (y i ,ŷ i ) as objective functions argmin θ f loss (y i , model(x i , θ)). Thus, we require strategies to optimize θ in order to minimize the loss value. Optimizer Adam [31] was utilized to train our deep model. The name ''Adam'' stands for adaptive moment estimation. It demonstrated state-of-the-art performance among many deep model optimizers, in terms of training speed and result.
Adam combines the innovation of Adagrad [32] and Momentum [33] into one optimizer. It takes two hyperparameters, β 1 for moment decay and β 2 for learning rate decay. The authors of Adam reported state-of-the-art performance on many application when β 1 = 0.9 and β 2 = 0.999. Therefore, all experiments mentioned in Sections VI and VIII were trained by the Adam optimizer with the parameter configuration proposed by the authors of Adam.
Essentially, the training phase is a repeating process, and it can be broken down into iterations. In deep learning literature, one training iteration is usually called an epoch. Each epoch starts by randomly selecting one or several instances x i from the training set. These selected n instances form a batch X . Deep models then produce prediction labels for the first batch and sum their loss values. This part of computation also referrals as forward propagation. Then the backward propagation is done by the optimizer's parameter update policy. Accurately, Adam calculates the following: where the θ represents the collection of all trainable parameters. Then m t and v t are estimates of the first moment and second moment of the gradients, respectively. If m 0 , v 0 are initialized as vectors of zeros, and β 1 , β 2 are close to 1, the authors of Adam observed that m 0 , v 0 are biased toward zero. Thus, instead of taking m 0 , v 0 directly, the biascorrected first and second moment estimates are computed asm The bias-corrected terms are then used to update the parameters as follows: These selected instances will not be used in the next batch of training. One epoch completes when every instance is selected. Afterward, all instances will be randomly shuffled, and we repeat the above procedure. Besides the optimizer and the CNN architecture, other factors could affect the training result. For instance: 1) The noise component must not dominate the real PTSD trend. Specifically, the PTSD's amplitude should be obvious for human eyes after adding the noise component.
2) The training set must have a strong connection with its labels. Otherwise, this situation could result in a large gap between training and testing evaluations. The model could perform very well on training instances by memorizing each of them; however, this does not guarantee that the model is generalized effectively at the evaluation stage, as reported by Zhang et al. [34]. 3) Usually, a training set should include all possible labels from the problem domain. If a deep model is deployed to a working environment that contains unseen labels, it can only classify the instance as the closest label from the training set. The listings above are essential but not complete. Some factors are not listed but can further push the training result away from the potential optimum.

VI. EXPERIMENTS ON SYNTHETIC DATASETS
The core problem in this article is to classify PTSDs by their period length, and experiments are designed around it. The experiments we performed in this section consist of two parts: 1) we test how resistant to the noise our algorithm is. 2) we show our algorithm can capture the relationship between the wave's shape and the target period while most classical algorithms cannot.
Due to the nature of deep models, we decided to use the Graphic Processing Unit (GPU) instead of the traditional Central Processing Unit (CPU) for the experiments. More specifically, two Gigabyte GeForce GTX 1080 graphics cards accelerated the process tremendously. In terms of software, the implementation of the deep model was helped by Python library Keras version 2.2.2. Additionally, Keras was using TensorFlowGPU version 1.10.0 as its backend. As a result of limitations in computational power, all experiments are configured with desiredWidth = desiredHeight = 250 during the preprocessing phase (see Algorithm 1). We can set these values higher if extra computing resources are available. Interestingly, a higher resolution provides more periodic features when polynomial components dominate. Besides, the batch size was set to be 32 whereas the initial learning rate was 0.0003.
The first part of our experiments were aimed at determining the threshold of the PCA. The data were generated through Algorithm 2 with the Gaussian noise added. The parameter periodLength determines the desired length of a single period measured by time indices. Then, the parameter window represents the upper bound of time indices, whereas sampleSize decides the number of data samples for a single period. In addition, the callable parameter noiseFunc() can be any possible noise function.
The algorithm starts by filling the array X by spaced time indices between zero and periodLength. They are representing time indices of the first period. Then temporary variable array Y p is populated by samples from a standard normal distribution. The algorithm now has the first period and can keep copying Y p and extended X until there is no enough room for another period.
The experiments had two possible ground truth labels: 2 and 7, and 2, 000 images are generated for each label, respectively. Altogether 4, 000 images were generated. After shuffling, 3, 600 (90%) were selected as training instances, and the rest 400 (10%) were saved as testing instances. We did not use the same image multiple times across different experiment runs.
That is, we first attempted to determine the maximum noise for which the PCA could tolerate. So we tested the algorithm by setting the standard deviation σ of the Gaussian noise from Append sample to Y p 6: end for 7: repeats := floor(window/periodLength) 8: nextPeriod := copy X 9: for i = 0; i < repeats; i = i + 1 do 10: x end := the last element of X 11: increase every element of nextPeriod by x end  zero to ten. Every time we increased the deviation, we trained Algorithm 2 from scratch.   degree as well as the highest coefficient variance for which the PCA could provide a correct answer. The listings below provide the degree values of polynomial noise functions along with the variance of the polynomial coefficient, respectively, for implementation of these experiments: 1) polynomial degree {0, 1, 2, 3, 4} 2) variance of polynomial coefficient {0.5, 1, 1.5, 2, 3, 4} By choosing one from each listing and combine, we can obtain 30 configurations. For each configurations, we have performed 10 runs and averaged the results for performance measurement. As mentioned previously, images were randomly generated by the methods introduced in Algorithm 2.
Thus, the results were not biased by a random sample or by random deep model initialization.  components, whereas the y-axis indicates the average out-ofsample accuracy from 10 identical runs. Each line represents a different coefficient variances configuration, as mentioned earlier. Specifically, the PCA showcased steady performance when polynomial degrees were below 3. The average accuracy reading was above 90%.
Once the polynomial degrees reach 3, the performance for some test cases decreased accordingly. This observation is expected, as all training and testing data were plotted to equal-sized images. As mentioned earlier, all PTSDs' periodic components were generated with an equal amplitude. Therefore, periodic features vanished in those PTSDs composed with higher degrees polynomial.
Furthermore, two coefficient test cases rose above 90% again, and two others decreased when the degree was equal to 4. In addition, the lower bound of the accuracy decreased even further, which demonstrates that the generation of datasets became more random. Therefore, it was more likely for polynomial components to become dominant. Since some waveforms became unrecognizable not with increasing polynomial noise, but with decreasing polynomial noise. A sawtooth wave is a typical example in which the amplitude increases within each period. Both Figs. 6 and 7 are examples of the same sawtooth wave (g(x) = 0.7 × mod(x, 2)) with different polynomial noise. Fig. 6 has noise polynomial h(x) = 0.005(x)(x −10)(x −17). When 4 < x < 12, the trend of the waveform decreases, and the sawtooth is evident within the interval. The sawtooth is still recognizable when x < 4. However, it fades when x surpasses 13. Fig. 6 plots a noisy polynomial with a degree of 3. Its periodic feature is not yet blurred. However, Fig. 7 demonstrates what can happen when the polynomial degree continues to increase. The coefficient variance of both waveforms are at the same level. In comparison with Fig. 6, the sawtooth is unrecognizable across the domain.
In addition, these plots did not contain enough periodic information for the PCA to extract. The waves looked more like curves than waves. However, this problem was not present in the selected real-world datasets. Waveforms across periods looked similar to others.  The second part of the experiments are performed to show that our new algorithm is able to detect the period of the data with certain changing features. For instance, considering an input data which is a mixture of sawtooth, square and sine waves. Our algorithm can learn to detect the period of a sawtooth wave always, which is very different from those classical methods that often output the period of the wave having the greatest amplitude.
We trained and tested our algorithm over the dataset generated through the following procedure, which mixes the waves of the various shapes and the amplitudes and selects the period of sawtooth wave as the target: amp i := a random integer from closed interval [1,6] with respect to uniform distribution 5: We listed the test results of our algorithm as well as the Fourier Transformation in Table 3. The table indicates that our algorithm successfully captures the relation between the shape of the waves and the target period. At the same time, the Fourier transformation can only output the period of the wave having the greatest amplitude and continuously fails during the test.

VII. REAL-WORLD DATASETS SELECTION
As mentioned, Algorithm 2 provides datasets that have balanced classes but different noise levels or noise types. The experiments demonstrated the PCA's capacity for binary classification tasks. For a real-world dataset, we used a different approach.
The PTSD generation algorithm provides all possible situations to fully test the PCA. However, the real-world datasets described below were collected in addition to the generated datasets. To simulate the working conditions of the PCA, these real-world datasets are more suitable than generated ones. Most importantly, they provide more insights into the performance of the PCA in a real-world setting.
Theoretically, Algorithm 2 can generate all possible noise on the amplitude axis. However, there are other aspects that can affect the performance, such as a phase shift or inconsistent sample rate caused by internet delay. Thus, we are still interested in how PCA performs on real-world datasets. After careful examination, we selected the following datasets: 1) Historical hourly weather data [35]  We did not use the entire datasets above because some fields did not contain periodic features. Some fields from the HHWD dataset, such as windDirection, appeared completely random across the time index.
Fields selected from the datasets were manually verified, including temperature data and electricity usage. Electricity usage appeared to be similar on the same days in different weeks. Thus, several weeks of data were used to plot periodic waveforms. Fig. 8 presents two examples from the PJMHEC dataset that reveal the nature of the dataset. Within each period, the five high peaks indicate that more electricity is consumed during working days and less during the weekend.
Unlike the electricity usage data from the PJMEHEC dataset, the temperature data from the HHWD and LEMD datasets contain both short-term and long-term periods. In short-term periods, such as different days of the same week, they may have similar temperature readings. In addition, temperature readings in different years within a decade appear to be periodic as well. However, the shorter the time window is, the noisier the waveform appears. Short-term prediction is also less challenging than long-term prediction. Therefore, the number of years was used as labels for temperature data. Fig. 9 presents three examples from the HHWD dataset. The plots all have a different number of periods; however, the images are the same size. The left plot has only one period, the middle plot has three periods, and the right plot has five periods. Comparing the three plots demonstrates that a larger number of periods plotted in an image leads to plots that are less clear. In addition, there are differences between each period, which signifies that noise exists within the dataset.

VIII. EXPERIMENTS ON REAL-WORLD DATASET
Unlike generated datasets, real-world datasets usually have a unbalanced label distribution. This signifies that if 70% of instances of a dataset are from one class, then the model simply maps everything to that class, which will result in 70% accuracy. Thus, accuracy and loss alone are not sufficient to justify the performance of PCA, and extra measurements must be deployed.
We calls instances that are correctly classified with respect to the ground truth label true positives (TPs) and true negatives (TNs). Positive instances classified as negative are called false negatives (FNs), while negative instances classified as positive are called false positives (FPs).
As mentioned, the accuracy measurement is not sufficient for unbalanced datasets. Thus, we use the F1 score, which is defined as follows: The F1 score combines two other measurements: precision and recall. Precision represents the proportion of predicted positive instances that are actually from the positive class. That is, when the classifier reports that an instance is positive, precision is the probability that the classifier is correct. Recall calculates the fraction between true positive instances and the total actual positive instances. In other words, it represents the percentage of actual positive instances that are correctly classified. Precision and recall are not entirely independent, and it is sometimes difficult to select one. This is one of the reasons for using the F1 score, which is a combination of the two. The F1 score is similar to the arithmetic mean but is not identical, and it is always between the precision and recall. However, the arithmetic mean gives the same weight to every element, whereas the F1 score gives larger weights to lower values. For example, if a classifier produces 100% recall and 0% precision, the F1 score is 0%. However, the arithmetic mean is 50%. The experiments were configured as follows. In order to prevent possible over-fitting, the algorithm had 20 epochs to learn features from each dataset, and then the network will be deployed on testing datasets right before re-initialization. The result is obtained through averaging performance measurements over multiple runs. Fig. 10 presents the F1 score   of the PCA during its training on a real-world dataset. The plot illustrates that the PCA achieved an F1 score of 95% for the HHWD, ENTSOE, and PJMHEC datasets. After the first epoch, the PCA hardly learned anything new from the data. However, the PCA struggled on the WeatherAUS dataset and even more so on the LEMD dataset. The PCA trained for 14 epochs on the WeatherAUS dataset to achieve the same performance for which it trained for one epoch on the HHWD dataset. In addition, the PCA completed 20 epochs on the LEMD dataset with an F1 score of over 80%. We believe that the difference was caused by the number of instances in each dataset. HHWD, ENTSOE, and PJMHEC had the largest number of images, which signifies that the PCA had more training material within each epoch. Because we trained the PCA by the Adam optimizer, weight updates occurred at the end of each batch. This means that if the batch size remained the same, a larger dataset would produce a larger number of batches, and consequently, more model optimization.  Now, we know that the PCA can fit the training data well; however, this does not guarantee that it can fit the test data successfully. Ten percent of the training data were saved for later validation. The validation phase occurred at end of each epoch, and these data were never used for optimization. By applying the model on the validation data, we can gain insights about whether the PCA overfits the training data.
Figs. 11 and 12 present the results of applying the PCA on the validation dataset. Generally, Figs. 12 and 10 exhibit the same trends. However, for the WeatherAUS dataset, the PCA performed more poorly on the validation data than on the training data. In addition, Fig. 11 demonstrates whether the PCA actually learned concepts or was simply biased by the training data. The plot illustrates that the optimizer did not overtrain the PCA across the training procedure. The reason for this is that the validation loss does not increase for the HHWD, ENTSOE, and PJMHEC datasets. However, there are peaks in the plots, and overall, the validation loss decreases.
There are 10% of samples saved for the final testing of the PCA. Note that we have made the label distributions of the training and testing datasets the same so that their data distribution should be similar. Notice that, unlike a traditional configuration that usually chooses an 80%/20% dataset separation for training and testing, we have chosen a higher ratio 90%/10%. This configuration could make the training and testing data distribution less similar to each other and thus make the task more challenging. Readers should expect a better performance of our algorithm if 80%/20% dataset separation is adopted. Overall, the PCA performed better on real-world datasets than on the generated dataset. The PCA achieved over 90% for all five datasets. However, its performance dropped to below 70% for some generated datasets.

IX. CONCLUSION
Our experiments demonstrate that the PCA is suitable for real-world periodic classification tasks. Unlike those classical methods, we have shown that the PCA can perform well while the target period is given by the wave that changes in a particular pattern instead of having the largest amplitude. Its performance may vary according to the quantity of data and the noise component. If the noise component follows the Gaussian distribution, its effectiveness on the PCA's performance is limited. However, if the noise component is a polynomial function and the polynomial trend is too large, then the data may not appear periodic in the images. Moreover, we have observed that the PCA's performance approaches to 100% as the training data size increases. This hints that our model has the enough capacity to fully capture the relationship between the input data and the target periods.

X. FUTURE WORK
Anticipated future work involves the following areas: • Applications: We tested the PCA on the synthetic and real-world datasets. Performance analysis revealed that the PCA demonstrated high accuracy. We believe various applications can utilize this model to improve their user experience. For instance, internet traffic usage prediction can be more accurate with precise period information.
• Efficiency: The deep convolutional model used in this work is relatively complex for the given task, and we believe that it is possible to train a shallower network to achieve the same performance. In addition, the necessary quantity of training data remains untested. Because shallower deep models have a smaller parameter space, they may result in less training data required to train the model. A model with fewer parameters results in higher portability, less energy consumption, and faster computation. The hyper-parameters used in this article can be further optimized for better performance.