Octave Mix: Data augmentation using frequency decomposition for activity recognition

In the research field of activity recognition, although it is difficult to collect a large amount of measured sensor data, there has not been much discussion about data augmentation (DA). In this study, I propose Octave Mix as a new synthetic-style DA method for sensor-based activity recognition. Octave Mix is a simple DA method that combines two types of waveforms by intersecting low and high frequency waveforms using frequency decomposition. In addition, I propose a DA ensemble model and its training algorithm to acquire robustness to the original sensor data while remaining a wide variety of feature representation. I conducted experiments to evaluate the effectiveness of my proposed method using four different benchmark datasets of sensing-based activity recognition. As a result, my proposed method achieved the best estimation accuracy. Furthermore, I found that ensembling two DA strategies: Octave Mix with rotation and mixup with rotation, make it possible to achieve higher accuracy.


Introduction
Task of recognizing human activity by measuring movement from sensors carried by people is called Human Activity Recognition (HAR). By being able to recognize people's activity, lifelogging and the provision of services based on acvivity can be available. Collecting a large amount of activity data of many people make it possible to exploit for marketing and traffic jam mitigation. HAR is often implemented by machine learning [1], and HAR based on deep learning has been actively studied in recent years [2]. Due to the high expressive ability of deep learning models, a large amount of training data is required to acquire a generic model that avoids overfitting. To address this challenge, data augmentation (DA) [3] is generally used to expand amout of data in the research field of image recognition. On the other hand, in the field of context-awareness using sensors, it is not easy to expand the labeled data. For example, Inoue et al. [4] collected a dataset of sensor-based nursing activity recognition. In their study, 22 nurses wearing sensors were performed their nursing nursing tasks, and another nurse as an observer manually recorded their activities. 41 different activity class labels were annotated to the sensor data measured over two weeks. Therefore, it takes a lot of time and effort to collect the training data for HAR, and improving DA methods is desirable.
In this study, I propose Octave Mix (Fig. 1) as a new DA method for HAR using sensor data. In addition, I propose an ensemble model and training algorithm combining the existing DA methods to improve the estimation accuracy. There are some kinds of DA, such as simple geometric transformation style DA for single data and synthetic style for multiple data. Octave Mix is a synthetic-style DA method. After applying a Low Pass Filter (LPF) and a High Pass Filter (HPF) + + × × + Figure 1: Process of data augmentation by Octave Mix.
to two sensor data, the two are combined by intersecting low and high frequency data and calculated weighted sum of both combined data. Finally, Octave Mix is used in conjunction with existing DA methods based on geometric transformations. Based on the above, the main contribution of this study is the proposal of the following three methods.
• Octave Mix: I propose a new synthetic-style DA method, Octave Mix, for sensor data based on frequency decomposition.
• DA ensemble model: I propose a deep learning model that ensembles multiple DA methods, and investigate the optimal combination of DA for HAR.
• DA revisited for fixed feature extractor (DAR-FFE): I propose a training algorithm for the proposed ensemble model that applies pre-training to enhance the effectiveness of DA.
In addition to the above proposals, my experimental results showed following beneficial findings.
• It was found that synthetic-style DA methods (mixup [5], and RICAP [6]) work well for HAR, and that Octave Mix outperforms them.
• The best accuracy was achieved by combining two DA policies: Octave Mix with Rotation which is one of the DA methods based on geometric transformations, mixup with Rotation.
2 Related works
With the wide spread of deep learning, end-to-end learning methods, including feature extractor as a trainable network, have been developed. Many studies have been published on HAR studies using deep learning. Many of these studies have simply applied convolutional neural networks (CNNs) to sensor-based HAR tasks [21,22,23,24,25,26] The model architecture is based on several convolution-pooling layers followed by a fully-connected layer. A method combining multiple sensors [27], methods inserting recurrent layer after several convolution-pooling layers [28,29], and more advanced methods introducing Inception, Residual, and Attention modules [30,31,32,33,34] have been studied. On the other hand, these studies focus on the discussion of the optimal model architecture for sensor-based HAR tasks, and do not discuss DA.
• Permutation: swap sections in time series.
• Scaling: scale waveform to amplitude direction.
• Time-warping: scale waveform to time direction.
Therefore, the discussion of DAs in HAR is limited to the simple geometric transformations. Effectiveness of syntheticstyle DAs have not been discussed.

Synthetic-style DA
According to Shorten et al. [3], DAs of image can be broadly classified into "Basic Image Manipulations" and "Deep Learning Approaches", and there are also "Meta Learning" to explore optimal DAs metaphorically. The DAs used in HAR falls under "Geometric Transformations" or "Random Erasing" of "Basic Image Manipulations". In other words, only a small part of the DA methods studied in image recognition is being used in HAR.
In this study, I focus on synthetic-style DA methods, which have not been utilized in HAR in the past. Synthetic-style DA methods in image recognition are widely used: miuxp [5] and RICAP [6]. mixup [5] is a method that combines multiple training data. Given two labeled data (x 1 , y 1 ) and (x 2 , y 2 ), mixup generates the synthetic data (x,ỹ) by the following formula (1).
where λ ∼ Beta(α, α), for α ∈ (0, ∞), and λ ∈ [0, 1]. The α is a hyperparameter of the mixup. From the above equation, it can be said that mixup is a method of combining two inputs and outputs by weighted averaging using a random weight value λ based on beta distribution. The important point of mixup is that not only the input x but also the output y are combined by weighted averaging. By this process, data that does not exist in the training data, which is the middle of the two data, is generated, and a middle label is generated.
RICAP [6] is a method to generate a composite image by cutting out randomly determined rectangular regions from each of four training image data, and then combining them side by side. As well as mixup, the output y is synthesized with a weighted average according to the size of the cut out rectangular regions.

Advanced DA approaches
Shorten et al. [3] describes "Deep Learning Approaches" that use deep learning models to perform DA, such as a method for augmenting data by automatically generating data [40], and a method for augmenting data by style transformation [41]. In recent years, meta-learning methods, such as AutoAugment [42], which uses reinforcement learning to search for the best strategy from multiple DA methods, RandAugment [43], which is faster by performing this randomly, and Adversarial AutoAugment [44], which is faster by adversarial training, have been proposed. A method called AugMix [45], which improves robustness by synthesizing multiple DAs, has also been proposed.
Advanced DA approaches are methods that explore combinations of DA methods in "Basic Image Manipulations"; therefore it is still important to develop new methods for "Basic Image Manipulations". In this paper, I propose a new synthetic-style DA method and an optimal ensemble method and training algorithm, this is a position of this study.

Outline
My proposed method consists of three components: Octave Mix, a new synthetic-style DA method; DA Ensemble Model which uses feature extractors trained by multiple DAs together; and DAR-FFE which pre-trains using DAs and additionally trains only the classifier part without using DAs. An overview diagram is illusrated in Figure 2. The individual details are described in the following sections.

Octave Mix (OctMix)
Octave Mix is inspired by Octave Convolution [46], which perform convolution after decomposing low and high frequency components, and applies it to synthetic-style DA. The Octave Mix algorithm for mini-batch input is shown in Algorithm. 1. First, a Low Pass Filter (LPF) and a High Pass Filter (HPF) are applied to the input to decompose the low frequency component LPF(x) and the high frequency component HPF(x). The LPF(x) and the HPF(x) with randomized order are combined. The resulting two composite waveforms are combined using a weighted sum with the coefficient λ as the weight.
The main idea of the Octave Mix is to perform frequency decomposition before synthesizing x, and in addition to that, synthesize y as well as mixup. Fig. 1 shows an example of generating synthetic waveforms by Octave Mix using the data x 1 ="walking" and x 2 ="jogging". The waveform of x 1 , which is the walking data, is observed to have lower amplitude and lower frequency. When LPF and HPF are applied to x 1 , the waveform is decomposed into a smooth walking waveform and a noise-like vibration. Similarly, when focusing on the x2, the amplitude and frequency of the low-frequency component is slightly higher than that of x1, and the amplitude of the high-frequency component is generally increased. By intersecting and combining these two waveforms, a waveform that looks like the vibration of the high-frequency component of jogging is added to the low-frequency component of walking, and a waveform that looks like the vibration of the high-frequency component of walking is added to the low-frequency component of jogging is generated. Finally, the two waveforms are combined by weighted averaging to produce a waveform according to the weight coefficient λ, as shown on the right in the figure.
There are two hyperparameters in the Octave Mix: one is α, which is used to determine the weight value λ ∼ Beta(α, α). Following the mixup, the parameters of the beta distribution are unified as α. The other is the cutoff frequency f c of LPF and HPF. Especially in HAR, the main frequency observed changes depending on the types of activities to be recognized. Therefore, it is desirable to adjust f c according to the task to be applied.
Defining each synthetic waveform as g 1 , g 2 , the Octave Mix synthesis can be transformed into the following equation (3).
Equation (3) can be regarded as the equation (1) of mixup when g 1 = x 1 , g 2 = x 2 . This means that the Octave Mix process is the same as mixup when the cutoff frequency f c is made as large as possible. Therefore, Octave Mix is an extension of mixup that includes mixup as a part.

Ensemble augmentation model architecture
While using Octave Mix as a DA strategy, I propose an ensemble model architecture in which multiple types of feature representations are acquired by multiple types of DAs. The model architecture is illusrated in Fig. 2, in which the upper path is a general deep learning model. Here, DA 1 , DA 2 are different DA strategies, E 1 , E 2 are feature extractors, and C 1 , C 2 , C are classifiers. Since the internal architecture of the feature extractors and classifiers are not restricted, a variety of architectures such as VGG [47] and ResNet [48] can be supported. For example, in the case of VGG, the convolution-pooling layer up to just before flatten is used as the feature extractor, and the remaining fully-connected layer is used as the classifier. Therefore, the proposed method is an ensemble method in which two types of feature extractors with two different DA strategies are trained separately. Since the proposed method outputs two predictions at prediction phase, the outputs of the two feature extractors are combined to output a single prediction result using a different classifier C (orange dashed line). I describe the training procedure for C in the next section.
The ensemble model is based on the idea that two different DA strategies contribute to the acquisition of different feature representations. Therefore, it is necessary that DA 1 , DA 2 are different strategies. In this study, after conducting the experiments described below, I adopted the strategy of overlaying Rotation with Octave Mix on DA 1 and Rotation with mixup on DA 2 .

Data Augmentation Revisited for Fixed Feature Extractor (DAR-FFE)
He et al. [49] pointed out that performing powerful data augmentation such as mixup and AutoAugment, has a possibility to enphasize a gap between the original data and the augmented data. To address this issue, they proposed DA Revisited which trained the model for N epochs on the augmented data, and then additionally trained the model for M epochs on the clean data.
In this study, inspired by DA revisited, I propose DA Revisited with Fixed Feature Extractor (DAR-FFE), in which the E 1 , E 2 , C 1 , C 2 in Fig. 2 is trained using augmented data and only C is trained using the original data without DA.
Although it is not discussed in He et al. literature, additional M-epochs training of the feature extractor with clean data may lead to the loss of feature representations with variations acquired by DA. Therefore, in DAR-FFE, I decided to train the feature extractor (E 1 , E 2 ) using DA in the pre-training, and train only the combined classifier C using clean data in the additional training using the weights-fixed feature extractor (Ê 1 ,Ê 2 ). Based on the above, the training procedure for the model ensembling K types of DAs is shown in Algorithm 2.
Training E k , C k on augmented training dataset (X k ,Ỹ k )   [14] and UniMiB SHAR [19]) summarized in Table 1. All of these are benchmark datasets for HAR using sensor data.
HASC [50] is a benchmark dataset for basic HAR using smartphone sensors. It consists of accelerometer and gyroscope measurements labeled with six basic activities (staying, walking, jogging, skipping, going up stairs, and going down stairs). I extracted data with a sampling frequency of 100 Hz from BasicActivity of the corpus from 2011 to 2013, and used only the raw data of the accelerometer. As a preprocessing step, I removed 5 seconds before and after from each measurement file, and divided the data into time series with 256 samples of frame size and 256 samples of stride. As a result, the shape of the input data is as follows: winsize=256, channels=3 (x,y,z). 5 seconds before and after the start of measurement are trimmed to remove the influence of the storage operation of the device. I did not used meta labels, such as measured divice information and personal information of subjects (e.g. gender, age, height and weight). As an experimental dataset, I used the data of 176 persons whose data could be obtained more than one frame after trimming.
PAMAP2 [13] is a benchmark dataset in which three Wireless IMUs were worn on the chest, wrist and ankle to record daily activities. This dataset collected 8 subjects' acceleration sensor data labeled with 12 activities (other, lying, sitting, standing, walking, jogging, cycling, computer work, car driving, ascending stairs, descending stairs, ironing). I devided the data into time series with 256 samples of frame size and 256 samples of stride. From the above, the shape of the input data is as follows: winsize=256, channels = 3 IMUs * 4 types of sensors * 3 (x, y, z) = 36.
UCI Smartphone [14] is a benchmark dataset using smartphone sensors for HAR. This dataset collected 30 subjects' acceleration sensor and gyroscope data labeled with 6 activities (siting, standing, laying walking, going up stairs, going down stairs). This dataset have been divided into 128 samples for each. From the above, the shape of the input data is as follows: winsize=128, channels = 2 types of sensors * 3 (x, y, z) = 6.
UniMiB SHAR [19]   rightward, generic falling backward, hitting an obstacle in the fall, falling with protection strategies, falling backwardsitting-chair, falling leftward, syncope). This dataset have been divided into 151 samples for each. From the above, the shape of the input data is as follows: winsize=151, channels = 3 (x, y, z).

Model training
My proposed method can be applied to any deep learning model architectures. My method used divided a general CNN model into a feature extractor and a classifier as illusrated in Fig. 2. In this study, I adopted the VGG architecture, which has been validated in my previous study [51], and used the architecture illustrated in Fig. 3. My previous study adopted the original VGG architecture [47], but in this study, to reduce the effect of C, I changed the flatten to GlobalAveragePooling and changed the fully-connected layer of C to only single layer.
The model is trained on Adam [52] for 300 epochs with a learning rate η = 0.001. I have confirmed that the training converges in 300 epochs in all conditions. In the case of using DA, I applied DA to the input data (X, Y ) with a probability of 50% to obtain (X,Ỹ ), and then combined (X, Y ) and (X,Ỹ ) as input.

Metrics
As Gholamiangonabadi el al. [53] pointed out, to evaluate sensor-based HAR, dataset should be divided by subjects. Assuming that labeled sensor data of prediction-target user could not be obtained in the real use case, I evaluated by subject-base-hold-out validation which divided dataset into training, validation, testing by subjects. The number of subjects included in each datasets is shown in Table 1. Subjects are selected by random sampling. (1) Sampling of subjects, (2) dividing of the dataset, and (3) training and accuracy evaluation of each method are considered as one trial. The estimation accuracy between methods is compared by discussing the average of the results of 10 trials. Since there is a bias in the number of data between labels in some datasets, I use the average f-score as an evaluation index in addition to accuracy.

Parameter tuning of DA
The purpose of this study is to verify the effectiveness of synthetic-style DA methods and to develop new methods. I compared my proposed method with typical synthetic-style DA methods: mixup [5] and RICAP [6]. As a DA method based on simple geometric transformations, I applied Rotation, which has been shown to be effective for HAR in the study of Um et al. [35]. Because of waveform data, I adopted to combine two waveforms back to forth in the time series direction as a RICAP procedure. Similarly, mixup was a weighted average of two waveforms.
For subsequent comparisons, the HASC dataset was first used to tune the hyperparameters of each DA method: α, the parameter of the beta distribution used to determine the weights during synthesis, was used for mixup and RICAP,  while α plus the cutoff frequency f c were the hyperparameters to be tuned for Octave Mix. The evaluation of the tuning was based on the accuracy of the vaidation data. As for the hyperparameters, the effect of α was not so large, and f c did not make much difference if it was greater than 1. In subsequent experiments, I will discuss mixup and RICAP with α set to 5.0, and Octave Mix with α = 0.5, f c = 2.1. Table 3 shows the results of the evaluation of my proposed method on four different benchmark datasets. The results were evaluated on the test data using the hyperparameters determined in the previous section. The best results were marked with an underscore and boldface, and the second best results were marked with an underscore.

Evaluation using four datasets
My proposed method achieved the highest average f-score, and 4.1% for HASC, 4.9% for PAMAP, 2.9% for UCI Smartphone, and 7.9% for UniMiB SHAR improved compared to Rotation alone. Although which synthetic-style DA was effective was different for each target task, it was found that ensembling with Octave Mix could improve the accuracy. Table 4 shows the results of the ablation study using the HASC dataset. The best accuracy and average f-score were obtained for my proposed method (9). Comparing effectiveness for each proposed component using (2) to (5), the effect of introducing Octave Mix was particularly significant, improving the f-score by 2.6% compared to Rotation alone (2). DAR-FFE also had a slight effect, improving the average f-score by 0.7%. On the other hand, (4) and (6), which were simple ensembles, showed a tendency to decrease the average f-score. Therefore, it was found that simple ensembling of DAs does not lead to the acquisition of feature representations with variation. Note that simple ensembling was a method which trained (E 1 , E 2 , C) all together using two DA policies to evaluate the effect of un-using DAR-FFE.

Ablation study
In (6) to (9), I consider the effect of missing one component. In (7), the model was trained using single DA (Rot.&OctMix) without ensembling the model, and then only the classifier was retrained on the original data. Despite the fact that the number of model parameters was the same as the simple methods (1)-(3), this method improved the accuracy by 3.2% in f-score from Rotation alone (2) and by 0.6% from Rot.&OctMix (3). This was equivalent to the accuracy of (8), where only the classifier was additionally trained after ensembling RICAP and mixup. The proposed method (9) further improved the accuracy by about 1%.    Fig. 4 shows the change in the average f-score when the number of subjects in the training data is changed. The proposed method works effectively regardless of the amount of training data. In addition, the difference between Mine and Rot.&OctMix was relatively small when the number of subjects is small (less than 10 persons), and the difference became more pronounced as the number of subjects increases. In other words, the effect of Octave Mix was large when the amount of data is small, and the effect of ensemble became larger as the number of subjects increases. Table 5 shows the experimental results of my proposed method with different DA combinations. My proposed method was an ensemble method of Octave Mix and mixup (b) in the table. The estimation accuracy of the proposed method is higher than that of the other combinations (a) and (c). In addition, it had a possibility to increase the number of DA combinations would improve the accuracy. I evaluated the method combining three DA policies (d), but there was almost no change in accuracy from (b). Therefore, combining two DA policieswere sufficient for the DA variation adopted in this study. In the future, combining any other DA methods to acquire a variety of feature representation, may be able to improve the accuracy by increasing the number of combinations.

Conclusion
In this study, I proposed and validated a new DA method Octave Mix for sensor-based HAR, a model architecture for ensembling DAs, and a method for additional training on the original data (DAR-FFE). My proposed method is a DA method that combines multiple input data and improves the conventional mixup method by using frequency decomposition. As a result of experiments, I confirmed that the three components of my proposed method (Octave Mix, ensemble model, and DAR-FFE) could improve the estimation accuracy of HAR. In the future, I expect that the three components of my proposed method will be applied to various problems and problem settings in different fields.