Improving Cross-Subject Activity Recognition via Adversarial Learning

,


I. INTRODUCTION
Deep learning techniques such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) have recently been applied to implement human activity recognition (HAR) using wearable sensors, and have proved to outperform shallow learning techniques like Support Vector Machine (SVM).Examples, as listed in Table 1, include recognition of hand gestures (e.g.raise, lower) and body movements (e.g.walking, sitting) from readings of inertial measurement unit (IMU).Among the activities which have not been well studied, the ones involving handobject interaction are essential for implementing emerging augmented/mixed reality applications, such as cognitive The associate editor coordinating the review of this manuscript and approving it for publication was Xian Sun .
assembly and maintenance assistance.In this work, we will take activities involved in the process of elevator panel maintenance as an example, and investigate the challenges and practical solutions of deep learning-based hand-object interaction recognition.
One key challenge in applying deep learning for HAR is the cross-subject performance degradation.As different subjects conduct the same activities in different ways, the gap in data distribution between the training and testing sets often causes significant performance degradation when testing the trained deep learning models on subjects not included in the training set.Ideally, this issue could be addressed by having a training set composed of data recorded with tens, or possibly hundreds, of different subjects.However, data collection and labeling is a laborious and time-consuming task.As a matter of fact, existing open datasets, such as PAMAP2 [29], Opportunity [3] and Daphnet [1], typically contain no more than 10 subjects.Therefore, the question is how to improve cross-subject performance having limited subject-variability in the training set.To the best of our knowledge, the works of Jiang et al. [10] and Khan et al. [14] are the only ones that propose to address this problem.However, as we will discuss in Section 2, their solutions present limitations that we aim to overcome with our method: 1) the need for training with data from subjects pertaining to the test set and 2) a different model must be trained for every different single or group of test subjects.
Implementing HAR is also a choice of different sensor modalities, sampling rates, and the number of subjects to collect data with.It might sound plausible that maximizing these three factors would also result in the maximization of the classification performance at the end.However, above a certain point, further maximizing them provides negligible or nonexistent performance gains.Instead, it may cause practical issues as increased resource consumption and processing delay.Even though this is a key trade-off in implementing HAR systems, there is a lack of practical guidelines on the selection of a minimal sensible setting for the aforementioned factors, which characterizes another challenge to be addressed in HAR.
This paper aims to solve both challenges.Our key contributions in this paper are summarized as follows.
1) We develop a novel deep learning solution for bridging the gap in performance across different subjects in HAR with wearable sensors.To achieve this goal, while training the activity classifier, our solution generates additional training data that mimic artificial subjects -with the purpose of increasing subject variability -and instructs the activity classifier to ignore subject-dependent information in the data.Our solution is versatile since there isn't any restriction on which activity classifier to use.Taking a CNN-LSTM baseline as the classifier and PAMAP2 as the dataset, our solution provides a gain of nearly 10% in cross-subject performance (in terms of mean F1-score) compared to the sole use of the CNN-LSTM baseline.Applied to the state-of-the-art Inno-HAR [36] classifier, the leap in performance reaches almost 5%, also for the PAMAP2 dataset.These improvements correspond to a decreased need for variability of subject behavior in the training set, which can be translated into fewer subjects with which to collect and label data.2) We provide deep insights into the impact of different influencing factors on classification performance and summarize a practical guideline based on our findings from experiments.
Figure 1 illustrates the blocks that form the structure of this work, as well as their corresponding sections.The rest of this paper is organized as follows.Section II introduces the background.Section III presents the method proposed in this work.Section IV describes the datasets, with the VOLUME 8, 2020 experimental results presented in Section V. Section VI summarizes the practical guideline and further discusses our method and the remained issues.Section VII presents the related work before we conclude this work in Section VIII.

II. BACKGROUND A. HUMAN ACTIVITY RECOGNITION (HAR)
HAR refers to the class of methods used for automatically understanding what task humans are performing by analyzing video, readings of wearable sensors, or wireless signals reflected by the human body [35].The algorithms for HAR can be classified into shallow and deep learning methods.Common shallow methods in HAR include SVM [13], [20], [23], k-nearest neighbors (kNN) [16], [24], linear discriminant analysis (LDA) [9], and random forest (RF) [21].Deep learning approaches, such as LSTM [7], [15], CNN-LSTM [25], [27], CNN [22], and convLSTM [26], have shown impressive leaps in performance compared to their shallow counterparts by learning to automatically extract features from raw sensor data, thus dropping the need for having human experts to provide hand-engineered features.A summary of recent works is listed in Table 1.Regarding the activities to be recognized, our work also serves to reinforce the scarce attention that is being given to activities involving hand-object interactions.

B. DOMAIN SHIFT
In computer vision, one often faces the problem of performance degradation when the training and the test sets present differences in terms of illumination, pose and image quality [34].Such differences in the underlying data distribution of the training set (i.e.source domain) and the test set (i.e.target domain) are named domain shift (or domain gap) and may bring huge discrepancies in performance when testing the model.
In HAR, when training a deep learning model on the labeled source domain data, since the distribution of the raw data depends on the subject, it is expected that the part of the network responsible for the feature extraction process outputs subject-dependent information (features) to the classification layers.That is, the extracted features depend on the behavioral style of the subjects of the training set.Hence, the domain shift problem is also present in HAR as a result of the difference between the behavioral styles of the subjects in the training set and those in the test set.There are a few factors that determine the behavioral style of a subject.
• Different subjects might perform the same activity in significantly distinctive ways.The activity of walking, for instance, presents enough differences across subjects such that it is possible to identify people by their gait.
• The level of dexterity and speed of performing the activities also differ from subject to subject.For instance, one can observe clear differences (e.g. in speed) in the behavior of a maintenance engineer when disassembling elevator buttons in comparison with a non-technician.
• In energy-demanding activities, different subjects may experience varying levels of tiredness that change in distinctive ways how they perform the activity.
• By performing the activities, the subjects can involuntarily shift the placement of the sensors.

C. DOMAIN ADAPTATION AND DATA AUGMENTATION
We envision that the cross-subject performance degradation can be minimized by reducing the domain shift through domain adaptation or by augmenting the subject variability in the training set through data augmentation.Domain adaptation (DA) techniques aim at reducing this performance degradation by bridging the domain gap.While there exists a handful of DA methods in the literature [4], the so-called adversarial DA methods (a subset of DA methods) have recently shown impressive results [42] and increasingly attracted the interest of many researchers [31].In adversarial DA, a network -the discriminator -is trained to distinguish data between domains, while another network -the generator -learns to generate domain-indistinguishable data, thus confusing the discriminator.These two networks pit against each other -hence the term ''adversarial''.Following this concept, the generator could be the feature extraction layers of the activity classifier trying to learn subject-indistinguishable features, whereas the discriminator could be a network that tries to predict the subject given the extracted features.A limitation of DA methods is that they require the use of labeled or unlabeled target domain data during the training phase.In this work, we follow the concept of adversarial learning to address the cross-subject performance degradation, however, we drop the need for utilizing any data pertaining to the target domain during the training of the classifier.
Data augmentation techniques generate artificial data that are combined with the real data during the training of the classifier.Simple techniques include adding noise to the sensor readings or increasing/decreasing their magnitude.Adversarial learning have also shown impressive results in generating artificial data [12].Again, two networks -a discriminator and a generator -are pitted against each other.The generator receives random noise as input and is required to learn how to transform such noise into an output that resembles the real data.The generator's output is fed into the discriminator, which is trained to distinguish between real and artificial data.This adversarial learning method does not guarantee that the training set -formed by artificial and real data -contains a higher subject variability, since the artificial data created by the generator should exhibit the same data distribution as the real training data.In our work, we utilize adversarial learning to generate artificial data.However, in our method, the artificial data are generated to present a different distribution from the real training data such that they mimic synthetic subjects.

III. ADVERSARIAL LEARNING
We divide this section into two parts.First, we explain the architecture of the method that generates artificial data with During the artificial data generation, the architecture is fed with original data from which subject-dependent characteristics are extracted and altered, thus generating data from an artificial subject performing the same activity as in its original counterpart.
rich forms of subject variability.Such artificial data are combined with the original training data and used to train the activity classifier, explained in the second part of this section.

A. GENERATING ARTIFICIAL SUBJECT VARIABILITY
Denoting x as a sequence of time-series data, our goal is to find which activity (among a set of predefined activities) is performed in x.We start from the premise that x contains subject-dependent characteristics, i.e. information that can be used to classify which subject (among the set of subjects in the training data) generated x, and subject-independent characteristics.Also, let us presume that there exist functions E SDC (•) and E SIC (•) that can extract from x subject-dependent and independent characteristics,respectively.Moreover, let M(E SDC (x), E SIC (x)) be a so-called merger function whose goal is to reconstruct the original data x from its constituents E SDC (x) and E SIC (x).
If we perturb E SDC (x), we can theoretically create data that can represent artificial subjects.We refer to as A(x, η) (given in Eq. 1) the data created from x representing an artificial subject given a disturbance η to E SDC (x).
where represents the Hadamard product -also known as the element-wise product -and η is an injected noise.
The functions E SDC (•), E SIC (•) and M(•) are characterized by neural networks.Given that the goal of M is to reconstruct the split-data, we define the reconstruction loss function as in Eq. 2.
To split x into its two constituents, we utilize two discriminator networks denoted as D SDC (•) and D SIC (•).We require both discriminators to learn to predict, in a supervised way, the subject to whom the input is related and to achieve maximum certainty about the prediction.Hence, the classes predicted by the discriminators are subject IDs.However, differently from D SDC (•), D SIC (•) establishes a mini-max game with E SIC (•).That is, E SIC (•) tries to confuse D SIC (•) by outputting information such that it is impossible for D SIC (•) to predict which subject the information is related to, while D SIC (•) does its best to learn to distinguish between subjects in its input.This mini-max game is employed with adversarial training.Before detailing how the weights of the networks are learned with adversarial training, let us define the crossentropy and entropy loss functions for subject classification, respectively L CES and L H .
where N is the number of subjects in the training data, u i and O i (P(x)) are the label and the probability prediction for subject i given by a function O(•) to a transformation P(x) of x, respectively.
where H (•) is the Shannon entropy function.From these functions, the weights of E SIC (•) and D SIC (•) (Eq. 5 and Eq. 6) are learned in an adversarial approach as in the vanilla GANs [6], with the exception that the concept of source and target domains is not valid here.Instead, each subject represents a domain and E SIC (•) learns to map different domains (subjects) into a common domain as in categorical GANs [30].Note that the E SIC (•) and D SIC (•) networks represent respectively the generator and the discriminator in the common GANs scheme.In our notation, the weights of a network O are expressed as θ O and the asterisk as in θ * O expresses the optimal values for θ O .
where λ RE , λ HE , λ CE and λ HD are positive real-valued constants.
The weights of E SDC (•) and D SDC (•) are given similarly (Eq.7 and Eq.8), however there isn't a mini-max game between these two networks -which is seen by the positive sign before L AH in Eq. 7.
Finally, the optimal weights of the merger network (Eq.9) are simply given as the result of the minimization of the reconstruction loss function.Figure 2 illustrates the scheme of the generation of artificial subject variability.
where K is the number of activity classes, y i is the label for class i of the labeled sample x, and λ A and λ T are positive real-valued constants that weigh the importance of correctly classifying the original and the artificial data, respectively.Notice that the loss function includes both real data x and artificial data A(x, η).
The optimal weights of the classification layers (Eq.11) can be promptly defined as those which minimize L CL .Since the feature extraction layers F(•) are required to learn subjectindependent features, we employ a third discriminator D SIF (•) whose goal is to play a mini-max game with the feature extraction layers similar to the case of E SIC (•) and D SIC (•).Hence, the optimal weights of F(•) and D SIF (•) are expressed in Eq. 12 and Eq. 13, respectively.Figure 3 illustrates the scheme involving the activity classifier.
where λ CL and λ HF are positive real-valued constants.It should be noted that once the training has been completed, all networks, except for the feature extraction and classification layers, can be discarded as they only served the purpose of assisting the activity classifier in obtaining robustness against cross-subject performance degradation.Furthermore, we explicitly differentiate between the extracted subject-independent characteristics (the output of E SIC (•)) and the subject-independent features (the output of F(•)).The reason for this is that the loss functions for learning E SIC (•) and F(•) differ, hence it is clearly not expected that E SIC (•) = F(•).The overall step-by-step algorithm is detailed in Algorithm 1.We used a fixed number of epochs as the convergence criteria for the training of the networks.To improve the stability of the adversarial training, we have forced Lipschitz continuity through spectral normalization [18] on all networks except for the activity classifier.
Note that there isn't any restriction concerning the structure of the activity classifier.This is the versatility of our method.As a matter of fact, in Section V, we apply our method with two different activity classifiers: a CNN-LSTM baseline and InnoHAR [36].

IV. DATASETS A. THE USED DATASETS 1) HARD
We collected three different datasets -namely HARD, HARD2 and HARD3 -of hand activities reproducing an elevator maintenance process.We asked participants to perform 7 different activities: 1) press buttons, 2) unplug the elevator cables, 3) plug the elevator cables back in, 4) remove the panel's button, 5) insert the buttons on the panel, 6) use a screwdriver to loosen and tighten screws in the panel and 7) use a hammer with the purpose of only mimicking the movement of hitting an object.Moreover, the null class is also considered, resulting in 8 different classes.Each participant took roughly 15 minutes to perform all the requested activities.The first setup (i.e.HARD) uses flex sensors on all fingers of both hands, thumb pressure sensors on both hands, and accelerometers on the back on each hand.The data were recorded at 25Hz with 19 different subjects.In the second setup (HARD2), gyroscopes on the back of each hand were used in addition to those sensors of the first setup, however, now the data were recorded at a sampling rate of 16.67Hz

2) PAMAP2
The PAMAP2 dataset [29] includes data collected from a heart monitor and three IMUs attached to the chest, hand, and ankle of the subject, respectively.There are in total 18 different physical activities performed by 9 different participants, as well as transient activities labeled as the null class.
Out of the 18 activities, 6 are rarely present in the data.To avoid having a heavily imbalanced dataset and following previous works [7], only the remaining 12 activities are considered in our experiments: lying quietly, sitting, standing, ironing, vacuum cleaning, ascending stairs, descending stairs, walking, Nordic walking, bicycling, running, and rope jumping.Furthermore, the sampling rate is reduced from 100Hz to 33.3Hz (higher sampling rates than 33.3Hz do not show any improvement in the performance, but add further computational cost and memory footprint), and the missing values present in the raw data (reported as NaN values), as well as data originated by transient activities, are discarded.

3) OPPORTUNITY
This dataset was recorded from 4 participants with 23 bodyworn sensors.It incorporates 18 domestic activities: cleaning a table, opening/closing the fridge, opening/closing the dishwasher, opening/closing 3 different drawers, opening/closing 2 different doors, toggling lights on and off, and drinking from a standing and sitting position.The sampling frequency was set to 30Hz.When running the experiments, we considered sensory readings (as in [7]) from the upper limbs, the back, and both feet.

4) DAPHNET
The Daphnet [1] dataset uses three wearable accelerometers placed on the ankle, thigh, and trunk of eight Parkinson's disease patients to detect freezing of gait (FOG).FOG is a condition that causes sudden impediments of walking elevating the risk of falls.The Daphnet data was recorded during various walking tasks of 10 different participants and have three different annotations: 1) transient activities (which are discarded here), 2) freezing of gait, and 3) normal movements.The data were recorded with a sampling rate of 64Hz, however, we downsample it to 32Hz by decimation and discard the transient activities, following [40].

B. DATA PRE-PROCESSING
As a pre-processing step, all the data is normalized to zero mean and unit variance.We choose a sliding window of approximately 2.5 seconds for the HARD, HARD2 and HARD3 datasets, with 50% of overlapping.Following other works [8], [26], [40], the PAMAP2 and Daphnet datasets have, respectively, a window size of approximately VOLUME 8, 2020 TABLE 2. Network architectures.The Rectified Linear Unit (ReLU) was used as activation function after the CNN layers and the first fully-connected (FC) layer in the D SIC , D SDC , D SFI , and C networks.In these networks, the second FC layer contains the same number of neurons as the number of classes and is followed by the softmax function.The convolutional kernel for all CNN layers was set to 3 × 3, whereas the max-pooling kernel size was 2 × 2. The number of filters in the three CNN layers of F, E SIC , E SDC networks are, respectively, 8, 16, and 32.In the three CNN layers of D SIC , D SDC , D SFI , the number of filters are 8, 4, and 2, respectively.16, 8 and 1 are the number of filters in the transposed CNN layers of the M network.
5.12 and one second with 78% and 50% of overlap.Following [8], the sliding window size for the Opportunity dataset was set to 1 second with 50% of overlap.The label for each sliding window corresponds to the activity whose duration occupies the largest percentage of the window.

V. EVALUATION
We implemented the workflow illustrated in Figure 1, and will present the experimental setup and results of each step in this section.The experiment contains two parts.The first part focuses on the evaluation of classification performance and its influencing factors without applying our proposed method -i.e.without using artificially generated training data and without requiring the feature extraction layers to learn subject-independent features.The second part applies our novel method and evaluates its effectiveness in improving cross-subject performance.In both parts, we choose the mean (over all classes) F1-score as a performance metric and calculate it following Eq.14.In cases of imbalanced class distribution, a common case in HAR, the mean F1-score can prove particularly more meaningful than the accuracy metric [26].
where TP i , FP i and FN i represent the number of true positives, false positives and false negatives of a class i, respectively.The number of classes is given by K .

A. EXPERIMENTAL SETUP
The network architectures are described in Table 2.Note that the activity classifier (composed of the feature extraction layers F and the classification layers C) follows a CNN-LSTM architecture.This choice was influenced by its superior performance compared to other basic networks [27].
The hyper-parameters were chosen by trial and error instead of using any automatic hyper-parameter tuning methods such as grid or random search for the following reasons.Firstly, the process of hyper-parameters search requires heavy computation.Due to limited computational resources, it is expected to reduce the number of searches during model training.Secondly, since the methods proposed here can easily lead to an imbalanced competition between the networks trained in an adversarial way, we need a human in the loop to understand the effects of each hyper-parameter and propose meaningful values for them.In Section VI, based on our experience in fine-tuning by trial and error, we provide a brief guideline on how to choose sensible values for the hyperparameters.
All the aforementioned networks were coded in Python 3.7.4using the TensorFlow 2.0 framework.We used an NVIDIA Tesla V100 to run the code.To guarantee the reproducibility of results, the initial seed for all random operations was chosen to be zero.We used Adam as the optimization algorithm with β 1 = 0.9 and β 2 = 0.999.As the convergence criteria, we used a fixed number of epochs -50 epochs and 150 epochs for the experiments of Section V-B and Section V-C, respectively.

B. CLASSIFICATION WITHOUT ADVERSARIAL LEARNING
To evaluate the impact of different factors on the classification performance, we compare the performance between different combinations of sensor modalities, sampling rates and numbers of subjects in the training set, respectively.The CNN-LSTM architecture used for the tests in this section is formed by the F and C networks shown in Table 2.

1) SELECTION OF SENSOR MODALITIES
HARD and HARD2 were collected using the smart gloves equipped with flex sensors, accelerometer, and gyroscope.For comparison, we tested the data collected with 7 different configurations: 1) flex sensors only, 2) accelerometer only, 3) gyroscope only, 4) flex sensors and accelerometer, 5) flex sensors and gyroscope, 6) accelerometer and gyroscope, and 7) flex sensors, accelerometer and gyroscope.
The effect of each sensor is measured in both cross-subject and same-subject scenarios.In the cross-subject scenario, one random subject was chosen to compose the validation set and another one for the test set.The data from the remaining subjects formed the training set -that is, 17 and 6 subjects, respectively, for the training set of the HARD and the HARD2.In the same-subject scenario, the entire dataset was randomly divided into training (60%), validation (20%) and test (20%) sets.For both scenarios and for each different configuration of sensor modality, we performed six different experiments with varying subjects in the training, validation and test sets.The results are averaged over these six runs of tests.
In the cross-subject scenario, as shown in Figure 5, the accelerometer readings are more informative than those of the flex sensor or the gyroscope.The significantly low   performance of the flex sensors can be attributed to the fact that there exists a huge variability in the ways the subjects move their fingers to perform a certain activity.Also, we have noticed that during the data collection sessions, the flex sensors inside the gloves can slide along the finger, thus constantly changing its position.However, in the samesubject scenario, the flex sensors can be as informative as the accelerometer.
Between Figure 5a and Figure 5b, there is an indication of a trade-off between the cross-subject and same-subject performance.In the cross-subject scenario, with more subjects in the training set, the classification model becomes more generalized to maintain high performance across subjects.However, in the same-subject scenario, training on a large number of subjects harms the performance.As the number of subjects in the training set increases, the extracted features of the deep learning model become more subject-independent.This is desired when our goal is to have a model that generalizes better when fed with data from a new subject.However, the loss of subject-specific features makes it more difficult for the classification layers to make correct predictions on unseen data of the subjects present on the training set.

2) THE EFFECT OF THE NUMBER OF SUBJECTS IN TRAINING SET
Using the CNN-LSTM classifier architecture, we varied the number of subjects in the training set of the HARD, HARD2, PAMAP2 and Daphnet datasets, while keeping one subject in the validation set and a different one in the test set.For each dataset and for each number of subjects in the training set, we performed 5 runs.Therefore, 5 different subjects were present in the validation and test sets considering all the runs.In each run, the training, validation and test sets were randomly generated.Figure 6 shows the evolution of the mean F1-scores as the number of subjects in the training set grows.
As we increase the number of activities, the addition of a subject in the training set is likely to impact more the performance.This can be explained as follows.The more activities we desire to classify, the higher the chances of having activities that can be performed in rather different ways by different subjects.Therefore, to learn features helpful in classifying activities irrespective of the subject, the deep learning classifier needs to be trained on subject-rich data.As an example, when we vary the number of subjects from 1 to 5 in the training set, for each additional subject included, the Daphnet dataset (solely 2 activities included) reports an average F1-score increase of 1.3%, whereas the PAMAP2 (including 12 activities) shows growth of 6.1%.In all cases, it is noticed an ever slower growth in performance -i.e.saturation -as the number of subjects is increased.

3) THE EFFECT OF THE SAMPLING RATE
To evaluate the effect of the sampling rates of sensor readings on the classification performance, we downsampled the data from the PAMAP2 and HARD3 -both recorded originally at 100Hz -to various sampling rates while maintaining the same window size (5.12 and 2.5 seconds, respectively, for the PAMAP2 and HARD3 datasets) and overlapping percentage (78% and 50% for, respectively, the PAMAP2 and HARD3 datasets).Subjects 5 and 6 form the validation and test sets, respectively, for the PAMAP2 dataset.This is a common choice of subjects in the literature [7], [8], [26], [37], [40].Decimation was used as the downsampling method.The CNN-LSTM classifier is also used here.
From Figure 7, in the range of 100-10Hz, we only observe very small and random variations in the performance of the classifier.This indicates that, for the considered activities, it is unnecessary to sample data at rates 10Hz (when the maximum sampling rate available is 100Hz).While there isn't any appreciable performance variation in the 100-10Hz range, the elapsed time for performing a forward and a backward pass on the network is, respectively, roughly 7x and 3.5x longer at 100Hz than at 10Hz.It is also highly dubious that, for these activities, a sampling rate in the range of 100-1000Hz would provide any benefit in terms of prediction performance.

C. CROSS-SUBJECT PERFORMANCE WITH ADVERSARIAL LEARNING
In the evaluation of our method for cross-subject performance improvement, we utilized four datasets: Opportunity, HARD, HARD2, and PAMAP2.The Daphnet dataset was discarded.As discussed earlier, the Daphnet dataset includes only 2 very simple activities and, according to the experiments of Section V-B2, it did not exhibit appreciable crosssubject performance degradation.Additionally to having a CNN-LSTM network as the activity classifier, we also performed tests having InnoHAR [36] -composed of inception layers followed by GRU layers -as the activity classifier since it has exhibited state-of-the-art performance in HAR.The goal is to compare the performance of each of these two activity classifiers with and without our adversarial learning method.
Denoting n as the number of subjects in a particular dataset, we have performed n different experiments for the dataset.Each experiment contains a different subject in the test set.The same is valid for the validation set.Therefore, the training set is always composed of n − 2 subjects.Table 3 lists the number of subjects for each dataset.The performance of all experiments for a particular dataset is then averaged.We remind that our method does not utilize any data from the validation or test set for training.Table 4 presents the results of all the experiments for all the considered datasets.The performances for DeepConvLSTM [26] and the LSTM with Uniqueness Attention [41] are also reported.However, these classifiers have not been used with our method since overall they do not perform as good as InnoHAR, which is already being combined with our method.
We were able to obtain an average improvement of 3.41% when utilizing our method combined with either the CNN-LSTM or the InnoHAR classifier.We estimate that such an improvement may be equivalent to adding 2-6 subjects to the training set.We emphasize that this performance gain was achieved without utilizing any data from subjects belonging to the test set.Domain adaptation techniques may achieve higher improvements in terms of performance.However, they require unlabeled or partially labeled data from subjects of the test set, which signifies additional burden in collecting and partially labeling data.Quantifying the difference in improvement between our method and domain adaptation methods is a topic for future research.
Our CNN-LSTM classifier has 2 orders of magnitude fewer parameters and is 1 order of magnitude computationally lighter compared to InnoHAR.Therefore, our method is able to provide gains in performance comparable to having a more complex network architecture.As an example, for the PAMAP2 case, our CNN-LSTM classifier achieved even significantly higher performance than InnoHAR when used with our adversarial learning method.
We hypothesize that 4 different factors can determine the performance gain for a certain dataset: 1) The nature of the activities in the dataset.Different activities present distinct levels of variability across subjects.In general, activities of higher complexitye.g.preparing a sandwich -allow for a greater level of variation across subjects than simpler activities as pressing a button.The PAMAP2 dataset includes activities of higher complexity compared to the other datasets used in this work.As a matter of fact, we believe this is the main reason for the significantly higher performance gain observed with respect to the other datasets.2) The number of sensors.Decoupling subject-dependent and subject-independent characteristics becomes harder when the number of sensors increases, since the data becomes more complex to learn from.
3) The amount of data per subject.Higher amounts of data per subject help in the learning process of subjectdependent and subject-independent characteristics.4) The number of subjects in the training set.The aforementioned learning process is negatively affected when the number of subjects is scarce.On the other hand, the purpose of our method is to help the activity classifier in learning subject-independent features without resorting to having an exceedingly high number of subjects.For the HARD dataset -with 19 subjectsthe performance gain is slightly smaller than for the HARD2 dataset -with 8 subjects -even though both datasets have the same activities and a similar number of sensors and amount of data per subject.With respect to the classifier, the CNN-LSTM classifier showed a slightly higher performance gain (4.34%) compared to the InnoHAR classifier (2.48%).We speculate that the higher dimensionality of the features in the activity classifier leads to a harder adversarial learning process between the feature extraction layers of the classifier and the subjectindependent features discriminator.In computer vision, this is equivalent to the limitation of GANs in generating highresolution images and is a well-known open problem [11].

VI. DISCUSSION
Based on the experiments carried out in this work, we summarize the practical guideline for sensor-based HAR.Our adversarial learning method, along with its limitations and possible areas for future work, is also discussed.

A. PRACTICAL GUIDELINE
We have seen that among flex sensors, gyroscopes, and accelerometers, the latter ones are more recommended for implementing sensor-based HAR since they provide more helpful and less subject-dependent information that led to considerably higher classification scores in cross-subject scenarios.Gyroscopes, when used with accelerometers, can lead to significantly better results in both cross and same-subject cases, however, using gyroscopes by themselves is not recommended.The use of the flex sensors is not appropriate for cross-subject scenarios, as these sensors extract quite subject-variant data.We only advise using flex sensors in combination with accelerometers and gyroscopes in a same-subject case.
We have seen that the number of subjects to include in a training dataset depends on how much the activities we want to classify can differ from one subject to the other.We recommend using approximately 5 different subjects to compose the training dataset in sensor-based HAR, as we have seen that empirically this is a number that balances the time-consuming task of data collection and labeling and the performance scores in cross-subject scenarios.
Finally, when the available sampling rate does not exceed 100Hz, a choice of 15Hz keeps both data transmission and processing times at more suitable values for real-time implementation of HAR without any degradation in the classification performance.We have not studied the effects of a sampling rate above 100Hz.It is possible that, for instance, in case one desires to recognize activities directly related to the use of highly vibrating machines (e.g.hairdryer or electric screwdriver), sampling at lower than 100Hz may not be enough to correctly distinguish between activities.On the other hand, the trade-off between transmission delays and sampling rate also needs to be taken into account in case of real-time HAR.Future work could revolve around the inclusion of other modalities of sensors as sEMG, as well as the effect of the sampling rate in activities related to machine operation.

B. THE ADVERSARIAL LEARNING METHOD FOR HAR
Regardless of which sensor modalities are present in the data, our adversarial learning method was able to provide performance improvements in all cases, especially for the PAMAP2 dataset.The duration of one epoch of training using our method is approximately twice as much as solely training using the activity classifier.Considering the PAMAP2, the training duration utilizing 150 epochs and run on an NVIDIA Tesla V100 lasted for approximately 4.2 hours.Training only the activity classifier for 50 epochs lasted for roughly 42 minutes on the same GPU.This is acceptable since the duration of the training represents only a small fraction of the total time taken to collect, label, and prepare the data for training.Most importantly, the inference time is never affected since all other neural networks, except for the activity classifier (networks F and C), are discarded.Also, there isn't any reason to suspect that the practical guideline detailed previously doesn't hold true when applying our method.
Our method utilizes 8 networks and 8 real-valued constants.To reduce time and resource-consuming efforts associated with the search for optimal hyper-parameters, we have compiled a guide (Table 5) based on all our experiments.It should be noted that the hyper-parameters that led to VOLUME 8, 2020 the best validation performance in one dataset might not serve to a different dataset.Therefore, even though this guide is based on experiments with diverse datasets, its only purpose is to serve as a starting point.
Our fine-tuning by trial and error followed the principle of maintaining a balanced adversarial competition between networks and we always used the performance on the validation set to make comparisons between choices of hyperparameters.We observed that setting the learning rates of the networks D SIC and D SIF to lower values compared to those of the F and E SIC networks resulted in better performance.This is due to the fact that the first group of networks has a simpler task than the latter group.Starting with a lower value for the magnitude of the noise and gradually increasing it -until a performance drop became evident -also proved to be a good practice.It was also noticed that assigning slightly lower values for λ CE and λ HD compared to λ HE and λ HF produced better results.However, we are unsure about the reasons for this.These relations between the mentioned parameters showed to be consistent across the datasets used.Concerning the parameters λ RE and λ A , we did not observe consistent relations.Nevertheless, we were able to determine an appropriate interval for each of them.
As a way to generate artificial subject variability, we have injected noise into the subject-dependent characteristics before the reconstruction.It is reasonable to conjecture that, in some cases, this can result in synthetic data with unrealistically fabricated subject variability.As future work, we would like to unravel, at least to some extent, the black-box nature of this process in order to obtain artificial data that more faithfully represent the reality.Furthermore, the performance obtained in the test set is sensitive to the hyper-parameters used during the training phase.As future work, an automatic search for sensible hyper-parameters -can be researched.

VII. RELATED WORK A. DOMAIN ADAPTATION
In [2], the authors used shallow DA approaches to bridge the domain gaps across people's age, sensor placement and the environment in HAR.In some cases, they achieved a significant increase in performance ranging between 8% and 12%.In other cases, however, the DA approaches reduced the performance.Their experiments were conducted with public datasets as PAMAP2.
In the work of Wang et al. [33], the authors developed a DA method for different scenarios: adaptation between similar body parts on the same person, different body parts on the same person, and similar body parts on different people.Their method was evaluated with public datasets, as PAMAP2 and OPPORTUNITY, against six common alternatives performing on average better.
Ye [38], to address the scarcity of labeled data in a certain dataset, proposed a method to leverages labeled data from different domains (in this case, datasets), providing a significant improvement on the performance of activity recognition models even when only a small fraction of annotated data of the target domain is available.
In [14], Khan et al. performed DA in HAR in cross-device (smartphone to smartwatch and vice-versa) and cross-subject scenarios.Their method -named HDCNN -consists of first training a deep learning model on the labeled source domain data and then adapting it to the unlabeled target domain.In the adaptation, the authors proposed to minimize the Kullback-Leibler divergence between the weights of the source domain model and those of the target domain.
In [10], the authors proposed a device-free HAR system that uses adversarial DA to bridge the gap between different domains, each of which representing different physical environments and different groups of subjects.Their solution consists of feature extraction layers that are trained to output environment and subject-independent information.
The aforementioned works present two unaddressed issues: 1) during training, they need to utilize data from subjects on which the HAR algorithm will be tested (i.e.target domain data), which results in an additional burden even if their methods do not require such data to be labeled; and 2) a different model must be trained for every single or group of test subjects.In [10], even if the feature extraction layers are trained to remove subject-specific information, in practice, they are still limited to cut out only subject-specific characteristics seen in the training data.Therefore, it doesn't completely solve the problem and still leaves room for improvement.
In our work, we design a HAR scheme that 1) removes the need for utilizing any data from the target domain during training and 2) aims at training the activity classifier to ignore subject variability present not only in the training data but also artificially generated subject variability that is never seen in the training data.

B. DATA AUGMENTATION
Wang et al. [32] used vanilla GANs to artificially generate data as a data augmentation framework for HAR.The use of simple vanilla GANs present difficulties in learning to generate data with rich subject variability since the training 90552 VOLUME 8, 2020 is performed in such a way that the artificial data exhibit the same data distribution as the real data from the training set.Therefore, we don't find the data generation method of [32] appropriate for creating artificial subject variability.
Erol et al. [5] also utilized GANs to generate synthetic data.However, instead of the vanilla GAN approach, they conditioned the generator to class labels and train the discriminator to predict the class of the synthetic data given by the generator.With their own dataset, the authors achieved an improvement of approximately 3% when training the activity classifier with both real and synthetic data.Their approach is not compared with the one by Wang et al. [32].For the same reason as the previously mentioned work, this one cannot be used to generate data from synthetic subjects.
Rashid and Louis [28] proposed four distinct data augmentation methods for time-series data: scaling, rotation, timewarping, and jittering.These methods are limited to IMU sensors.In scaling, the magnitude of the raw data is changed while preserving the label.Rotation applies artificial changes in the data that mimic different orientations of the sensors, considering that the labels should be invariant to such transformations.Time-warping alters how fast or slow an activity is performed.Finally, jittering simulates random additive sensor noise to increase the robustness of the classifier to small variations.Applied to their own dataset, the authors were able to achieve an accuracy improvement of at least 10%.While these techniques may help in reducing cross-subject performance degradation, it was not designed for such purpose.The authors did not claim their methods address cross-subject performance degradation nor did they perform experiments to evaluate their potential in addressing this issue.
To the best of our knowledge, our work is the first 1) to utilize data augmentation to explicitly increase subject variability in the training and 2) to perform experiments to evaluate this data augmentation scheme in addressing the cross-subject performance degradation.

VIII. CONCLUSIONS
In this paper, we have drawn the attention to an understudied yet a crucial challenge in HAR: cross-subject performance degradation.We have then proposed a novel method for addressing this adverse variance of classification performance seen across different subjects in HAR.As a result of various experiments, we have demonstrated its potential in providing appreciable performance gains that reduce the need for larger data collection and annotation procedures with various subjects, as it is common in HAR.With additional experiments related to sensor modalities, sampling rates and the number of subjects in the training set, we have proposed a practical guideline for implementing more efficient and better-performing sensor-based HAR solutions.

FIGURE 2 .
FIGURE 2. The architecture of the subject variability-oriented data augmentation.During training, this architecture is required to extract subject-dependent and subject-independent characteristics from the original data and merge these characteristics back together to reconstruct the data.During the artificial data generation, the architecture is fed with original data from which subject-dependent characteristics are extracted and altered, thus generating data from an artificial subject performing the same activity as in its original counterpart.

B
. THE CLASSIFIER We indicate as F(•) and C(•) as the feature extraction and the classification layers, respectively, of the activity classifier.First, let us the define the cross-entropy function of the activity classification L CL .

Algorithm 1
Training of Our Method Load training data D TRAIN Create networks F, C, E SIC , E SDC , D SIC , D SDC , M , and D SIF Define λ T , λ A , λ CL , λ CE , λ HD , λ HF , λ HE , and λ RE while not converged do for all x train in D TRAIN do Compute gradients of L RECONS , L H (D SIC , E SIC ), L CES (D SIC , E SIC ), L H (D SDC , E SDC ), L CES (D SDC , E SDC ) Perform optimization step on θ E SIC , θ D SIC , θ E SDC , θ D SDC , θ M end end while not converged do for all x train in D TRAIN do Sample random noise η Compute A(x train , η) Compute gradients of L CL , L ACE (D SIF , F), L AH (D SIF , F) Perform optimization step on θ F , θ C , and θ D SIF end end with 9 subjects.The last setup (HARD3) was only recorded with 4 subjects using accelerometers on each hand at a rate of 104Hz.

FIGURE 4 .
FIGURE 4. Tasks used in the HARD datasets.

FIGURE 5 .
FIGURE 5. Violin plots of the effects of accelerometers, gyroscopes and flex sensors in the performance of classification models of hand activities.

FIGURE 6 .
FIGURE 6.The cross-subject performance increase with the number of subjects in the training set .

FIGURE 7 .
FIGURE 7. The impact of the sampling rate on the F1-scores.

TABLE 1 .
Recent works on sensor-based human activity recognition.
FIGURE 1. Overview of our work.

TABLE 3 .
Number of subjects in each dataset.

TABLE 5 .
Recommended values for the hyper-parameters of our proposed method.