Introduction
Data stream classification is the process of analyzing big data in real time to enable intelligent systems, such as online system monitoring, intelligent transportation systems[1], and financial risk management. With the rapid development of mobile networks and social media, data stream classification has gained increasing interest in academia and industry[2]. In addition, data stream analysis is a crucial component of the big data and analytics technology that enables Industry 4.0[3].
The demand for real-time data stream processing is increasing. Therefore, developing incremental learning methods is urgently needed. Incremental learning is a learning paradigm that allows continuous model updating based on new incoming data rather than training it once on the entire dataset. The purpose of incremental learning is to process the continuous flow of information in the real world, and retain, integrate, and optimize previous knowledge based on absorbed new knowledge. In the field of machine learning, incremental learning is dedicated to solving a common flaw in model training: catastrophic forgetting [4]-[6]. That is to say, when a general machine learning model is trained on an entirely new task, its performance on the previous task will typically degrade prominently. Multi-task multi-view learning provides a useful paradigm for various types of data collected from heterogeneous devices with multiple modalities in real-world scenarios, such as sensor-based Human Activity Recognition (HAR)[7] and traffic data analysis in smart cities. The goal of multi-task multi-view learning is to improve efficiency and accuracy by jointly learning objectives of multiple tasks with multiple views collected from different sources[8]. Compared to models with single-task multi-view learning, models with multi-task multi-view learning paradigms have high efficiency and low inference cost during training. The models are implemented in a way that promotes generalization capabilities using simultaneous learning of shared representations between related tasks and views [9]-[11]. However, research on the application of multi-task multi-view learning to incremental learning is limited. In the multi-task multi-view scenarios, learning a task can first cause interference with previously learned tasks due to updates in shared parameters. Second, different sensors can provide various views, leading to differences in input data that may be ignored in the learning process. Finally, data imbalance, in which certain tasks may have more samples than others, can lead to the neglect of less represented tasks in favor of overrepresented tasks during the learning process, resulting in catastrophic forgetting.
Some challenges arise when solving multi-task multi-view incremental data stream classification on mobile sensors. (1) Sensor data are continuously generated at a fast pace, occupying a large storage space. Therefore, incremental learning is an urgent need. (2) Different tasks and views interfere with each other in the training process, and data distribution differs between various tasks or views. Thus, the heterogeneity of tasks and views often differs among participants and sensors when performing the same activity. Using the traditional single-task single-view incremental learning may not effectively handle the specific catastrophic forgetting. (3) Existing methods addressing catastrophic forgetting, such as the traditional Elastic Weight Consolidation (EWC) approach[12], often ignore the significance of different components in the Fisher information matrix and fail to adapt to the evolution of the model structure. However, current data stream classification research only partially addresses these challenges, and few attempts have been made to address all these challenges within a single framework.
For example, the method in Ref. [13] was designed to run on mobile devices, where convolutional neural networks are used to learn local interactions in each perceptual view as well as global interactions between different sensor inputs and perform regression and classification tasks based on data from moving sensors. Reference [14] used incremental learning to build a personalized recognition model of human activities. The algorithm Learn++ is an integrated method that can use any classifier as a basic classifier. Reference [15] proposed a criterion of the “Kappa” architecture (namely MOLESTRA) to ensure the utilization of cognitive and learning relationships among the streaming data. Reference [16] transfered multi-view data stream from multiple views to a shared potential subspace and integrated distinguishing information by maximizing and minimizing the interclass and intraclass separability of streaming data, respectively. However, the existing work has not addressed all the aforementioned challenges of streaming data analysis within the overall framework.
Therefore, an adaptive Multi-Task Multi-View Incremental learning framework for data Stream classification (MTMVIS) is proposed to solve the above challenges. The general overview of MTMVIS is depicted in Fig. 1, where the notations are shown in Table 1. First, a hierarchical attention mechanism is used to weigh each measurement value within a certain time interval to align data collected from different sensors at the same body location. Task-view relationships with Multi-gate Mixture-of-Expert (MMoE)[17] are then comprehensively modeled A gating network for each task-view, which can determine a customized feature representation by combining the output of multiple experts, is constructed. In addition, another attention layer is used to construct a view fusion layer for each task to measure the importance of different views to the same task. Inspired by a single-task single-view method, i.e., IADM[18], a special adaptive output layer for each task is introduced, and each layer of the network in the model is trained to obtain a specialized feature representation, which serves as the input to the final output layer. Each task obtains a specific output representation by weighting the output combination of the multilayer network. In addition, homoscedastic uncertainty[19] is used to tune the loss weights of tasks adaptively during training. Finally, adaptive Fisher regularization is employed to overcome the catastrophic forgetting problem. Results of extensive experiments on two different datasets demonstrate that MTMVIS is superior to other state-of-the-art methods. The main contributions can be summarized as follows.
A novel framework for adaptive multi-task multi-view incremental learning is proposed to address the catastrophic forgetting problem. Different from traditional incremental learning methods, the proposed model utilizes the relevant knowledge of multi-task multi-view learning.
Compared with traditional EWC methods for solving the catastrophic forgetting problem, adaptive Fisher regularization is introduced to enhance the scalability of the model.
The proposed approach is evaluated on the basis of two real datasets, namely RealWorld-HAR[20] and GLass Eating And Motion (GLEAM)[21], to demonstrate the effectiveness and efficiency of MTMVIS.
Overview of the proposed MTMVIS framework. Specifically, each layer of the multi-task multi-view neural network is used to extract features separately, and the attention layer is employed to weight the output of all layers of the multi-task multi-view deep neural network as the final output layer for each task. Adaptive Fisher regularization is utilized to alleviate catastrophic forgetting problems and enhance model scalability. The details of multi-task multi-view deep neural network are shown in Fig. 3 in the following Section 3.1. The notations appearing in this figure are shown in Table 1.
Related Work
2.1 Multi-Task Learning for Data Stream Classification
The volume of data stream displaying heterogeneity has been increasing with the rise of mobile networks. Moreover, different participants exhibit various features for the same activity due to variations in sensor worn, usage environment, or personal habits, which cannot be resolved using traditional single-task models. Multi-task learning approaches generally provide superior prediction results because the noise of a single model is biger than the combined noise of the corresponding multiple models. Multi-task learning is a machine learning paradigm that leverages inter-task correlations. This approach learns all relevant tasks and knowledge among these tasks simultaneously to enhance generalization performance[22]. The two most common approaches for multi-task learning are hard parameter sharing and soft parameter sharing according to Ref. [10]. Previous work has focused on joint feature selection and feature learning, where task correlations are represented as mutual blocks of tasks [23]-[25].
Most work assumes that multiple tasks utilize a shared representation. However, in practical applications, sharing a set of common features for all tasks often ignores task-specific features due to task heterogeneity. Some recent methods have proposed different viewpoints. Inspired by Mixture-of-Expert (MoE)[26], Ref. [17] proposed MMoE, which introduces multiple expert networks and then introduces a gating network for each task. The gating network learns different combinations of expert networks for each task; that is, adaptive weighting is performed on the output of the expert network. In addition, the conventional method for multi-task loss simply adds the loss of each task or sets a uniform weight loss. Furthermore, manually adjusting the weights is an inefficient and difficult task. Reference [19] proposed the methed of setting the weight of the loss function of different tasks by considering the homoscedastic uncertainty between different tasks.
Furthermore, some works consider data stream classification based on multi-task learning. For example, Ref. [27] proposed a multi-task deep learning architecture for maritime surveillance using Automatic Identification System (AIS) data streams. Multi-task learning is used to learn multiple tasks simultaneously via shared layers and task-specific layers. The fusion layer is utilized to combine the learned representations from different tasks and produces a final prediction. Additionally, Ref. [28] proposed a novel approach for multi-task data stream classification using a double-coupling learning method. The proposed model mainly focuses on coupling the mutual influences between different tasks and the temporal relationship within data streams to achieve multi-task data stream classification and avoid catastrophic forgetting. Furthermore, MOLESTRA[15] combines multi-task grouping and overlapping learning techniques with sliding learning windows for data stream classification. Moreover, Ref. [29] developed two complex models based on a fully convolutional network to predict node and edge flows, which are connected and trained together by coupling their respective potential representatives to the middle layer. However, most existing studies only consider multiple tasks and fail to analyze additional fine-grained information. Some models cannot even handle incremental data or ignore the existing catastrophic forgetting problem in multi-task incremental learning. Developing approaches for predicting data stream classification incrementally using multi-task learning is necessary to solve task heterogeneity to address the aforementioned issue.
2.2 Multi-View Learning for Data Stream Classification
The sensors located in different parts of the body acquire a considerable amount of information due to the increase in the number of various sensors. For instance, sensors located in distinct areas of the body are often considered distinct views. Moreover, co-learnable features exist across different views. Therefore, a multi-view learning approach can be employed to tackle this issue.
Multi-view learning emulates the consistency and diversity among multiple data views, resulting in superior learning performance compared to single-view approaches. In traditional multi-view learning, functions are employed to model specific views, and all functions are jointly optimized to leverage redundant views effectively with identical inputs, thereby enhancing learning performance[30]. Multi-view learning is also frequently employed to enhance single-view clustering techniques[31]. Reference [32] discovered that Expectation-Maximization (EM) based multi-view algorithms demonstrate considerably superior performance to their corresponding single-view counterparts. Furthermore, recent studies have demonstrated that multi-view spectral clustering holds substantial research implications. References [33]–[35] provided insights into the challenges and approaches encountered by Low-Rank Representations (LRR) in multi-view data clustering. Reference [33] proposed structured LRR via decomposing potential low-dimensional data clustering representations that characterize the data clustering structure of each view. Reference [34] demonstrated that LRR essentially comprises an optimized local structure for spectral clustering, which is a potential clustering orthogonal projection-based representation, leading to a highly desirable multi-view consensus. Reference [35] developed an effective and efficient approach to learning low-rank kernelized hash functions that are shared across views.
Moreover, the co-training algorithm[36] utilizes a view predictor to forecast other view labels, which, in turn, extends the training dataset. Reference [37] constructed a “co-regularization” specification associates with data, where a synthetic eproduction kernel associated with a single Reproducing Kernel Hilbert Space (RKHS), simulated the theoretical analysis, and extended the algorithm for co-regularization. References [22], [38] explored the basis vectors of two sets of variables by maximizing the correlation between their basis vectors. Therefore, these vectors were directly applied to dual-view data for selecting shared potential subspaces. Further development of these vectors was realized in multi-view regression[39].
Some existing works consider data stream classification based on multi-view learning. For example, Ref. [40] built an integrated global clustering model on each data block, integrating data from multiple views in a streaming manner through a split-merge clustering algorithm applied at each time window. Reference [16] transfered multi-view streaming data from different views to a shared potential subspace, and integrated distinguishing information by minimizing and maximizing the intraclass and interclass separability of streaming data, respectively. However, all these methods are only used to address single-task learning. The proposed MTMVIS focuses on the multi-task learning paradigm based on multi-view learning.
2.3 Multi-Task Multi-View Learning
The research on multi-task multi-view learning has recently generated a wide range of interest. Many real-world scenarios can be formalized as multi-task multi-view learning problems, such as human activity recognition and news classification. Reference [8] is the earliest work on multi-task multi-view learning, which presented a graph-based framework where an effective algorithm was proposed to manipulate the framework (namely GraM2). However, GraM2 can only handle data with nonnegative eigenvalues. Reference [41] assumes that the views of different prediction models should be congruent, considering the idea of common normalization. Therefore, Ref. [42] proposed a highly general algorithm (namely CSL-MTMV), which assumes that multiple relevant tasks with the same view should be shared in a low-dimensional common subspace. Inspired by linear discriminant analysis, Ref. [43] focused on double heterogeneity through the shared latent space between multiple views. Reference [44] introduced algorithm MFM based on a multi-linear decomposition machine through learning the multi-linear structure shared by task-specific feature maps and task-views. A unified deep learning framework (namely DMTMV)[11] was proposed to solve multi-task multi-view problems by establishing three types of networks: feature-sharing network, specific feature network, and task network, and using layer-by-layer regularization to update parameters. However, these methods have generally assumed that views and tasks are independent and not studied the fine-grained task-view interaction relationship.
2.4 Incremental Learning for Data Stream Classification
The change in data distribution over time is usually called concept drift, which is a challenging problem in data stream classification. A considerable amount of recent work has been studying incremental learning to address the aforementioned problem [45]-[47]. Specifically, incremental learning can continuously process the continuous flow of information in the real world and absorb new knowledge. Simultaneously, incremental learning can retain or even integrate and optimize previous knowledge. Learning from emerging data from nonstationary data distribution is inevitably prone to inducing catastrophic forgetting[4], [5]. Incremental learning uses three different paradigms, namely, regularization[48], replay[49], and parameter isolation, to address the catastrophic forgetting problem. Among these paradigms, Ref. [12] is one of the representatives of regularization. At the end of each training iteration, the EWC method computes a diagonal Fisher matrix from the final loss, where each value is a parameter reflecting the importance of the model to which it belongs. In the next iteration of the training process, the model uses a regularization term to limit the variation of the parameters.
Incremental learning for data stream classification has always been a long-standing research field[12], [45]. An ensemble-based incremental learning method for unbalanced streaming data is proposed by Ref. [50], the algorithm for concept drift, which reduces class imbalance by using bagging-based subsets and employs different methods to focus on class-specific performance. Inspired by the Hoeffding Decision Tree (HDT)[51] for efficient data stream classification, Ref. [52] used the proposed learning procedure in HDT to adapt to the recently introduced fuzzy decision tree and utilized a uniform fuzzy partition for each input feature to address the data stream classification problem. Reference [53] proposed a new incremental semi-supervised learning framework based on streaming data to address the problem of insufficient labeled samples in data stream classification. This framework uses autoencoders for learning features generated from streaming data to normalize the network construction by establishing pairwise similarity and dissimilarity constraints. However, the existing approaches only focus on solving the catastrophic forgetting of models in single-task scenarios. Data stream classification can generally be formalized as a multi-task multi-view problem. Therefore, using the idea of multi-task multi-view learning may be necessary to solve catastrophic forgetting.
Mtmvis: Adaptive Multi-Task Multi-View Incremental Learning Framework for Data Stream Classification
Owing to the nonstationary distribution characteristics of streaming data, old knowledge is disturbed by new knowledge when the knowledge is continuously acquired from nonstationary data distribution, resulting in a rapid decline in overall performance. This phenomenon is called catastrophic forgetting. Incremental learning is introduced to address the issue of catastrophic forgetting in data streams. Furthermore, the following two scenarios involving sensor data are examined: the first scenario's data from the same type of sensors are gathered from different participants, and the second scenario's data are collected from different parts of the same body using various sensors. Task heterogeneity or view heterogeneity arise because each participant or sensor behaves differently for the same activity. Each participant is assumed to represent a specific task, and each sensor corresponds to a view, as shown in Fig. 2. A multi-task multi-view incremental learning approach is adopted to tackle these scenarios. Therefore, an adaptive multi-task multi-view incremental learning framework for data stream classification, called MTMVIS, is proposed as depicted in Fig. 1. Specifically, the proposed MTMVIS first aligns the data to extract features effectively. Then, MTMVIS uses MMoE[17] to select appropriate expert networks for different task-views and construct a special view fusion layer for each task. Finally, MTMVIS combines adaptive Fisher regularization and task homoscedasticity uncertainty[19] to solve the catastrophic forgetting problem. The details of the MTMVIS are presented as follows.
3.1 Problem Statement
Using incremental learning to predict the data stream classification on the mobile sensors and improve the overall performance of the model via multi-task multi-view learning is considered in this paper. \begin{equation*}
f(\{X_{k}^{t}\}_{t=0}^{t_{m}})\rightarrow y_{k}^{m}
\tag{1}
\end{equation*}
Assumption indicates that each participant represents a specific task, each sensor corresponds to a view, and the collected data by the sensor are the data stream. Task heterogeneity and view heterogeneity arise because each participant or sensor behaves differently for the same activity. Thus, the multi-task and multi-view incremental learning method is used to solve these problems.
3.2 Data Alignment
In the collected datasets, the input data come from different sensors, which have various effects on the prediction results of the same activity simultaneously. In addition, different measurement values generated in the same time interval may have various effects on the prediction results. The hierarchical attention mechanism is introduced to perform the above situations separately to address heterogeneity. The attention mechanism can assign weights based on the importance of different parts and is already widely used in computer vision tasks. Specifically, in the first part of the attention layer, the data of different sensors located in the same part of the body are aligned, and a fixed-length feature vector is generated during each time interval with
The details of each part of the attention layer can be formalized as follows:\begin{align*}
a_{k, v}^{n, s} & =\frac{\exp \left(\tanh \left(w_{k, v}^1 \cdot x_{k, v}^{n, s}+b_{k, v}^1\right)\right)}{\sum_{s=1}^S \exp \left(\tanh \left(w_{k, v}^1 \cdot x_{k, v}^{n, s}+b_{k, v}^1\right)\right)},
\tag{2}\\
x_{k, v}^n= & \sum_{s=1}^S a_{k, v}^{n, s} \cdot x_{k, v}^{n, s}\\
a_{k, v}^{n} & =\frac{\exp \left(\tanh \left(w_{k, v}^2 \cdot x_{k, v}^{n}+b_{k, v}^2\right)\right)}{\sum_{n=1}^N \exp \left(\tanh \left(w_{k, v}^2 \cdot x_{k, v}^{n}+b_{k, v}^2\right)\right)},
\tag{3}\\
\hat x_{k, v}= & \sum_{n=1}^N a_{k, v}^{n} \cdot x_{k, v}^{n}\\
\end{align*}
Multi-task multi-view deep neural network of the proposed model mtmvis. The MTMVIS model initially aligns the data to facilitate optimal feature extraction. Subsequently, this model employs mmoe to choose the most suitable expert network for diverse task views and establishes a specialized view fusion layer for each task. The notations appearing in this figure are shown in table 1.
The learning capability of the model is further improved by using a fully connected neural network with two hidden layers at the output,\begin{equation*}
\overline{x}_{k,v}=\text{FCL}(\overline{x}_{k,v})
\tag{4}
\end{equation*}
3.3 Task-Views Shared Layer With Multi-Gate Mixture-of-Experts
After computing the representation of each view of each task within the k-th stage, we then model the relationship among task-views. Most prior studies utilized soft-sharing constraints and are based on the division of the Multi-Task Multi-View Learning (MTMVL) problem into multiple multi-task learning problems, which are considered as separate approaches. Moreover, in these studies, it is assumed that all views are conditionally independent[11] [42], [43], and finer-grained task-view-interaction correlations are ignored. Inspired by MMoE and the recent MoE layer, we propose an MMoE network with feature representation obtained by FCL(·) for each view of each task. The MMoE network consists of K × V gates and
MMoE networks can formally be identified as follows:\begin{align*}
& \tilde{x}_{k, v}^e=\text{ReLU}\left(w_{k, v}^e \cdot \bar{x}_{k, v}+b_{k, v}^e\right), \\
& c_{k, v}^e=\frac{\exp \left(\text{ReLU}\left(w_{k, v}^e \cdot \bar{x}_{k, v}+b_{k, v}^e\right)\right)}{\sum_{e=1}^E \exp \left(\text{ReLU}\left(w_{k, v}^e \cdot \bar{x}_{k, v}+b_{k, v}^e\right)\right)},
\tag{5}\\
& \tilde{x}_{k, v}=\sum_{e=1}^E\left(c_{k, v}^e \cdot \tilde{x}_{k, v}^e\right)
\end{align*}
3.4 View Fusion Layer
Based on the assumption that all views are conditionally independent, the traditional method transforms the multi-task multi-view learning problem into the multi-task learning problem; that is, the previous multi-task multi-view learning model simply represented the average of the output of all views as the task output. In our work, after getting a better representation of task-view features, we use an additional attention layer to fuse view features from each task. Considering that different views may contribute differently to different tasks, we utilize an attention layer to calculate the importance of each view.
For a single task, its attention layer can be formalized as follows:\begin{align*}
& d_{k, v}=\frac{\exp \left(\tanh \left(w_{k, v} \cdot \tilde{x}_{k, v}+b_{k, v}\right)\right)}{\sum_{v=1}^V \exp \left(\tanh \left(w_{k, v} \cdot \tilde{x}_{k, v}+b_{k, v}\right)\right)} \\
& \vec{X}_k=\sum_{v=1}^V d_{k, v} \cdot \tilde{x}_{k, v}
\tag{6}
\end{align*}
Thus, we utilize regularization term to represent multi-view loss as follows:\begin{equation*}
L_{\mathrm{M}\mathrm{V}}=\lambda_{\mathrm{M}\mathrm{V}} \sum_{k=1}^K \sum_{v=1}^V \frac{1}{V}\left\Vert f_k(\cdot)-f_{k, v}(\cdot)\right\Vert_2
\tag{7}
\end{equation*}
3.5 Adaptive Output Layer
A considerable amount of stream data is constantly evolving in nature. That is, the joint distribution between the ground truth and the input features varies due to concept drift[54]. If the changes in the distribution are ignored, then the performance of the previously obtained distribution would have been markedly degraded, consequently leading to a catastrophic forgetting problem[55]. Some works recently utilize the Fisher information matrix[56] to address the above problem. However, these works ignore the significance of the different components in the Fisher information matrix. Therefore, inspired by IADM[18], a special adaptive output layer is proposed for each task. Unlike the original network that uses the final feature representation for prediction, in MTMVIS, Fig. 1 shows that customized feature representations in the adaptive output layer are obtained at the end of the attention layer as the input part of the final output layer by training each network layer in the model. Each task obtains a specific output representation that will help improve the overall prediction performance by weighing the combination of the multilayer network output.
The details of the adaptive output layer for each task can be formalized as follows:\begin{align*}
& \hat{\beta}_{k, v}=\frac{\exp \left(\tanh \left(\widehat{w}_{k, v} \cdot \hat{x}_{k, v}+\hat{b}_{k, v}\right)\right)}{\left.\sum_{k=1}^K \sum_{v=1}^V \exp \left(\widehat{w}_{k, v} \cdot \hat{x}_{k, v}+\hat{b}_{k, v}\right)\right)} \\
& \widehat{X}=\sum_{k=1}^K \sum_{v=1}^V \hat{\beta}_{k, v} \hat{x}_{k, v}
\tag{8}\\
& \bar{\beta}_{k, v}=\frac{\exp \left(\tanh \left(\bar{w}_{k, v} \cdot \widehat{x}_{k, v}+\bar{b}_{k, v}\right)\right)}{\left.\sum_{k=1}^K \sum_{v=1}^V \exp \left(\bar{w}_{k, v} \cdot \bar{x}_{k, v}+\bar{b}_{k, v}\right)\right)} \\
& \bar{X}=\sum_{k=1}^K \sum_{v=1}^V \bar{\beta}_{k, v} \cdot \bar{x}_{k, v}
\\
& \widetilde{\beta}_{k, v}=\frac{\exp \left(\tanh \left(\widetilde{w}_{k, v} \cdot \tilde{x}_{k, v}+\tilde{b}_{k, v}\right)\right)}{\left.\sum_{k=1}^K \sum_{v=1}^V \exp \left(\widetilde{w}_{k, v} \cdot \tilde{x}_{k, v}+\tilde{b}_{k, v}\right)\right)}, \\
& \widetilde{X}=\sum_{k=1}^K \sum_{v=1}^V \widetilde{\beta}_{k, v} \cdot \tilde{x}_{k, v}
\tag{10}\\
& \gamma_{k_l}=\frac{\exp \left(\tanh \left(w_k \cdot X_{k_l}+b_k\right)\right)}{\sum_{l=1}^L \exp \left(\tanh \left(w_k \cdot X_{k_l}+b_k\right)\right)},
\tag{11}\\
&\qquad F_k=w_{\text {out }}\left(\sum_{l=1}^L \gamma_{k_l} \cdot X_{k_l}\right)+b_{\text {out }}
\tag{12}
\end{align*}
Cross entropy is used in the current work. The classification loss can be formalized as follows:\begin{equation*}
L_{\mathrm{C}\mathrm{L}}=\sum_{k=1}^{K}\Vert y_{k}-F_{k}\Vert _{2}
\tag{13}
\end{equation*}
3.6 Modeling Tasks Relationship With Homoscedastic Uncertainty
Simultaneously, MTMVIS is concerned with the joint optimization of multiple related tasks. The conventional method aims to define a total loss function, which is a linear combination of the loss of each individual task. However, manually adjusting the weight parameters is expensive and tricky. If the weight values of the loss of different tasks are quite different, then a certain task will dominate the value of the overall loss, and the final network learning process will eventually be affected by the loss of other tasks. Inspired by Ref. [19], the loss for each task is automatically traded off by homoscedastic uncertainty during the model training to address the above problems. The relative weights of each task in the total loss function are adjusted by deriving a multi-task loss function based on the maximum likelihood estimation of the normal distribution of task-dependent uncertainties. Specifically, the multi-task loss function is rewritten as follows:\begin{equation*}
L_{\mathrm{M}\mathrm{T}}=\lambda_{\mathrm{M}\mathrm{T}} \sum_{k=1}^K L_k\left(X_k, \theta_k, \sigma_k\right)
\tag{14}
\end{equation*}
The similarity likelihood is adjusted to compress a scaled version of the model output via the softmax function,\begin{equation*}
p\left(y \mid F_k\left(X_k\right), \sigma_k\right)=\text{softmax}\left(\frac{1}{\sigma_k^2} F_k\left(X_k\right)\right)
\tag{15}
\end{equation*}
The output can be written as\begin{equation*}
p\left(y=c \mid X_k, \theta_k, \sigma_k\right)=\frac{\exp \left(\frac{1}{\sigma_k^2} f_k^c\left(X_k\right)\right)}{\sum_{c=1}^C \exp \left(\frac{1}{\sigma_k^2} f_k^c\left(X_k\right)\right)}
\tag{16}
\end{equation*}
\begin{align*}
L_k\left(X_k, \theta_k, \sigma_k\right)= & \sum_{c=1}^C-\log p\left(y=c \mid X_k, \theta_k, \sigma_k\right)= \\
& \frac{1}{\sigma_k^2} \sum_{c=1}^C \log p\left(y=c \mid X_k, \theta_k\right)+\log \sigma_k
\tag{17}
\end{align*}
3.7 Adaptive Fisher Regularization for Overcoming Catastrophic Forgetting
A model that adapts to new data often suffers from catastrophic forgetting, where it forgets what it has learned before. This phenomenon poses a practical challenge for incremental learning, i.e., minimally forgetting previous knowledge. Previous work has focused on protecting old knowledge from being overwritten by new knowledge by imposing constraints on the loss function of the new task to address the abovementioned challenge. This study takes the following view: the distribution of instances does not change significantly within a short period, such that the interest of users in following an online information stream remains stable. Moreover, a drift detection algorithm can be employed in highly complex cases to partition the stream into epochs with relatively smooth underlying distributions. Therefore, using the Fisher information matrix is proposed to regularize the conditional likelihood distribution at each stage as a forgetting metric. This matrix can be represented as follows:\begin{equation*}
L_{\mathrm{I}\mathrm{L}}\left(\theta_k^{m+1}\right)=\sum_{n=1}^P F_k^{m_p}\left(\theta_k^{(m+1)_p}-\theta_k^{m_p}\right)^2
\tag{18}
\end{equation*}
Notably, Eq. (18) only considers the Fisher information matrix of the previous stage and disregards all previous stages, which may still lead to interval forgetting. Moreover, in the incremental setting, different hidden layers of the neural network may have various importance due to the network structure evolution with an attention mechanism, resulting in different parts of the Fisher information matrix with various significance in successive stages. The attention weights are incorporated into the relevant parameters of the Fisher regularization to address this issue and gradually align the posterior distribution of the neural network trained at each stage. Therefore, Eq. (18) can be further expressed as\begin{equation*}
L_{\mathrm{I}\mathrm{L}}\left(\theta_k^{m+1}\right)=\lambda_{\mathrm{I}\mathrm{L}} \sum_{p=1}^P \gamma_k^m \odot F_k^{m_p}\left(\theta_k^{(m+1)_p}-\theta_k^{m_p}\right)^2
\tag{19}
\end{equation*}
\begin{equation*}
F_k^{m_p}=\sum_{t=1}^{t_m}\left(\frac{\partial L\left(X_k^t\right)}{\partial \theta_k^{m_p}}\right)^2+F_k^{(m-1)_p}
\tag{20}
\end{equation*}
Parameters related to multi-view loss and multi-task loss by integrating parameters relates to classification loss. In the next training iteration, the proposed MTMVIS uses a method that preferentially preserves the above parameters to overcome the catastrophic forgetting problem. The formula for calculating the loss function is as follows:\begin{gather*}
L_F\left(\theta_k^{m+1}\right)=L_{\mathrm{C}\mathrm{L}}\left(\theta_k^{m+1}\right)+L_{\mathrm{M}\mathrm{V}}\left(\theta_k^{m+1}\right)+L_{\mathrm{M}\mathrm{T}}\left(\theta_k^{m+1}\right)+ \\
\sum_{k=1}^K L_{\mathrm{I}\mathrm{L}}\left(\theta_k^{m+1}\right)
\tag{21}
\end{gather*}
Experiment
Experiments using two real-world human activity recognition datasets are conducted in this section to demonstrate that the proposed model outperforms many existing baselines. A detailed introduction to the two datasets used, the baseline, and the experimental setup are first presented. The effectiveness of the proposed MTMVIS is then demonstrated.
4.1 Dataset Descriptions
(1) Reaiworld-Har[20]
The dataset comprises motion sensor data for various activities, including stair climbing and jumping. Four males and four females are selected from a pool of 15 participants to join in the experiment. The acceleration of seven body positions during each activity is recorded using sensors: chest, forearm, head, calf, thigh, upper arm, and waist. Each participant is regarded as a separate task, and each body position is considered a distinct view to perform behavioral recognition for each participant. Thus, this dataset has eight tasks and seven views.
(2) Gleam[21]
The GLEAM dataset is a collection of head motion tracking data gathered from Google Glass, comprising 2 hours of data spanning various activities, such as eating and walking. These labeled data obtained from the head-mounted sensor can be utilized to identify diet and other activities, ultimately aiding patients with chronic diseases. The dataset includes data from all the sensors of Glass, which include acceleration, gravity, linear acceleration, rotation, gyroscope, magnetic field, and light. Table 2 provides a brief description of these sensors. The data are collected from 38 participants aged 18–21, and eight of them are randomly selected as experimenters, each having six views (light sensor data are not used).
4.2 Baselines
Some baselines are presented in this study and compared with the proposed method.
• DMTMV[11]
This baseline is a deep multi-task multi-view approach in the unified framework, which learns not only nonlinear feature representation and classifier but also the relationship between tasks in nonlinear models.
• ASM2TV[57]
This baseline is a semi-supervised multi-task multi-view learning algorithm. ASM2TV presents a new perspective named gating control policy, a learnable task-view-interacted sharing policy that adaptively selects the most desirable candidate shared block for any view across any task.
• IADM[18]
An incremental adaptive model, provides an additional attention model to the hidden layers which prevents forgetting by exploiting the attention-based Fisher information matrix to address the capacity sustainability problem. Specifically, the method incorporates the attention weights obtained through learning into the corresponding model parameters of the abovementioned matrix when computing the Fisher information matrix.
• MTIS
A multi-task multi-view neural network and adaptive Fisher regularization are used to train the model sequentially. MTIS is the same as the MTMVIS structure, except that the attention mechanism is not used in the view fusion layer. Instead, the average of the output results of all views is used as the output vector of each task.
• MVIS
A multi-task multi-view neural network and adaptive Fisher regularization are used to train the model sequentially. MVIS is the same as the MTMVIS structure, except that only one expert is used in the MMoE layer.
4.3 Experiment Settings
The model is trained one after another with stages of sequential style. The model no longer stores training data after completing each training stage. Each activity occurs thrice, and the sequence of activities in each stage is randomly arranged. Specifically, the measurement values from different sensors in a certain time window (T= 2.5 s in default) are used as the input for action recognition. For both datasets, 70% of the data are randomly selected as the training set, 10% of the data are the validation set, and the rest of the data is chosen for testing. Adam is chosen as the optimizer due to the use of adaptive Fisher regularization to tackle the catastrophic forgetting problem. {10-6, 10-5, …, 106} is utilized for hyperparameter λ, 10-4 for learning rate. Accuracy, F1-score (F1), precision, and AUC are used to measure the model performance.
4.4 Experiment Results
4.4.1 Numerical Results
Table 3 and Fig. 4 display the results of the proposed framework and all baselines under four metrics on the dataset RealWorld-HAR[20]. Meanwhile, Table 4 and Fig. 5 display the results of the proposed framework and all baselines under four metrics on dataset GLEAM[21]. The evaluation metrics, which are derived from eight tasks on two datasets, RealWorld-HAR [20] and GLEAM[21] (Tables 3 and 4, respectively), are reported, and the change curves of four metrics (see Figs. 4 and 5) with the number of training stages on two datasets RealWorld-HAR[20] and GLEAM[21] are presented. The following observations are provided.
Compared with baselines, the proposed MTMVIS reaches the best predictive performance on four metrics of all tasks on two real-world datasets.
DMTMV and ASM2TV are not designed for incremental learning scenarios. Thus, the performances of DMTMV and ASM2TV sharply decrease when faced with the training of sequential stages. The results show that when completing all stages of training, compared with DMTMV and ASM2TV, the four metrics are improved by approximately 45%.
The performance of IADM is worse than that of MTMVIS, indicating that the attention mechanism has superior performance in processing time series perception data in incremental learning. The idea of multi-task multi-view learning also plays a role in improvement.
The performance of MTIS is lower than that of the proposed MTMVIS, because MTIS discards the attention mechanism of the view fusion layer and directly averages the output of all views. This approach ignores the importance of different sensors for the same activity and the impact of sensor data at different times. By contrast, considering the different importance of various views in the same task is beneficial for learning the invariant features between views and improving overall performance.
The performance of MVIS is worse than that of MTMVIS, because harmful interference or negative feedback cannot be avoided when all task-views share the same expert network.
The performance of MTMVIS can be maintained at a stable level under repetitive activities. For instance, on dataset GLEAM[21] shown in Fig. 5, from the first to the third stage, all metrics decrease due to the catastrophic forgetting problem. However, the performance of the model improves from the third to the fourth stage. This phenomenon is because the activities in the first and fourth stages are both talking, and the proposed MTMVIS consolidates important parameters for talking.
4.4.2 Ablation Studies
A series of ablation studies is further conducted on the effect of each component to validate the effectiveness of the proposed MTMVIS. Table 5 reports performance improvement results (in percent) on both datasets compared to IADM. Specifically, the following observations are presented.
Compared to single-task single-view models, MTMVIS generally achieves significant improvements of over 23% and 15% for all metrics on the RealWorld-HAR and GLEAM datasets, respectively.
MTIS can also achieve 17% and 11% significant improvements on all metrics on two datasets. This finding indicates that the appropriate number of experts often has a considerable effect on model improvement.
MVIS has not significantly enhanced with 8% and 4% significant improvements on all metrics on two datasets. This finding shows that negative feedback is often unavoidable when only one expert network is shared.
MTMVS has the same structure as MTMVIS, except that it only uses the output of the last layer as the final output. Moreover, the performance of MTMVS is substantially close to the proposed MTMVIS, with 21% and 11% significant improvements on all metrics on the two datasets. This finding indicates that the combination of the outputs of different layers as the final output performs better than only the output of the last layer as the final output.
Overall, each component contributes to the performance achieved by the final model.
4.4.3 Parameter Analysis
Further experiments are conducted considering the parameters presented below to explore the effect of different learning parameters on the effectiveness and efficiency of the proposed MTMIS.
(1) Number of Experts
The influence of the number of experts on the model performance (Fig. 6) is investigated on two datasets, namely RealWorld-HAR[20] and GLEAM[21]. The best model performance is observed on both datasets when the number of experts is 5. The model performance does not improve significantly, and the curve tends to be flat as the number of experts increases. The model performance on datasets slightly improves as the number of experts increases; thus, no significant difference is observed under statistical testing. With this exciting discovery, the model can achieve the best performance with the least number of experts, which will help reduce the number of network parameters, increase the convergence speed, and reduce memory consumption.
(2) Attention Weight
As depicted in Fig. 7, it shows the attention weight matrix for the output layer on two datasets. The results show that the largest weights are in the shallow layers in the initial stage. Deep layers obtain additional weights as the stages change. Thus, the weight evolution indicates that MTMVIS can perform model selection. Moreover, different stages have various attention weight matrices, indicating that MTMVIS learns additional discriminative features in task-view sharing layer and view fusion layers. Thus, multi-task multi-view learning can play a significant role in incremental learning to solve catastrophic forgetting.
In addition, the adaptive multi-task multi-view incremental learning framework can reflect excellent effects on highly complex network structures. For traditional models, if the network structure is too complex, then the overall convergence speed of the model will be markedly reduced, and the natural performance will also decrease. The proposed model can adaptively make the suitable choice along with the training process.
Conclusion
An adaptive multi-task multi-view incremental learning framework for data stream classification, namely MTMVIS, is proposed in the current work to address the catastrophic forgetting problem in incremental learning. MTMVIS heuristically uses a hierarchical attention mechanism to align data collected by different sensors in different views. In addition, MTMVIS utilizes adaptive Fisher regularization from the perspective of multi-task multi-view learning to preserve knowledge in previous stages. The experimental results demonstrate its superiority, achieving better model performance than other advanced models on two public datasets. The proposed MTMVIS has achieved several exciting results, which can effectively alleviate the catastrophic forgetting problem in incremental learning. Numerous unlabeled samples exist in real-world scenarios. Thus, future work aims to extend the proposed model using semi-supervised learning to improve its overall performance.
Attention weight matrix of the output layer with the change in stage on two datasets.
ACKNOWLEDGMENT
This work was supported by the National Key Research and Development Program of China (No. 2022YFF0712100), the National Natural Science Foundation of China (Nos. 62202273 and 62072279), the National Science Fund for Excellent Young Scholars of China (No. 62122042), the Shandong Provincial Natural Science Foundation of China (No. ZR2021QF044), and the Fundamental Research Funds for the Central Universities.