Introduction
In multi-view learning, more than one feature set is used for learning. The features may be redundant, but they are not entirely similar. As such, besides learning the patterns in the features, the relationship among the feature sets can be used for learning too.
Multi-view learning was first introduced as a framework by Blum and Mitchell [1] for the semi-supervised learning of web page classification. The text of the web pages and the anchor text in the hyperlinks of the web pages were used as the feature sets in this two-view setting. Using co-training, two separate models built on the two disjoint views were used to predict the unlabeled data. This was used to decide on which of the unlabeled data to add to the training set. In this way, the training set can be enlarged for further training.
A survey on multi-view learning by Sun [2] reviews the theories, properties and behaviors of multi-view learning. It shows that multi-view learning, as an emerging and rapidly growing field in machine learning, has been used in all branches of machine learning, from unsupervised learning [3], semi-supervised learning, active learning [4], supervised learning, transfer learning [5], and ensemble training [6]. Some examples of applications include the sentiment analysis of the attitude or opinion of a user [7], and speech analysis for phonetic recognition [8].
An important part of multi-view learning is the construction of the views. The views may be naturally distinct, as in the text of the web page and the anchor text in the hyperlink of the web page, or the video and audio signals of a multimedia content [9]. They may be distinct due to the feature extraction methods used on the raw data, such as the CELP features and the MFCC features of an audio signal [10]. They may be subsets that are split from a single feature set, based on the ordered importance of the features in the feature set.
When multi-view features are not available, random feature split of a single view can be used to construct artificial views. The use of the artificial views in multi-view learning can still improve the generalization performance. This is because multi-view learning is robust to the violated assumptions of its underlying classifiers [11].
The architectures for the two types of multi-view data, namely natural and artificial, are shown in Fig. 1 below:
In this work, we utilize deep learning to create the artificial views, and then make use of the artificial views in multi-view learning for the classification of time series data, in particular sounds. The framework makes use of
Fig. 2 shows the architecture of the proposed network. The time series data are first decomposed in the time-frequency domain to expose the spectral aspect of the time series to the deep learning sub-models. The features extracted by the sub-models form the views. These views are obviously redundant, but they are not entirely similar, due to the different configurations of the deep learners. The views from the deep learners can be combined according to their complementarity. The combined data, being more representative of the target concept, will result in better performance by the final classifier.
The proposed framework addresses the problem of the strong dependency of the performance of a trained model on the representativeness of the data. As is well known, it is tedious and expensive to construct a representative training set, due to the extensive manual curation and annotation that are needed. This is particularly true for time series data, as clear segmentation is not readily available. By treating the outputs of the deep learning sub-models as the views of the same target concept, the dependency on any one of the views could be weakened through the appropriate use of the views’ complementarity. This helps reduce the need for clear segmentation and improve the generalization performance of the classifier.
A. Construction of Multiple Views by Deep Learning
In a traditional single-view classifier, the training set consists of a single feature set
The proposed way to construct the views is to subject each of the input data segments,
The view to be retrieved from the sub-model is the penultimate layer of the sub-model, rather than the final softmax layer. The penultimate layer can be thought of as the feature set that is extracted by deep learning from the input data. It can be represented as the approximate function
According to the Representer theorem [12], the approximate function of a machine learning model is the linear combination of the basis functions. Thus, assuming that each of the views is a basis function, the views can be combined linearly, with appropriate weights assigned to the linear combination, as shown in (1) below.\begin{equation*} \boldsymbol {V}_{combined}=\sum \limits _{i=1}^{M} {\alpha ^{\left ({i }\right)}\boldsymbol {V}}^{\left ({i }\right)}\tag{1}\end{equation*}
The weights \begin{equation*} \sum \limits _{i=1}^{M} \alpha ^{\left ({i }\right)} =1, \quad \alpha ^{\left ({i }\right)}>0\tag{2}\end{equation*}
The value of
B. Complementarity of Multiple Views
Intuitively, views that are independent and supplemental will contribute equally to the global view of the combined data. The weight of each of these views is the average weight
On the other hand, if a view contains complementary information, it will contribute more to the global view, and its weight will be higher than
So, instead of using the average weight
However, the global view of a linear mixture is actually latent, given the individual views. In other words, although the global view can be obtained from the weighted sum of the individual views, it begs the question of what the weight values should be for the linear combination.
The candidate method to solve the minimization problem with two unknowns (the weights and the global view) is alternate optimization. An example of alternate optimization is the expectation maximization (EM) method used in the Gaussian mixture [14].
A similar approach is proposed for the multi-view temporal ensemble. The cost function, which has to be defined in alternate optimization, is based on that of Laplacian eigenmap [15], a non-linear data reduction technique. It will be modified in this work so that it can be used in the multi-view setting. This will be described later in Section II where the computation method for complementarity is explained.
C. Features in the Time-Frequency Domain
Time-frequency decomposition exposes the spectral changes in the time series data to the sub-model
The sub-model, as a machine learning model, can be a generalized linear model, decision tree, k nearest neighbor, or neural network. In the past decade, deep learning, which is the composition of layers of models, has been found to be effective in the classification of raw signals.
Deep learning, as a feature extractor, has a smooth output in the feature space that can be classified easily by the final classifier. Not only can it approximate the function with an exponentially lower number of training parameters compared to a shallow network, it is also more immune to overfitting [19].
The workhorses of deep learning are deep belief net [20], convolutional neural network (CNN) [21] and long short-term memory (LSTM) recurrent neural network [22]. These models can be combined in different ways to form practical models for signal classification.
In this work, the CNN-LSTM model is proposed for use in the multi-view temporal ensemble. The reason for using the CNN-LSTM model is to extract the temporal and spectral patterns from the two-dimensional time-frequency domain. The lower CNN layer takes in the input data in two-dimension, while the LSTM works on the subsequently flattened layer in one-dimension.
The fully-connected layer before the final softmax layer is the penultimate layer. It contains the features that form the view of the sub-model. By linearly combining the penultimate layers of the CNN-LSTM sub-models, a new input will be formed for the final classifier. This qualifies the proposed multi-view temporal ensemble as an intermediate data fusion technique, rather than a late data fusion technique. This is because the penultimate layer represents the feature extracted by the sub-model, not the decision made by the sub-model.
Method
This section first provides the overview of the method to compute complementarity, followed by the details in the sub-sections.
The input, output, and initial weight values are as shown below:
Input: A set of
data matrices, each withM data points of lengthN ,d \boldsymbol {X}=\left \{{\boldsymbol {X}^{\left ({i }\right)}\in \mathbb {R}^{N\times d} }\right \}_{i=1}^{M} Output: A set of
mixing coefficients,M \boldsymbol {\alpha } =\left \{{\alpha ^{\left ({i }\right)} }\right \}_{i=1}^{M} Initialize
\boldsymbol {\alpha } =\left [{ \frac {1}{M},\ldots,\frac {1}{M} }\right]
The set of
A summary of the terms used in this section are shown below:
Weighted adjacency matrix of a view,\boldsymbol {W} - \boldsymbol {W}\in \mathbb {R}^{N\times N} Laplacian matrix of a view,\boldsymbol {L} - \boldsymbol {L}\in \mathbb {R}^{N\times N} Spectral embedding of a view,\boldsymbol {Y} - ,\boldsymbol {Y}\in \mathbb {R}^{N\times m} m< N Weighted adjacency matrix of the\boldsymbol {W}^{\left ({i }\right)} - -th viewi Laplacian matrix of the\boldsymbol {L}^{\left ({i }\right)} - -th viewi Spectral embedding of the\boldsymbol {Y}^{\left ({i }\right)} - -th viewi Laplacian matrix of the global view\boldsymbol {L}^{\left ({G }\right)} - Spectral embedding of the global view\boldsymbol {Y}^{\left ({G }\right)} - Complementarity (i.e. the mixing coefficient, or weight) of the\alpha ^{\left ({i }\right)} - -th viewi
To compute complementarity, a set of
From the adjacency matrix of the
The global spectral embedding
Alternate optimization of
The iterative process in Fig. 4 can be summarized by the following steps:
Obtain
from a set of\boldsymbol {L}^{\left ({i }\right)} co-occurring data vectors of the same class from theN -th viewi Align the individual
to the global spectral embedding in 2 steps:\boldsymbol {L}^{\left ({i }\right)} obtain
from\boldsymbol {L}^{\left ({G }\right)} by linear combination, according to the weights\boldsymbol {L}^{\left ({i }\right)} \alpha ^{\left ({i }\right)} obtain
from\boldsymbol {Y}^{\left ({G }\right)} by eigen-decomposition, formed by the\boldsymbol {L}^{\left ({G }\right)} eigen-vectors that correspond to them smallest eigenvalues other thanm , where\lambda _{0} , and0=\lambda _{0}\le \lambda _{1}\le \ldots \lambda _{1}\le \ldots \le \lambda _{N -1} m< N
Update the values of
, which is the inverse of the trace of\alpha ^{\left ({i }\right)} \boldsymbol {Y}^{\left ({G }\right)T}\boldsymbol {L}^{\left ({i }\right)}\boldsymbol {Y}^{\left ({G }\right)}
Iterate through (2) if the norm of the change in
The above is the overview of how complementarity, in sets of
A. Adjacency Matrix and Laplacian Matrix
For a set of \begin{equation*} \left [{ \boldsymbol {W} }\right]_{i,j}= \begin{cases} exp\left ({-\frac {\left \|{ \boldsymbol {x}_{i}-\boldsymbol {x}_{j} }\right \|_{2}^{2}}{\sigma ^{2}} }\right)& if ~ x_{i},\quad x_{j}~ connected\\ 0 &otherwise\\ \end{cases}\tag{3}\end{equation*}
According to (3) above, the entry
B. Spectral Embedding of the Data Manifold
The spectral embedding \begin{equation*} J\left ({\boldsymbol {Y} }\right)=\sum \limits _{i,j\in \left \{{1,\ldots,N }\right \}} {\left \|{ \boldsymbol {y}_{i}-\boldsymbol {y}_{j} }\right \|^{2}\left [{ \boldsymbol {W} }\right]_{i,j}}\tag{4}\end{equation*}
As seen from (4) above, the cost function
The solution \begin{equation*} \boldsymbol {Y}^{\ast }=\arg \min \limits _{\boldsymbol {Y}^{T}\boldsymbol {DY}=1,\boldsymbol {Y}^{T}\boldsymbol {D1}=0}{tr(\boldsymbol {Y}^{T}\boldsymbol {LY})}\tag{5}\end{equation*}
In (5) above,
Importantly, finding
With
It is interesting to note that the spectral embedding
C. Multi-View Laplacian Eigenmap
With multiple views, say
Following the method of patch alignment with multi-view spectral embedding for image and video [25], it is proposed that the global view \begin{equation*} \boldsymbol {L}^{\left ({G }\right)}=\sum \limits _{i=1}^{M} {\left ({\alpha ^{\left ({i }\right)} }\right)^{r}\boldsymbol {L}}^{\left ({i }\right)},\quad r>1\tag{6}\end{equation*}
The minimization problem in (5) will then become (7) as shown below:\begin{equation*} \boldsymbol {Y}^{(G)\ast }=\arg \min \limits _{\boldsymbol {Y^{T}Y}=\boldsymbol {1}} \sum \limits _{i=1}^{M} {\left ({\alpha ^{\left ({i }\right)} }\right)^{r}tr\boldsymbol {(}\boldsymbol {Y}^{{\boldsymbol {(}G\boldsymbol {)}}^{\boldsymbol {T}}} \boldsymbol {L}^{\boldsymbol {(i)}}\boldsymbol {Y}^{\boldsymbol {(}G\boldsymbol {)}}\boldsymbol {)}}\tag{7}\end{equation*}
The hyper-parameter
The eigenvectors are arranged in order of the eigenvalue, from the smallest eigenvalue to the largest value, up to the specified dimension
The eigenvectors with the smallest eigenvalues are selected because a compact representation in the projection space is desired. However, since the eigenvector associated with the smallest eigenvalue is likely to represent the noise, it will have to be discarded. Thus, only column vectors
D. Complementarity
The complementarity of the \begin{align*}&\hspace {-0.5pc}\alpha ^{\left ({i }\right)}=\left ({1/tr\left ({\boldsymbol {Y}^{{\boldsymbol {(}G\boldsymbol {)}}^{\boldsymbol {T}}}\boldsymbol {L}^{\left ({i }\right)}\boldsymbol {Y}^{(G)\ast } }\right) }\right)^{\frac {1}{r-1}}/ \\&\qquad \qquad \qquad \quad \sum \nolimits _{i=1}^{M} \left ({1/tr\left ({\boldsymbol {Y}^{{\boldsymbol {(}G\boldsymbol {)}}^{\boldsymbol {T}}}\boldsymbol {L}^{\left ({i }\right)}\boldsymbol {Y}^{(G)\ast } }\right) }\right)^{\frac {1}{r-1}}\tag{8}\end{align*}
The \begin{equation*} \sqrt {\sum \nolimits _{i=1}^{M} \left ({\alpha _{k}^{\left ({i }\right)}-\alpha _{k-1}^{\left ({i }\right)} }\right)^{2} < \varepsilon }\tag{9}\end{equation*}
In (9) above,
At convergence,
E. Co-Occurrence and Class-Specificity
The computation of complementarity is a way to produce data that are more representative of the target concept. Thus, when computing complementarity, the data points across the views must describe the same target concept. For time series data, this translates to the following rules:
The data points across the views must be aligned in time, i.e. co-occurring.
The data points must belong to the same class, i.e. class-specific.
1) Co-Occurrence
Co-occurrence does not preclude the shuffling of the data points in the individual views, which is often a necessary operation to achieve independent and identical distribution of the input data for model training. It merely states that the same shuffled order must be used across the views so that the data points across the views will occur at the same time point and thus describe the same target concept.
To ensure co-occurrence at the penultimate layers of the deep learning sub-models, the same data set (shuffled and so random in order) will have to be used as the inputs for all the sub-models. As long as there is no randomization of the data vectors in the sub-models, the outputs at the penultimate layers of the sub-models will be co-occurring too. These outputs, which are co-occurring, can then be used as the views for multi-view learning.
The rule of co-occurrence has to be enforced during both the training and the testing process.
2) Class-Specificity
The rule of class-specificity applies to multi-view learning because complementarity can only be determined among data of the same class. It cannot be used for classes that are different.
The cats and dogs analogy illustrates this idea. For a set of dog images and a set of cat images, it is meaningful to define the complementarity of the images within the sets (either the cats or the dogs) but not across the sets. This is because complementarity is ill defined for a combined data set that has different concepts.
The proposed solution to satisfy the requirement of class-specificity is to re-arrange the outputs of the sub-models by class, yet without disturbing the time order necessary for co-occurrence. Complementarity is then computed on the class-specific data, which is then combined across the views.
This process, when carried out separately for the classes, will result in linearly combined data that are class-specific. The data of these classes will have to be stacked together as one single feature set and then shuffled so that they can be used as the input by the final classifier.
The rule of class-specificity seems to contradict the testing requirement in machine learning, where the class in the test set is assumed unknown. Class-specific data seems impossible when the class information is not available in the test set.
Actually, this is not a problem in the proposed multi-view temporal ensemble. This is because the sub-models can predict the class during testing. The predicted class, instead of the actual class, can be used to re-arrange the outputs of the sub-models. The linearly combined data, based on the predicted classes, are then used by the final classifier for the final prediction.
F. CNN-LSTM Sub-Model
The multi-view temporal ensemble, when applied to time series data, entails some considerations as shown below:
In general, it is a good idea to decompose the time series into the time-frequency representation of the signals so that spectral features are exposed to the learner.
The sub-model will need to be a good learner because a good learner is able to produce data that are smooth with respect to their target class labels, thus making the criterion of local proximity in the spectral embedding achievable.
Different configurations of the same sub-model could be used to generate artificial views of the input data that may not be segmented well.
With the 2-dimensional CNN as the front end, a 1-dimensional CNN can be added on top of it to extract the temporal features across the feature maps. This is then followed by an LSTM to extract the remaining high-level temporal features. Different configurations of such CNN-LSTM models can be used as the sub-models of the multi-view temporal ensemble to produce the views that are needed by multi-view learning.
Data Experiment and Result
This section will describe the data experiment done on the ESC-50 data set [28]. The ESC-50 data set is chosen for this work because the signals are non-stationary with no obvious time-dependent structure. The purpose is to validate the performance of the multi-view temporal ensemble on a time series data set without curation or manual segmentation.
The work is presented in four parts: (1) the description of the data set, (2) the spot-checking to get the general benchmark of the data set, (3) the performance evaluation of the individual views, each of which is a CNN-LSTM model configured in a particular way, and (4) the performance evaluation of the multi-view temporal ensemble, based on the penultimate outputs of the CNN-LSTM sub-models.
A. ESC-50 Data Set
The ESC-50 data set is a univariate numeric time series data set with 2,000 audio recordings constructed from the sound clips in the Freesound project [29]. There are 50 classes, of which 22 classes are the sounds of animals and humans (dog, rooster, etc.), and the rest natural or mechanical sounds (door knock, siren, etc.).
Each of the 50 classes has 40 recordings. Each recording is a 5 second long.wav file (110,250 samples at 22,050 Hz). They can be decoded by the avconv library package and processed using the LibROSA library package in the Python programming environment.
According to [28], the human capabilities in recognizing the sounds in the data set is estimated at 81.3%. The performance varies across the sounds, with a low of 34.1% for the washing machine noise and almost 100% for crying babies. It is postulated that trained and attentive listeners could reach 90% accuracy for the data set.
With just 40 recordings per class in this data set, there is hardly enough training instances per class for deep learning. To overcome this problem, each of the 5-second audio clips is split into 9 overlapping segments, with 20,992 samples per segment (0.952 second). The content of each segment is arbitrarily segmented with no curation, other than the removal of segments that have very low power, likely to be due to silence.
Within each segment, 41 time-consecutive frames, each with 512 samples (0.023 second), are formed. Each frame is subjected to Fourier transform and converted to the energy values of a 60-bin Mel-frequency cepstrum.
As a result, the data of each segment is a 2-D matrix with 41 time steps and 60 coefficients. The 2-D matrix has a total of 2,460 coefficients in it and is associated with a particular sound class.
B. Spot-Checking
Previous work [15] shows that using a deep learning approach with two convolutional layers with max-pooling followed by two fully connected layers can produce a classification accuracy of 64.5%.
It is also interesting to note that not all deep learning will yield good result on the ESC-50 data set. To show this, a deep learning model with two LSTM layers, a dense layer, and a softmax layer, as shown in Fig. 5 below, was used on the time-frequency representation of the ESC-50 data set.
The result (60.9% accuracy) is less than appealing despite the use of dropout as regularization. This is likely due to the spectral features not being extracted by the LSTMs as well as the CNNs.
C. Performance of the Individual Views
Three configurations of the CNN-LSTM model are used in this work, referred to here as View 1, 2, and 3.
The CNN-LSTM model used for View 1 is shown in Fig. 6 below. It consists of two groups of 2-D CNN layers, one group of 1-D CNN layer, one group of LSTM layer, a fully connected dense layer, and a softmax layer. It has 1,454,226 trainable parameters.
The input of the CNN-LSTM model is a tensor of size (1, 41, 60), where the number of channels is 1, the number of time steps is 41, and the number of attributes is 60. As Fig. 6 shows, the input is filtered by 32 kernels in the first CNN layer. This will result in 32 feature maps. After max pooling by a
The 2-D CNN group (CNN, max pooling and dropout) is then repeated, this time with 64 kernels, giving rise to the second 2-D CNN group. Together, the two 2-D CNN groups serve as a deep learner to capture the invariant features across the time-frequency structure of the audio segment.
The features are then re-organized as a matrix of 10 time-steps of 960 features. This is used as the input for the 1D-CNN layer. The kernels in the 1D-CNN layer have a size of 3 time steps by 960 features, covering all the 960 features in one dimension.
The output from the 1D-CNN layer is fed to an LSTM layer to extract the remaining high-level features. Thereafter, a fully connected layer with ReLU activation is used with a softmax layer to implement multi-class classification.
Validation of the performance of the CNN-LSTM model for View 1 is done by 66/33 training/test splitting. The classification accuracy is used as the performance metrics, since the data set is a balanced one. Table 1 shows the View 1 result (classification accuracy) over 20 epochs. It shows that the result has converged over the epochs. The final result, at 83.94%, is close to the reported top scores for this data set.
The configurations of the CNN-LSTM models for View 2 and View 3 differ from that for View 1 in terms of the number of kernels used in the two 2-D CNN groups. Instead of 32 and 64 kernels for View 1, they are 8 and 16 for View 2, and 16 and 32 for View 3. As a result, the number of trainable parameters of the CNN-LSTM models for View 2 and View 3 are 990, 834 and 1, 138, 898 respectively.
There is no sure way of knowing which set of configurations is better for the given data set. The purpose here is not to select a good configuration for the given data set. Rather, it is to use the different configurations to generate random split of the features so that multiple views can be generated so that they can be linearly combined based on their complementarity to boost the generalization performance.
Based on the CNN-LSTM model for View 2, the results over 20 epochs are shown in Table 2 below. At 82.64%, it is close to the results of View 1.
The results produced by the CNN-LSTM model for View 3 are shown in Table 3 below. At 83.06%, it is, again, similar to the results of View 1 and View 2.
The results in Table 1, Table 2 and Table 3 show that the single views from the CNN-LSTM models are sufficient for state-of-the-art performance for sound classification. The purpose of this work, however, is to show that when a number of such views are available from the same set of time series data, multi-view learning based on the proposed complementarity can boost the performance further.
D. Performance of the Multi-View Temporal Ensemble
The penultimate layer of the CNN-LSTM model has 128 nodes. These are the features as extracted by the model. They form the view of the time series data as seen by the model.
There are three CNN-LSTM models in the ensemble, each configured differently from the rest. As such, the ensemble has three complementary views.
In both training and testing, the views will have to be computed for complementarity in small mini-batches of
The final classifier used here is a neural network with a hidden layer and a 50-output softmax layer. As for all classifiers, it will have to be trained before it can be used for prediction. Since its training data are now more representative of the target concept, compared to the single view of the CNN-LSTM models, better performance is expected.
Validation of the performance of the proposed multi-view temporal ensemble is done by 66/33 training/test splitting. It was found that the classification accuracy improved to 85.5%, which is better than that of any of the single view.
Fig. 7 below shows the performance of the multi-view temporal ensemble versus those of the individual views. It shows that the complementary data in the individual views boost the system performance after they were blended according to their complementarity.
Comparison of MTE vs individual view (ESC-50) in terms of classification accuracies.
The mean and standard deviation of 10-times bootstrap resampling is used to compare the effect of the models. This is shown in Table 4 below. It shows that the accuracy for the individual sub-models (83.59%, 81.05%, and 82.58%) improves to 85.97% when the multi-view temporal ensemble is used to blend the views and then reclassified by the final classifier.
The better performance is due to the final classifier having a more complete view of the underlying phenomenon. This can be explained from the perspective of the bias-variance dilemma [30]. With complementary views, the consensus between the views reduces the variance while the increase in the discriminatory information reduces the bias.
Conclusion
In this paper, the exploration of deep learning was extended to the field of ensemble technique and multi-view learning. An intermediate data fusion technique, called the multi-view temporal ensemble, is proposed for use with time series data such as sound to boost the generalization performance of classification. In the proposed method, the outputs of the sub-models in the ensemble are linearly combined according to their complementarity so that the features, used as the input by the final classifier, can be more representative of the target concept.
It is proposed that the cost function of the Laplacian eigenmap be adopted for alternate optimization to solve the two-fold problem: (1) the mixing coefficients are unknown, and (2) the global view (i.e. the weighted sum of the individual views) is also unknown. The alternate update of the two unknowns will result in the minimization of the cost function, resulting in the convergence of the mixing coefficients. This technique can be used with time series data with two rules: (1) co-occurrence, and (2) class-specificity.
A CNN-LSTM ensemble framework was described and tested with a time series data set. The result shows that without manual segmentation and curation, the time series data can be classified with greater generalization performance in the multi-view setting, compared to deep learning based on single view alone.