Deep Residual Learning with Dilated Causal Convolution Extreme Learning Machine

A feedforward neural network with random weights (RW-FFNN) uses a randomized feature map layer. This randomization enables the optimization problem to be replaced by a standard linear least-squares problem, which offers a major advantage in terms of training speed. An extreme learning machine (ELM) is a well-known RW-FFNN that can be implemented as a single-hidden-layer feedforward neural network. However, for a large dataset, owing to the shallow architecture, such an ELM typically requires a very large number of nodes in a single hidden layer to achieve a sufficient level of accuracy. In this paper, we propose a deep residual learning method with a dilated causal convolution ELM (DRLDCC-ELM). The baseline layer performs feature mapping to predict the target features based on the input features. The subsequent residual-compensation layers then iteratively remodel the uncaptured prediction errors in the previous layer. The proposed network architecture also adopts dilated causal convolution based on the ELM in each layer to effectively expand the receptive field of the multilayer network. The results of experiments involving acoustic scene classification of daily activities in a home environment confirmed that the proposed DRLDCC-ELM outperforms the previously proposed residual-compensation ELM and deep-residual-compensation ELM methods. We also confirmed that the generalization capability of the proposed DRLDCC-ELM tends to be superior to that of convolutional neural network-based models, especially for a large number of parameters.


I. INTRODUCTION
A feedforward neural network with random weights (RW-FFNN) uses a randomized feature map layer that is defined independently of the training data. This randomization offers a major advantage in that optimization becomes a standard linear least-squares problem, which can be solved using a single analytical step implemented efficiently in most linear algebra libraries. Ideas similar to the RW-FFNN have been proposed many times, in different forms, over the last several decades [1]. The current form of the RW-FFNN was originally introduced as a random vector functional link network [2]. Recently, an RW-FFNN in the form of a single-hidden-layer feedforward neural network (SLFN) has been proposed and is known as an extreme learning machine (ELM) [3]. In an ELM, the weights between the input layer and the hidden layer, and the biases of the hidden layer are randomly generated and do not need to be adjusted. The weights between the hidden and output layers were analytically determined using the least-squares method. Over the last decade, researchers have devoted attention to comparing ELMs with other machine learning algorithms, including support vector machine [50], random neural network [51]- [54], radial basis function neural network [55]- [58], and other deep learning methods. Generally speaking, ELMs are faster than those methods and their variants tend to be more robust [4]. ELMs have been applied to many learning tasks for classification, clustering, and regression. Because of their superiority in terms of training speed, accuracy, and generalization, ELMs have been used in fields such as medicine [14]- [29], chemistry [30]- [32], economics [33]- [40], transportation [41]- [45], robotics [46]- [49], and so on [4].
However, for a large dataset, owing to the shallow architecture, an ELM typically requires a very large number of nodes in a single hidden layer to achieve a sufficient level of accuracy, which makes the process computationally intensive. To overcome this, multilayer ELMs have been proposed in recent years. Kasun et al. proposed an ELM autoencoder (ELM-AE) [5] in which a multilayer ELM was constructed by iteratively stacking several layers. This ELM-AE was proposed mainly for classification problems rather than regression problems. Zhang et al. proposed a residualcompensation ELM (RC-ELM) for regression problems, which compensated for prediction errors [6]. The RC-ELM employs a multilayer structure consisting of a baseline layer and several residual-compensation layers. The baseline layer constructs a feature map that predicts the target features from the input features. The residual-compensation layers iteratively remodel the uncaptured prediction errors in the previous layer, where the same input features fed into the baseline layer are repeatedly fed into the residualcompensation layers. Additionally, the output of each individual layer is not fed into the input of another layer. Therefore, an RC-ELM has an ensemble network structure rather than a deep network structure. However, it has been reported [8] that an RC-ELM cannot perform effective error correction and tends to cause overfitting after the stacking of three residual-compensation layers. Chen et al. proposed a deep-residual-compensation ELM (DRC-ELM) to improve the robustness and generalization capabilities of an RC-ELM [8]. A DRC-ELM borrows the basic idea from a residual neural network [9] and has a deep network structure. The architecture of a DRC-ELM is slightly different from that of a residual neural network; in the former, the predicted target features for the previous layer are concatenated with the original features fed into the baseline layer, and the new concatenated features are then fed into the following layer. The input features fed into the baseline layer are thus repeatedly fed into the residual-compensation layers, as in the case of an RC-ELM. Chen et al. applied a DRC-ELM to predict gold prices based on those for the five previous trading days and several other indices, and demonstrated the effectiveness of the method [8]. However, in the case of timeseries predictions based on past observations over a larger area, DRC-ELM is not practical because of the intensive calculations required.
In this paper, we propose a modified architecture for deep residual learning using a dilated causal convolution ELM that is applicable to learning tasks such as recognition, classification, and regression of a time series of feature vectors. The proposed network architecture effectively utilizes past input features over a larger area and offers improved robustness and generalization capabilities. Our architecture adopts a dilated causal convolution based on an ELM in each layer to effectively expand the receptive field of the multilayer network. This idea stems from WaveNet [11]. In a dilated convolution, a filter is applied to an area larger than the filter width by skipping the input features with a certain step size. The dilation step size can be changed layer by layer. Stacked dilated convolutions enable networks to have very large receptive fields using only a few layers, while preserving the input resolution throughout the network as well as the computational efficiency [11]. In the baseline layer, the original input and time-delayed features are concatenated and fed into the ELM to achieve dilated causal convolution, where the delay time corresponds to the dilation size of the ELM filter with randomized weights, and is then trained to directly predict the original target features. In the residualcompensation layers, the original input features and the prediction of the previous stacked layer are concatenated, and the results are then concatenated with the time-delayed features and fed into the ELM. By using a combination of both the original input features and the predicted target features as the ELM input, together with the corresponding prediction residuals, the ELM for the residual-compensation layer can effectively learn the prediction characteristics of the previously stacked layers. This is expected to allow the ELM to accurately infer the residual of the previous layer. In addition, because the proposed method executes the dilated causal convolutions with different dilation sizes for each layer, this is expected to improve the learning efficiency of the ELM compared with a DRC-ELM that repeatedly feeds the same original input features into each layer. Hereinafter, we refer to the proposed model as a deep residual learning with dilated causal convolution ELM (DRLDCC-ELM).

II. EXTREME LEARNING MACHINE
The ELM in the present study is an RW-FFNN in the form of an SLFN, as depicted in Fig. 1. Here, we assume that the input layer, hidden layer, and output layer consist of M, L, and K nodes, respectively. Let , denote the weight between the m-th node in the input layer and the l-th node in the hidden layer, and let denote the bias at the l-th node in the hidden layer. The output of the l-th node in the hidden layer is then given by where = [ ,1 ⋯ , ] , = 1, ⋯ , represents the input feature vectors, N denotes the number of input feature vectors, and (•) represents a nonlinear activation function. The weights and biases were generated randomly. Huang et al. [10] proposed a method to orthogonalize the weights so that the network can extract a more complete set of features in comparison with non-orthogonalized weights, and reported that the use of orthogonalized weights can achieve better performance than even their well-trained counterparts. Inspired by this, the proposed method adopts orthonormalized weights obtained by applying the Gram-Schmidt orthonormalization method to randomly generated weights. If the number of nodes in the hidden layer, L, is larger than the number of nodes in the input layer, M, the weight vectors ] T is normalized as T = 1. Next, let , denote the weight between the l-th node in the hidden layer and the k-th node in the output layer. The predicted output value for the k-th node in the output layer is given by (2) We define the predicted target feature matrix based on all the predicted target feature vectors ̂= [ 1 ( ) ⋯ ( )] T , = 1, ⋯ , as Given a training dataset ( , ) , where denotes an original target feature vector = [ ,1 ⋯ , ] T , the optimal weight matrix, * , can be obtained as a solution to a standard regularized least-squares problem [10]: where denotes the original target feature matrix T = [ 1 ⋯ ] and C denotes a user-defined scalar for weighting the prediction error term. The closed-form solution of Eq.

III. PROPOSED METHOD
The proposed model (DRLDCC-ELM) adopts a dilated causal convolution ELM in each layer to effectively expand the receptive field of a deep-residual-compensation ELM. Figs. 2(a) and 2(b) depict the baseline layer and residualcompensation layer for the DRLDCC-ELM, respectively. In contrast to a residual neural network in which the entire network is simultaneously trained using backpropagation, the DRLDCC-ELM trains each ELM layer by layer from the baseline layer and stacks the trained ELMs in order. Let ∈ ℝ and ∈ ℝ denote an original input feature vector and an original target feature vector, respectively. During training, the pairs of input and output feature vectors ( , ), = 1, ⋯ , are given. The following describes the training method for the baseline layer and residualcompensation layers.
In the baseline layer, the original input feature vectors are directly fed into the feature preprocessing block, which applies a time delay and performs concatenation. Let 1 denote the time-delay parameter that determines the dilation step. The original input feature vector, ∈ ℝ , and its timedelayed vector, − 1 , are concatenated as which is an input feature vector fed into the next ELM. The time-delayed vectors in the range of − 1 < 1 were set to − 1 = . The dimension of the ELM input feature vector is 2 and each element value of the vector is assigned to each node in the input layer, so the number of nodes in the input layer is also set to = 2 . The ELM in the baseline layer is trained to directly predict the original target feature vector. Thus, the target feature vector of the ELM is given by Because the dimension of the target feature vector is , the number of nodes in the output layer is set to = . The number of nodes in the hidden layer is a user-defined parameter. The training of the ELM is conducted using the training dataset ( , ) given by Eqs. (6) and (7), respectively: We first prepare the randomly generated weight vectors, = [ ,1 ⋯ , ] T , = 1, ⋯ , , and the bias vector, = [ 1 ⋯ ] T , and then orthonormalize these vectors according to the method described in section II. The output values in the hidden layer are then given by Eqs. (1). Matrix defined in Eq. (3) is constructed using the output values in the hidden layer. Using this matrix and the original target feature matrix, , the optimal weight matrix * , is given by Eq. (5). The ELM's predicted target feature matrix ̂, can be calculated using Eq. (3) by replacing the weight matrix , with the obtained optimal matrix * . Because the ELM in the baseline layer is intended to predict the original target feature vectors, , the target feature vectors, ̂, predicted by the ELM are regarded as the original target feature vector prediction, ̂1: where the superscript 1 indicates that the output feature vector is obtained from the baseline layer.
In the residual-compensation layers, several layers were trained and stacked on the baseline layer. To distinguish the layers, we introduce a layer index denoted by # , where = 1 indicates the baseline layer and > 1 indicates the residual-compensation layers. In the following, residual compensation layer # is assumed to be trained. Let ̂− 1 ∈ ℝ denote the predicted original target feature vector for the previous layer, which is to be concatenated with the original input feature vector, ∈ ℝ , to form a new feature vector, , as follows: This feature vector, , is then fed into the feature preprocessing block, which introduces a time delay of and performs concatenation. The ELM input feature vector in this layer is given by The dimension of the ELM input feature vector is 2( + ), so the number of nodes in the input layer is also = 2( + ). The ELM in this residual compensation layer is trained to predict the residual feature vector for the previous layer, −1 . Thus, the target feature vector for the ELM is given by Because the dimension of the target feature vector is , the number of nodes in the output layer is set to = . The number of nodes in the hidden layer is a user-defined parameter. The training of the ELM is conducted using the training dataset ( , ) given by Eqs. (10) and (11), respectively: The ELM was trained using the same procedure as that for the baseline layer. Because the ELM in the residual compensation layer is intended to predict the residual feature vectors for the previous layer, −1 , the predicted target feature vectors for the ELM, ̂, are regarded as the prediction for the residual feature vector ̂− 1 : Therefore, the prediction for the original target feature vector at this layer, ̂, is given by For stacked layers, the prediction for the final layer, ̂, for the original target feature vector can be rewritten by recursively applying Eq. (13), as follows: As discussed by He et al. [9], if the added layers can be constructed as identity mappings, a deeper model should have a training error no greater than its shallower counterpart. In the proposed model, if the ( − 1) -th residual compensation layer has already achieved perfect prediction such that ̂− 1 = , the following s-th residual compensation layer should be an identity mapping. The identified mapping can be easily constructed. Because, from Eq. (11), the ELM target feature vectors to = −1 = , and the optimal weight matrix is obtained as * = , which results in ̂=̂= .
The dilation pattern in the proposed method is the same as that used in WaveNet, in that the dilation is doubled for every layer up to a certain limit and the pattern is then repeated [11]: where denotes the repeat period. The receptive field of the multilayer network stacking layers is then given by For instance, in the case of = = 4, the dilation pattern is represented by the time delay parameters, ( ) =1,⋯,4 = (1, 2, 4, 8) . The prediction was equivalently conducted according to the block diagram shown in Fig. 3. This multilayer network calculates the prediction of the final layer, ̂4, based on the original input feature vectors in a range from time to − 15. Thus, the receptive field is 16, as given by Eq. (16).

IV. EXPERIMENTS
The proposed DRLDCC-ELM has the following characteristics: A. Several residual compensation layers are iteratively stacked to remodel the uncaptured prediction error for the previous layer. B. The predicted target features for the previous layer are fed into the next residual-compensation layer following concatenation with the original input features. C. Each layer is subjected to a dilated causal convolution, where the dilation step size is different for each layer.
Characteristic A is shared by the RC-ELM, DRC-ELM, and DRLDCC-ELM. Characteristic B is lacking in the RC-ELM because the predicted target features for the previous layer are not fed into the next layer. Characteristic C is lacking in both the RC-ELM and DRC-ELM because some of the same input features are repeatedly fed into each layer.
Characteristics B and C are very important for the proposed DRLDCC-ELM. In the following experiments, the effectiveness of these characteristics was evaluated.  1,2,4,8).

A. DATASET
The task we chose was acoustic scene classification of daily activities in a home environment, and we conducted experiments using a dataset named "SINS" that contains recordings of the activities of a person at home obtained using an acoustic sensor network. The dataset was collected in a vacation home consisting of five different rooms: a combined living room and kitchen, a bathroom, a toilet, a bedroom, and a hall. Thirteen sensor nodes, each containing four microphones, were distributed uniformly around the five rooms [12]. Each audio channel was sampled at a rate of 16 kHz with a bit depth of 12. In our experiments, we selected the recordings of living room activities because they were the most varied. Nine kinds of activities were included in our analysis: making a phone call, cooking, dishwashing, eating, visiting, watching TV, working, vacuum cleaning, and the absence of any activity. Other activities were excluded. We used the monaural channel #1 audio signal extracted from the four audio channels recorded by the microphone array of node 7 situated in the living room. Overall, the dataset was highly variable, reflecting an imbalance of activities occur in daily life. In order to balance the dataset, we therefore extracted data segments with a duration of one minute that included only one activity, and, for each activity, we concatenated the segments to generate training and testing datasets with durations of 30 minutes and 15 minutes, respectively. Because there were nine kinds of activity, the total durations of these datasets were 270 min and 135 min, respectively.

1) PROPOSED DRLDCC-ELM
During front-end processing, mel-spectrogram features were extracted as original input features, ∈ ℝ , from the monaural audio signals pre-emphasized with a difference filter of 1 − 0.97 −1 . The size and shift of the analysis frame were set to 30 ms and 20 ms, respectively. The dimensions of the mel-spectrogram feature vector were = 40 . In the proposed DRLDCC-ELM, the concatenated feature vectors given by Eqs. (6) are fed into the baseline layer of ELM. Thus, for the baseline layer, the number of nodes in the ELM input layer was set to = 2 = 80. In this experiment, a nine-class classification task was conducted, so the original target feature vectors, ∈ ℝ , were one-hot vectors with dimensions of nine. The number of nodes in the ELM output layer was set to = = 9. For the residual-compensation layer, the input feature vectors given by Eqs. (10) were fed into the residual-compensation layer ELM, so the number of nodes in the ELM input layer was set to = 2( + ) = 98 . The nonlinear activation functions and other user-defined values were the same for all ELMs. A rectified linear unit (ReLU) function was adopted as the nonlinear activation function in Eq. (1). For the number of nodes in the hidden layer, we used ten values from L = 1000 to 10000, with a step size of 1000. The weight of the prediction error term in Eq. (4) was set to = 1.0. The number of stacked layers was set to = 10, 11, 12 . The repeat period for the dilation pattern given in Eq. (15) was set to = 12. The receptive fields for = 10, 11, 12 are given by Eqs. (16) as = 1024, 2048, 4096, respectively. The learning model was iterated five times for each condition. Table 1 shows the experimental results for the proposed DRLDCC-ELM, where the average and standard deviation of the average F1 scores evaluated from five trials are presented. The table presents the total number of learnable parameters, which is given by × × , for each combination of the number of stacked layers, , and the number of nodes in the hidden layer, . The DRLDCC-ELM predicts the target output features for the same frame-shift period as the original input features. We evaluated the F1 score for each class by comparing the predictions with the correct labels for each frame. We then adopted the average F1 score as a metric for multiclass classification. These results are summarized in Table 1, and the graphs are shown in Fig. 4. The average F1 score increased almost monotonically as the number of nodes in the hidden layer and/or the number of stacked layers increased.

2) COMPARISON WITH RC-ELM
In order to evaluate the effectiveness of the aforementioned characteristic B, instead of comparing the DRLDCC-ELM directly with the original RC-ELM lacking characteristics B and C, we conducted experiments using an alternative method lacking only characteristic B. In this alternative method, the predicted target features of the previous layer were not fed into the next residual-compensation layer, but only the original input features were fed into. The alternative method is thus lacking characteristic B, as in the case of the original RC-ELM. On the other hand, the alternative method executes the dilation convolution the same as in DRLDCC-ELM. Because of this, the alternative method has characteristic C. Macro averaged F1-score 7 ELM input feature vector in the residual-compensation layer given by Eq. (10) is the same as that in the baseline layer, as given by Eq. (6). The total number of learnable parameters is the same as that for the DRLDCC-ELM. In this experiment, we considered only the case where the number of stacked layers was = 10. The learning model was iterated five times for each condition. The results are presented in Table  3, and a graph of the results is shown in Fig. 5. The average F1 score for the DRLDCC-ELM is 21.81% higher than that of the alternative RC-ELM method.

3) COMPARISON WITH DRC-ELM
Next, we evaluated the effectiveness of characteristic C by conducting experiments using DRC-ELM, which lacks this characteristic. Instead of the dilation convolution adopted in the DRLDCC-ELM, ( + 1) successive original input features were fed into the baseline layer and were concatenated with the predicted original target feature vector for the previous layer, and then repeatedly fed into the residual-compensation layers. Figs. 6(a) and 6(b) depict the baseline layer and residual-compensation layer for the DRC-ELM, respectively. As shown, the DRC-ELM used in this experiment can be constructed by embedding a successive feature sampling block instead of the feature preprocessing block used in the DRLDCC-ELM, as depicted in Fig. 2.
Owing to adopting successive feature sampling blocks, the DRC-ELM used in this experiment is slightly extended from the original DRC-ELM such that each residual compensation layer can utilize the previous layer's past original target feature predictions to predict the current original target feature. When ( + 1) successive original input features, including the current time, are fed into the ELMs for both the baseline layer and all the prediction-residual layers, each sample time delay block needs to be concatenated times.
(The maximum delay time is thus .) All of the time-delayed features as well as the current time feature are concatenated and then fed into each ELM. The receptive field for this type of multilayer network is given by ( , ) = 1 + . To ensure that the experimental conditions were the same as those for the DRLDCC-ELM, where the number of stacked layers was = 10 ( = 1024 ), we set the number of stacked layers and the maximum delay time to = 10 and = 103, respectively. In this case, the receptive field for the DRC-ELM is = 1031, which is slightly larger than that for DRLDCC-ELM and may be advantageous for DRC-ELM. The learning model was iterated five times for each condition. The results are presented in Table 2 and the graphs are shown in Fig. 5. It can be seen that the average F1 score for DRLDCC-ELM is higher than that for DRC-ELM by 6.07%.

4) COMPARISON WITH CNN-BASED METHOD
F i n a l l y , w e c o m p a r e d t h e D R L D C C -E L M w i t h convolutional neural network (CNN)-based models. We constructed CNN-based networks by modifying a previously proposed network architecture [13]. The network architectures used in the following experiments were constructed by combining the element blocks presented in     input block consisted of the mel-spectrogram (40 dimensions) of 1024 successive frames; thus, the total number of dimensions was 40  1024. Four types of network architectures were constructed by inserting several middle blocks between the input and output blocks, as shown in Table 5. An Adam optimization algorithm was adopted for stochastic gradient descent for training deep learning models, and the learning rate was set to 0.0001. The learning model was iterated five times for each condition. The average and standard deviation of the F1 scores evaluated from the five trials are presented in Table 6, together with the number of learnable parameters. Fig. 5 shows a graph of the results. When the number of learnable parameters was less than approximately 265k, the CNNbased model outperformed the DRLDCC-ELM model. When the number of learnable parameters was between 265k and 565k, the DRLDCC-ELM performance was comparable to that of the CNN-based models. For a number larger than approximately 565k, the F1 scores for the CNN-based models decreased slightly. However, the F1 scores for the DRLDCC-ELM increased constantly. Thus, we confirmed that the generalization capability of the DRLDCC-ELM is superior to that of the CNN-based models when the number of parameters is large.

V. HYPERPARAMETER OPTIMIZATION
The proposed DRLDCC-ELM has the following hyperparameters:  The number, , of nodes in the hidden layers.  The weight , for the prediction error term in Eq. (4).  The number, , of stacked layers in Eq. (16).  The repeat period , of the dilation pattern in Eq. (16).
In the present study, the number of nodes in the hidden layers ranged from 1000 to 10000. The weight for the prediction error term was set to an empirically determined value of 1.0. The number of stacked layers and the repeat period for the dilation pattern both have a significant impact on the performance of the resulting network. The grid search method has traditionally been used for hyperparameter optimization. This technique investigates all combinations of hyperparameters by evaluating the performance of the resulting networks. However, using this method with the DRLDCC-ELM, if there are a large number of hyperparameter combinations, intensive calculations will be required irrespective of the main advantage of the ELM; that is, it can optimize the network weights by solving a single standard linear least-squares problem. To avoid this situation, we introduce a fundamental concept to effectively search for the semi-optimal hyperparameters and in Eq. (16) from the training data for sound event detection.
In this approach, the DRLDCC-ELM was trained to detect the existence of only one target sound event, frame by frame, from a mel-spectrogram time series. First, we conducted a grid search. The models were trained on all combinations of hyperparameters, where was 1-30 and was 1-10. Therefore, the number of hyperparameter combinations was 300, and the number of trained models was also 300. We evaluated the F1 score of each model tested the models on the test data. Fig. 7 shows the evaluated F1 scores for all the trained models, where scores obtained with a fixed value of are connected by a line, and the horizontal axis represents the receptive field, ( , ), in Eq. (16). Irrespective of the value of , all the lines are convex curves, and the  maximum F1 scores are obtained for receptive fields of approximately 100 frames. We then investigated the duration of the target sound events in the training data. The statistics obtained are as follows: The minimum duration was 46 frames, the maximum duration was 456 frames, and the average duration was 108 frames with a standard deviation of 66 frames. Therefore, the maximum F1 scores tended to be obtained with models trained with the same receptive field as the average duration of the target sound event in the training data. This implies that only models whose receptive field is around the average duration of the target sound event in the training data should be considered as candidates for the optimal model. Fig. 8 shows the receptive fields calculated for all combinations of S and W. Receptive fields obtained with a fixed value of W are connected by a line, and the horizontal axis represents the number of stacked layers, . The vertical axis represents the receptive field on a logarithmic scale. In the graph, one dot represents one combination of S and W, that is, one model. For the grid search method, all dots (models) in the graph were trained and evaluated. However, if we focus only on models with receptive fields that are around the average duration of the target sound event, we can drastically reduce the number of models to be considered. In the graph, the models depicted in the orange area can be regarded as candidates for optimal models.

VI. CONCLUSIONS
We proposed a modified architecture based on deep residualcompensation ELM to effectively utilize past input features over a larger area by introducing a dilated causal convolution ELM. In the proposed DRLDCC-ELM, several residual compensation layers are iteratively stacked to remodel the uncaptured prediction errors in the previous layer, the predicted target features of the previous layer are then concatenated with the original input features and fed into the next residual-compensation layer, and each layer has a dilated causal convolution with a different dilation step size. The experimental results confirmed that the latter two features are necessary for the DRLDCC-ELM to outperform the previously proposed RC-ELM and DRC-ELM. We also found that the generalization capability of the DRLDCC-ELM was superior to that of the CNN-based models, especially for larger numbers of parameters.