Feature Extraction Based on the Non-Negative Matrix Factorization of Convolutional Neural Networks for Monitoring Domestic Activity With Acoustic Signals

In this paper, a feature extraction method is proposed based on the non-negative matrix factorization (NMF) for classifiers for monitoring domestic activities with acoustic signals. Most of the classifiers of the acoustic signals use data-independent spectral features (e.g., log-Mel spectrum and Mel-frequency cepstral coefficients). Recently, some novel feature extraction methods have been researched, including convolution-NMF-based features combined with K-means clustering. This study proposes an enhanced NMF-based feature extraction method that is inspired by the NMF-based noise reduction algorithm. The proposed method independently estimates the frequency basis matrix for each class, and then cascades the basis matrices to form the entire frequency bases, where the acoustic signal is transformed to the proposed feature by estimating the temporal basis matrix with the trained frequency bases. In addition, this study proposes a data augmentation method for the proposed feature that is inspired by the “mix and shuffle” method for audio waveforms. In order to evaluate the proposed system, which consists of the proposed NMF-based feature and the convolutional-neural-network-based classifier, some evaluations were performed using the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 5 – Monitoring of Domestic Activities Based on Multi-channel Acoustics – Database. The results showed that the proposed system has comparable performance to that of state-of-the-art algorithms and that it has enhanced the F1-score performance of 6%–12% in comparison with the conventional NMF-based feature extraction method that is based on convolutional NMF and K-means clustering.


I. INTRODUCTION
Acoustic scene classification (ASC) is tasked to automatically recognize environments through acoustic signals. In particular, the ASC task focuses on classifying long audio segments by characterizing whole audio environments, distinguishing from sound event detection problems to detect short sound events [1]. The recognition of environments via acoustic signals is one of the main problems of computational The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . auditory scene analysis (CASA) [2] and it has become a major area of interest in many recent machine learning techniques, including robotic navigation [3] and personal archiving [4]. As the interest in the ASC problem grows, ASC-related tasks are researched by several communities such as the detection and classification of acoustic scenes and events (DCASE) [5].
Both ASC and acoustic event classification (AEC) are among the main analysis problems of environments through sound signals, and the scene and event classification tasks are not clearly distinguished. Recently, ASC tasks mainly tend to focus on classifying longer signals with analyzing the whole acoustic environment [1], while AEC tasks deal more with short acoustic events, such as knock sounds or laughs. Also, ASC tasks have recently expanded into the monitoring domestic activity (MDA) tasks, which classify in-door sounds into several activity classes (e.g., vacuum cleaning, cooking, or watching TV). Both the modern ASC and MDA algorithms consist of two parts: feature extraction and classification modules.
The most common choices of the audio feature for recent ASC and MDA algorithms are Mel-frequency cepstral coefficients (MFCC) [3], [6], [7] or Mel-frequency-domain spectrum [8]- [10], which are kinds of spectrum-based values processed in a psycho-acoustic frequency domain [11]. These features are motivated by their success in speech signal applications, but their performance is limited in the acoustic scene or event classification applications, as acoustic environmental signals are less structured [1]. To substitute the Mel-frequency-based features, researches of the ASC algorithm have studied with the features inspired by the computer vision [12] or the modeling of statistical distributions [13]. Recently, several psychoacoustics-based features, such as the mel-frequency discrete wavelet coefficients [14], [15], hybrid constant-Q transform [16], gammatonegram [17], and gammatone-frequency cepstral coefficients have been studied [18]. To improve the performance, a combination of multiple features, such as a DNN-based ensemble network of the log-Mel-spectrogram, gammatonegram, and constant-Q transform [19] and an ensemble of label-tree embeddings of the log-Mel-spectrogram, gammatonegram, and MFCC [20] have been studied.
MFCC and other similar features can be considered as data-independent analysis techniques, as the required analysis processes for extracting the features do not depend on the signal characteristics. Recently, data-dependent signal analysis methods, such as principal component analysis (PCA) [21] and non-negative matrix factorization (NMF) [22], have been researched. In particular, the NMF algorithm was applied to analyze the magnitude spectrogram of an acoustic signal in recent acoustic signal processing, such as music signal processing [23]- [25] and speech denoising [26]- [29], as the NMF technique can decompose the spectrogram into the frequency and temporal basis matrices.
Lee and Seung have shown that the NMF algorithm can analyze two-dimensional non-negative data by using partsbased representation [22]. For example, the NMF algorithm makes decomposed images that correspond to parts of a face, e.g. mouth and eyebrow, when the algorithm is applied to a facial image, while the vector quantization algorithm makes prototypes of the whole face and the PCA algorithm makes ''eigenfaces'' that form a distorted version of the whole face. For the analysis of a sound signal, the NMF algorithm decomposes magnitude spectrograms into frequency basis and temporal basis matrices due to the parts-based representation characteristic. Each frequency basis and temporal basis can be a frequency structure and a temporal envelope of a musical note when the NMF method is applied to a music signal [23]. In speech denoising applications, the NMF method is used to learn the frequency structures of speech and noise signals a priori, and the temporal basis matrix of each frequency basis is estimated from noisy speech signals [28], [29]. Most of the NMF applications take advantage of the fact that the NMF can analyze the characteristic frequency structure and the temporal activation of the same class of signals.
Recently, the NMF algorithm was applied to acoustic scene classification in both supervised [1] and unsupervised methods [1], [30], [31]. The supervised method was developed based on the task-driven dictionary learning (TDL) model with a multinomial logistic regression [32] and L-BFGS [33]. However, the model and the update algorithms were far from the recent classifiers, such as deep neural networks and gradient-based algorithms, so it was difficult to extend them using recent techniques, such as the convolutional neural network (CNN). The unsupervised methods were developed based on the NMF with time-averaged clips or convolutional NMF with K-means clustering [1], [31], but they required very a large data matrix and additional data reduction processes, such as time averaging or K-means clustering, as they have to deal with the whole un-categorized dataset. If the task is supervised, we may generate the basis matrices via simpler processes.
The task of monitoring the domestic activity [34] has a goal of classifying the audio segments to predefined classes that are composed of daily activities in home environments, e.g., cooking, dishwashing, vacuum cleaning, etc. In order to achieve this goal, in this paper, we try to develop a scene classification method based on the NMF and CNN techniques that is as simple and extensible as possible. The proposed system consists of two modules: a NMF-based feature extraction module and a CNN-based classifier module. Our main contribution is the development of simple feature extraction and augmentation methods based on NMF in a supervised manner and compatibility to common classifiers, such as the simple CNN classifier.

II. PROBLEM DESCRIPTION A. PROBLEM DESCRIPTION
The ASC is a task used for classifying audio segments with given durations. The common ASC is defined as the recognition of the audio environments, which are defined based on   physical or social contexts, such as parks, offices, etc [35]. However, the monitoring of domestic activity tasks has a goal, which is to classify the performed activities by people, such as cooking, dishwashing, working, etc., as shown in Fig. 1. Since the sounds of domestic activities include ensembles of multiple sound events, classifying domestic activities can be regarded as a kind of ASC tasks [34]. Moreover, the algorithm has to focus on the characteristics of sound events, such as keyboard typing and running water rather than the room environments, such as room transfer functions and background noise.
In order to classify domestic activity sounds, some features can be extracted from the sound clips and then classified into activity classes with neural-network-based classifiers just like many other recent algorithms. Although various classifiers with different network structures have been recently tried, the features adopted were still similar frequency-based features. For example, the log-Mel spectral energies were used as the input feature in 26 systems out of the 31 systems submitted to the DCASE 2018 Task 5. So, we would like to find a different feature extraction strategy that is suitable for the recent neural-network-based classifiers. Figure 2 shows examples of the magnitude spectrograms of domestic activities. The domestic activity sounds include sets of event sounds that have distinctive characteristics in the time-frequency domain as shown in fig. 2. For example, the ''vacuum cleaning'' sound consists of broadband-noise with two tonal lines around 500 Hz and 1 kHz, and the ''watching TV'' and ''social activity'' sounds consist of various harmonic components. Both the ''eating'' and ''working'' classes may have similar structures (impulsive sounds), but their temporal characteristic (intervals between the impulsive sounds) are quite different. Therefore, we hope that the NMF method can generate distinctive features by analyzing the temporal and spectral characteristics.

B. RELATED WORKS
For the acoustic scene classification problem, V. Bisot et al. have developed two NMF-related methods as mentioned in the introduction. One of them is the supervised TDL model with an L-BFGS optimizer [36]. Although the TDL model demonstrated a good performance in the evaluation, it is difficult to apply it with arbitrary classifiers, as the update equations of the NMF basis and the model weight are strongly combined. The other one is an unsupervised NMF model for feature extraction [1]. The method can be applied to various classifiers, as the NMF-based feature extraction and the classifier learning process are clearly separated. Therefore, this method is consistent with the goal of this study, but it does not use the annotation data, so there is room for improvement if we use the annotation data. Recently, some networks for acoustic signals that are based on the nonnegative auto encoder (NAE) [37], [38], which is a variant of the NMF, have been researched, but there is still not enough research regarding the application of NAE to the ASC tasks.
The unsupervised NMF method performs convolutive NMF to each audio clip to generate a large set of NMF bases, which are then clustered using the K-means clustering technique. Unfortunately, this process is complicated, and it takes a long time to estimate the NMF basis matrix because of the K-means clustering technique. Therefore, in this study, we aimed to develop an NMF-based feature extraction method that is simple and easy to use. Furthermore, we also tried to enhance the classification performance by using the annotation data in the NMF basis learning step.
The NMF method has been tried in previous studies for acoustic scene classification and sound event detection tasks. However, these previous investigations utilize the NMF method as an auxiliary tool to pre-process the input signal or the activity classifier, rather than a feature extraction method. Zhou et al. [39] proposed the NMF-based sound event detector, but the NMF method was only used to perform noise reduction of the evaluation data. Mesaros et al. [40] also proposed the coupled-NMF-based sound event detector that consisted of the data analysis and classification step based on the NMF. The data analysis step has similar purpose to the feature extraction of the proposed method, but the dictionary matrix was generated in an unsupervised manner and coupled with the classifier, while the proposed method generates the dictionary matrix in a supervised manner and independently from the classifier. Chan's NMF-CNN structure [41] was developed for the weakly-supervised sound event detection task, whose dataset consisted of the data with annotations of the class and onset-offset time (strongly labeled), data with class annotations only (weakly labeled), and data without any annotation (unlabeled). Chan's method may look similar to the proposed method in that it uses both of the NMF and CNN, but the NMF method was simply used to preprocess the weakly-labeled and unlabeled data with pseudolabeling of onset and offset time, so the design purpose and the structure are totally different compared to the proposed method.

1) NON-NEGATIVE MATRIX FACTORIZATION
The NMF is a method for the estimation of non-negative matrices W ∈ R + K ×R and H ∈ R + R×N , where the multiplication of two matrices is the same as a known non-negative where E ∈ R K ×N is an error matrix. The matrices of W and H are estimated by minimizing the cost function between V and WH as [42] W = arg min where C (A|B) is the distance measure between the two matrices A and B. Also, various distance measures, e.g., where ⊗ and the fraction denote element-wise multiplication and division, respectively, and 1 K ×N means a K × N matrix whose elements are all one.
In most of the NMF applications for acoustic signals, the known matrix V is the magnitude spectrogram of the input signals, and R is set to a small value relative to K or N so that the magnitude can be modeled as the multiplication of the matrices W and H, which represent the spectral characteristics and temporal activations of acoustical events, respectively. For example, if the NMF algorithm is applied to a magnitude spectrogram of a music signal that consists of three musical events, each column vector of the matrix W may correspond to a frequency structure, and the row vector of the matrix H may correspond to a temporal envelope of a musical event, as shown in Fig. 3 (a). By focusing on these characteristics of the NMF method in the acoustic signals, several NMF applications have been developed, e.g., the speech denoising [28], [29] and the active sonar reverberation suppression [44], as shown in Fig. 3 (b). Speech denoising methods divide the bases into two classes, speech and noise, and remove the noise bases after calculating the temporal bases of each class. The active sonar reverberation suppression technique uses a similar methodology to that of speech denoising, where it divides the basis matrix into target echo and reverberation classes instead of speech and noise classes. Both the denoising and reverberation suppression methods use the NMF method as a separation tool by pretraining and merging the class-wise frequency bases. Focusing on the music signal applications, we believe that if we consider the matrix W and H as a transform matrix and a feature matrix, respectively, the generated feature matrix by the NMF can be considered as a sparse representation of the input spectrogram, because matrix H is a sparse representation of the input music signal in the music signal processing systems. Also, inspired by the denoising and the reverberation suppression methods, we believe that if we construct the frequency basis matrix by concatenating the class-wise frequency basis matrices, the temporal activation pattern, which is the H matrix, may vary depending on the class of the input signal. This is due to the fact that H s and H n represent the temporal activations of the speech and noise classes in the speech denoising system. Thus, we first propose a method to construct the frequency basis matrix by concatenating the class-wise bases, which greatly varies from the conventional  NMF-based feature extraction [1], as described in the next section.

2) CLASS-WISE LEARNING OF THE FREQUENCY BASIS MATRIX
As mentioned in the previous section, the NMF method decomposes the spectrogram V into a transform matrix W and a feature matrix H. The matrix W may be estimated before or during the analysis process in the acoustic signal processing applications. However, we decide to learn the matrix W in advance because the NMF method has scale and ordering ambiguities, so it may interfere with the training and inference procedures if it is learned during the analysis.
As inspired by the NMF-based noise reduction algorithm [28], [44], we divide the basis vectors into C groups as where W c ∈ R + K ×R c is a class-wise frequency basis matrix. W c is estimated by iterative update equations as where V c ∈ R + K ×NL and H c ∈ R + R c ×NL are the class-wise spectrogram and temporal basis matrix, respectively, and R c , K , N , and L are the number of bases per class, the number of frequency bins, the number of frames in a clip, and the number of clips in a class, respectively. The data matrix V c consists of the spectrograms of the files in class c as where V c,l ∈ R + K ×N is the spectrogram of the lth file in the cth class.
The procedure for constructing the frequency basis matrix is described in Fig. 4 (a). To construct the frequency basis matrix W, the audio clips are collected for each class, and spectrograms in each class are concatenated along the temporal axis, the NMF methods ( (7) and (8)) are applied until convergence to estimate W c . After that, the classwise frequency matrices are concatenated by (6) to compose the frequency basis matrix W.

3) FEATURE EXTRACTION
After the learning of the matrix W is completed, the feature extraction and classifier learning step can be performed. If we denote V l as a magnitude spectrogram of the l-th audio clip, the feature matrix H l , which describes temporal activation of the basis vectors, of the clip is obtained by the iterations of During the estimation of H l , the frequency bases W are not changed. Thus, the feature extraction procedure requires a relatively small number of iterations. Fig. 5 shows examples of the change in the cost function with the number of iterations for training data (The details of the dataset are described in Chapter IV). The gray-colored area denotes the inter-quartile range between the 25th and 75th percentile points, and the thick solid line indicates the average values. The graphs show that the cost function may converge with about 20 iterations. The entire structure of the classifier system with the proposed feature extraction method is shown in Fig. 4 (b). As shown in Fig. 4 (b), the proposed feature extraction method can be used by the same structure as that of the conventional features e.g. Mel-spectrogram, if the NMF frequency basis matrix W is pre-trained.

4) DATA AUGMENTATION
Inspired by the data augmentation of the mixing and shuffling of the sound waveform [9], [45], we augment the data by mixing and shuffling the temporal basis matrix. In the waveform-based data augmentation method, the new waveform is generated by mixing two randomly chosen waveforms with a randomly shuffled order. That is, the augmented waveform x aug is generated as where b l and c l are the block number and the clip number, respectively, l is the number of shuffle blocks in a clip, and x b l ,c l is the b l -th block of the c l -th clip in the database of a certain class. b l is randomly chosen in {l : 1 ≤ l ≤ L} without duplication, and c l is randomly chosen from two clip numbers.
If we assume that the length of each block is an integer multiple of the length of the FFT window, (11) can be presented in the time-frequency domain as where V is the time-frequency-domain presentation, e.g., the spectrogram of x. According to the NMF model (1), the temporal slice of the spectrogram corresponds to the slice of the temporal basis matrix. For example, . Therefore, (12) can be represented as As a result, the temporal basis matrix, which is the proposed feature, can be augmented by mixing and shuffling as without performing an additional NMF feature extraction process. The illustrative diagram of the proposed augmentation procedure is described in Fig. 6. The conventional mix and shuffle augmentation method have to be applied to the waveform directly, and so the augmented data have to be processed by the NMF, which is the most time consuming part of our feature extraction procedure. However, our data augmentation method, which mix and shuffle matrix H l instead of the waveform, can augment data without additional STFT or NMF calculations. While the NMF method consists of numerous multiplications, the mix and shuffle uses no multiplication, and the proposed data augmentation method can expand a large amount data with very light operations.

B. NETWORK STRUCTURE OF THE CLASSIFIER
The recent classifiers for the 2-dimensional data, e.g. Mel-Frequency spectrogram and MFCCs, are mainly based on or include the CNN structure [9], [46]. The feature matrix H l is also a 2-dimensional data, so the CNN-based classifier is used in our system. In order to compare the proposed method with the conventional feature extraction method, the classifier is similarly designed to the state-of-the-art classifier of the log-Mel energy features [9]. An example of the classifier structure of  the proposed NMF-based features is displayed in Fig. 7 with R C = 10. Since the first-axis dimension of the input matrix (90 in Fig. 7) is defined by N C R C , where N C is the number of classes, the filter length of the second CNN layer (Conv (22,1) in Fig. 7) is calculated by N C R C 4 , where means the floor function.

A. EVALUATION SETTING
In order to evaluate the proposed system for the monitoring domestic activities, some simulations were performed with the DCASE 2018 Task 5 database, which is an audio dataset for the monitoring of domestic activities that was recorded in a living room and a kitchen [34]. The audio files were recorded with 4-channel linear microphone arrays.
There were 9 activity classes: absence (nobody in the room), cooking, dishwashing, eating, social activity, vacuum cleaning, watching TV, working, and other (non-relevant activity), as shown in Table 1. Each audio file was 10seconds long and represented one activity. The audio files were acquired with a 16-kHz sampling rate and a 12-bit quan-tization. The detailed recording setup, including the floorplan, can be found in [34].
The dataset consisted of development and evaluation sets. The development set approximately had 200 hours of data from 4 microphone arrays for the training and evaluation of the monitoring system. The evaluation set consisted of data from 7 microphone arrays, and the quantity of the evaluation set was similar to that of the development set. The used 4 microphone arrays to get the evaluation set were same arrays used for the development set, and the other 3 microphone arrays that were used for the evaluation set were not used for the development set.
The audio clips were short-time-Fourier-transformed by 512-samples Hamming window with 50% overlap into 512 frequency bins. The number of NMF iterations was set to 100 for learning frequency basis matrix and 30 for the feature extraction. We also tested the enhanced NMF methods by sparseness and temporal continuity [47] with various parameters, but it could not improve the performance. The classifiers were trained by Adam optimizer [48] with a learning rate of 0.0001. The batch size and number of epochs were 16 and 100, respectively. The input audio in the dataset had a 4-channel signal, so each frequency basis matrix was independently trained and applied for each channel.
The performance was measured by the F 1 -score, which is defined as where P and R are the precision and recall, respectively. The precision and recall are relevance measures, which are defined as P = n TP n TP + n FP (16) where n TP , n FP , and n FN are the numbers of true positives (relevant answers), false positives (false answers), and false negatives (missing answers), respectively. We used the macro-averaged score, where the class-wise scores were first calculated, and then averaged, to evaluated the performance. The performance of the development dataset was crosschecked by 4-folds and then averaged. For example, suppose we divide the development dataset into 4-blocks, named a, b, c, and d. The first fold consists of the training data of a, b, c and the evaluation data of d, and the second fold consists of the training data of a, b, d and the evaluation data of c, and so on. The frequency basis matrix for each fold was generated by only using the training data, and the evaluation data of the development and evaluation datasets were not used. Also, the detailed cross-check configuration, including the clip list for each fold, was in accordance with the DCASE 2018 Task 5.

B. COMPARISONS WITH THE STATE-OF-THE-ART ALGORITHM
In order to evaluate the performance, the proposed system was compared with Inoue's algorithm [9] and Liu's method [7], which have the best performances in the DCASE 2018 Task 5 competition. Inoue's algorithm consisted of log-Mel-spectrogram-based features and the CNN-based classifier with three CNN layers with batch normalization (BN) and ReLU activation and two fully-connected layers with the softmax output, which is a similar structure to that of the proposed system. In the implementation of Inoue's system, the 40-bin log-Mel spectrograms were extractced using a 64-ms window with a 20-ms overlap, and the classifiers were trained using the Adam [48] optimizer with a learning rate of 0.0001 for 100 epochs. The detailed structure of the classifier can be found in [9].
Liu's method used an ensemble structure of three subsystems. The first sub-system used 40-bin Mel-spectrogram features per frame and a CNN-based classifier, which has a similar structure to that of the proposed and Inoue's systems. The second sub-system used 40 Mel-frequency cepstral coefficients (MFCC) per frame and a CNN-based classifier with the same structure of the first sub-system. The third subsystem used 128 extracted features by a pre-trained VGGish [49], which is a variant of the VGG [50] for audio signals, per frame and a long-short-term-memory -based (LSTM) classifier. The detailed structures can be found in [49], and the classifiers were trained by the Adam optimizer with a learning rate of 0.0001 (0.001 for the LSTM classifier) for 100 epochs. Table 2 shows the F1-score results of the comparison and the proposed methods. NMF-CNN denotes the proposed system, and ''with BN'' means that each CNN layer in the classifier was combined with the BN modules. The results show that the performance of the proposed system is slightly less than that of Inoue's method and better than that of Liu's method in both the Dev and Eval2 datasets. The Eval1 performance of the proposed method is similar to that of both Inoue's and Liu's methods.
According to the results of Inoue's method with and without BN, the BN in Inoue's method can improve the performance. However, the results of the proposed method with and without BN show that the BN is not effective for the proposed system. The performance of the proposed method without BN is comparable to that of Inoue's method without BN. The proposed method is slightly better in the Dev dataset, slightly worse in the Eval1 dataset, and almost the same in the Eval2 dataset. Therefore, the performance differences between the proposed method and Inoue's method may be due to the difference in the adequacy of BN for the features.
There is one more thing to note. The performances of Inoue's and Liu's methods are about 1% and 2.5 % lower, respectively, in the Eval2 data than in the Eval1 data. This phenomenon not only occurs in the those methods but also in most of the submitted methods to DCASE 2018 Task 5. However, the performance difference between the two datasets is relatively small, about 0.4 %, in the proposed method.

C. PERFORMANCE CHANGE ACCORDING TO THE NUMBER OF BASES
The number of bases is used as a major engineering parameter in many NMF-based signal processing methods. Therefore, the performance change according to the number of bases was analyzed in this paper. Table 3 and Fig. 8 show the performances of the proposed systems with R C = 20, R C = 10, R C = 5, and R C = 3. Since the number of classes is 9 in the experiment, the dimensions of the features for a certain time frame are 180, 90, 45, and 27.  In the R C = 20, R C = 10, and R C = 5 cases, the performances of the systems slightly increase with the increase in the number of bases. The performance of the Eval2 dataset does not significantly change even if the number of bases changes in those cases. However, the performance of the proposed method with R C = 3 is noticeably reduced with all the datasets. We think that the performance of the proposed system was not largely affected by the change in the number of bases when R C ≥ 5 in this experiment.
As mentioned in the previous section, the performance differences in the proposed systems between the Eval1 and Eval2 datasets are relatively small. This property is also shown in the result of R C = 10 and R C = 5 cases.

D. COMPARISONS TO THE CONVENTIONAL NMF-BASED FEATURE EXTRACTION METHOD
In order to evaluate the performance of the proposed feature extraction method, we compared the proposed system with the conventional NMF-based feature extraction method [1], which consists of the convolutional NMF and K-means clustering. Just like the proposed features, the audio clips were short-time-Fourier-transformed by 512-samples Hamming window with 50% overlap into 512 frequency bins. The spectrogram of each audio clip was decomposed to 20 2D-dictionaries with 257 frequency bins and 4 time frames. Therefore, a dictionary from an audio clip was R 257×4×20 The whole dictionaries were clustered into 256 and 512 centers with K-means clustering, but the K-means clustering to 512 centers failed to converge in our dataset. Table 4 shows the comparison results between the classification performances using the conventional convolutional-NMF-based features and the proposed features. The used classifier for the conventional features are the same as those of the proposed system. The results show that the proposed NMF-based features may be more suitable for the used CNNbased architecture in the proposed system.

E. COMPARISONS WITH CONVENTIONAL FEATURES
In order to compare the performace of the proposed feature with the various existing features, the performance of the proposed system was compared to the systems utilizing the conventional features, including constant-Q transforms (CQT) [51], [52], power-normalized cepstral coefficient (PNCC) [53], Mel-frequency discrete wavelet coefficients (MFDWC) [14], gammatonegram (GAM) [17], and gammatone frequency cepstral coefficient (GFCC) [18]. The length of window and overlap for the Fourier transform were set to 64 ms and 20 ms, respectively, which are the same values as in the log-Mel spectrogram case. The CNN classifiers were the same as in the proposed system, as shown in Fig. 7. Similar to the proposed system, the filter length of the second CNN layer was adjusted to N feature 4 , where N feature is the number of features in a frame, so that the first dimension of the second CNN layer output was one. All training data were equally augmented by the mix & shuffle method [45] in the waveform domain.
All the compared features are frequency-based so they can be implemented with the short-time Fourier transform. The parameters of the Fourier transforms, e.g., the window/overlap length, number of FFT points, and type of the window function, were set to the same values as used by Inoue [9] and the proposed system. The number of features adopted for each system is displayed in Table 5. The lower bound frequency was set to 32.7 Hz and the number of CQT bins was 12 per octave in the CQT system, so the 96 CQT bins (= 8 octaves) could cover the whole frequency range. The number of the PNCC features was set to 40, the same as in reference [53]. The numbers of featrues of the MFDWC and GFCC were 15 and 13, respectively, as in the previous studies [14], [18], and larger numbers of coefficients (MFDWC31 and GFCC26) were also tested. The MFDWC15 system used 8, 4, 2, 1 coefficient at scale 4, 8, 16, 32, respectively, the same as in reference [14], while the MFDWC31 system used 16, 8, 4, 2, 1 coefficient at scale 4, 8, 16, 32, 64, respectively. Therefore, the number of Mel-bands was set to 64 in the MFDWC31 system, while the MFDWC15 system used 32 Mel-bands. The ensemble systems consisted of the independent networks for the log-Mel-spectrogram, CQT, and the GAM, and the prediction results of the networks were averaged. In the averaged results of Dev dataset, the performance of the proposed feature is better than the results for all of the other features, except for the GAM, MFDWC31, and the ensemble system. Although the proposed feature performs slightly better than MFDWC31, the improvement is only marginal. The results from the Eval1 dataset show that the MFDWCs and the GFCC26 features perform better than the proposed algorithm, unlike the results from the Dev dataset. However, the results for the MFDWCs and the GFCC26 features exhibit significantly degraded results for the Eval2 dataset. As a result, the proposed features demonstrate a better performance than all of the compared systems, except the GAM and the ensemble system, and a very close performance to the GAM and the ensemble system.
The F 1 -score of the most of the analyzed systems is about 1.5% to 5% lower for the Eval2 dataset than for the Eval1 dataset. For the MFDWC and GFCC systems, the performance assessed on the Dev and Eval1 datasets improves as the number of features increases, but the difference between the Eval1 and Eval2 datasets also increases, so the performances on the Eval2 dataset decrease. However, the proposed system shows a difference of only 0.47% between the two datasets, so the proposed system can be regarded as robust to the change of the room transfer function. The PNCC system exhibits the smallest difference (0.8%) between the datasets among the compared systems, but the performance of the PNCC system itself is inferior to the proposed system.

V. CONCLUSION
In this paper, an NMF-based feature extraction method is proposed for the monitoring domestic activity tasks by using sound signals. The proposed method was designed for supervised classifiers for domestic sounds. Inspired by the NMFbased source separation methods, the proposed method estimates class-wise frequency bases using annotated sound signals. Then, the temporal bases matrix is extracted from the input signal based on the concatenated class-wise frequency basis matrices. The temporal basis matrix was used as the feature matrix, and the features could be augmented using the proposed data augmentation method that is derived from the waveform-based mix and shuffle method without additional calculations.
In order to evaluate the proposed feature extraction method, some experiments were performed based on the DCASE 2018 Task 5 database. First, the proposed method was compared to state-of-the-art algorithms that utilize the log-Mel spectrum, Mel-spectrogram, and VGGish model output as input features. The evaluation results showed that the combined system of the proposed NMF-based feature and the CNN-based classifier has comparable performance to that of the state-of-the-art algorithms. Second, the proposed algorithm was evaluated by changing the feature dimension, and the results show that the performance of the proposed algorithm is consistent with the change in the feature dimensions, except for the extremely-small-bases case (R C = 3).
Third, the proposed algorithm was compared to the conventional NMF-based feature extraction method, which consists of convolution NMF and K-means clustering, and the results showed that the proposed algorithm has better F1-score performances of 6%-12% in comparison with the conventional NMF-based features.