Open Set Audio Recognition for Multi-Class Classification With Rejection

Most supervised audio recognition systems developed to this point have used a testing set which includes the same categories as the training set database. Such systems are called closed-set recognition (CSR). However, audio recognition in real applications can be more complicated, where the datasets can be dynamic, and novel categories can ceaselessly be detected. Hence, in practice, the usual methods will assign to these novel classes labels which are often incorrect. This work aims to investigate audio open-set recognition (OSR) suitable for multi-classes classification recognition, with a rejection option for classes never seen by the system. A probabilistic calibration of a support vector machine classifier is utilized and formulated under the open-set scenario. For this, it is proposed to apply a threshold technique called peak side ratio (PSR) to the audio recognition task. A candidate label is first examined by a Platt-calibrated support vector machine (SVM) to produce posterior probabilities. The PSR is then used to characterize the distribution of posterior probabilities values. This process helps to determine a threshold in order to reject or accept a particular class. Our proposed method is evaluated on different variations of open sets, using well-known metrics. Experimental results reveal that our proposed method outperforms previous OSR approaches over a wide range of openness values.


I. INTRODUCTION
Closed-set recognition systems (CSR) are often governed by misleading assumptions, where all testing and training data are taken from the same database, often with equal distribution. Under these assumptions, several algorithms have achieved significant success in many applications of machine learning. Machine learning algorithms are able to perform empirical risk minimization very well, using their ability to handle large feature spaces and to identify outliers. However, these assumptions do not reflect some practical applications in which out-of-set data may be encountered. When data from a new class occurs, it is classified as one of the known classes. Even if this sample lies far from any of the training samples, it may be classified with a high probability, that is, the algorithm will not only be wrong, but it may also be very The associate editor coordinating the review of this manuscript and approving it for publication was Ioannis Schizas . confident in its results [1]. A more practical problem is open set recognition (OSR), where samples of classes not seen during training may show at testing time. OSR systems need to formulate new assumptions that can balance both empirical risks: the risk of a tested sample misclassification and the risk of labeling unknown space [2]. Any signal or event in an open-set classification scenario will be under one of the following categories [3]: 1. Known classes (KCs): the classes for which data samples are labeled positive for training and testing. 2. Known unknown classes (KUCs): Classes that are seen at training time, but at validation time they are considered as unknown, to build unknown models. These are used for tuning parameters. 3. Unknown unknown classes (UUCs): Classes that have not been seen in the training or the validation stages. They appear only in the testing stage. In this paper, we aim to solve the open-set audio or sound event identification and detection problem by setting a threshold to detect new instances and new classes, based on the peak side ratio (PSR) of calibrated posterior probabilities, rather than applying a threshold on the probabilities themselves.
Experiments need to be carefully designed for evaluating multi-class open set recognition. We use well-known data sets, retrieved from the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge [4]. Because open-set recognition requires an experimental system that uses classes unseen during training, we take some classes from the database to simulate the unknown classes. A one-vsrest multi-class method is used to implement the radial basis function support vector machine (RBF SVM). The system has as many classifiers as the number of known classes. During the test phase, the classifiers predict the posterior probability of each class for the tested sample. The posterior probability under a closed set setting can easily predict the tested sample's distance from the separating hyperplane, and the highest probability value leads to the class to which it belongs. However, the system in an open-set setting does not know the unknown classes and thus cannot estimate the probability of unknown classes. Instead, we compare the distribution of posterior probabilities of all classes for the tested sample. If the highest probability value is far away from other values, this class is considered as known and is recognized. Otherwise, it will be considered as an unknown class. We propose the PSR to measure this aspect. This measure was introduced in [5] for face recognition purposes. In this work, we propose to include the PSR in audio recognition tasks and to validate its use. Consequently, our proposed algorithm consists of: deriving a set of calibrated posterior probabilities values; deriving a PSR for the tested samples using the set of probabilities values; comparing the PSR to a rejection threshold; and rejecting the unknown samples if the rejection threshold is less than or equal to the PSR.
The structure of this paper is as follows. In section II, we present the literature review; in section III, we clarify some definitions and notation concepts of open-set scenarios; in section IV, we describe feature extraction; in section V, we illustrate the evaluation metrics used to measure the system performance; in section VI, we propose our method of classification; and in section VII, we evaluate the performance of our proposed method and compare it with other methods. A conclusion then follows in section VIII.

II. LITERATURE REVIEW
This section reviews recent work in the literature that explicitly deals with audio features representations (II-A), audio recognition (II-B), and open-set scenarios (II-C).

A. AUDIO FEATURES REPRESENTATIONS
Feature extraction is a critical step in audio/sound event classification. An efficient representation must capture the most significant sound properties for the task. Several efforts have been made to harvest data with respect to dimension and sample size. The most common features used in audio recognition are the Mel-frequency Cepstral Coefficients (MFCCs). In the speech processing domain, the first thirteen MFCC values have been verified to be particularly pertinent due to their approximate separation of the glottal excitation from the vocal tract [6]. The MFCCs are perceptual features computed from the short-term Fourier transform. The power spectrum bins are computed and scaled with Mel-frequency scaling. The output is then framed into a number of filter banks, corresponding to overlapping triangular filters. Finally, a discrete cosine transform (DCT) is applied to the filter bank magnitude logarithm outputs, producing vectors of nearly decorrelated MFCC features. By equally spacing the frequency bands on the Mel-scale, the MFCC decomposition resembles how sound is perceived by humans and provides statistics of how the frequency content changes in different spectrum bands, which make them useful for audio recognition. Other features have been used for audio classification, including but not limited to: MPEG-7 audio features [7], feature representation based on bag-of-audio-words [8], and indexing-based features [9]. Also, features based on the discrete Hartley transform [10] have been shown to be useful in audio scene classification applications.
Although the choice of features affects the overall classification performance, this choice is not key to the main objective of this paper, which is to investigate the benefit provided by the PSR measurement for open-set problems in audio classification. This paper uses the MFCC features as the base features because of their widespread use and since they have proven to be effective in recognizing the structure of acoustic/speech/audio signals (sometimes in combination with other features, as we do in this paper).

B. AUDIO DETECTION/RECOGNITION
Many of the classification methods that can be encountered in the literature detect presence/absence of a specific event during a period of time. Several organized challenges have been created for acoustic events detection and recognition to evaluate these research methods. These challenges regularly report progress and provide useful resources for research, such as the SiSEC evaluation for signal separation [11], the CHiME speech separation and recognition challenge [12], the MIREX competition for music information retrieval [13], and the DCASE 2013-2019 challenges [4].
Other notable previous work includes [14], where Lopatika et al. used an SVM classifier to discern between classes of hazardous situations. They used long-range audio features to classify four predefined classes and then localize them. Similarly, Hilal et al. [15] used linear discriminate analysis (LDA) and SVM classifiers to identify and localize a set of predefined environmental sound events.

C. OPEN-SET SCENARIO
In the audio domain, there have been few previous publications investigating open-set scenarios. Battaglino et al. [16] applied the open-set method to audio scene classification, where a specific type of 1-class SVM has shown promising results for open-set recognition. Recently, Krstulovic [17] published a book chapter illustrating the importance of the open-set problem in the audio domain and investigating the restrictions of the existing evaluation practices, such as using F1-score, precision, and recall.
The problem of road surveillance was considered in [8], where two hazardous situations were detected: car crashes and tire skidding. The bag-of-words dictionary technique was used to compute the occurrences of low-level features and assemble a high-level vector that has an equal dimensionality to the number of possible words in the dictionary. Crocco et al. [18] made a good survey on audio surveillance methods, which dealt with a small group of sound classes. They also emphasized the challenges that a surveillance scenario faces.
The concept of open-set recognition has so far received more attention in image/face recognition tasks. Recent methods to detect unrelated samples in open-sets problems have been investigated by Scheirer et al. [2], where a one-vsall setting was used to formulate the problem of open set image recognition. The open space risk and the empirical error were aimed to be balanced. The work was then extended in [3] by introducing the 1-vs-set machine with the compact abating probability model. Jain et al. [19] introduced the Weibull-calibrated multi-class SVM classifier for open-set image recognition. They showed promising results based on minimizing open space risk, which is a compromise between computing capabilities and support vector estimations. Bendale et al. [20] proposed a Nearest Non-Outlier (NNO) algorithm and gave the definition of open word image recognition, where the unknown samples are not a static set. They extended their work in [21] by modifying a deep learning structure for open-set recognition, combined with the concept of the penultimate layer with meta-recognition [22]. Similarly, in radar image recognition, Roos and Shaw [23] used open-set recognition on high range resolution radar and formulated an automatic target recognition.
Our proposed solution to distinguish known and unknown events is related to estimating the posterior probability. The recognition algorithm is implemented according to the SVM classifier framework. The SVM has a good reputation as a robust classifier due to the optimal margin gap between separating hyperplanes. The decision function in a Plattcalibrated SVM provides the posterior probability of each classifier. Since we deal with multi-class recognition, the decision functions are combined in a one-vs-rest fashion.
However, a traditional SVM works for the closed-set problem, where the testing set includes the same categories as the training set database. The open-set recognition cannot simply use the maximum posterior probability (MAP) estimate over the known classes as the best solution, because the probability estimation of unknown classes is not possible. In addition, the probability estimation does not provide a sufficient scale when novel class data are far from any training data. The PSR measurement is proposed for audio classification in our work to make a rejection decision by defining the relationship between the maximum posterior probability value and other values. If the PSR is bigger than a certain threshold, the sample is rejected and labeled as an unknown class. A large PSR value means that the recognition is questionable due to ambiguity and yields rejection. The threshold is computed a priori during a validation step, where the system is inspected under a variety of thresholds. The best threshold found is then considered predefined and constant during the real experiments.
As can be seen from the literature, most of the previous works on open-set scenarios have been done in image processing applications. Therefore, in this paper we investigate the applicability of some techniques previously developed for image classification and evaluate their use and performance in the context of audio classification, where the signals have different characteristics.

III. NOTATION AND DEFINITION
Let us first establish preliminaries related to open-set recognition. The openness of a given problem is an estimate of the need for rejection in order for a proper solution to be found. It measures the potential for the existence of sound classes that are not fully known in the training and validation stages. Therefore, some classes are withheld as unknown to simulate this scenario. Let Y be the set containing all class labels in a given data. Scheirer et al. [2] defined three quantities: target, known, and unknown. We call Y the set containing all the class labels in the given dataset and use the subscripts t, k and u for the target, known and unknown sets, as described by the Venn diagram in Fig. 1.
The data from the target and known negative classes are used for training Y train = Y t ∪ Y k , and representative data from known classes are used only to evaluate open-set classification. The testing dataset is a combination of all classes The openness is defined in (1) as: where | . | here represents the number of elements in each set. The openness ranges from 0 to 1, where 0 represents a completely closed-set problem. Increasing the number of classes available in the training phase leads to a decrease in openness. The square root slows down the openness from quickly moving toward the 1 value. If we define the sets of training and testing classes as Y train ⊂ Y and Y test ⊂ Y , the novel classes will be Y u = {y|y ∈ Y test and y / ∈ Y train }.

IV. FEATURES EXTRACTION
In contrast to video signals, an audio signal has quick variations within a short time. Hence, the input audio stream is first framed into groups of P overlapping segments. Each segment has a duration of N samples and is multiplied by a Hamming window. The choice of the segment duration is a trade-off. If the length is too short, it may be unable to represent accurately fine details of the low-frequency components (from poor frequency resolution). Conversely, if it is too long, the frame cannot describe the short-time changes in the audio signal. We used N = 2048 samples and 512 samples for the segment shift. Before executing any training or classification, the sounds under investigation need to be characterized and represented in a form that can easily be computed. Feature extraction is the process of extracting useful discriminative information from the raw waveform signals and producing a compact set of feature vectors.

A. MEL FREQUENCY CEPSTRAL COEFFICIENTS
As previously mentioned, extracting MFCCs is a widely used procedure in speech classification applications. The frequency bands are equally spaced on the Mel-scale to resemble how humans perceive sound. For speech or speaker recognition, 13 MFCCs are enough to encode information about the vocal tract [24]. However, there is no optimal number of MFCCs to be used for a non-speech application. In this work, we adopt the choice made in the DCASE challenge of using 40 coefficients [25]. The MFCC coefficients are computed using the following steps. The signal is first divided into frames. The frame x p (n) with length N is pre-emphasized with a high-pass filter to compensate for a spectral slope. The transfer function is H (z) = 1 − αz −1 , where 0.9 ≤ α ≤ 1.0. We used α = 0.97. The spectral slope is the tendency of natural audio signals to have less energy at high frequencies.
Each frame is multiplied with a Hamming window w(n) = 0.54 − 0.46 cos (πn/N ) 0 ≤ n ≤ N − 1, followed by a discrete Fourier transform. In practice, this is referred to as the (discretized) short-time Fourier transform (STFT) [26]: with k = 0, 1, . . . , N − 1 frequency coefficients (bins) and p = 0, 1, . . . , P − 1 frames. X p (k) corresponds to the content at the discrete frequency f (k) = kF s /N , where F s is the sampling frequency. The result is a matrix of size N × P.
The spectrum X p (k) is then scaled in both magnitude and frequency. The frequency is logarithmically scaled as in (3): with m = 1, 2, . . . , F, where F is the number of filter banks, and p = 0, 1, . . . , P − 1. The filter bank output is computed by the product of the magnitude spectrum X p (k) and the Mel filter bank H (m, k), described by an F × N matrix. The Mel filter bank H (m, k) triangular responses are computed in terms of the center frequencies f c (m) as, The filter bank center frequencies in Hz are obtained from uniformly distributed frequencies in the Mel scale. There are several formulae in the literature to convert a frequency in Hz to a frequency in Mel. We use the one found both in the Librosa Python library [27] and the MATLAB Auditory Toolbox [28], given as: A uniform Mel scale frequency resolution is then calculated based on a logarithmic scaling of the frequencies, by The Mel scale center frequencies are computed by f c.Mel (m) = m f Mel for m = 1, 2 . . . F, from which the center frequencies f c (m) in Hz can be obtained. Applying the DCT to X (m) provides the MFCC coefficients, where c p (r) is the r th MFCC of the p th frame, and r = 0, 1 . . . 39. Features are then collected, including the coefficient for frequency zero. We also compute delta ''differential coefficients'' that can capture the dynamic properties of the cepstrum, calculated using the following formula: where d p is a delta coefficient for frame p, and N d is the number of coefficients selected. We use N d = 9 in our proposed system.

B. FREQUENCY-DOMAIN FEATURES
While MFCCs are well-known features in speech classification applications that resemble how people interpret sound, for audio recognition additional features are often used, such as spectral sparsity, spectral flux, and spectral centroid used in this work. Spectral sparsity is defined as a normalized maximum frequency, as the following: The spectral flux captures the change of spectra between two successive segments. It is computed as in [22] by taking the squared difference between two consecutive spectral contents: The spectral centroid computes the center of gravity of the spectrum. For each p th frame, it is computed as:

V. EVALUATION METRICS
The choice of statistics to use for evaluation of classification performance needs to be addressed. Metrics used in audio detection and classification include accuracy, F-score, precision, acoustic event error rate (AEER), and error rate (ER). Let TP, TN, FP, and FN denote the true positive, true negative, false positive, and false negative counts, as explained in [29].

A. PRECISION AND RECALL
Precision is the portion of positives classifications that are computed correctly (true), while recall is the ratio of the total positive samples which are correctly detected by positive classifications. These can be calculated in two ways, macro-averaging as in eq (12) and micro-averaging as in eq (13): In this work, we computed the average of precision and recall for all classes using macro and micro averages, treating all classes equally, as described in [30].

B. AVERAGE ACCURACY AND F-MEASURE
Following the definition of metrics as in [31], the average accuracy is computed as: where the i subscript refers to the i th training class. The F-measure is a score that can represent both precision and recall in a single score. The general form of F-measure is computed as, where β is a parameter providing a weight for the precision (P) and recall (R). We used β = 1 in this work, as it is the definition of the F1-measure, a commonly used measure.

C. CONFUSION MATRIX
The accuracy metric provides an appropriate evaluation only if the class labels are uniformly distributed, which is the case of the closed-set recognition task also considered in this work. In this case, the confusion matrix CM (i, j) is a good technique to summarize the performance of multi-class classification. Each column of the matrix corresponds to an actual class, and each row corresponds to a predicted class. The diagonal of the confusion matrix shows the correct classification prediction (i = j). In this work we used a row-wise normalized confusion matrix CM n (i, j) as defined in [32], where each element is divided by the sum of elements in each row:

D. REDUCED CONFUSION MATRIX
The reduced confusion matrix is the confusion matrix reduced to a 2 × 2 matrix to evaluate the open set performance [23]. The classes of the matrix are the known targets and unknown targets, as shown in Table 1. The performance of the classifier can be computed based on how well these two classes are distinguished.

E. ACOUSTIC EVENT ERROR RATE
This metric defined in [33] is used to measure the classifica- Given T as the number of detected events, the Acoustic Event Error Rate (AEER) is then computed as:

VI. METHODOLOGY
Compared with closed-set audio classification, which has been investigated for decades, open-set audio recognition classification needs a special setup, methodology, and experiments, which can distinguish audio events of interest from frequent daily events. This section will discuss the proposed method in details.

A. DATABASE
We validate our proposed framework in an open-set audio recognition experiment by using a dataset that contains audio events recorded in an office-like environment retrieved from DCASE 2013 [34] and from the Freesound online database [35]. The recorded events have different sizes and varying noise levels. The sound event classes used are: short alert, clearing throat, cough, door slam, drawer, keyboard clicks, keys put on the table, knocking on door, laughter, mouse click, page turning, pen or pencil touching a table surface, phone, printer, speech, and switch. This is a challenging set to work with, as some of these sounds can be similar. To address open-set recognition, we randomly select six classes to build our models and keep the remaining ten classes as unknown classes. To create different levels of openness, for each setting 2-5 classes from the known classes are considered as target classes. From the unknown classes, in each setting 3-9 classes are selected. Experiments in multiple settings are performed, for a total of 28 combinations.

B. SIGNALS PROCESSING
The input audio files are preprocessed with normalization, frame segmentation and windowing. The sampling frequency is fixed for each file as 44100 samples/sec. Files that have different sampling frequencies are resampled to this value. Each audio file is normalized to maximum unit amplitude, to bring the gain of the entire track to its maximum without clipping. Due to the presence of silence frames and to avoid ambiguity in sound event labels, the system uses a Voice Activity Detection (VAD) technique to select the most energetic frames of the audio and to discard the rest. We divided each recording into frames with a small length of 20 ms, and 50% overlapping. The actual ES pq energies are computed in the frequency domain frame by frame, where p represents the frame index and q is a frequency bin. The noise energies En pq are estimated independently for each frame. The first frame is considered a pure noise, and the algorithm adapts with each frame based on the moving average function En pq = 0.94 × En pq + (1 − 0.94) × ES pq . This function is computed only when the signal energy ES pq is not higher than a threshold η, where η = 2En pq . The signal to noise ratio (SNR) of the p th frame in dB is computed as The VAD algorithm compares the local SNR of the p th frame to a global SNR, computed by the average of the local SNRs for the entire audio file. If the SNR of the p th frame is lower than the global SNR, a frame is considered as a silence and is removed. After silence removal and annotation/labeling, the audio clips are segmented into frames of 2048 samples with 512 samples overlap. Only signals that are detected for at least four-consecutive frames are considered events.

C. PSR-SVM CLASSIFIER
We use the support vector machine classifier, SVM, along with peak-side-ratio, PSR. The SVM uses hyperplanes to define the decision boundaries that separate data spaces of different classes. Since SVM is a binary classifier, it is required to use several SVM classifiers operating in parallel to solve multiple binary classification problems. We consider the problem as a set ofNk two-class problems, where Nk is the number of classes during the training stage. Multi-binary SVM classifiers are constructed for a multi-class problem using a One-versus-All (OVA) technique that applies a winner-takes-all strategy on the classifiers' outputs. For the j th binary SVM classifier, the class j is considered a positive class (+1), whereas the remaining classes are considered collectively as negative (-1). Given a separating hyperplane, the decision function of the SVM is given by: where b is an offset parameter, ω is a multi-dimensional weight vector, and ψ : R d → H feat is a kernel function, which is used to transform the input data into a high-dimensional feature space H feat . Practically, the transformation ψ is indirectly defined by a kernel function, so that K (x i , x j ) = ψ(x i )ψ(x j ) . We use the Radial Basis Function (RBF) Kernel where γ > 0. Each model is designed to find maximum separation boundaries of the j th class from the rest of the other classes. A hyperplane is defined for the j th class as the following: As seen in Fig. 2, during the training stage, each classifier builds its own model and saves it to be used in the testing stage. In the testing stage, the Platt probabilities estimate [36] are computed based on the decision function using a sigmoid: where the A and B parameters are determined by Maximum Likelihood Estimation (MLE). Our proposed method for the detection aspect involves using the PSR (peak-side-ratio) [5]. It is a confidence measurement that computes the difference between the maximum posterior probability value and other values. Let us arrange posterior probability values P j , where j = 1, . . . , Nk in descending order, where Nk is the number of classifiers, P 1 is the largest value, and P Nk is the smallest value. The PSR is then: whereP is the average and |.| represents the absolute value here. When the system receives a new samplex, the SVM classifiers compute posterior probability P(Y j |x) as in (21). In the traditional way, they assign the new sample to a certain class Y j that has an index associated with the largest probability value, but the probability estimation of unknown classes P(Y unknown |x) is not possible, and argmax of the probability estimation does not provide a sufficient scale when novel class data are far from any training data. The PSR is used here to make the rejection decision. If the PSR is larger than a certain threshold, the sample is rejected and labeled as an unknown class. Otherwise, the sample is assigned to the class that has the maximum posterior probability: Nk). (23) The PSR characterizes the distribution of posterior probability values. It helps to determine the threshold for rejecting or accepting a particular class. A small PSR score gives credibility to the classifier. A high score means the posterior probabilities are randomly distributed, which leads to a questionable classification decision, and the tested signal will be rejected. To visualize the effect of the PSR, Fig. 3 shows the histogram of PSR and posterior probabilities. It is clear that the use of the PSR is much more reliable for distinguishing new classes from predefined classes.

D. THRESHOLDING CRITERIA
In order to get an optimum classifier, we have to tune the decision threshold. We select 10% of the audio samples from all classes for validation, i.e., for tuning parameters based on a grid search. The remaining data are used later for the classification experiments. The optimal threshold is determined over different threshold values in the range δ ∈ [1,4], with a resolution of δ = 0.01. The F1-measure is computed for each threshold value. To improve the robustness, we use fivefold stratified cross-validation, and the F1-measure values are computed as a mean ensemble. We found that the threshold δ = 2.1 leads to the best performance. This threshold is then considered constant throughout our experiments.  [37], we plotted the grid search surface, as shown in Fig. 4. We found that the optimum hyperparameters that give the best prediction score were c = 100 and γ = 20.
For the closed-set scenario, the audio dataset is split into training and testing datasets. We use 5-folding crossvalidations. In other words, from the data unused by the validation step, 20% of the unused data are used for testing, and 80% of the unused data are used as the training dataset to model the classifiers. The classifiers are multi-class SVM with the one-vs-all approach, where each class of events models its classifier separately. In the testing stage, a classifier fusion step is used to merge the results of multiple classifiers. The final predicted class is the one that has the maximum vote. The audio classification is performed at both framelevel and event-level. In general, event-based performance is expected to be better than frame-based performance, because the detected event class is obtained as the most frequent frame-based class detected over that whole event (statistical mode), which removes some local errors.
While multi-class classification for the closed set system is evaluated by tracking the correct and incorrect classification, the open set classification evaluation must keep track of incorrect multi-class classification over known categories and errors between unknown and known categories. Therefore, two types of open set experiments are conducted: reduced open-set recognition, which computes errors between known and unknown categories, and multi-class open-set recognition, which evaluates the whole system. This will be further explained in the next section.

VII. RESULTS AND DISCUSSION
This section evaluates the efficiency of our proposed architectures. We conducted our experiments as shown in Fig. 5 on a closed-set scenario and an open-set scenario. We follow a cross-validation testing procedure to verify reliability in the reported results across several trials. The experiments are conducted five times. During the experiments, our procedure randomly sorted classes to be the target, known, and unknown classes, producing a gradual transition from closed-set to increasingly open-set configurations. We report the efficiency of the system for both segment-based and event-based metrics, using standard evaluation setups provided in [3]. The systems are evaluated for the following types of errors: -Misclassification: Test samples misclassified with a wrong label, belonging to one of the predefined classes; -False unknown: Test samples rejected as unknown, but in reality, belonging to one of the predefined classes; -False known: Test samples truly unknown, but assigned to one of the predefined classes. The first task is to examine the classifiers' accuracy in closed-set recognition, where only the first type of error is possible. This task is important because it explains how well an algorithm learns the training data. -Reduced open-set recognition examines the classifiers' ability to distinguish unknown from known targets, as described in [38]. For a signal labeled by the classifier as known, it does not show if it will be further correctly assigned to one of the known classes. The first type of error is not possible in this scenario. -Multi-class open-set recognition described by Scheirer et al. [2], where a system not only rejects a target that is not seen in the training stage, but it also labels the class that a target belongs to within the known classes. In this scenario, the three types of errors are possible. In order to assess the accuracy and effectiveness of our proposed method, it was compared with the following methods known in the prior art for open set recognition: -LP-SVM classifier [36]: SVM classifier with a linear kernel and the posterior probabilities calibrated using Platt scaling. -RBF-SVM classifier [19]: One-vs-all multi-class SVM classifier with Radial Basis Function (RBF) kernel, and posterior probabilities calibrated using Platt scaling. -1-vs-set machine [2]: a linear classifier with a one-vsall approach. It optimizes the recognition of empirical and open space risk. We used the new version of the C code implementation provided on the website [39].

A. CLOSED-SET RECOGNITION
When Y k = Y u the number of unknown acoustic classes is 0, which is a full closed-set. The training/test protocol is 5-fold cross-validation. The model of each class is trained on four sets of the class, and the model is tested using the remaining 5 th fold. Since the purpose of the closed set recognition experiment is to validate the ability of our proposed algorithm to discriminate among known classes, we did not conduct comparisons with other algorithms in this part. Comparisons with other algorithms are made for open set recognition experiments in the next sub-sections. Table 2 shows the overall accuracy and F1 metrics, computed with macro-averages and micro-averages. As can be seen, there are 11% misclassifications due to the similarities among some classes. In general, the classification performance using both event-based and frame-based metrics shows that the proposed algorithm is a reliable classifier. For both frame-based and event-based metrics, the confusion matrices are displayed in Fig. 6 and Fig. 7, respectively. The matrix rows and columns refer to the ground truth and predicted labels, respectively. The matrix is normalized rowwise, as mentioned earlier. This makes all classes be considered as equal size, and the dataset becomes class-balanced. In the confusion matrix, the elements in the diagonal are correctly classified, while the elements out of the diagonal are misclassified.
As can be seen, most of the signals are correctly classified. However, there are some cases where the system fails to distinguish classes due to strong correlation between them. It can be noticed in both figures that among the 16 classes, 'door slam' was the most difficult sound class to identify. It can be seen that it got misclassified with other classes that have a short duration, such as 'switch' and 'keys.' The macro-averaged F1-scores for the experiments of Fig. 6 and Fig. 7 are respectively 0.854 and 0.892, while the micro-averaged F1-scores are 0.844 and 0.876, respectively. These scores show the expected overall better performance of event-based metrics. The closed-set results of this sub-section will be used as a baseline comparison for the open-set results of the next sub-sections, where a comparison is also made with prior art methods.  The reduced open set performance explains how an algorithm distinguishes known from unknown data. The simulation of this scenario is done by selecting a portion of the available classes and considering them as known in the training and testing stages. This experiment is designed only for rejecting or accepting a new sample. It makes a decision as to whether or not a new sample belongs to the defined groups (known classes), but it does not reveal whether the correct class was assigned within the groups. Such information is considered to be expressed by closed set performance. The F1-precision and recall measures are computed. The experiments are conducted with different numbers of target, known, and unknown classes. Table 3 shows some selected trials and our proposed algorithm's responses. The table reveals the results of the macro-average and the micro-average metrics. As described in (12)-(13), a macro-average treats all classes equally since it computes each class independently and then takes the average of them. A micro-average combines the contribution of all classes to compute their average. Since we have more examples for some classes than others, the micro-average is preferable here.
The results of the experiments evaluating reduced open set recognition for the proposed method, as well as other prior art methods, are summarized and discussed in the following. Using frame-based metrics, the results of the detection experiment for the different systems considered are shown in Fig. 8. Rejection by thresholding with linear kernel and Platt probabilities produced the worst performance. A likely explanation for this is that the linear kernel Platt classifier made the calibration model weak for unknown classes, so they were falling in between separated hyperplanes.
Similarly, SVM with RBF kernel and Platt probabilities produced high F1 measures at lower levels of openness, but as openness increased, it produced a poor performance. The classification problem becomes more challenging as the number of unknown classes grows. Comparing with other classifiers, our proposed method overall produced better results over a wide range of openness values, except for the extreme case with high openness of 0.46, where the 1-vs-set machine method produced better results.
In the same way, the reduced open-set recognition results for event-based metrics are reported in Fig. 9. As to be  expected, when more classes are available during training (less openness), the classifiers are again more accurate. We note that the performance of event-based metrics is better than the performance of frame-based metrics in Fig. 8, for all classifiers. As the openness increases, the performance of RBF kernel Platt drops quickly, and once again, the 1-vs-set machine maintained a good performance for very open scenarios. But overall, for a wide range of openness values our proposed method again produced better results compared to other classifiers.

C. MULTI-CLASS OPEN SET RECOGNITION
In the previous subsection, the results showed the ability of our proposed method to identify and reject unknown classes and its ability to accept the known classes without labeling them. The performance of the classifiers in this subsection is evaluated for multi-class open set recognition. For multi-class  recognition, we conduct these experiments on algorithms that have a rejection option. The experiments are performed by selecting six classes for training. The remaining ten classes are used as unknown data. To generate different amounts of openness, each trial selects from 1 to 10 classes of the available unknown classes, and from 2 to 6 classes to be considered as target classes. All the algorithms are executed using the same data, with the exact same negative and positive examples to sustain a fair comparison. During the testing, we consider rejected samples as either true if from an unknown class, or false if from a known class. Furthermore, the algorithms assign accepted samples to one of the known classes. The class with the maximum score, probability or votes is the predicted class. Any algorithm that does not produce a good rejection will have very poor precision as the simulation setup becomes more open, because if they miss accepting or rejecting a sample at the threshold stage, it will appear as misclassified in the metrics used.
For frame-based metrics, we see from Fig. 10 that our proposed method provides either the best performance or near-best performance over a wide range of openness values, for the task of separating known classes from unknown classes and distinguishing among known classes. On the other hand, the linear kernel Platt classifier is again the weakest method for this experiment. The decrease of performance as the openness of the dataset increases is again very clear for most methods. Fig. 11 shows the performance of the same simulation setups but with event-based metrics. We note that the performance of event-based metrics is again better than the performance for frame-based metrics in Fig. 10, for all classifiers. Again our proposed method has either the best performance or near the best performance over a wide range of openness values.

VIII. CONCLUSION
In this work, we investigated the use of a supervised classification strategy for sound event detection and identification in an open-set scenario. Extensive experiments were conducted using sound features proposed in the literature for closed-set audio identification. For challenging open-set scenarios, experiments using SVM classifiers were performed for recognizing known versus unknown audio events, using a threshold and a rejection function. The rejection function proposed in this paper for audio classification is a confidence measurement called the peak side ratio (PSR). It computes the distribution of posterior probabilities for all classifier outputs, to determine whether a particular measured event belongs to a certain group of known events or not. The experiments in the paper were performed on data from the DCASE 2013 challenge. Compared to previous work, our proposed method delivered the best or nearly the best performance over a wide range of openness values. However, for very large openness values, our proposed method was outperformed by the 1-vs-set machine method.
Overall, we demonstrated that our proposed method is promising for open set audio classification. Future work should include designing a system giving improved performance when the number of known classes is minimal.