AI-enabled Sound Pattern Recognition on Asthma Medication Adherence: Evaluation with the RDA Benchmark Suite

Asthma is a common, usually long-term respiratory disease with negative impact on global society and economy. Treatment involves using medical devices (inhalers) that distribute medication to the airways and its efficiency depends on the precision of the inhalation technique. There is a clinical need for objective methods to assess the inhalation technique, during clinical consultation. Integrated health monitoring systems, equipped with sensors, enable the recognition of drug actuation, embedded with sound signal detection, analysis and identification, from intelligent structures, that could provide powerful tools for reliable content management. Health monitoring systems equipped with sensors, embedded with sound signal detection, enable the recognition of drug actuation and could be used for effective audio content analysis. This paper revisits sound pattern recognition with machine learning techniques for asthma medication adherence assessment and presents the Respiratory and Drug Actuation (RDA) Suite (https://gitlab.com/vvr/monitoring-medication-adherence/rda-benchmark) for benchmarking and further research. The RDA Suite includes a set of tools for audio processing, feature extraction and classification procedures and is provided along with a dataset, consisting of respiratory and drug actuation sounds. The classification models in RDA are implemented based on conventional and advanced machine learning and deep networks' architectures. This study provides a comparative evaluation of the implemented approaches, examines potential improvements and discusses on challenges and future tendencies.


INTRODUCTION
Asthma is a chronic inflammatory condition of the airways, with over 235 million people [1]- [3] suffering worldwide. The direct cost associated with asthma has risen a significant burden on society and healthcare systems [4], [5]. Asthma is a common disease among children and one of the most common chronic conditions. It is characterized by recurrent attacks of wheezing, breathlessness, dyspnea, chest tightness and coughing, known as asthma attacks [6]. It deteriorates the quality of life for patients and their families up to a level, leading to increased healthcare cost, poor clinical outcomes and increased morbidity rates [7]. The variety of obstructive respiratory diseases [8], [9] reveals the importance for innovative models that could help patients face their unpleasant condition and enjoy a better quality of life [10]. There is an evidence that adherence to asthma treatment is variable, hindering the proper management of the disease [11]. Detailed constructive feedback from clinicians on inhaler usage may motivate patients to focus more on treatment adherence, which in turn may improve their quality of life, prevent exacerbation and hospitalization events and, eventually, reduce mortality rates associated with chronic respiratory diseases [12].
Inhaled aerosol therapies are the main treatment of obstructive lung diseases [13]. Inhaler based monitoring devices were introduced at the beginning of the 1980s and, since then, have been developed, mainly, for the proper assessment of medication adherence. These devices are presented in section 3. Aerosol devices deliver a fixed medication dose, rapidly and directly into the airways, from a pressurized canister containing a medication/propellant mixture [14], [15]. The efficient and effective management of asthma is strongly connected with patient adherence to the prescribed action plan, while reduced adherence has been strongly linked with significant indicators of health degradation. Active feedback may encourage patients to improve their adherence and manage their condition, better. For this purpose, specialists have developed methodologies to monitor inhaler users and understand if patients use their inhaler devices, with the appropriate technique and at the correct time duration. Audio process phenomenology requires the transformation of the measured data to extract the desired information (section 5). The current study presents machine learning and, mainly, deep learning approaches, in order for the researchers, to develop classification models that given as input the monitored acoustic signals, automatically recognize the phase of respiration and detect the drug inhalation onset (section 6). Often, time-frequency analysis of the respiratory signals [16]- [18] is performed prior to classification, to extract a variety of features, critical for extensive analysis. Acoustic analysis of breathing has been employed to detect the different phases of respiration, such as inhalation, exhalation and drug actuation [19]- [21]. Many of these works are related to personalized management services on obstructive respiratory diseases, aiming to provide methodologies for medication adherence monitoring procedures [22]. Specifically, these studies either focus on the deployment of device-integrated solutions, using pressure-activated switches [23], [24] or intelligent systems on ambient sound analysis [130]. Despite their coalescent differences, every approach targets on the detection of drug actuation and recognition of respiratory events, for the right drug assessment and procedure management.
Signal processing techniques can effectively extract useful information. Audio processing problems can lead to some complex and intricate approaches, for performing pattern extraction of critical information, especially on noisy and sometimes incomplete (i.e., time series with missing values) sound measurements [25], [26]. However, they can be quite difficult to be developed. The adherence of patients to their medication intake, in terms of prescribed dosage and careful usage of inhaler devices, is critical for controlling the disease, as 24% of asthma exacerbation and 60% of hospitalization are caused by poor medication adherence [27]. Studies suggest that up to 67% of clinicians cannot describe the steps correctly or demonstrate correct inhaler usage, so we focus on the optimization of adherence and on the management of non-adherence, through the usage of systems with methodologies that consider patient's preferences, on the treatment and care decisions [28], [29]. For these experiments, it was used a pressurized metered-dose inhaler (pMDI), where patients were instructed to actuate the canister of the pMDI, as they begin a "slow" and "deep" inhalation [30]. To ensure that the medication reaches the lower airways, the inhalation in drug actuation must be steadily below 90 L/min [31], [32]. Because the majority of patients perform at least one step of the inhalation technique incorrectly (insufficient respiratory effort), systematic training is required to achieve optimal inhaler technique [33]. Through breathing (inhalation and exhalation) the respiratory system facilitates the exchange of gases, between the air and the blood and between the blood and the body's cells [34]. Modern signal analysis techniques have been applied to extract features from inhalation sounds that characterize the events. Hence, we hypothesized that by analyzing the acoustics of inhalation, in a group of patients with a variety of respiratory and non-respiratory diseases, we could assess the accuracy, the sensitivity and the specificity of the models, related with inhalation sound recognition [35].

METHODOLOGY
This paper presents an extensive review and discussion on the state-of-the-art methods and tools for acoustic analysis and content-based audio classification of inhaler sounds on medication adherence, which could be used to improve the techniques on aerosol therapy [36]. We try to cover a large part of the existing material, so all points of interest need to be included, to capture from one corner of the topic to the current status of the research and make the research of broad interest, but focusing only on inhaler's and respiratory sounds.
The methodology that we followed aimed to include works utilizing pMDI inhalers for audio analysis and recognition purposes. Our research includes articles that use classical algorithms and machine learning approaches for acoustic signal analysis, detection and recognition. The state of the art begins with methods from 2010, which mainly use decision trees as a technique for the identification of the signal and continues with supervised learning methods, aiming to classify respiratory sounds obtained from pMDIs. This scientific sub-field is referred as "sound analysis, detection and recognition" and the search query was formed as follows: (("inhalers' sounds") AND ("identification" OR "recognition" OR "classification") AND ("machine learning" OR "deep learning")). We retrieved papers published from January 1, 2010 until December 31, 2021. In addition to the articles extracted from this search, we also examined works published in the same time range using manual search. The main findings of the algorithms detailed in this review, suggest that temporal and spectral audio-based features of inhaler sounds can be used to assess the inhalation techniques, objectively.
Upon the literature review, we developed the RDA Benchmark suite, in which we implemented techniques from original research articles and, also, used it for a comparative analysis. The literature search has identified boosting approaches, decision trees' logic and deep neural networks for respiratory and drug actuation classification. The models are built in a stage-wise fashion and arbitrary differentiable loss functions are introduced for improvement of algorithms' performance and to encounter for possible over-fitting. We deployed the several methods and evaluated them on the RDA dataset, which includes the actuation, inhalation and exhalation classes.
The tool for data annotation was the Audacity, whereas feature extraction and signal analysis were performed using Python Libraries (Pandas and Numpy). The overall flow of research and individual methodological components on inhaler's and respiratory sound's analysis, are illustrated in figure 1.
The rest of the paper is organized as follows: Section 3 provides a synopsis on the state of the art of inhaler devices with different system designs. Section 4 presents the annotation on RDA dataset recordings, after the feature extraction procedure, which is constituted in section 5 and has been used on different approaches and methodological aspects for audio recordings. Section 6 analyzes the different frameworks and architectural components and the methodological aspects, which are used for the detection and recognition of respiratory and actuation sounds from RDA dataset. Section 7 describes the experimental evaluation of each structure and their accuracy comparisons, while section 8 gives recommendations for future directions on acoustic analysis and, finally, section 9 concludes this study.

INHALER TECHNIQUE MONITORING SYSTEMS
There has been an increasing interest from researchers and software architects for medication adherence monitoring devices. The first inhalation device used for asthma was the pressurized metered-dose inhaler, in the 1950s. Today, there are many devices available with different techniques to support the proper drug intake [37]. We briefly present representative technologies utilized over time:

Fig. 1: Flow Diagram
• SmartMist TM is a microprocessor-controlled device, widely used in academic research that optimizes drug deposition in the lung, emitted from metered-dose inhalers (MDIs) [38]- [40]. • Diskus Adherence Logger (DAL) is an inhaler device with a small size sensor, designed to identify the motion of the dose delivery level, in Diskus DPIs and to communicate with the event recorder chip, for control and data uploading to a computer [41], [42]. • The SmartTrack is an innovative adherence monitoring device, for pressurized metered-dose inhalers, that consists of an LCD screen and four push buttons that allow the navigation in the device menu [43], [44]. • The SmartTurbo (Adherium (NZ) Ltd, Auckland, New Zealand) is an electronic monitoring device that combines its use with a Turbuhaler device (AstraZeneca, UK) and consists of electromechanical sensors to identify the state on the mouthpiece of the inhaler [45]- [47]. • The Asthmapolis system relies on technology that monitors the location of blister actuations, allowing the user to gain information about the disease, such as date and time of the usage [48] and to collect timely and geographically specific information about asthma management, with a clear picture of health status [49], [50]. • The "Inspiromatic" is an innovative approach to inhaler enhancement, based on the real-time inhalation flow measurements [52].
• Sensohaler is a novel device that incorporates MDIs, with fundamental acoustic sensing functionalities that are used for the prediction of volumetric flow rate [53]. • The T-Haler device is based on another innovative approach for the design of a MDI, with enhanced monitoring key performance characteristics [52]. • Furthermore, an integrated system was presented at the University of Patras [54], consisting of three main parts: the monitoring device, the smartphone application and the cloud processing server part.

EVALUATION DATASET
The central aim of this research is to identify associations between high-level classification labels and low-level features, extracted from audio clips of different semantic activities. We investigate the clinical applicability of different audiobased signal processing methods, for assessing medication adherence. The dataset [129] consists of recordings acquired in an acoustically controlled setting, free of ambient indoor environmental noise, at the University of Patras. Three subjects familiar with the inhaler technique, participated in the study. The participants were instructed to use the inhaler, as typically performed in a clinical procedure. For each and every participant informed, consent was obtained. During breathing and drug actuation, the audio signals were acquired by a microphone attached to the inhalation Fig. 2: Audio timeseries colored with ground truth labels (top) and visualization of the corresponding spectrogram, within a sliding window (bottom). Events include inhalation (blue), drug actuation (green), exhalation (red) and environmental noise (black). Window sliding positions might include non-mixed states (window 1) or transitional states, e.g. from inhalation to drug actuation (window 2) and breath holding to exhalation (window 3), respectively. The class labels of the sliding windows correspond to the central points (reproduced with permission from Pettas et al. [51]).
device, communicating with a mobile phone via Bluetooth. The addition of the adherence monitoring device did not impact the normal functioning of the inhaler, which had a full placebo canister. In total, 370 audio files were recorded with different duration each, containing an entire inhaler use case, with respiratory flow ranging on 180-240 L/min. Each audio recording was sampled with a 8KHz sampling frequency, as a mono channel WAV file at 8-bit depth. The audio recordings were segmented and annotated by a human specialist into inhaler actuation, exhalation, inhalation and environmental noise. The obtained segments (of nonmixed states) were of variable length and, for some methods, were further segmented into frames of fixed length for the purposes of feature extraction, as described in section 5.
The acoustic signal of a typical patient recording is shown in figure 2. The constructed database overall consisted of 193 drug actuation segments, 319 inhalation, 620 exhalation and 505 environmental noise segments, ready to be used for audio recognition, using different sets of features.

AUDIO DESCRIPTORS
Various features have been extracted from audio signals, both in temporal and spectral domain and have been used as a basis for audio analysis algorithms [55]. It is typical for audio analysis to extract the features across sliding time windows, in order to capture the class or activity within that particular moment in time. The extraction and selection of robust and descriptive features for specific applications, is the main challenge in designing audio classification systems [56], [57]. We have elaborated our work on the characterization of the audio signals, using several spectral features and audio patterns, as proposed in the literature. We first denote a signal x (t) , t = 1, ..., T as a time series in time domain.
As the signal is changing through time, it is assumed that on short time scales, the first is statistically stationary and, thus, statistical features can be extracted through a windowing process, in which the signal is segmented into small, possibly overlapping, time windows (frames) of the same length (N ). We denote as x n (i) the i th sample in the n th frame audio signal, where i = 1, ..., N .

Volume in time domain analysis
Volume in the domain of time is a reliable indicator for silence detection. Therefore, it can segment audio sequences and determine clip boundaries. It is commonly perceived as loudness, since natural sounds are pressure waves with different amounts of power that modifies the signal, with different distributions for each audio recording. In electronic sounds, the physical quantity is amplitude and, therefore, volume is often calculated as the Root-Mean-Square (RMS) of amplitude. Volume of the n th frame is calculated by the following formula: where V(n) is the volume, at n point in time domain. Figure  3 shows the volume of an inhalation recording.

Zero Crossing Rate analysis
Zero-Crossing Rate (ZCR) is defined formally as the number of time-domain zero-crossings, according to the pressure on the sound waves, within a defined region of the signal, divided by the number of samples of that region [58]. In the context of discrete-time signals, a zero crossing is said to occur, if successive samples have different algebraic signs. The zero-crossing finds the rate at which the signal changes from positive to negative and vice-versa [59], [60]. In some cases, only the "positive-going" or "negative-going" crossings are counted, rather than all the crossings, since, for a deterministic reason, between a pair of adjacent positive zero-crossings, there must be one negative zero-crossing [61]. This feature has been used extensively in speech recognition and music information retrieval, to classify percussive sounds: where x (t) is the signal on time domain, with length T and J{y} is a logical function returning one, if its argument is true and zero otherwise [62].

Spectrogram analysis
Any sound signal can be expressed in the frequency spectrum, which shows the average amplitude of various frequency components in the audio signal and the (frequency) distribution. To obtain frequency domain features, the spectrogram of an audio clip in the form of Short-Time Fourier Transform (STFT), is calculated for each audio frame. The spectrogram is used for the extraction of two features, namely frequency centroid and frequency bandwidth.

Cepstrum analysis
The first paper on cepstrum analysis [63] defined the cepstrum as "the power spectrum of the logarithm of the power spectrum". The cepstrum results from the Inverse Discrete Fourier Transform (IDFT) of a signal's log magnitude of the Discrete Fourier Transform (DFT). It has been used in speech analysis for determining voice pitch (by accurately measuring the harmonic spacing), but also for separating the formants (transfer function of the vocal tract) from voiced and unvoiced sources, which led quite early to similar applications in mechanics [64]. The definition of the complex cepstrum is: (2) where F −1 is the IDFT and F is the DFT, in terms of the amplitude and phase of the spectrum [65].

Mel Frequency Cepstral Coefficients Analysis and Power Spectral Density
Audio feature extraction is used to decrease the dimensionality of the input vector, while maintaining the discriminating power of the signal. It is an important part to prepare data for the sound identification process and this kind of analysis is derived from the cepstral representation of the audio data [66]. A systematic study of various spectral features can be found in (Kinnunen 2004) [67]. The Mel Frequency Cepstral Coefficient (MFCC) feature extraction is a technique in sound recognition that is based on the frequency domain of Mel scale, for human ear scale [68]. MFCC's extraction is a significant technique, mainly, due to efficient computation schemes and its robustness in the presence of different noise [69]. To compute the MFCC's, a Hamming window is multiplied with the overlapping segments, after windowing and the Fast Fourier Transform (FFT) is computed for every frame. The equation for Hamming window sequence can be defined by: For each frame we take the periodogram-based Power Spectral Density (PSD) estimation:P where X i (k) is the complex DFT of the signal x (n), with N samples: The power spectral density function helps calculate the total power contained in each spectral component of a specific signal. Power spectrum of any time-domain signal x (t), helps to determine the distribution of the variance of data x (t), over the frequency domain, in the form of spectral components, into which the actual signal can be decomposed [70]. This is motivated by the human cochlea (an organ in the ear), which vibrates at different spots depending on the frequency of the incoming sounds. According to the location in the cochlea that vibrates, different nerves fire, informing the brain that specific frequencies are present. Therefore, the estimated PSD of the signal is related to the modulus of its DFT [71].
After this step, we continue the calculations, with the spectrum segmented into a number of critical bands employing filterbanks. The filterbank, typically, consists of overlapping triangular filters, which are spaced linearly in a perceptual Mel scale. Then, the Mel filterbanks are calculated, in order to examine how much energy exists in various frequency regions. The Mel scale determines how to space the filterbanks and how wide to make them. It relates the perceived frequency of a pure tone to its actual measured frequency. Once we have the filterbank energies, we compute their logarithm. The logarithm allows us to use cepstral mean subtraction, which is a channel normalization technique. Finally, Discrete Cosine Transformation (DCT) is applied to the logarithm of the filterbank outputs, which results in the raw MFCC vector [72]. Because our filterbanks are all overlapping, the filterbank energies are quite correlated with each other. The DCT decorrelates the energies, so There can be variations on this process, for example, differences in the shape or spacing of the windows used to map the scale [73] or addition of dynamic features, such as "delta" and "delta-delta" (first-and second-order frame-toframe difference) coefficients. Exponentiating the log Melfilter bank spectrum before the cepstrum computation, can significantly reduce the sensitivity of the cepstra to spurious low energy perturbations [74]. The difference between the cepstrum and the Mel frequency cepstrum is that in the MFCC, the frequency bands are equally spaced on the Mel scale, which approximates the human auditory system's response more closely, than the linearly-spaced frequency bands, used in the normal cepstrum. This frequency warping can permit for better representation of sound, in audio compression. MFCC's have been successfully used in speech and music applications, playing a central part in recent efforts, to complete machine audition [75].

Wavelet Transform analysis
The wavelet transform can construct a time-frequency representation of a signal, that offers very good time and frequency localization. The wavelet transform, with the wavelet ψ of a signal x(t) is defined as: where a is called dilation and b translation parameter. The Morlet wavelet, consisting of a sinusoid multiplied by a Gaussian window, is commonly used, because its scalefrequency relationship requires less computation, as the peak frequency is equal to the center frequency of the wavelet. The Morlet wavelet ψ, which is sometimes called Gabor wavelet, has impressive mathematical and biological properties and is given by: where f 0 is the center frequency of the mother wavelet [76].

Continuous Wavelet Transform
In definition, the Continuous Wavelet Transform (CWT) is a convolution of the input data sequence, with a set of func-tions generated by the mother wavelet. It provides an overcomplete representation of a signal by letting the translation and scale parameter of the wavelets, vary continuously [77], [78]. The convolution can be calculated in the time-domain or in the frequency-domain, by using an FFT algorithm. For the STFT, a fixed width segment size controls the time-frequency resolution trade-off. This results in a single resolution in the domain of time and a single resolution in the frequency domain, regardless of the rendered frequency. In contrast, wavelet analysis is a multi-resolution method. The time-frequency resolution is not constant, but varies with frequency. Multi-resolution analysis was designed for the common condition, where high-frequency components exist for a short time duration within a signal, while lowfrequency components are more persistent. Short-lived high frequency components need strong time localization. In order to achieve this, the frequency resolution of high frequency components will be diminished. On the other hand, long-lived low frequency components can tolerate poorer time resolution, but require effective frequency resolution. Low-frequency components, often determine the significant part of a signal's character and these properties will be best quantified, if the frequency resolution is as satisfactory as possible. More references on the CWT can be found in appendix B.1.

RESPIRATORY AND INHALER'S SOUND CLASSI-FICATION
The elements of the pipeline will be executed in parallel and in a time-sliced fashion, at some points. In some approaches, the data must be transformed to take advantage of the value it can deliver, so it can continuously improve the accuracy of the models and achieve successful results [79]. We tried to have the appropriate data quality, reliability and accessibility. Metadata extraction is correlated with the captured data and provides descriptive and targeted information, about the object and the data itself.

Decision Trees with CWT
Chronologically, the first algorithm that was employed to identify actuation sounds [80] was utilized Decision Trees (DTs) with CWT, for inhalation classification. The CWT was calculated using a Morlet Wavelet (MW) with an adjustable parameter of 20. A peak assessment routine was then employed to detect and assess peaks, exceeding a threshold θ 1 = 0.38. Each peak was initially marked as a potential actuation plume. Power values were taken at specific time distance (56 ms), before and after the peak point, to observe if the power decreased by a given threshold (θ 2 = 0.25), in comparison to the peak value, to flag it as an actuation plume. It was observed that the actuation acoustic signal was of very small time duration (100-150ms). The wavelet variance is calculated, so that different datasets may be compared at different scales: where W 2 a, x j is the squared wavelet coefficient, associated with scale a at data point x j and n is the number of data points. From this definition, wavelet variance is a function of scale. Bradshaw and Spies [81] pointed out that "high values of wavelet variance, at a given scale, reflect the presence of a greater number of peaks or a greater intensity of the signal, or both". Other algorithmic approaches in the sub-field of blister and respiratory sound detection and classification, follow a 3 stages methodology including (i) blister detection, (ii) breath detection and (iii) inhalation/exhalation differentiation [130], [131]. The algorithms were designed and developed to automatically detect inhaler events from the audio signals and provide feedback, concerning medication adherence. These approaches have multiple clinical implications, as they prove the practicability of using acoustics to objectively monitor patient inhaler adherence and provide real-time personalized medical care for chronic respiratory illness.

Hidden Markov Model
Another approach proposes the Directed Acyclic Graph (DAG) logic [135]. The formulation relies on a graph denoted as G = {N, L}, where N = {n 1 , . . . , n m } represents the nodes and L = l 1 , . . . , l p the links. Each node in N is responsible for a binary classification task conducted via a set of Hidden Markov Models (HMMs), which fit well the specifications for sound pattern classification [136]. DAGs can be seen as a generalization of the class of Decision Trees (DTs), while repetitions that may occur in different branches of the tree can be handled more efficiently, since different decision paths might be merged. In order to get an early indication of the degree of difficulty of a classification task, a metric is employed representing the distance of the involved classes in the probabilistic space, the Kullback-Leibler Divergence (KLD). The KLD between two J-dimensional probability distributions A and B is defined as:

Quadratic Discriminant analysis
In 2018, a Quadratic Discriminant Analysis (QDA) model was employed, for audio-based analysis of respiratory sounds [82]. The algorithm was composed of two phases, training and testing, to automatically recognize the sound events. In both training and testing phases, the inhaler audio signal was band-pass filtered at 140 to 22000 Hz, to emphasize the events and to reduce the external noise. The audio signals were divided into frames of 40 ms duration, with 20 ms overlap. The DC offset (mean amplitude displacement from zero) was removed from each frame. Thirty audio-based features from time and spectral domains were extracted, for each frame. The extracted features are: 12 MFCC's, 10 Linear Predictive Coding (LPC) coefficients, PSD, ZCR and a high frequency power value (over 15 kHz), estimated using the CWT. Since the Flo-Tone device generates a harmonic sound during inhalation, a harmonic feature was also extracted. This harmonic feature was calculated as the peak value of the frame's auto-correlation function, searched in the range of 500-600Hz.
Classifying between two multivariate normal populations leads to the idea of discriminant analysis, which is essentially the Bayes classifier for the problem, that constructs a combination of the features. The features from two classes follow multivariate normal distributions, with different means µ i , with precision matrices Ω i = Σ −1 i , with i = 1, 2, ..., Q(Y ), where Q(Y ) is the cardinal number of the set Y of the classes. In this, high dimensional setting, these matrices can be estimated only in presence of sparsity.
Consider observing training data (X m , Y m ) , m = 1, . . . , n, where X m ∈ X and Y m ∈ {0, 1, .., n} and The idea of the one-versus-the-rest method is as follows: to get a K-class classifier, first construct a set of binary classifiers C 1 , C 2 , · · · , C K . Each binary classifier is first trained to separate one class from the rest and, then, the multiclass classification is carried out according to the maximal output, of the binary classifiers. Since the binary classifiers are obtained by training on different binary classification problems, it is unclear whether their real-valued outputs (before thresholding) are on comparable scales [83]. In practice, however, situations often arise, where several binary classifiers assign the same instance to their respective class or where none does [83].

Support Vector Machines
Support Vector Machine (SVM) classifier is based on statistical learning theory [84] and has shown to be one of the most robust supervised learning methods. SVMs' simplicity comes from the fact that they apply a simple linear method to the data, but in a high-dimensional feature space are nonlinearly related to the input space. In binary classification, only the decision boundaries of the first class are to be known and the rest (complement of first-class) is considered as the second class, whereas in multi-class classification, several decision boundaries need to be calculated, which may lead to increase of error probability. SVMs are highly accurate and able to model complex non-linear decision boundaries. This classifier may be applied both to linearly and non-linearly separable data, with the use of kernel transformations. Specifically, it transforms the data to a higher dimension, from where it can identify a hyperplane that separates the data.
Furthermore, an other interesting application of SVM for respiratory signal classification is presented in Eleftheriadou et al. [107], where an audio-based method is assessing the proper usage of dry powder inhalers, by using the FFT of the signal. A window size of 512 frames, with a number of frames between FFT columns (128 frames), was used and 16 MFCC's including the zero coefficient, were returned resulting in an array of size (16 × 126). For "silent" areas removal, the short-term features of the whole recording were extracted and an SVM model was trained to distinguish between high-energy and low-energy short-term frames, using 50% of the highest and 50% of the lowest energy components, for SVM model training. Then, the segments containing sound and the active segments were detected by median and dynamic thresholding, respectively. In addition to this model, a work in 2021 showed that the SVM classifier can be used on inhalation sounds and achieved an accuracy of 96.9%, using an open access respiratory sounds database [132].

Random Forest
The Random Forest (RF) algorithm trains several tree-like classifiers [85] and aggregates results, by majority voting. The RF usually illustrates high accuracy and processing speed, though the correlation or/and independence of trees may affect the accuracy of the outcomes. The RF classifier draws n tree bootstrap samples from the original data and, among the variables, the best split is selected. The accuracy of the RF overall depends on the strength of each tree and the correlation between any two trees. Each tree in RF can also be constructed by a bootstrap sample from the data, using a small set of randomly selected attributes.

Adaboost
Boosting was proposed by Freund and Schapire, in 1990 [86]. Adaboost is the most common boosting algorithm. It is an efficient instrument for improving the predictive ability of a learning system and a most typical method in coordinating learning. It, usually, employs DT models as weak learners and evaluates them sequentially [87]. Subsequent DTs are updated in favour of those samples misclassified by previous DTs.
Preparing training sets (x 1 , y 1 ) , . . . , (x n , y n ), where x i ∈ X and X represents a certain domain or instance space and each member is a training example, with a label. In the initial development, it was a two-class problem of learning. In going from two-class to multi-class classification, most boosting algorithms have been restricted, reducing the multi-class classification problem to multiple two-class problems (e.g. as shown in early works [88]- [93]). The weights of all the training examples are initially set to be 1/m equally. Adaboost conducts T times of iteration through repeatedly calling a weak learning algorithm. As per D t distribution, the weak learner finds appropriate weak hypothesis h t : X → R, thus, predicting function sequence is gained, at this point. In the simplest case, if the scope of each h t is two-valued {−1, +1}, the task of the learner is to minimize the error. Combining T weak hypotheses, the final predicting function (hypothesis) H is gained, after T times of circulation, with a weighted majority voting method. The learning accuracy rate of a single weak learner is not high enough [94], however, with the application of Boosting algorithm, the accuracy rate of the final result is to be improved. The Adaboost algorithm is described in detail, in appendix A.2.

Gradient Boosting
Recent work on inhalation signal classification [107] has used Gradient Boosting, as a numerical optimization method. The objective has been to minimize the loss function by adding weak learners in a gradient descent type procedure. Decision trees have also been used as weak learners in gradient boosting to produce an ensemble of weak prediction models [108]. The DT model is built in a stage-wise fashion and the generalization is achieved by an arbitrary differentiable loss function. The Gradient Boosting algorithm can easily overfit a training data set and that is why different regularization methods are applied to improve the algorithm's performance and address the problem of possible overfitting [109].

Gaussian Mixture Models
Gaussian Mixture Models (GMMs) can be employed to approximate any Probability Density Function (PDF), given a number of components. It is proposed a novel content based audio classification approach, for monitoring pMDI medication adherence [95], that exploits the separability of the cepstrogram features, using a GMM classifier. GMMs are statistical models for representing normally distributed sub-populations, within an overall population and are used in many pattern recognition applications. For each class, a separate model is trained by fitting the corresponding feature vectors to a GMM, with parameters {a i , µ i , C i }, where i ∈ {1, ..., K} and K is the number of components, a i is the mixture weight of component i, µ i is the d-dimensional vector containing the mean values for each feature and C i is the corresponding co-variance matrix. The Gaussian mixture density p ν|λ n is modeled as a linear combination of multivariate Gaussian PDFs, where ν is a feature vector and λ n is the GMM, corresponding to class n. In order to classify a test feature vector, we derive the P ν|λ n for each class. The test feature vector is assigned to the class n, with the greatest likelihood P ν|λ n . An expectation maximization (EM) approach is utilized to derive the parameters K n , {a i , µ i , C i } for the GMM λ n , corresponding to the class n that best fits the input data.
After the optimal parameters for the GMMs have been computed and given d the number of features, K the number of components of the i th feature vector v i and λ n the GMM of class n, we have the following equation: where a i n are the mixture weights to satisfy the constraint: Finally, after the P v|λ n for the test feature vector v and for each class n is estimated, the test feature vector is assigned to the class n with the greatest likelihood. Then, we have the relevant feedback. This procedure lies in the assumption that the initial dataset was compiled by a small group of people. For the derivation of the GMM, through the EM approach, we refer to the appendix A.3.

Deep Learning Models
Neural Networks (NNs) have been employed in the past for a variety of classification problems and have shown significantly accurate results, in medical applications [96], [97] and general audio classification problems [98], [99]. The Convolutional Neural Networks (CNNs) can adapt to the characteristics of the training dataset and create a hierarchy of increasingly complex features [100], while at the same time, they illustrate relatively fast and consistent convergence in the training process. This model is robust in automatically learning the intrinsic patterns from the data, which can both prevent time-consuming manual feature engineering and capture hidden intrinsic patterns more effectively. Moreover, a CNN is more capable of discovering intricate patterns in high-dimensional data, compared with manual feature engineering. A pooling layer is introduced in each stage to merge similar features, reducing the dimensionality and dealing with some motif variations due to small signal variability (shifts and distortions).
The prior probabilities of each class were determined by their frequency in the training dataset. A function that takes audio recordings as inputs was developed, which returned segment endpoints that correspond to individual sound events. Backpropagation was used to train the multi-layer neural network models, which computes the gradient of a predefined objective function with respect to all the neuron parameters. The gradients are propagated backwards from the output layer to the input layer to adjust the parameters, such that the network can converge to a state that is able to encode the training patterns.

Convolutional Neural Network's approaches
In classical machine learning based direction, it is proposed an algorithmic methodology, using CNNs, intending to produce more accurate results, for real life environments, as opposed to controlled laboratory conditions with reduced levels of noise. This type of models has been established as a reliable, data-driven approach for time series and image classification [101], [102]. Also, the authors in [103] have demonstrated their accuracy and efficiency in classification tasks, while strengthening the basis for CNN architectures.
One important CNN approach [104] has a deep architecture that is consisted of two convolutional layers, of 16 1D kernels of size 100. Two fully connected layers of size 256 and 16, each of which was subjected to 50% dropout in the training process, lead to the output of two distinct states. For all layers of the CNN, RELU was selected as the activation function, in order to reduce the computational complexity of the algorithm. Furthermore, the initialization of the network's parameters was performed, using a random generator of uniform distribution for the layers. Finally the training of the 1D CNN utilized Adam optimizer and was based on the categorical cross-entropy, between the predictions and the target values.
Later, a fast data-driven approach [105], [106] was proposed, based on 2D CNNs. The benefits according to the authors' belief, can be summarized in the following points: • The presented approach is applied directly on the time domain, to gain from reduced computational complexity. • Convolutional deep sparse coding speeds up the computational graph, aiming to allow the real-time implementation. Specifically, the architecture utilizes three convolutional layers, with each layer constituted by a max pooling layer and a dropout function [110]. The output is fed to a set of four dense layers. To compute the loss of each model, it is used categorical cross entropy. Furthermore, the stride is set to one and it was used zero padding to keep the shape of the output of each filter, constant. By processing an audio recording, a vector of n samples is created, reshaped in a twodimensional array. Next, an ensemble of C-DNN and Autoencoder networks is deployed for the analysis and classification of respiratory sounds [133]. In this work, the experiments were conducted using the 2017 ICBHI (Internal Conference on Biomedical Health Informatics) benchmark dataset. Also, an attention-based CNN was developed for automatic identification of respiratory illness and applied on the same database [134]. The experimental results indicate that the residual networks can lead to important improvement, as compared with the baseline algorithms.

LSTM analysis and implementation
In this approach [111], deep layers were added to the network. The proposed sequential architecture consists, initially, of one layer of LSTM memory cells, with each one consisting of h = 64 units. After the LSTM input layer, a dropout layer [112] is introduced, in order to reduce overfitted parts after training, with dropout rate set to 0.3, followed by a flatten layer and a dense output layer, that returns a 4 × 1 vector. Finally, a softmax activation function is used. The model is optimized using binary cross-entropy loss [113] and the Adam optimizer [114].
The spectrogram was used as a tool to develop a classifier of inhaler sounds. It is swept across the temporal dimension, with a sliding window with length w = 15, moving at a step size being equal to 1 window. In order to form the training instances, we segment S into time windows of size N f × T and assign a class to each one of them, according to the class of the central point of the window, as presented in figure 2. Each training example W k is defined as: The training set is organized in microbatches B, so that B ∈ R b×(T ×N F ) (with b = 25 in our experiments). The input tensor can be defined as X ∈ R n×w×F , where n = 25 is the minibatch size, w = 25 is the window size and F = 42 is the dimension of the spectrogram feature vector.
Extensive hyper-parameter optimization took place, to define the number of hidden units, the number of dropout rate and the minibatch size. By observing the performance of the network on the validation set, we stopped training at 70 epochs, to avoid over-fitting. The testing loss increases and the average testing accuracy stabilizes after around 70 epochs.

Simulation setup and validation settings
This work aims to highlight the RDA suite of methods and models 1 trained with the Respiratory and Drug Actua-tion Dataset 2 . The methodology of the simulation studies, presented in [137], entails the comparison of the classification performance of the aforementioned important and widely used machine learning and deep learning algorithms, namely the Random Forests (RFs), Support Vector Machines (SVMs), Adaboost, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
They were evaluated using spectrogram, cepstrogram and MFCC features. CNNs were directly applied on timeseries values aiming to demonstrate their capability to provide dependable solutions at lower execution times.
A main differentiation parameter in validation comes from the availability of previous recordings, from a specific individual. It is expected that prior information can increase the classification accuracy, however it puts additional burden in the usage of the monitoring system, since it requires the collection of data every time a new patient wants to test the framework.
Firstly, we consider the Multi Subject modeling approach, denoted as MultiSubj. In this case, the recordings of all subjects are used to form a large dataset, which is divided in five equal parts used to perform five-fold cross-validation, thereby allowing different samples from the same subject to be used in training and test set, respectively. This validation scheme was followed in previous work [105] and thus is performed, also, here for comparison purposes.
The second case includes the Single Subject setting, in which the performance of the classifier is validated through training and testing, within each subject's recordings. We denote such models, as SingleSubj. The recordings of each subject are split in five equal parts, to perform crossvalidation. The accuracy is assessed for each subject separately and, then, the overall performance of the classifier is calculated by averaging the three individual results.
The third evaluation setting refers to the case, when no previous recordings for the testing subject are available, thus samples from other subjects are used. This is the leave-onesubject-out (LOSO) approach that illustrates how well the trained network can generalize to individuals that it never saw before, during training. LOSO models facilitate the use of the monitoring system, since they don't require a data pre-collection phase and, also, they have the lowest risk of over-fitting. However, if the inter-subject variability is high, the data might not adapt well, especially if the number of the training subjects is small, as in our case. With this approach we use the recordings of two subjects for training and the recordings of the third subject for testing. This procedure is completed when all subjects have been used for testing and the accuracy is averaged to obtain the overall performance of the classifiers.
Furthermore, a new parameter was brought into the comparison, that is the testing dataset mixing. A non mixed setup means that each audio segment in the testing set, consists of a single class, but in a mixed setup each testing set sample emanates from a sliding window, that naturally includes parts belonging to multiple classes. In this case, the class of an audio segment is the class of the central sample.
2. https://ieee-dataport.org/documents/ respiratory-and-drug-actuation-dataset  Table 1 summarizes the classification accuracy for drug actuation, exhalation and inhalation sounds, across all validation setups. Out of a superficial examination, no method is better, but key performance indicators need to be established. Multi-subject and single-subject settings indicate the classifier's success, if the user has previously participated in the sampling process. Leave-one-subject-out setting is the closest to a real situation setting, since a commercial classifier would not assume that a new user has already submitted samples to the training process. Even though a feedback loop can improve accuracy [95], such a process cannot be a prerequisite. As a result, leave-one-subjectout performance is the most representative. Furthermore, between mixed and non-mixed setup, the first is closer to the real situation, since the non-mixed assumes that the position of audio segments containing a certain audio event, is known before the classifier is applied. In short, the key performance indicators can be summarized below:

Key performance indicators
• The three highest non-mixed LOSO accuracies should be highlighted (Table 1) • The three highest mixed multi-subject accuracies should be highlighted ( Table 1) • The three highest mixed LOSO accuracies should be highlighted ( Table 1) • The three highest drug precision mixed and non-mixed sensitivity should be highlighted (Table 1) Metrics to measure the performance of the compared classifiers are accuracy, sensitivity and specificity. For the sake of where T P is the number of positive correct identifications, T N the number of negative correct identifications, F P is the number of positive incorrect identifications and F N is the number of negative incorrect identifications.

Results
Close inspection of tables 1 and 2 reveals that no method clearly outperforms the others. The performance of datadriven approaches largely depends on the dataset and the pre-processing steps. However, since all the approaches were trained with the same dataset, we can quantify which can capture the individual characteristics more efficiently. For all methodologies, the multi-subject approach yields the highest score. This means that if user's data have been included in the training process, the success probability climbs to a level near 96%.
Given the Key Performance Indicators (KPIs) defined in the previous section, in the non-mixed setup, SVM-MFCC, ADA-MFCC and CNN-TIME yield the highest accuracy, when patients' data are already included in the dataset, corresponding to Multi and Single subject cases. However, in the "Leave One Subject Out" case, which also corresponds to a real-life scenario CNN-TIME, ADA-CEPST, and RF-CEPST yield the best results. In the mixed setup, the results are similar with LSTM-SPECT to replace CNN-TIME only in the multi-subject setup. Further insight is provided by table 2. Measuring F1 score is highly important, since it consists of the harmonic mean of precision and recall of a candidate detector. A close inspection reveals that drug detection sounds have much more different characteristics than exhalation and inhalation signals. For the Exhale/LOSO case, CNN-TIME has the best performance that, also, concurs with the results presented in the Accuracy Table. For detecting inhalations in LOSO setup, CNN-TIME also yields the best performance. However, in the multi-subject setup, MFCC methods demonstrate a more stable performance than the CNN-TIME method. In the drug classification case, the F1 score reveals much more different results. Specifically, the mixed setup demonstrates lower F1 Scores than the non-mixed setup, meaning that "Drug" detection is highly affected by the surrounding audio events. ADA-SPECT, GMM-SPECT and RF-SPECT have the highest accuracy in multi-subject mixed setup, reaching over 73%, while for the non-mixed case, the accuracy is over 96%. Likewise, LOSO with Non-mixed on Random Forest based methods reaches 84%, but in the mixed setup the corresponding performance is 32%.
Finally, we compare the computational cost of the CNN model with the other approaches, executed in the same machine (Intel(R) Core(TM) i5-5250U CPU @ 2.7GHz). The results are summarized in figure 6. This figure highlights the gain in computational speedup, compared to the timeconsuming spectrogram and similar feature-based algorithms. Specifically, figure 6 shows that classification by CNN is 40 times faster than the slowest cepstrogram-based methods and fifteen times faster than spectrogram and MFCC-based methods.

DISCUSSION AND FUTURE CONSIDERATIONS
The pressurized metered dose inhaler (pMDI) is the most commonly used inhaler, with total worldwide sales of pMDI products reaching over $2 billion per year [115]. Studies have reported that over 50% of patients are prone to not adhering to the correct inhaler technique [116]. It has been stated that acoustics can be employed to detect and recognize dry powder inhaler sounds [117]. Almost 4 decades ago, the earliest investigations into poor inhaler sufficiency in patients using pMDIs, indicated that poor technique was likely to be associated with less than the highest quality of response to the therapy [118], [119]. Several studies have since intimated a strong connection between poor inhaler adequacy and patient outcomes, on a larger scale [120], including higher rates of hospital access and emergency room attendance [121]. By providing suitable feedback to medical staff and guiding patients to improve their inhaler usage technique, we could facilitate efficient self-management of obstructive respiratory diseases, allowing patients to avoid dangerous exacerbation events [110]. Asthma is almost in the center of the wave of digital health developments, as it requires systematic attention of both health care specialist doctors and patients [122]. Employing acoustic signal processing methods, the aforementioned algorithms were developed to accurately identify drug actuations, from pMDI's. As science and technology evolve and modern sensing components are becoming more available, a continual improvement process of inhalers with an extended range of monitoring capabilities, holds the promise to further optimize asthma self-management. These methods provide an opportunity to enhance clinical education, by providing informative feedback to patients, which may contribute to improving respiratory health. Future work will consist of identifying pMDI inhalations to monitor actuation coordination technique and to provide patient feedback, regarding drug delivery using acoustic methods. The accurate estimation of the parameters would be of significant clinical benefit to both patients and healthcare professionals, by enhancing precision medicine for chronic respiratory diseases. Ultimately, this work aims to prevent common mistakes, leading to potential upcoming dangerous events, such as exacerbation and hospitalization.

CONCLUSIONS
Asthma forms an important socioeconomic burden, both in terms of medication costs and disability adjusted life years. The accurate assessment of the state of asthma is the fundamental basis of digital health approaches and also is the most significant factor towards the preventive and efficient management of the disease. The necessity of inhaled medication offers a basic platform, upon which, modern technologies can be integrated, namely the inhaler device system itself. The control of asthma is a complex and multiparametric issue, that is greatly affected not only by physiological and environmental parameters, but, also, the psychological state of patients and their cultural and socioeconomic background. Indicative of the complexity of asthma disease, is the diversity of its prevalence around the world. All the above outline the need to increase the active involvement of patients, in modern treatment methodologies and to use modern technologies so as to create easyto-use tools, for safe and effective self-management. A fundamental step in this direction is the creation of a sensing framework, that could provide accurate information, about the health of patients and help their doctors understand any possible difficulty, that prevents patients from using their inhaled medication adherence correctly. This need for the modernization of inhaler devices, has stimulated the research and commercial interest for their enhancement with novel sensing capabilities and has led to a number of approaches, that focus mainly on the detection of inhaler actuations. The modern adherence monitoring environment has also been analyzed in other studies, addressing important related issues, such as the interpretation of results and the design of interventions that promote adherence.

A.1 Support vector machines
For a decision hyper-plane x T w + b = 0 to separate the two classes: P = {(x i , 1)} and N = {(x i , −1)}, it has to satisfy y i (x T i w + b) ≥ 0 for both x i ∈ P and x i ∈ N . Among all such planes satisfying this condition, we want to find the optimal one H 0 that separates the two classes with the maximal margin (the distance between the decision plane and the closest sample points). The optimal plane should be in the middle of the two classes, so that the distance from the plane to the closest point on either side is the same. We define two additional planes H + and H − that are parallel to H 0 and go through the point closest to the plane on either side: All points x i ∈ P on the positive side should satisfy: x T i w + b ≥ 1, y i = 1 and all points x i ∈ N on the negative side should satisfy: x T i w + b ≤ −1, y i = −1 These can be combined into one inequality: The equality holds for those points on the planes H + or H − . Such points are called support vectors, for which i.e., the following holds for all support vectors: Moreover, the distances from the origin to the three parallel planes H − , H 0 and H + are, respectively: |b − 1|/||w||, |b|/||w|| and |b + 1|/||w|| and the distance between planes H − and H + is 2/||w||. Our goal is to maximize this distance, or equivalently, to minimize the norm ||w||. Now the problem of finding the optimal decision plane in terms of w and b can be formulated as: (i = 1, · · · , m). Since the objective function is quadratic, this constrained optimization problem is called a quadratic program (QP) problem. If the objective function is linear instead, the problem is a linear program (LP) problem. This QP problem can be solved by Lagrange multipliers method to minimize the following with respect to w, b and the Lagrange coefficients α i ≥ 0, where (i = 1, · · · , α m ). We let These lead, respectively, to w = m j=1 α j y j x j , and m i=1 α i y i = 0 Substituting these two equations back into the expression of L(w, b), we get the dual problem (with respect to α i ) of the above primal problem: The dual problem is related to the primal problem by: i.e., L d is the greatest lower bound (infimum) of L p for all w and b. Solving this dual problem (an easier problem than the primal one), we get α i , from which w of the optimal plane can be found. Those points x i on either of the two planes H + and H − (for which the equality y i (w T x i + b) = 1 holds) are called support vectors and they correspond to positive Lagrange multipliers α i > 0. The training depends only on the support vectors, while all other samples away from the planes H + and H − , are not important. For a support vector x i (on the H − or H + plane), the constraining condition is here sv is a set of all indices of support vectors x i (corresponding to α i > 0). Substituting Note that the summation only contains terms corresponding to those support vectors x j , with α j > 0, i.e.
For the optimal weight vector w and optimal b, we have: The last equality is due to m i=1 α i y i = 0 shown above. Recall that the distance between the two margin planes H + and H − and the margin is 2/||w||, the distance between H + (or H − ) and the optimal decision plane H 0 is

A.2 Adaboost
Let's assume N samples x i ∈ X, i = {1, ..., N } with corresponding labels y i ∈ Y = {−1, +1} are given and used as a training set (x 1 , y 1 ) , (x 2 , y 2 ) , . . . . . . (x n , y N ). Initially set up a weight D(i) and make D(i) = 1/N and then iterate for t = 1, 2, . . . , T, where T is a parameter representing the maximum circulation times of training. For T-training, firstly, the weight distributing on the sample {X i , Y i } is recorded as D t (i) while the T-th iteration happens and as per the distribution D t the weak learner finds its weak hypothesis h t : X → {+1, −1} and adjusts distribution. Secondly, the error rate in computing h t : ε t = N i=1 D t (x i ) h t (x i ) = y i . Thirdly, we compute the weight of weak classifier based on the error rate: α t = (1/2) ln (1 − ε v ) /ε t . Fourthly, we update the sample weight: D t+1 (x i ) = Dt(xi) Zt exp −α t y i h t (x i ) among which Z t = n i=1 D t (x i ) exp −α t y i h t (x i ) , where Z t is a standardized factor that meets the probability distribution. T-weak classifiers are gained after T times of circulation and a strong classifier H(x) = sign T t=1 α t h t (x) is gained after adding to the updated weight [54].

A.3 GMM
In the GMM classifier, the Gaussian mixture density of each d-dimensional feature vector v is modeled as a linear combination of multivariate Gaussian PDFs with the general form: where θ i = (µ i , C i ) are the parameters of component i, including the mean feature vector µ i and the d × d covariance matrix C i and |C i | is the determinant of C i . The complete set of parameters for a mixture model with K components, is Θ = {α 1 , ..., α K , θ 1 , ..., θ K }. Each GMM model λ n for class n is parameterized as follows: where k = 1, ..., K. It is important to note that each multivariate Gaussian PDF is completely defined, if we know θ.
At this point, we analyze the expectation-maximization (EM) algorithm employed to compute the GMM parameters. The membership weight of data point v in component k given parameter 2, is defined as: for all components k, 1 ≤ k ≤ K and all data samples i, 1 ≤ i ≤ N . In each iteration of the EM algorithm for Gaussian Mixtures, we deploy an E-step and an M-step. At E-Step, we compute w ik for all feature vectors v i and all mixture components k. At M-Step, we calculate the new parameters. Given N k = N i=1 w ik the sum of membership weights for the k-th component, we get the mixture weights: The updated mean: and the updated covariance: The termination criteria for the EM is the following: where the log-likelihood, defined as log l (Θ) = N i=1 log p v i |Θ and is a small user-defined scalar value. In order to find the best fit for the data, we compute the GMM for 1 to d = 40 components iterating over full and diagonal covariance matrices, where d is the size of each feature vector v. With the generation of each model, we estimate the Bayesian Information Criteria (BIC). The model with the lowest BIC best fits the input data.

B.1 Continuous Wavelet Transform
In theory, it is assumed that continuous signal x(t) can be approximated perfectly by a n and ψ n , but in reality this is often not true. Let us consider a real signal x(t) and its approximationx(t), which can be perfectly approximated by a n and ψ n . The approximation error is defined as the difference between x(t) andx(t). This results in an error function, which is used as a measure for the overall error. Therefore the sum of the remaining squared inner products is calculated (2-norm) and is used as a measure for the error ( ).
[M ] = x(t) −x(t) 2 = +∞ n=M x(t), ψ n (t) x(t), ψ n (t) ψ n (24) If the number of terms M is increased the error becomes smaller i.e. when M goes to +∞ then the approximation error goes to zero. The rate that determines how fast [M ] goes to zero, with increasing M is called the decay rate and gives information about how well a certain frame can approximate a signal. To be "admissible" as a wavelet, this function must have zero mean and be localized in both time and frequency space [123].
The Continuous Wavelet Transform (CWT) was introduced almost 3 decades ago, in order to overcome the limited time-frequency localization of the FFT [124] for nonstationary signals and was found to be suitable in multiple applications [124], [126]. It is similar to the human ear which exhibits similar time-frequency resolution characteristics [127], [128]. While the Fourier Transform decomposes a signal into infinite length sines and cosines, effectively losing all time-localization information, the CWT's basis functions are scaled and shifted versions of the time-localized mother wavelet. The CWT of x(t) at any scale s and position u is the projection of x on the corresponding wavelet atom ψ, as described in the following formula: It represents one-dimensional signals by highly redundant time-scale images in (u, s). The CWT is an excellent tool for mapping the changing properties of non-stationary signals. The CWT consists of N spectral values for each scale used, each of these requiring an IFFT. The computational load of the CWT and its memory requirements are thus considerable. The benefit from this high measure of redundancy in the CWT is an accurate time-frequency spectrum. Unlike a Fourier decomposition which always uses complex exponential (sine and cosine) basis functions, a wavelet decomposition uses a time-localized oscillatory function as the analyzing or mother wavelet. The mother wavelet is a function that is continuous in both time and frequency and serves as the source function, from which scaled and translated basis functions are constructed. The mother wavelet can be complex or real and it, generally, includes an adjustable parameter which controls the properties of the localized oscillation.