Marathi Speech Intelligibility Enhancement Using I-AMS Based Neuro-Fuzzy Classifier Approach for Hearing Aid Users

Globally, 1.6 billion individuals suffered from hearing disability in 2019. According to the World Health Organization, by 2050, the number of people with hearing impairments will rise to 2.5 billion. Speech perception in noisy surroundings is a challenge for hearing aid users. This study aimed to design a novel methodology to improve the speech recognition ability of hearing aid users from various backgrounds. To improve speech enhancement, we propose a discrete cosine transform (DCT)-based improved amplitude-magnitude spectrogram (I-AMS) algorithm with a fuzzy classifier. First, the I-AMS approach disintegrates speech signals containing noise into time-frequency units and eliminates the noise present in the signal. Next, the time frequency units (t-f units), modulation frequency (fm), and centre frequency (fc) are extracted from the denoised signal. A neuro-fuzzy classifier was used to classify the background speech environment into three different classes. The proposed I-AMS algorithm was tested, achieved improvements in terms of sensitivity (+1.02%) and accuracy (+11.80%). Speech denoising revealed a 1.27% improvement in speech recognition performance.


I. INTRODUCTION
Speech is a form of human communication, and over the past 50 years, speech recognition has become a fascinating research field. Speech is the core of human activity because it helps humanity collaborate in a common and viable manner [1]. Approximately 13% population in developed countries suffers from hearing deficiencies. These factors influence communication abilities and prevent normal living [2]. In addition, approximately 25% users avoided the use of hearing aids owing to bothering and repulsive shrieks.
The associate editor coordinating the review of this manuscript and approving it for publication was Giovanni Pau .
Most hearing aids (HA) are designed for single-background environments. In a noisy background, the signal and noise are amplified similarly [3]. Hearing aids are electroacoustic devices that improve the audibility of individuals with hearing impairment. The main objective is to increase speech intelligibility through amplification to achieve a better hearing aid performance [4]. However, this procedure typically increases the sound power levels in each frequency band, including the hearing thresholds of the user, which has no noticeable benefit [5]. To avoid this, a frequency-lowering technique was used to transfer the frequency band from the dead (impaired) to the audible band [6], [7]. Speech playback at a lower rate than sampling is a unique method of frequency lowering that degrades the speech quality [8]. Several approaches have been proposed to overcome the challenges of speech recognition. The speech recognition process was developed using the hidden Markov model [9] and the stereo vision neural network model [10]. Some known speech denoising methods include Wiener filtering [11], spectral subtraction algorithms [12], and subspace filtering [13], which have attracted substantial attention and exploration owing to their simple designs and implementations. The orthogonal-polynomialbased speech enhancement algorithm emphasizes the development of a minimum low-distortion estimator for speech and noise data signals. The observed signal was transformed into the transform domain using an orthogonal polynomial [14]. During speech processing, these linear approaches minimize noise while simultaneously enhancing the signal-to-noise ratio (SNR). The Support vector machine (SVM) approach [15] has been proven to improve the oversimplification capability of the classifier. Speech recognizers are generally calibrated to avoid mismatches during the recognition period, such as minimal distinction malfunctions [16].
This study proposes a novel combinational feature extraction and classification approach to increase the speech intelligence of hearing aid users in the Marathi language. Numerous commercial hearing aids cannot adapt to acoustic environmental changes or background conditions. The focus of this study was to design a speech background classifier that helps improve the auditory performance of HA users under different background conditions. First, we decomposed the input speech into discrete t-f units and reduced the signal noise. Next, we extracted useful features such as the time frequency unit (t-f), center frequency (fc), and modulation frequency (fm) from the Marathi speech using the improved amplitude magnitude spectrogram (IAMS) technique. The input speech feature values were categorized into corresponding classes based on the ratio [17]. This ratio was determined using a neuro-fuzzy classifier based on approximate and original spectral values. A window function was applied, followed by weighting and addition of the corresponding mask value to obtain an enhanced signal. Quality improvement of speech includes the recognition of syllables, monosyllables, vowels, consonants, words, short sentences, and phonemes by hearing aid users under different speech background conditions. Marathi speech samples were collected from participants of different sexes and speech backgrounds. A novel contribution of the proposed approach lies in feature selection for speech enhancement in HA. The features are selected using a discrete cosine transform (DCT)-based improved amplitude magnitude spectrogram (I-AMS) algorithm, which reduces speech processing time. The neuro-fuzzy classifier categorized the denoised speech signal into four classes: target, target-dominated, masker-dominated, and masker. This paper is organized into six sections. In Section 2, state-of-the-art literature is provided. The proposed approach for improving speech intelligibility is described in detail in Section 3. Section 4 illustrates the collection of databases, experimentation process, and audiogram analysis. Section 5 focuses on the signal enhancement, classifier performance, and recognition results. Finally, in Section 6, conclusions and future work are presented.

II. LITERATURE REVIEW
Numerous researchers have assessed the hearing loss on certain frequencies.
Ching et al. [18] clarified the speech perception of people with hearing disabilities and calculated their speech intelligibility index (SII). Customization of the SII is considered to boost the accuracy [19]. The index scale was considered inadequate in a recursive recognition skill test, and alternative improvements were proposed [18]. Satisfactory outcomes were obtained using this system, in which the amount of distortion with the audible frequency capability of the user was merged. This approach has been assessed for syllables and sentences. Moreover, in [20], noise reduction techniques for speech quality improvement useful for hearing-impaired (HI) individuals were examined. Speech quality was improved using the shrinkage sparse coding (SSC) technique [21]. In this method, the examination is extended to contain speech quality evaluations using interrelated comparative ranking (ICR) [22].
In [23], improved high-frequency speech intelligibility in noise was proposed, and sound sources were located on a horizontal plane with high accuracy. First, speech frames are decomposed into three groups of speech models: amplitude, frequency, and phase [24]. The input speech frequency above the reference cut-off frequency (f c ) was reallocated towards a lower frequency range to improve high-frequency speech recognition ability [25]. Frequency compression ratio (CR) was set for various frequency ranges. To prevent spectral distortion of speech, the input spectrum was categorized into six octaves [26]. Furthermore, Matthias et al. [27] developed the F 0 modulation F 0 (mod) processing technique for the cochlear implant (CI). This approach offers F0 (mod), which enhances the spectral pitch cue by performing intensity modulation of multichannel electrical stimulation [28]. The input speech signal F0 (fundamental frequency) was used. This approach has been verified for recognition at word and sentence levels in various noise-level situations. Table 1 depicts existing speech processing methods [1], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39] and their processing strategies, along with their highlights and limitations for vowel, consonant, word, and sentence recognition.
Cochlear implant (CI) speech perception systems, where acoustic listening was proposed to replicate the effects of speech-in-noise intelligibility, have been proposed in [40] and [41]. In relation to electrical and/or acoustic stimuli, a model has been used to simulate the neurons of the auditory system [42]. Positioning and spatial-temporal structured spiking variations were employed as inner illustrations of the noisy voices. For signals including stagnant noise, speech reception thresholds were predicted for a sentence [43] and checked for an automatic method of speech recognition in [44]. Furthermore, in [45], interaural differences in terms of intensity, time, and phase were introduced. The sound arrival time (S at ) and sound arrival level (S al ) allow the distinction of sounds in the horizontal planes, and these parameters are helpful for source separation and speech perception in environments with background noise [46], [47]. Sound localization behaviour was examined using 14 bimodal uses; all users adopted the same CI, the advanced Phonak hearing aid. The primary objective was to find binaural and monaural cues for horizontal sound-source localization [48].
In [49], speech intelligibility in terms of inconsistency in cochlear implant (CI) users was investigated. Speech comprehension in various environments with background noise was investigated in [46] and [50]. They utilized a speech amplification algorithm [51] based on a neural network to increase the speech perception in the presence of background noise. The noise was separated from the speech signal and converted into time-frequency units [52], [53]. A neural network was used for channel frequency estimation in [54].
There is still a need for improvement in terms of the gain frequency correlation, noise cancellation, acoustic feedback, and signal processing delay. Many algorithms that are designed and implemented for single-background environments are not useful for other speech backgrounds. Input speech denoising, insertion gain, and feature extraction required for classification are key parameters of the proposed method.

III. THE I-AMS BASED SPEECH SIGNAL ENHANCEMENT TECHNIQUE
The proposed approach focuses on four phases: preprocessing, feature extraction, training-testing rate for classifiers, and speech enhancement. Figure 1 shows the overall diagram of the proposed neuro-fuzzy classifier for speech enhancement. First, noise was removed from the input speech. Subsequently, an improved amplitude-magnitude spectrogram (I-AMS) technique is used. The extracted features are trained using a neuro-fuzzy classifier. During the training phase, the noise-masked signal t-f units [55] were classified into four classes: target class (class_1), target-dominated class (class_2), masker class (class_3), and masker-dominated class (class_4).
During the enhancement phase, the noise-masked signal has individual t-f units, which are multiplied by the equivalent class weight to obtain the enhanced speech waveform.

A. PRE-PROCESSING
Pre-processing in I-AMS is an important stage in speech enhancement and consists of four steps: pre-emphasis, Automatic Gain Control (AGC), FFT filter bank, and envelope extraction.

1) PRE-EMPHASIS
Pre-emphasis is the first stage involved in the pre-processing. The input signal may have different frequency components that fall between high and low frequencies. To avoid high frequencies and compensate for the high-frequency components [2], we use a pre-emphasis filter whose pre-emphasis factor 'α' is selected in the range of 0.9 1. In the pre-emphasis phase, the high-frequency speech components are amplified to a higher magnitude than the noise components, which helps improve the signal-to-noise ratio (SNR).

2) AUTOMATIC GAIN CONTROL
After the pre-emphasis stage, the filtered output signals were passed through automatic gain control (AGC) [56]. Amplification converts soft, moderate, and loud sounds to audible ranges. The AGC controls the gain during processing according to the background environment of the speaker and the HA user. It is improved using a dual-loop AGC that contains low-gain AGC for level deviation and high-gain AGC for severe deviation. In sentence assessments, the dual-loop AGC offered a better speech understanding ability than the fast AGC approach.

3) FFT FILTER BANK
The compressed signal was transformed using FFT. It computes the real and imaginary parts of the signal, where it decomposes the N' point time-domain signal into the frequency domain. Then it calculates corresponding 'N' point frequency spectra and convolves into a single frequency spectrum.

B. FEATURE EXTRACTION USING IMPROVED AMPLITUDE MAGNITUDE SPECTROGRAM (I-AMS)
Feature extraction is the most significant step in the I-AMS technique, as illustrated in Fig. 2. The signals were sampled, bandpass filtered, rectified, and segmented. In this method, we used the discrete cosine transform rather than the normal Fourier transform [57]. We also added delta functions to improve the feature vector values by using Equation (1).
where F T is the transformed frequency and F S is the speech frequency. Dataset (D) of the Marathi speech samples was partitioned into the training (D TR ) and testing D TS datasets. The input signal I (t) contains both the clean signal C(t) and noise signal N (t) as indicated in Equation 2.
During the sampling process, the continuous time signal is transformed into a discrete-time domain at a sampling rate of 16 KHz [58], [59]. The input signal I (t) in Equation (2) is sampled as I (n) as shown in Equation (3).
The time duration of each frame with 320 samples was 20m-sec with an overlap of 50%. Rounding and truncation are widely used in quantization processes. We implemented a 6-bit quantization process in which quantization improvement was performed using the µ law. The quantized signal is processed through a pre-emphasis phase to enhance the power level, in which emphasis is placed on the higher frequency contents of the signal compared to the lower one. SNR improvement is achieved by limiting the undesirable effects of saturation and attenuation losses [60]. The SNR of the n th frequency band was calculated using Equation (4).
where α = 0.95 andR n is n th frequency. The pre-emphasized signal was passed through a band-pass filter with 25 channels. The processed signals were separated into different time-frequency units (t-f) using bandpass filters. The signal is converted into 25 different t-f units where each t-f unit related to corresponding channel which is represented by 'C i ' where1 ≤ i ≤ 25. Each channel has a corresponding upper and lower cut-off frequency. After complete wave rectification, the envelope of each band was decimated using 3.
The decimation envelope was divided into 128 intertwining segments, which had 32 ms segments with 64 overlapping samples per frame. Each segmented signal is defined by 'S ij ' where 1 ≤ i ≤ 25, 1 ≤ j ≤ N i , w here, N i is the number of segments related to the i th channel. The sampled signals are windowed using the Hanning window with a 25ms window size that eliminates spectrum artifacts [61]. The window function is defined by Equation (5): where N is the sample width, and n is an integer from 0 to (N -1). Zero padding and DCT have also been used [62].
In terms of the number of cosine functions at various frequencies [6], a discrete cosine transform (DCT) conveys a limitedmagnitude sequence at different DCT data points. The DCT was applied to the input signal using (6).
where t = 0, . . . , N − 1 The DCT computes the modulation spectrum for each of the 25 channels, and each channel is duplicated by 15 triangular windows [63] within a range of 15.6-400 Hz (Equation (7)).
These spectrum amplitudes are summed, and each describes the feature vector F s (ρ, t) , where t is the time slot and ρ corresponds to the subband.
We included delta functions in the extracted features to consider shifts in the time and frequency domains [64], [65], where the delta function is expressed in Equation (8).
The delta function [66] in terms of frequency is defined by Eq. (9).
For t = 1, Equation (9) is expressed as For ρ = 2 Equation (9) becomes The overall feature vector [67] is expressed using the delta function: We selected 25 sub-bands: B. Because a S (b, τ ), a T (b, τ ) and a B (b, τ ) have dimensions of 15, the total feature vector dimension of A S (b, τ ) is 45.

C. NEURO-FUZZY CLASSIFIER TRAINING
Each input t-f unit is classified into the corresponding class [68]. In the proposed method, the signal was classified into four classes: masker, masker-dominated, targetdominated, and target classes. The quality ratio classes are represented as Q1, Q2, Q3 and Q4, respectively. Let us consider the noisy speech spectrum N (b, τ ) at time slot τ and sub-bandb. The signal spectrumȲ = (b, τ ) is estimated by multiplying the gain function [72], [73] with the noisy speech spectrum N (b, τ ) at time slot τ and sub-bandb using Equation (11) The gain G (b, τ ) is calculated using Equation (12).
The prior signal-to-noise ratio [71] is SNR p and is computed using Equation (13).
where the smoothing constant is α = 0.98 and the background noise variance estimation is λ D . The estimated magnitude of speech was compared to the actual speech magnitude [72], and Equation (14) was used for the corresponding t-f unit.
During the training stage, the four different classes shown in Equation (15) were used:  (15) where Q 1 is the masker class, Q 2 is the masker-dominated class, Q 3 is the target-dominated class, and Q 4 is the masker, masker-dominated, target-dominated, and target classes, respectively. The DCT provides significantly higher energy compaction than the DFT. We collected 8100 Marathi speech samples from female and male (14 female, four male) speakers in different speech background conditions. For neuro-fuzzy classifier training and testing purposes we used 70-30%, 60-40% and 80-20% data from collected samples.

D. ENHANCEMENT MODULE
After classifier training, the pre-processing noise input signal is convolved with the calculated optimal binary value.
The proposed waveform synthesis technique is illustrated in Fig. 3. The predicted class [73] produces gain G(b, τ ) of the mask represented in Equation (16).

IV. EXPERIMENTATION AND TESTING
This section presents a detailed analysis of the proposed method of experimentation and testing using hearing-aid  users. The selection of hearing aid users with relevant audiogram analysis is a key stage in the experimentation. Each participant was accurately examined to determine their audiogram response at a particular decibel frequency level.

A. RECORDING SPEECH DATASET
Marathi letters, words, short sentences, and rhyming words were recorded in three main situations: a silent room, speech with a musical background, and speech with fan noise. The speech dataset statistics and sample details are listed in Table 2. Figure 4 illustrates the experimental flow for the speech intelligibility measurements. Each speech-processing method was examined in terms of the performance parameter (scientific) and the recognition approach (developmental). The recognition scores of all hearing-aid users were measured using all the processing techniques.

C. PARTICIPANTS SELECTION PROCESS AND AUDIOGRAM ANALYSIS
Candidate selection and system design verification procedures were performed in accordance with the clinical test practice suggested by the hearing aid manufacturers' standards; 12 hearing aid users participated in testing: seven in the 7-14-year age group and five in the 14-17-year age group. The participants were selected from the NGO operated Priyadarshini deaf residential school, Shirpur (M.S) located in North Maharashtra region. All selected participants had mild to moderate hearing loss. The participants were categorized into two groups according to their sex. This categorization was useful for detecting the impact of words spoken by female speakers, sentence recognition, and intelligence ability. The audiologist's outcome and parameter fitting process revealed the dead region and patient's requirements. The use of audiologist outcomes helped us to select the essential set of performance parameters for the algorithm to avoid overfitting.

V. RESULTS AND DISCUSSIONS
In this study, the amplitude-frequency variation for noisy input and enhanced (de-noised) signal and the neuro fuzzy classifier performance parameter were calculated for different training and testing rates. After designing and implementing the proposed algorithm, recognition tests are performed under different conditions for a group of HA users.

A. SIGNAL ENHANCEMENT RESULTS
Denoising was the primary stage of this section. The proposed classifier was used to identify the class of incoming signals.
The signal-to-noise ratio (SNR) variation was plotted for various collected samples and compared with the existing pitch-intensity-based neural network approach [5], [7], [28] in terms of the SNR for various collected samples, as shown in Figs. 5 and 6. Both methods showed an SNR improvement over the existing neural-network approach for different backgrounds. The proposed denoising approach achieved a maximum SNR of 25 dB, whereas traditional neuralnetwork-based speech intelligibility achieved a maximum SNR of 23 dB in a silent room situation.
The proposed I-AMS based Neuro fuzzy classifier approach achieves a maximum SNR of 13db while neural network-based speech intelligibility achieves a maximum SNR of 11 dB under a music background situation.
Insertion Gains for Speech (IGSPXX): Insertion gains (IG) are required to maintain the processed signal up to the requirements of a hearing aid user [76]. The insertion gain was estimated using an audiogram. The insertion gain (dB) response over the channel center frequency is shown in Fig. 7. Audiogram observations and frequency gain functions were incorporated into the proposed algorithm.

B. NEURO FUZZY CLASSIFIER PERFORMANCE
The neuro-fuzzy classifier performance results were measured using the following parameters: sensitivity [76], specificity, classification accuracy, false positive classification rate (FPCR), false negative classification rate (FNCR), false acceptance classification rate (FACR), false rejection classification rate (FRCR), positive estimation value (PEV), negative  estimation value (NEV), and the Mathews correlation coefficient (MCC). These performance parameters were calculated for the different training and testing ratios of the neurofuzzy classifier. Sensitivity refers to the ability of the classifier to classify correctly; specificity [77] is a measure of the capability of the classifier to correctly classify negative signals [78]; and accuracy is given by Equation (17).

Accuracy =
Truepositive + Truenegative no of samples (17) VOLUME 10, 2022    The Matthews correlation coefficient (MCC) is widely adopted in machine learning to calculate the excellence of binary (two-class) classification. MCC is the relationship coefficient between experiential and forecasted two-stage classifications [79]. It has a value between −1 and +1. Table 3 shows the performance parameters of the I-AMS-based classifier by extracting different features as t-f units, center frequency, and modulation frequency for different training and testing rates. The target (higher) frequency band is linearly compressed to a lower (audible) frequency This approach is designed using critical bark-band scaling, which reduces spectral loss in the lower speech frequency range [47].

C. MARATHI LANGUAGE RECOGNITION TEST RESULTS
The 7-to 12-year-old participants were randomly divided into two groups. Figures 8 and 9 show the average vowel and consonant recognition scores calculated for the proposed method by extracting the t-f unit, center frequency, modulation frequency, and individual hearing aid. In these experiments, each vowel and consonant were played randomly multiple times with different speaker and listener backgrounds. For the short-sentence recognition test, six listeners were selected from existing 12 users. These participants were selected based on their perception of the highest individual recognition rate during the vowel and consonant tests. A short-sentence recognition test was conducted with different backgrounds of the speakers and listeners.
The overall recognition score was measured for speakers and listeners in the quiet-quiet, quiet-crowded, crowdedquiet, and crowded-crowded rooms. Table 4 shows that the individual recognition score calculated for five different cases, which indicates that the highest rate of approximately 67% was achieved after denoising and t-f unit extraction.
In Fig. 10, the SNR variation comparison for the proposed technique indicates that the processing method retains the SNR level and reduces the insertion gain requirement.
In Table 5, the processed speech after t-f unit extraction retains a SNR level between minimum 50.6 dB to maximum 67.3 dB which yields a better speech quality for HA users, while other speech features as centre frequency (fc)   and modulation frequency (fm) reduces processed speech SNR and demands more insertion gain to meet patient's requirement. Table 6 presents the spoken and listened confusion matrices for the Marathi consonants. The confusion matrix diagonals specified the correct identification of each consonant.  The Marathi confusing consonants group was responsible for the reduced recognition rate.

VI. CONCLUSION
The proposed speech enhancement based on I-AMS processing was designed to improve hearing precision for the hearing-disabled under different speaker and listener background conditions. It makes several contributions to signal denoising (enhancement), insertion gains at different frequency levels, and feature extraction, training, and testing of neuro-fuzzy classifiers.
Current signal processing techniques in hearing aids process speech signals regardless of the speech background, which may maintain the SPL below the hearing threshold level. In the proposed technique, the minimum insertion gain (IGSPxx) is added according to the speech background to satisfy the hearing aid user requirements. The proposed method achieves an SNR of 25 dB in contrast to the existing technique, providing 23 dB in quiet room conditions, which is similar to a noisy background SNR of 13 dB as compared to the existing 11 dB. The recognition results showed the importance of denoising; the short sentence correct recognition rate increased from 47.33% to 48.60%. Denoising has a greater impact on recognition results in a noisy background than in a quiet background situation. The performance of the neuro-fuzzy classifier varies according to the training and testing rates. Sensitivity variation was found in the range 98.44%-99.46%, 97.61-100 % and 94.44%-98.40% respectively, after extracting t-f unit, centre frequency, and modulation frequency. The classification accuracy ranged from 74.86% to 86.66% for 80-20% training and testing conditions. Finally, the t-f unit and training testing rate played a vital role in improving classifier performance. Speech enhancement with t-f unit extraction had a positive impact on short sentences recognition, and 66.70% accuracy was obtained for the t-f unit.
In addition, the proposed AMS-based classification method showed a significant improvement for female speakers when compared to male speakers. This research can be extended by extracting additional speech features and incorporating appropriate modifications during the training, classification, and testing phases. Additionally, implementations VOLUME 10, 2022 on different hardware platforms, such as complex programmable logic devices (CPLD) and field-programmable gate arrays (FPGA), are envisaged.

ACKNOWLEDGMENT
This research work was demonstrated and verified at the Priyadarshini Deaf Residential School, Shirpur, Dhule, India. Ethical approval was obtained from the relevant ethics committee of the Institutional Oversight Board. The researchers granted permission to exploit the experimental results and outcomes. The institute has provided written permission to publish research outcomes in conferences and journals.
PRASHANT G. PATIL received the B.E. degree from North Maharashtra University, Jalgaon, in 2004, the M.E. degree from the RKDF Institute of Science and Technology, in 2011, and the Ph.D. degree from Rashtrasant Tukdoji Maharaj Nagpur University, in 2020. He is currently working as an Associate Professor with the Department of Electronics and Telecommunication Engineering, R. C. Patel Institute of Technology, Shirpur, Maharashtra, India. He has more than 19 years of teaching experience. His current research interests include biomedical signal and speech processing, natural language processing, and image processing.