Application of PNN-HMM Model Based on Emotion-Speech Combination in Broadcast Intelligent Communication Analysis

The emotive manifestation of a news anchor ought not to be arbitrary, but rather meticulously crafted and refined to elicit a more coherent emotional response. Hence, the identification of the appropriate emotional disposition during news broadcasts constitutes a meritorious domain of inquiry. To address concerns related to imprecise extraction of speech features and intricate detection of emotional states in broadcasting, this study presents an innovative Chinese pronunciation system predicated on speech recognition. The GA-SVM algorithm is employed to ascertain the endpoint of input Chinese speech signals and extract emotional feature parameters. For the recognition of the emotional temperament in broadcast speech, a PNN model is utilized to execute the decoding process of Viterbi in HMM. Experimental findings evince that the SVM optimized by GA attains a robust classification effect across diverse test samples. Moreover, the PNN-HMM model exhibits a noteworthy capacity to withstand noise during Chinese speech extraction, thereby enabling accurate discernment of the emotional characteristics of the speech under examination. Additionally, this research furnishes a valuable point of reference for the application of intelligent classification technology to audio information.


I. INTRODUCTION
Although the dissemination of Chinese news information commenced relatively late, the collaborative efforts of several generations of announcers have culminated in the establishment of a fixed paradigm characterized by ''apt vocabulary, rhythmic clarity, and fluent language'' in linguistic expression. However, in light of the evolving times, an increasing number of newscasters recognize the need to infuse emotions into the process of news broadcasting. Presently, computerassisted language learning systems in multimedia primarily emphasize training in listening, speaking, and reading, with minimal focus on pronunciation. Furthermore, even if realtime pronunciation correction feedback is provided, the application of judging and adjusting the interaction mode based on users' emotional states remains exceptionally rare.
The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .
Humans possess the ability to convey feelings of joy or dissatisfaction through speech. Moreover, human emotions play a pivotal role in interpersonal communication through language, serving as a crucial information resource for perceiving phenomena [1]. As the accumulation of this intricate information continues, the volume of informational resources in the teaching process expands. However, these concise texts brimming with intense personal sentiments may appear sparse and disordered, yet they harbor significant untapped value, necessitating exploration [2]. Hence, identifying and summarizing these texts while uncovering the underlying patterns has emerged as a profoundly meaningful research direction. Therefore, leveraging digital multimedia technology for broadcasting training proves highly effective, contingent upon the computer's ability to discern the user's emotional state.
In the early research of emotion, firstly, the emotion dictionary was constructed manually, then the emotion words in the text data were matched with the emotion dictionary, and the emotion words were weighted according to the weighted values marked in the emotion dictionary. Finally, the average of all the weighted values was calculated to get the emotion score of the text data. However, the emotion expressed by many words in different contexts is different from the emotion marked in the emotion dictionary, and the accuracy of emotion classification of texts is low [3], [4]. The use of emotional expression in news broadcasting does not simply require announcers to use cadence in intonation, but more important is to use logical thinking to integrate emotion timely. Under the current broadcasting and hosting environment, different regions have different education levels and dissemination scope. With the continuous development of Internet technology, the acquisition of broadcast content is no longer limited to radio and TV. The intervention of more multimedia platforms has greatly improved the communication environment in remote areas. In addition, the voice emotion recognition system can infer that students are in positive, confident, negative or nervous emotional states through their voice emotion during training and classify them into emotions. Then, it can conduct one-to-one emotional interaction according to their emotional states to encourage and promote their efficient learning.
In the pre-processing stage of speech emotion recognition, the data collection process within speech emotion recognition systems exhibits imperfections. The presence of noisy speech signals significantly alters the distribution patterns of acoustic features, thereby hindering accurate long-term tracking of emotional states. Furthermore, the acoustic characteristics affected by variables such as gender, speaker age, and data collection methods exert a certain influence on the efficacy of feature selection [5]. Moreover, cross-cultural and cross-linguistic disparities introduce variations in the expression of emotions. Particularly, different countries' distinct cultural and linguistic features contribute to divergent emotion expression features. In scenarios where the available data sets are limited, the number of extractable features becomes constrained, ultimately resulting in unsatisfactory recognition outcomes.
Expanding upon the original paragraph, it is essential to recognize that the imperfections within the data collection process of speech emotion recognition systems pose challenges. The impact of noisy speech signals on the distribution patterns of acoustic features affects the ability to accurately track emotional states over extended durations. Additionally, factors like gender, speaker age, and data collection methods introduce acoustic variations that can influence the effectiveness of feature selection [5].
Considering the broader context, cultural and linguistic discrepancies across different countries further complicate the expression of emotions in speech. These differences should be taken into account when designing speech emotion recognition systems to ensure their applicability across diverse populations. Furthermore, when working with datasets containing a limited number of samples, the available features for extraction become restricted. This limitation can hinder the performance of recognition systems, leading to suboptimal results.
By highlighting these aspects, we recognize the need to address the imperfections in data collection, account for diverse cultural and linguistic characteristics, and develop robust feature extraction techniques to enhance the accuracy and reliability of speech emotion recognition systems.. The main contributions of this paper are as follows (1) Design of a broadcasting training system based on an emotional-speech model: The system employs the GA-SVM algorithm to detect the endpoint of input Chinese speech signals with emotions. This enables the extraction of emotion characteristic parameters. (2) Utilization of the PAD (Pleasure, Arousal, Dominance) emotion model: The paper adopts Hidden Markov Models (HMM) to extract sequential features from speech emotion data. It further incorporates Probabilistic Neural Networks (PNN) to enhance the decoding process of Viterbi in HMM. This approach aims to improve the system's ability to handle noise and enhance the accuracy of speech extraction.

II. RELATED WORKS
Speech emotion recognition refers to a technology that uses computer processing to extract the features of emotional signals in frames, simulates human perception and understands human emotions, and then infers the types of speech emotions. At present, speech emotion recognition algorithms are divided into template matching method, probability statistics method and discrimination classifier according to pattern recognition; It can also be divided into Hidden Markov Model, HMM [6], Gaussian Mixed Model, GMM [7] and K-Nearst Neighbors,KNN [8] as the representative of statistical classifier and artificial neural network [9], decision tree [10] and support vector machine, SVM [11] as the representative of discriminant classifier. HMM is suitable for the recognition of time series, and the system has good expansibility. It only needs to train new samples. However, the fitting function of HMM to speech emotion data is general, which is greatly influenced by phoneme information and has poor discrimination to neighboring emotions. In view of the shortcomings of HMM in speech emotion recognition, Emperuma [12] proposed an improved HMM/RBF hybrid model recognition method, which introduced the neural predictor into HMM to calculate the state observation probability, so that HMM could make effective use of inter-frame information. GMM is a model that decomposes a thing into several speech feature vectors based on Gaussian probability density function. This model has achieved great success in speech recognition and other fields. The advantage of GMM is that it has higher fitting ability to speech emotion data and its robustness is higher than that of HMM, while the disadvantage is that the model has too high price and strong dependence on training data. Based on this shortcoming, Ashev et al. [13] proposed a speech emotion recognition method based on the improved GMM model. By replacing the traditional output probability VOLUME 11, 2023 80855 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
value with the vector quantization error value to calculate the score of the model, the amount of training data needed in modeling was reduced, and the recognition speed was improved. In addition, SVM is suitable for small sample training set, and has high fitting ability to phonetic emotion data. It can solve the local value problem to achieve global optimization. However, SVM has shortcomings in multiclassification problems. Bang et al [14] proposed an adaptive incremental algorithm of SVM, which effectively solved the problem of incremental data and large-scale dat; The hybrid model based on decision tree and improved SVM proposed by [15] effectively avoided the problems of unbounded generalization error, large number of classifiers, restricted optimization, etc.
With the development of deep learning, researchers have increasingly focused on the automatic learning of optimal features directly from raw data. Le et al. [16] introduced a multi-task deep bidirectional long-term and short-term memory Recurrent Neural Network (RNN) trained with cost-sensitive cross entropy loss. Their method jointly modeled different emotion tags and demonstrated competitive performance in pure audio-based emotion recognition on the RECOLA dataset. This work provided an alternative approach for continuous emotion recognition, showcasing the potential of deep learning in this domain.
In the field of speech emotion recognition, Hidden Markov Models (HMM) and Artificial Neural Networks (ANN) are commonly employed models. HMMs possess strong dynamic modeling abilities, allowing them to capture temporal dependencies in speech data. On the other hand, ANNs excel in classification tasks due to their powerful discriminative capabilities.
Several studies have explored the strengths and limitations of these models and their hybrid combinations. For instance, Konang et al. [17] compared the performance of HMMs, ANN-based systems, and hybrid HMM-ANN models for speech emotion recognition. Their findings indicated that the hybrid model outperformed both standalone HMMs and ANNs in terms of recognition accuracy. The integration of HMMs and ANNs in a hybrid architecture benefited from the temporal modeling abilities of HMMs and the discriminative capabilities of ANNs, resulting in improved recognition performance.
Additionally, other researchers have investigated alternative approaches. For example, Zhang et al. [18] proposed a deep learning framework based on CNN for speech emotion recognition. Their method leveraged spectrogram images as input and achieved competitive performance on multiple emotion datasets. The CNN-based approach offered advantages in capturing local patterns and extracting discriminative features from speech signals. Moreover, some studies have explored the fusion of multiple modalities for enhanced emotion recognition. Wang et al. [19] employed a multimodal approach that combined audio and visual features to improve the accuracy of emotion recognition. Their results demonstrated the benefits of leveraging complementary information from different modalities.
In summary, the development of deep learning techniques has propelled the research on automatically learning features directly from raw data for speech emotion recognition. While HMMs and ANNs have been widely used, hybrid models, such as the HMM-ANN combination, have shown improved recognition performance compared to standalone models. Furthermore, alternative approaches like CNNbased methods and multimodal fusion have also yielded promising results. Future research in this area should explore further advancements in deep learning architectures, feature representations, and fusion strategies to enhance the accuracy and robustness of speech emotion recognition systems.

III. BROADCAST TRAINING SYSTEM BASED ON SPEECH RECOGNITION A. OVERALL DESIGN
The Broadcast Training System, grounded in speech recognition technology, constitutes a sophisticated system that leverages the advancements in speech recognition to facilitate effective language learning. This system operates by evaluating the resemblance and alignment between users' pronunciation and the standardized template library. It employs a designated scoring methodology to provide users with constructive feedback for improving their pronunciation skills. The scoring system is a manifestation of the amalgamation of computer technology and digital signal processing techniques in the realm of language pedagogy. The Pronunciation Teaching System, at its core, functions as an artificial intelligence-driven expert system that guides users in honing their pronunciation abilities. To provide a comprehensive overview, the architecture of this system is visually depicted in Figure 1. This visual representation showcases the interconnected components and modules that collectively contribute to the seamless operation and efficacy of the Pronunciation Teaching System. Through the utilization of advanced speech recognition techniques and innovative scoring methodologies, this system empowers learners to refine their pronunciation skills in a guided and structured manner. The integration of artificial intelligence technologies enhances the adaptability and personalization of the learning experience, ultimately fostering improved pronunciation accuracy and fluency. The system should have the following functions: (1) Standard pronunciation prompt and training of standard template librar; (2) Scoring the pronunciation level of the user's pronunciation, and synthesizing and mapping the score; (3) Feature extraction and emotion recognition and classification are carried out on the pronunciation state of user;

B. SPEECH SIGNAL RECOGNITION
Even if the speaker and text information are consistent, the effective speech segments in different emotional states will be different. The length, starting point and ending point of effective speech of the same text in different emotional states are different. If we take the whole recorded speech as valid data, it will inevitably affect the recognition performance of the whole system. Therefore, we must adopt a method to extract the effective speech segments.
The pre-emphasis process can obviously improve the signal-to-noise ratio, which is realized by a high-pass filter. Its purpose is to eliminate some low-frequency interference and enhance the high-frequency speech spectrum components. Speech has the property of being relatively stable in a short time. In order to make use of this property, it is necessary to frame the speech signal. In this paper, every 16ms is taken as a frame signal s(n), which is recorded as the frame length and the frame shift is 5ms. We can think that this short speech signal is relatively stable in this 16ms time period.
The signal s(n) is decomposed into high frequency space c 1 (n) and low frequency space d 1 (n), and then c 1 (n) is decomposed into high frequency space c 2 (n) and low frequency space d 2 (n). Until the five-level decomposition is complet ed, 10 frequency components c 1 (n) ∼ c 5 (n) and d 1 (n) ∼ d 5 (n) are obtained. According to the principle of wavelet multiresolution analysis, d 1 (n) ∼ D_5(n) and c 5 (n) can represent all the frequency information of the original speech signal. The average energy of coefficients d 1 (n) ∼ d 5 (n) and c 5 (n) of each layer is calculated by the Formula (1) where c i (n) represents a coefficient in d 1 (n) ∼d 5 (n) and c 5 (n), and N(m) is the number of coefficients contained in the speech signal at m-th frame. The mean value E m and variance σ 2 of these 6 average energies are: The above parameters constitute the feature vector of a frame speech signal:

IV. SPEECH ENDPOINT DETECTION BASED ON 4 GA-SVM MODEL
At present, there are many voice endpoint detection algorithms that can be selected, including double-threshold algorithm, voice endpoint detection algorithm based on spectrum characteristics and multi-feature fusion. However, these algorithms only work well under the condition of high signal-to-noise ratio (SNR). With the decrease of SNR, the performance drops rapidly, or even loses its effect completely. In recent years, many researchers have thought of using support vector machine as classifier to detect the endpoint of speech signals. However, as a classifier, SVM has a great relationship with the selection of kernel parameters and penalty factors [7]. At present, its parameters can only be roughly determined by experience or repeated experiments under specific conditions. Therefore, in this paper, the feature vector is extracted by wavelet analysis as the input of SVM, and the optimal parameters of SVM are obtained by GA. Figure 2 shows the recognition process of voice endpoint detection using GA-SVM algorithm.

A. SVM CLASSIFICATION
For two classes of nonlinear classification problem a set of SVM input (x i , y i ) i = 1, 2, · · · ,n, x i ∈R d , y i ∈ {+1, −1} is the output of the SVM. First, the input eigenvector becomes linearly separable by mapping the nonlinear space R into the higher-dimensional space F through the function φ : R → F. The optimal hyperplane of SVM classifier is solved by the following quadratic programming problem.
(ω) = min where ω represents the weight coefficient vector, C represents the penalty factor, B represents the classification threshold, and ξ i represents the loose variable. Introducing Lagrange factor, transforming it into its dual form; where α i is the Lagrange factor, in this programming problem, only a small part of α i ̸ = 0, denoted by α i * , the number of non-zero is denoted by N, and the corresponding VOLUME 11, 2023 samples are the support vectors x i * of the classification lines H1 and H2. Then, the optimal solution obtained is: The function of the optimal classification hyperplane is Suppose kernel function K x i , x j = φ (x i ) ·φ x j , the resulting decision function is: SVM classifiers with different performance are constructed by selecting different kernel functions. In this paper, RBF is used as the kernel function of SVM, and its definition is as follows B. GA OPTIMIZED SVM GA can search for the optimal solution globally and optimize multiple parameters at the same time. Therefore, genetic algorithm is used to obtain the optimal parameters of kernel parameters and penalty factors, so as to obtain a better speech detection rate. The kernel parameter and penalty factor of SVM are encoded by chromosome in binary form, and the length of encoding depends on their value range and our solving accuracy. Let the value range of parameter x be [x min , x max ] and the solution accuracy be ∂, then the formula of coding length l is: In the endpoint detection of speech signal, the value range of C is (0, 100], the value range of γ is [0,100], and the accuracy is 0.0001. Therefore, both the kernel parameter γ and the penalty factor C can be represented by a 20-bit binary string. Decode the binary string aa l−1 · · ·a 1 a 0 , and the decoding result is x, then Fitness function is used to calculate the fitness value of individuals, and the calculation results are the basis of selection operator selection. The fitness function is defined as follows where C accuracy represents the proportion of accurate classification in the dataset.
Therefore, under the same test sample conditions, the function value is directly proportional to the classification accuracy.
When the fitness value of several generations of adjacent populations has no obvious change or the evolution algebra reaches the set maximum value, the program termination condition is met. In this paper, the preset evolutionary algebra is used as the termination condition of the algorithm to obtain the optimal parameters.

C. VOICE DETECTION STEPS
(1) Pre-processing the data of the voice library.
(2) Mallat tower algorithm is used to decompose each relatively stable frame signal by five layers of wavelet to obtain the feature vector, which is used as the input of SVM to manually calibrate each frame signal, and the speech segment is as follows +1, the non-voice segment is −1, as the output of SVM.
(3) GA is used to get the optimal SVM parameters that are used to train the SVM model.
(4) Test the SVM model. After the training of SVM model is completed, the training sample is used as the input of SVM prediction, and the training result is verified by comparing the output result of SVM prediction with the manual calibration result. If a high accuracy can be achieved, it is considered that the training is over, otherwise, it is necessary to re-train the model until a satisfactory accuracy is achieved.
(5) The trained SVM classifier is used to predict the test samples, and the output results are obtained.
(6) Make a comprehensive judgment on the output result of SVM, and assume that the output vector of a voice data SVM is Y = {y 1 , y 2 , y 3 , · · · ,y n } , n is the total number of frames of the voice data. If 10 consecutive frames from y i (which can be changed according to different noise ratios) are speech segments, the starting point of the speech data is y i . If 10 consecutive frames from y j are non-speech segments, the end point of the speech data is y j .

V. EMOTION RECOGNITION MODEL BASED ON PNN A. PAD EMOTIONAL MODEL
In the early stage, the selected and recorded emotional speech corpus became the cornerstone of correspondence analysis and modeling of later speech emotion. The acoustic characteristics of emotion include two important parts, namely, the neutral part for basic transmission of information and the emotional part for expressing emotional types [18]. In the synthesis of emotional speech, we pay more attention to the emotional part, so as to make corresponding analysis and modification to realize the synthesis of emotional speech. Through emotional features, we build models and establish corresponding acoustic models. In the previous elaboration, we concluded that the performance of emotional speech in acoustics is mainly divided into three parts, namely, the size of energy, the speed of speech (also regarded as the duration of pronunciation) and the fundamental frequency. Based on the research of acoustics, we adopted PAD emotional model, in which, Pleasure-Displeasure reflects the degree of pleasure of people's emotions, Arousal-Nonarousal reflects the degree of physiological excitement to the environment or individuals, and Dominance-Submissiveness reflects the degree of control of emotions to the outside world, which is used to distinguish whether emotions are passive or active, such as contempt is active, while fear is passive.
According to the different emotional voices heard, many testers scored and judged. According to three original value calculation methods, the specific parameters corresponding to different emotions were finally summarized. We selected the PAD rating scale of Chinese typical emotion proposed by Chinese Academy of Sciences, as shown in Table 1.
The expression of different emotions in acoustic characteristic parameters shows the difference of energy, duration and fundamental frequency. That is to say, there is a corresponding relationship between the parameters of PAD emotional model and acoustic characteristic parameters, which adopts Pearson linear correlation coefficient. The relation between the target value s and the feature x i is as follows: When r s,x j = 1, it represents the highest correlation, and when it is 0, it represents the lowest correlation. According to the calculation and analysis, P has a high correlation with the speaking speed, and the speaking speed and average duration have a great influence on it, while the fundamental frequency has a small influence. A was positively correlated with fundamental frequency and negatively correlated with speech speed; D shows little correlation with the whole, and has a certain correlation with the pause.

B. PNN-HMM MODEL
The calm voice is generated by speech synthesis technology, and then the emotional voice with emotion is generated by modifying the prosodic parameters corresponding to different emotional states according to the emotional features in different states. The naturalness of the emotional voice generated in this way is not high enough. In the whole process of speech synthesis, calm speech is used for training to form a Chinese HMM emotional model of calm speech. Then, the original model is updated and modified by extracting its characteristic parameters from the remaining emotional voices recorded before, and it evolves from a calm model to a special emotional model. HMM model is widely used because of its strong modeling ability for a certain time series. However, the HMM model still has some defects, mainly as follows: firstly, it does not show the same strong modeling ability for all time series, for example, its modeling ability for speech data with low-level acoustic phonemes is relatively poor. Therefore, HMM model can't effectively identify those acoustically similar speech signals such as surprise and happiness; then, in the prior hypothesis of HMM model, only the current state is related to the previous state, which is obviously inconsistent with the characteristics of speech.
Probabilistic neural network has the characteristics of timeliness of network training, so it is suitable for application in real-time systems. Secondly, probabilistic neural network can solve nonlinear problems by linear learning method, so it has faster training speed and higher classification accuracy than other neural networks. Finally, probabilistic neural network is a forward network structure, so it can be easily implemented in hardware, and its structure is relatively simple. Therefore, in this paper, HMM is used to extract the time series characteristics of Chinese speech emotion data, and the probabilistic neural network is used as the classifier to classify and recognize the emotion state, aiming at the defect of poor classification ability of HMM. The structure of the hybrid model is shown in Figure 3.
Suppose two kinds of emotional patterns a and b, the vector to be identified is X, which can be obtained by Bayesian minimum risk criterion: if H a L a F a (X) , then X ∈ b where H a and H b respectively represent the prior probability that X belongs to class a and class b. Take H a = n a /n, H b = n b /n, N_A and N_B are the number of training samples of affective mode a and b, and n is the total number of training samples. L a and L b are the cost factors of mispartition respectively, F a (X) and F b (X) are the probability density functions of X belonging to class a and class b respectively, which can be obtained according to the method of estimating probability density functions [20]: The formula of the same form can be used to obtain F b (X ), where p is the dimension of the feature vector, X aq is the q-th training sample in emotional state A, and σ is the smoothing coefficient.
The number of states of HMM model N = 4 and the number of observation symbols M = 3. Based on the best state sequence obtained by Viterbi decoding of HMM model, the feature vectors in the same state are time-normalized, and are uniformly normalized into two frames, each frame is composed of 30 feature parameters, and the number of states is 4. Therefore, a Chinese speech emotion sentence can be represented by a vector of 4*2*30 = s240 dimensions. That is, the number of neurons in the input layer of probabilistic neural network is 240. The number of neurons in the training mode layer is determined by the size of the training sample. In this paper, five kinds of Chinese emotional states need to be classified, so five neuron nodes are selected in the summation layer and output layer of probabilistic neural network.

VI. EXPERIMENT AND ANALYSIS A. DATA SOURCES
In order to Happy the foundation for the synthesis of emotional speech, aiming at the complexity of emotional speech, we selected the five most representative emotions from CASIA database [21]: Calm, Happy, Angry, Sad and Neutral. We selected 1671 sentences as the basic corpus of each emotion. The selected corpus resources are used for audio feature extraction and analysis respectively, in which the short-term energy, amplitude, zero-crossing rate, pitch frequency and other feature parameters are mainly selected. The feature extraction of each speech is repeated five times, and the feature parameters are stored in excel and mat files respectively according to emotion categories.

B. RESULTS AND DISCUSSION
When the noise samples and signal-to-noise ratio are different, the optimal parameters obtained by genetic algorithm will be different. Therefore, it is necessary to select parameters by algorithm, and when the background environment is constant, the optimal parameters can be obtained by training set. White noise with SNR = 5dB was added to the training set speech as the sample, and the input and output vectors were obtained by the proposed method. Firstly, the optimal penalty factor C and kernel function parameter g were obtained by genetic algorithm, and they were used to set the γ parameter in RBF kernel function. The result is shown in Figure 4. The optimal parameters obtained from Figure 4 are c = 7.4593 and g = 30.0787. Using this parameter for SVM training and classification prediction, the detection rate is 94.6197%. Figure 5 shows the results of speech feature classification.  It can be seen from the figure that the SVM optimized by GA shows good classification effect in different test samples.
Three algorithms, namely, single HMM model, PNN model and PNN-HMM, are used to classify and recognize the five emotional states of emotional speech test set. The recognition rates of five different emotional states by different algorithms are shown in Figure 6.
As depicted in Figure 6, the enhanced HMM hybrid model demonstrates a notable resilience against noise interference, exhibiting remarkable performance even in acoustically challenging settings with a signal-to-noise ratio of merely 5dB. This hybrid model displays a commendable capability to discern the emotional state conveyed through speech signals.
Moreover, when compared to the singular HMM and PNN, the hybrid model showcases a substantial enhancement in terms of its average recognition rate, achieving an impressive accuracy of 62%. This improvement underscores the superiority of the hybrid model over its individual counterparts in accurately identifying and interpreting emotional nuances embedded within speech data.
The anti-noise ability of the improved HMM hybrid model, as evidenced by its reliable emotional state recognition in adverse acoustic conditions, reflects its capacity to effectively mitigate the detrimental effects of noise interference. The amalgamation of HMM and other computational techniques within this hybrid model synergistically harnesses the strengths of each component, resulting in a robust and proficient system for speech emotion recognition. The substantial performance gain exhibited by the hybrid model in comparison to the singular HMM and PNN models indicates its potential for broader applicability in various domains, such as human-computer interaction, affective computing, and speech analysis. Future research endeavors may explore further optimizations and refinements to this hybrid model, aiming to unlock its full potential in capturing and deciphering the complex emotional expressions conveyed through speech.
The audio features of the same Chinese speech sentence '' '' under five emotions are extracted and analyzed. Figure 7 shows a short-term energy comparison of speech signals under five emotions, and the corresponding emotions from top to bottom are angry, scared, happy, neutral and sad in turn. By comparing the time-domain waveforms under different emotions, as well as the characteristics of energy, amplitude distribution and pitch track, it can be seen that the statistical characteristics of waveform duration and amplitude under different emotional characteristics are obviously different, which indicates that the characteristic parameters selected in this paper are recognizable.  As can be seen from the figure, the recognition rates of happiness and fear are higher, both of which are over 90%, while those of anger, neutrality and sad are relatively low, but they can still reach over 70%. Among them, the recognition rate of neutrality is the lowest, where the emotional color of neutrality is less than that of the other four emotions, so the misjudgment rate will be relatively higher.
The polarity of emotion words may not be consistent for different descriptive objects. For example, The word high in the sentences '' '' and '' '' indicates different emotional polarity. When it's used to describe prices, it means prices are too high, which has a negative bias, while when it used to describe quality, it means high quality, which is the meaning of positive tendency.

VII. CONCLUSION
Due to variations in cultural and linguistic characteristics across countries, the expression of emotions also differs. Limited sample sizes in datasets restrict the number of extractable features, leading to unsatisfactory recognition outcomes. This paper examines the performance of a GA-SVM-based endpoint detection method and achieves the extraction of emotional feature parameters. The PNN-HMM model is employed for the classification and recognition of speech-based emotional states, comparing single and mixed models under different signal-to-noise ratios. The results demonstrate that the PNN-HMM model exhibits high recognition rates and strong resistance to interference. Notably, in the dataset utilized in this study, the recognition rates for pleasure and fear exceed 90%, while the recognition rate for neutral emotions is the lowest. By constructing a speech emotion recognition model, this research enables realtime monitoring of broadcasters' emotional states, facilitating prompt feedback and adjustments. This enables the continuous enhancement of broadcasting and hosting content and ensures accurate information delivery by leveraging AIdriven stored information. Future work will integrate auditory features, spectral features, prosodic features, acoustic quality features, and additional emotional features. Furthermore, more sophisticated classification models will be employed to further enhance the emotion recognition rate.

ACKNOWLEDGMENT
The author would like to thank the anonymous reviewers whose comments and suggestions helped improve this manuscript.

CONFLICTS OF INTEREST
The author declares that they have no known competing financial interests or personal relationships.
HUA YANG received the degree from Nanyang Technological University, Singapore, specializing in journalism and broadcasting. She is currently a Lecturer and the Director of the New Media and Art College, Xi'an Kedagaoxin University. With a profound expertise in news communication and hosting, she stands as the esteemed speaker of the provincial-level flagship course, ''Applied Journalism.'' Her dedication and innovative teaching methodologies received her the third prize in the Fourth Undergraduate Classroom Teaching Innovation Competition in Shaanxi Province. She has authored over ten research papers and successfully led two Xi'an Municipal Social Science Fund Projects, in 2021 and 2022, consecutively.