Attention-Based Convolution Skip Bidirectional Long Short-Term Memory Network for Speech Emotion Recognition

Speech emotion recognition is a challenging task in natural language processing. It relies heavily on the effectiveness of speech features and acoustic models. However, existing acoustic models may not handle speech emotion recognition efficiently for their built-in limitations. In this work, a novel deep-learning acoustic model called attention-based skip convolution bi-directional long short-term memory, abbreviated as SCBAMM, is proposed to recognize speech emotion. It has eight hidden layers, namely, two dense layers, convolutional layer, skip layer, mask layer, Bi-LSTM layer, attention layer, and pooling layer. SCBAMM makes better use of spatiotemporal information and captures emotion-related features more effectively. In addition, it solves the problems of gradient exploding and gradient vanishing in deep learning to some extent. On the databases EMO-DB and CASIA, the proposed model SCBAMM achieves an accuracy rate of 94.58% and 72.50%, respectively. As far as we know, compared with peer models, this is the best accuracy rate.


I. INTRODUCTION
The emotional state is an important element in the interactions of human beings. It influences many aspects of communication such as facial expressions, voice characteristics, and semantic contents [1]. As we all know, emotion is an inseparable component of speech, and it plays an important role in recognizing, interpreting, and responding to the emotions expressed in speech for a human-machine interface [2]. Therefore, speech emotion recognition (SER) is an essential component in natural language processing (NLP). SER consists of the following main steps: corpus construction, signal preprocessing, feature extraction, and acoustic modeling, etc. [3]. Among which, the acoustic model is the core component of an SER system. It deciphers the relationship between an audio input signal and linguistic elements through knowledge discovery models.
Traditionally, emotional features are input into the model, and the recognition results are obtained through various The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . acoustic models such as hidden Markov models (HMM) [4], Gaussian mixture models (GMM) [5], support vector machines (SVM) [6], and so on. HMM is a parametric representation of time-varying features that simulate human language processing and needs a large number of samples for time-consuming training [7]- [9]. GMM is a probability density estimation model that can fit all probability distribution functions, but it depends heavily on data and it is sensitive to data noise [10]- [12]. SVM maps the feature vectors from input space to a high-dimensional Hilbert space by using kernel tricks at first and then seeks an optimal hyperplane in the high-dimensional space to classify samples. But it cannot solve the problems of large-scale training samples that lead to a large or prohibitively huge kernel matrix [13]- [16].
With the rise of deep learning, a variety of artificial neural networks (ANNs) [17] are introduced for acoustic modeling. Compared with traditional methods, these neural networks based on deep learning have better performance for their capabilities in learning when handling large scale data. However, different deep learning acoustic models have their own pros and cons. For example, recurrent neural networks (RNN) are good at dealing with time series information [18]- [20], convolution neural networks (CNN) do well in capturing spatial information [21]- [23], and the deep residual network (DRN) can tackle the problems of gradient exploding or gradient vanishing which become popular with the deepening of the network layers [24]- [26]. Some representative deep learning acoustic models are summarized in detail as follows.
RNN is normally used as a dynamic model for sequential input, whose output is related not only to the current input but also to the output of the previously hidden layer. RNN can successfully predict the subsequent information when the context length is small. However, it may not predict well due to the problem of gradients vanishing or exploding caused by its training algorithm BPTT (Back Propagation through Time) [27]- [30].
To solve the problems of gradient exploding or gradient vanishing, long short-term memory (LSTM) is used as the basic recurrent unit of RNN and it uses memory cells and gates to control whether the input information is to be memorized, output, or forgotten [31]- [34].
LSTM only makes good use of information of the previous time step, in contrast, Bi-LSTM (Bidirectional LSTM) presumes that the state of the current time step relies not only on the information of the previous time step, but also on that of the future time step. It enables the network to make full use of context information and make more accurate judgments. This presumption, however, makes the network focus mainly on memorizing a large amount of input information and weakens its modeling capability [35], [36]. To make up for this deficiency, skip connections [37]- [39], the core technique of DRN [40], is introduced especially for deeper Bi-LSTM networks, because each neuron node in the skip connections makes use of the information of the previous hidden layer and enhances the modeling ability of the network.
Furthermore, Bi-LSTM cannot deal with spatial information in emotion recognition and its computation is more complicated. These problems are handled well by introducing convolution and pooling, the core operations of CNN [41]- [45].
Some other techniques are proposed to handle the challenges that may affect the recognition accuracies of acoustic models based on deep learning. For example, the masking operation is introduced to reduce the amount of calculation [46]- [49]. Similarly, weighted pooling based on attention over time is proposed to tackle the problems caused by a long silence, pause, or non-speech filler of the input voice, because it focuses mainly on specific regions of a speech signal that are more emotionally salient [50]- [52].
Given all that, a novel acoustic model SCBAMM is proposed to handle the challenges in speech emotion recognition. It has eight hidden layers, namely, two dense layers, convolutional layer, skip layer, mask layer, Bi-LSTM layer, attention layer, and pooling layer. This novel model makes good use of spatiotemporal information and captures emotion-related features effectively. In addition, it, to some extent, solves the problems of gradient exploding and gradient vanishing in deep learning. It demonstrates its superiority to the peer models on the EMO-DB [53] and CASIA [54] corpus.
The remaining of the paper is organized as follows. Section 2 describes the details of the proposed model. Section 3 describes the experimental results. Section 4 discusses future research directions and concludes this study.

II. METHODS
The development path of the proposed models will be unveiled in this section in the sequence of CBAM, SCBAM, and SCBAMM.

A. CBAM: ATTENTION-BASED CONV-BiLSTM
The visual attention mechanism is a special brain signal processing mechanism of human vision. Human vision scans the global image quickly and then obtains a target area. To obtain more detailed information, more attention would be invested in the target area. In the meantime, it suppresses other useless or irrelevant information [55]- [60].
The attention mechanism of deep learning, first proposed by DeepMind for image classification, is similar to that of human vision [61]. It enables the neural network to focus more on the relevant parts of the input and less on the irrelevant parts. Since then, the attention mechanism is widely used in many NLP fields, especially, in speech emotion recognition to extract features [62].
To extract the temporal information of speech more effectively, Bi-LSTM is first introduced because it can simultaneously use the information of previous time and future time. CNN is then used to extract the spatial information of speech signals. Furthermore, the attention mechanism is employed to select the features that can best represent emotions.
Based on the above analysis, a model called attentionbased Conv-BiLSTM, abbreviated as CBAM, is developed. Figure 1 depicts the flow chart of the proposed CBAM. There are six hidden layers, namely, two dense layers, a convolutional layer, a Bi-LSTM layer, an attention layer, and a weighted pooling layer. The convolutional layer is used to extract spatial information, the Bi-LSTM layer is used to extract contextual information, the attention mechanism is employed to learn the weights of each time sequence, and then the weighted pooling is computed as the representation of the whole utterance. In this way, the proposed model CBAM can learn to assign weights to different time steps from data and it is especially efficient in emotion recognition. A Softmax(·) function is finally employed to classify emotions based on the fused features of the output of the CBAM. The model parameters are optimized by minimizing the cross-entropy loss objective function. The following subsections present the details of the proposed CBAM model layer by layer.
The input layer receives features of speech frames. In this study, they are 36 dimensional acoustic features including 34-dimensional spectral features, 1-dimensional pitch, and 1dimensional harmonic to noise ratio (HNR). The first hidden layer of the CBAM model is a dense layer, its output is recorded as h 1 and it is calculated as: w 1 ij is the element of the weight matrix w 1 and it represents the weight of i-th node of input layer connected to j-th node of the first dense layer (512 nodes), where i = 1, 2, ... , 36 and j = 1, 2, ... , 512. The matrix w 1 is defined as: is a LeakyReLU activation function and it is defined as: where α is a hyperparameter. When α = 0, it is the ReLU function. For negative input, both the output of ReLU and its first derivative is always 0, which makes the neuron unable to update the parameters. When α > 0, it is the LeakyReLU. For negative input, both the output of LeakyReLU and its first derivative is non-zero, which solves the gradient problems in deep learning to some extent and solves the problem that neurons do not learn when the ReLU function enters the negative interval. The value of α in this paper is 0.01. In this case, the obtained h 1 is used as the input of the next dense layer. The calculation of h 2 is similar to that of h 1 , see formula (1). Similarly, the calculated h 2 is used as the input of the convolution layer. Valid convolution only considers the case that the length of a one-dimensional tensor can completely cover the convolution kernel, that is, the convolution kernel moves inside the one-dimensional tensor. The output conv 3 of the valid convolution is input to the Bi-LSTM layer, and it is defined as: where F = [k 1 , k 2 , . . . , k 512 ] represents the convolution kernel, N is the number of filters and it is set 512, and S represents the stride and it is set 1.
To the Bi-LSTM layer, it has three inputs: the first one is conv 3 , which comes from lower layer at the current time t; the second one is h t−1 , which is the output of the same hidden state at time t −1; and the third one is h t+1 , which is the output of the same hidden state at time t + 1.
The gating mechanism of memory cell is used to control information flow. Figure 2 shows the LSTM cell. There is a cell state C t to memorize information and it is updated as: where represents Hadamard product, C t−1 represents the cell state of previous time series. f t is the output of forgetting gate at time t and calculated as: It represents the forgetting probability of the hidden state of the previous time sequence.
The output of f t is a three-dimensional array, where the first element, which is set to 32, represents the dimension of the batch size vector; the second element, which is set to 144, represents the dimension of the time step vector; and the third element, which is set to 128, represents the number of hidden states. In the following formula, the dimensions of i t ,C t , o t are equal to that of f t .
The U = r ij m×n represents the weight matrix between the convolution layer and Bi-LSTM cell states, where m = 1, 2, . . . , 512, n = 1, 2, . . . , 128. The dimensions of matrices U f , U i , U c , and U o are the same as those of U , where U f is the forgetting weight matrix, U i and U c are the input weight matrices, and U o is the output weight matrix.
The symbol W n×n represents the weight matrix between the hidden states at adjacent time steps. The dimensions of matrices W f , W i , W c , and W o all are equal to that of W n×n , where W f is the connection weight matrix between the former hidden state and current time forgetting gate, W i is the connection weight between the former hidden state and the 5334 VOLUME 9, 2021 current time input gate, W c is the connection weight between the former hidden state and the current time cell state, and W o is the connection weight between the former hidden state and the current time output gate.
The parameter h t−1 in equation (7) is a 128-dimensional vector of the hidden state, f is the bias, and σ (·) is a Sigmoid function defined as: The inputting gate, responsible for processing the input information of the current sequence position, consists of two parts: i t andC t , and they are multiplied to update the cell state. The parameter represents the output of the activation function Sigmoid: Correspondingly, the parameter C represents the output of the activation function tanh: where The update of the hidden state h t is the Hadamard product of o t and h t−1 , that is: where the dimensions of h t equal to that of h t−1 and o t is computed as: Finally, the output y t o of the current sequence is calculated as: t is the bias vector, and V is the connection weight matrix between the cell hidden state and output that has the same dimensions as W n×n . Similarly, the dimensionality of y t o is the same as that of f t . Because Bi-LSTM processes a sequence of information from both forward and backward directions at the same time, the final output y B of Bi-LSTM, which is also the input of the attention layer, is a three-dimensional array, where the first element, which is set to 32, represents the dimension of the batch size vector; the second element, which is set to 1024, represents the dimension of the time step vector; and the third element, which is set to 256, represents the number of hidden states.
In the attention layer, Softmax(·) is used to learn the attention parameters of an input frame feature. It computes the final weights for the frames which sum to unity. u represents a 256-dimensional vector calculated as: where i A , as the input of attention layer, is a two-dimensional array: the first element, which is set to 32, represents the dimension of the batch size, and the second element, which is set to 256, represents the dimension of the time step. In addition, W A = w ij 2m×2n is the weight matrix and b A = [b 1 , b 2 , . . . , b 2n ] is the bias vector. Softmax(·) is an activation function which maps the original output to the interval (0,1) and the sum of the values is 1. It could be understood as the probability and the node with the highest probability should be selected as the focus of attention. α is the probability of the sequence features passing through the attention layer, which is calculated as: where · represents the dot product operation. It is noted to take the last dimension of u and y B for the dot product operation to calculate the probability through the Softmax(·) function.
The vector corresponding to the maximum probability is the target of attention mechanism that has the same dimension as that of i A .
In the weighted pooling layer, in order to get the utterance-level representation z p , the weighted pooling operation is performed on the sentence and take the value on the horizontal axis of α and y B for the dot product operation, that is On the top of the CBAM model, there is an output layer and it calculates the probability through Softmax(·) function to perform classification: To find the optimal weight and bias, the cross-entropy loss function is employed to train the CBAM network. The cross-entropy L CE is calculated as: where N denotes the total number of samples, n denotes the n-th sample, k denotes the k-th class, t nk denotes the label of sample. It is worthwhile to point out that t nk denotes the ground probability of the n-th sample belongs to the class k(k=0,1,2,. . . ). In addition, y nk is the output of the neural network and represents the predicted probability of the n-th sample belonging to the class k.

B. SCBAM: CBAM WITH SKIP CONNECTIONS
The CBAM network focuses on memorizing a large amount of input information. By adding a skip connection [37] between the first hidden layer and the convolution layer, a new model called SCBAM is developed to enhance the modeling capability of deep learning networks in this study. Figure 3 illustrates its topology. The skip connection introduced in this model makes the network focus not only on memorizing a large amount of input information but also target to improve the modeling ability.  Furthermore, the SCBAM network can avoid the gradient exploding or gradient vanishing problems as the network is deepened. The reason behind this is that SCBAM fuses the feature vectors conv 3 and h 1 , where conv 3 represents the feature vector extracted from the convolution layer and h 1 represents the feature vector extracted from the dense layer. The fused feature F c is calculated as: where the concatenate (·) function concatenates the two features. The dimension of F c is the sum of the dimensions of conv 3 and h 1 because of the concatenation. That is, there are 1024 neuron nodes in the skip layer. The calculation procedures of other layers in the SCBAM network are exactly as same as those in the CBAM network. Furthermore, in implementation, both CBAM and SCBAM networks employ the LeakyReLU [63] activation function and RMSprop [64] optimizer.

C. SCBAMM: SCBAM WITH MASKING OPERATIONS
The SCBAM network focuses on the specific region of a speech signal that is emotionally salient. To extract the features of the target region more effectively, a mask layer [49] is added between the convolution layer and the Bi-LSTM layer of SCBAM to build a new model named SCBAMM. Figure 4 illustrates the system diagram of SCBAMM. The function of masking operation is to extract the features of the interest region. These features are obtained by multiplying the feature mask of interest with the features to be processed. The inner image value of the interest region remains unchanged while the outer value is 0.
When inputting the sample features, 0s are padded to align the dimensions of all the sample features. To the Bi-LSTM network model, all 0s in F c need to be masked, that is, all 0s in F c do not participate in the calculation. The mask operation y m can be represented as: where (F c , 0) means all 0s in F c and do not need to be calculated. Similarly, the calculation procedures of other layers in the SCBAMM network are exactly as same as those in the SCBAM network. To prevent possible data over-fitting [65], during the training stage of CBAM, SCBAM, and SCBAMM, dropout [66] is implemented in all layers but the attention layer and weighted pooling layer. The dropout rate is set to be 0.1 generally unless specified. At the same time, the batch size is assigned 32, the number of cross-validation is assigned 10, and the epoch is set as 100. In addition, the optimizer and the activation functions are the RMSprop and LeakyReLU, respectively.

III. EXPERIMENTAL RESULTS
The performances of the proposed CBAM, SCBAM, and SCBAMM are validated on the EMO-DB corpus [54] and CASIA corpus [55].
EMO-DB is a German emotion database made up of 10 actors (5 males and 5 females) to simulate 7 classes of emotions, namely, anger (W), boredom (L), disgust (E), fear (A), joy (F), sadness (T), and neutral (N). The sample numbers of these classes are 127, 81, 46, 69, 71, 62, and 79, respectively. Totally, the corpus contains 535 emotional speech sentences with a sampling rate of 48-kHz and 16-bit quantification. Randomly, one male and one female are selected as testing subjects. The data from other subjects is used as validation data to check if the system needs to be stopped as soon as possible. The 36-dimensional feature vector consists of 34D magnitude FFT vectors, harmonic to noise ratio (HNR), and pitch (F0). The feature extraction is performed within a 25ms window with a shifting step size of 10ms. The acoustic feature sequence is Z-normalized within each utterance [54].
The CASIA speech emotion database was recorded by the Institute of Automation, Chinese Academy of Sciences. It is recorded by actors (2 men and 2 women) in six different emotions, namely, anger (A), fear (F), happy (H), neutral (N), sad (Sa), surprise (Su). The signal-to-noise ratio (SNR) is about 35 dB, and data acquisition is complemented in a pure recording environment with 16bit quantization and 16KHz sampling rate. The publicly available CASIA dataset contains 1200 utterances; each actor speaks 300 words in the same text, and each person recites six emotions. The average length of an audio file is about 1.9s [55]. The 20-dimensional MFCC features are extracted, and the high-level statistical functions, namely mean, variance, and maximum, of the MFCC features are calculated. The feature extraction is performed within a 25ms window with a shifting step size of 10ms. The acoustic feature sequence is Z-normalized within each utterance [54].
The experiments are conducted on a powerful PC with 64G RAM running under Windows 10, the benchmark speed of the CPU is 2.10 GHz, the core is 40, the logic processor is 80, and two RTX 2080 Ti GPUs are also employed for calculation speedup.
The CBAM, SCBAM and SCBAMM architectures are implemented with TensorFlow toolkit. The parameters of the proposed models are shown in Table 1. The optimizer is Rmsprop and the initial learning rate is set to 0.001. When training the neural network, if the learning rate is very large, it is likely that more neurons in the network are 'dead', and LeakyReLU retains some values of the negative axis so that all information of the negative axis will not be lost. Thus, the network can be better trained. Rmsprop with bias correction accelerates convergence and decreases possible oscillations in training [64]. It is more robust when the gradient becomes sparser.
The confusion matrix and five evaluation measures are employed to evaluate the performances of each model. In a confusion matrix, each row represents the prediction categories of each emotion, each column represents the actual categories of each emotion, and each number on the diagonal indicates the correct number of identified samples. The five evaluation measures include accuracy, precision, weighted average recall, unweighted average recall (UAR), and F1score, respectively. The accuracy refers to the probability of correct predictions among predictions, i.e., the proportion of correct predictions. The precision represents the number of positive samples predicted to be positive; the recall evaluates how many positive samples in the total samples are predicted correctly. F1-score is the weighted average of recall rate and  precision rate. In the case of imbalanced data, the recall rate can be biased, therefore, to evaluate the experimental performance comprehensively, both weighted average recall (WAR) and unweighted average recall (UAR) are used.

A. PERFORMANCE OF CBAM
Confusion matrix and 10-fold cross-validation are employed to verify the performance of CBAM. Figure 5 and Figure 6 are the best confusion matrices of CBAM on the databases EMO-DB and CASIA, respectively. It can be seen that: Firstly, the average accuracy rates are 80.75% on the EMO-DB and 63.33% on the CASIA, respectively. Secondly, 95.28% of anger (W) samples are predicted correctly on the EMO-DB dataset, which is a very considerable recognition result. Thirdly, there is a situation where one class is easily predicted to be another. Take joy (F) emotion as an example, only 56.34% of its samples are identified correctly, and 35.21% of its samples are predicted to anger (W), and 4.23% of its samples are predicted to fear (A), etc. Finally, W and F, L and N, are easily confusing emotion class pairs. Here, the spectral subtraction of two samples is used to demonstrate their similarity more intuitively. Figure 7 shows the spectrograms of W, F, F-W, W-F. It can be found that the spectrograms of joy (F) and anger (W) are similar. F-W and W-F reflect the difference between F and W. The darker the spectrogram of F-W or W-F, the more similar are classes F and W. Figures 8 and 9 show us the best confusion matrices of SCBAM on the databases EMO-DB and CASIA. It can be seen that: Firstly, the average accuracy rates are 92.71% on the EMO-DB and 70.00% on the CASIA under the same computing environment of CBAM. Secondly, the accuracy of SCBAM is 11.96% higher than that of CBAM on the EMO-DB dataset. The reasons behind that are as follows. The skip connections in SCBAM make the network focus not only on memorizing a large amount of input information, but also on the promotion of the modeling ability; in addition, it can deal with the problems of the gradient exploding or gradient vanishing. Thirdly, except the classes of joy (F) and disgust (E), the samples of the other five types of emotions can be well recognized. 5.63% of joy (F) samples are predicted to be another. Once again, it is proved that the samples of class F are easily predicted as class W.

B. PERFORMANCE OF SCBAM
C. PERFORMANCE OF SCBAMM Figure 10 and Figure 11 illustrate the best confusion matrices of SCBAMM on the databases EMO-DB and CASIA. It is obvious that: Firstly, the classification accuracy of the proposed model is 94.58% on EMO-DB and 72.90% on CASIA under the same computing environment of CBAM and SCBAM. Secondly, the accurate rate of SCBAMM for  each kind of emotion reaches 90.00% on the EMO-DB dataset, which indicates that the SCBAMM model has good robustness. Finally, the accuracy rate of SCBAMM is 13.83% and 1.87% higher than that of CBAM and SCBAM, respectively. The reason is that the masking operation in SCBAMM is good at extracting the effective features of the target regions, which contributes to detecting different emotion states.
D. COMPARISON OF CBAM, SCBAM, AND SCBAMM Figure 12 shows the improvements in terms of accuracy of the proposed models CBAM, SCBAM and SCBAMM in the 10-fold cross-validation on the EMO-DB database. It is easy to come to the following conclusions. Firstly, the average accuracy of SCBAMM is optimal (orange squares in the box) among that of CBAM, SCBAM, and SCBAMM. Secondly, the results obtained by SCBAMM 5338 VOLUME 9, 2021  in the 10-fold cross-validation are relatively concentrated (longitudinal height of the box), which indicates that SCBAMM has better stability and robustness. Finally, red solid circles indicate outliers. Table 2 summarizes the improvements in terms of accuracy of the proposed models CBAM, SCBAM and SCBAMM to the peer models.
Firstly, SCBAMM is superior to SCBAM and CBAM on both datasets EMO-DB and CASIA in evaluation measures such as accuracy, UAR, precision, and F1-score.
Secondly, SCBAMM is superior to previous research results other than reference [75] on the EMO-DB dataset, no matter which evaluation index is measured. The accuracy rate of reference [75] is as high as 98.00%. The reason behind that is reference [75] just selects a subset of the EMO-DB database, which contains only four types of emotions, each containing 30 emotional sentences.
Thirdly, SCBAMM demonstrates a strong prediction capability in emotion recognition. Mathematically, the weight  matrix of SCBAMM can be more representative and faster than those of the peer models for its customized optimizations Firstly, SCBAMM is superior to SCBAM and CBAM on both datasets EMO-DB and CASIA in evaluation measures such as accuracy, UAR, precision, and F1-score.
Secondly, SCBAMM is superior to previous research results other than reference [75] on the EMO-DB dataset, no matter which evaluation index is measured. The accuracy rate of reference [75] is as high as 98.00%. The reason behind that is reference [75] just selects a subset of the EMO-DB database, which contains only four types of emotions, each containing 30 emotional sentences.
Thirdly, SCBAMM demonstrates a strong prediction capability in emotion recognition. Mathematically, the weight matrix of SCBAMM can be more representative and faster than those of the peer models for its customized optimizations VOLUME 9, 2021

IV. CONCLUSION AND FUTURE WORKS
SCBAMM, a novel acoustic model based on deep learning, is proposed for speech emotion recognition. To achieve better performance, several techniques, namely, attention mechanism, skip connection, mask operation, and integration of spatial and time series information all are proposed. It demonstrates obvious advantages over the peer models on the benchmark datasets EMO-DB and CASIA. Experimental results suggest that SCBAMM seems to be much fitter for emotion recognition than its peers. The reason behind that is, SCBAMM makes good use of spatiotemporal information and captures emotion-related features effectively.
It can be interesting for us to further prove the superiority of the proposed model from machine learning theory. For example, it is highly likely that the weight matrix sequence in SCBAMM learning can be sparse and meaningful than those of its peer model. Such a study would be useful to know whether the specific operations proposed in SCBAMM would be redundant for some special datasets (e.g. imbalanced data). To further verify the effectiveness of SCBAMM, it will be applied to other emotion classification databases. In addition, it will be extended to speech recognition and image classification.
HUIYUN ZHANG was born in Huanxian, Gansu, China, in 1993. She is currently pursuing the doctor's degree with the Computer Science and Technology, Qinghai Normal University, China. Her research interests include pattern recognition and intelligence systems. Her research interests also include speech emotion recognition and machine learning.
HEMING HUANG was born in Ledu, Qinghai, China, in 1969. He received the B.S. degree in mathematics from Shaanxi Normal University, the M.S. degree in computer application technology from Lanzhou University, and the Ph.D. degree in pattern recognition and intelligence system from Southeast University, China.
He is currently a Professor of computer science and technology with Qinghai Normal University and a Doctoral Supervisor of pattern recognition and intelligent system. He is also a member of the China Computer Federation (CCF) and the Association for Computing Machinery (ACM).
HENRY HAN received the Ph.D. degree from the University of Iowa, in 2004.
He is currently a Professor of computer science with the Department of Computer and Information Science, Fordham University. He is also the Director of the Laboratory of Big Data and Analytics. His current research interests include AI, data science, big data, bioinformatics/health informatics, fintech, and cybersecurity. He has published nearly 80 articles in leading journals and conferences in data science fields. He was the Founding Director of Fordham University's master program in cybersecurity besides Department Associate Chair. He has been supervising about a total of 60 undergraduate, master students, and Ph.D. students, since 2005. His research has been supported by NSF, NIH, and research contracts from the industry.