<![CDATA[ IEEE/ACM Transactions on Audio, Speech, and Language Processing - new TOC ]]>
http://ieeexplore.ieee.org
TOC Alert for Publication# 6570655 2016September29<![CDATA[Context-Dependent Piano Music Transcription With Convolutional Sparse Coding]]>2412221822301746<![CDATA[Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition]]>neural network-based multifactor aware joint training, is proposed to improve the recognition accuracy for noise robust speech recognition. This approach is a structured model which integrates several different functional modules into one computational deep model. We explore and extract speaker, phone, and environment factor representations using deep neural networks (DNNs), which are integrated into the main ASR DNN to improve classification accuracy. In addition, the hidden activations in the main ASR DNN are used to improve factor extraction, which in turn helps the ASR DNN. All the model parameters, including those in the ASR DNN and factor extraction DNNs, are jointly optimized under the multitask learning framework. Unlike prior traditional techniques for the factor-aware training, our approach requires no explicit separate stages for factor extraction and adaptation. Moreover, the proposed neural network-based multifactor aware joint training can be easily combined with the conventional factor-aware training which uses the explicit factors, such as i-vector, noise energy, and value to obtain additional improvement. The proposed method is evaluated on two main noise robust tasks: the AMI single distant microphone task in which reverberation is the main concern, and the Aurora4 task in which multiple noise types exist. Experiments on both tasks show that the proposed model can significantly reduce word error rate (WER). The best configuration achieved more than 15% relative reduction in WER over the baselines on these two tasks.]]>241222312240665<![CDATA[Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling]]>241222412250405<![CDATA[On MMSE-Based Estimation of Amplitude and Complex Speech Spectral Coefficients Under Phase-Uncertainty]]>241222512262857<![CDATA[Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition]]>2412226322761094<![CDATA[Generation of Affective Accompaniment in Accordance With Emotion Flow]]>241222772287721<![CDATA[Scalable Audio Coding Using Trellis-Based Optimized Joint Entropy Coding and Quantization]]>2412228823001638<![CDATA[Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding]]>2412230123121007<![CDATA[Kernel Method for Voice Activity Detection in the Presence of Transients]]>2412231323261026<![CDATA[Bayesian Networks to Model the Variability of Speaker Verification Scores in Adverse Environments]]>2412232723401044<![CDATA[Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition]]>2412234123531537<![CDATA[Instantaneous Fundamental Frequency Estimation With Optimal Segmentation for Nonstationary Voiced Speech]]>2412235423671562<![CDATA[Robust Variable Step-Size Decorrelation Normalized Least-Mean-Square Algorithm and its Application to Acoustic Echo Cancellation]]>l_{2} norm of the a decorrelated posteriori error signal with a constraint on the filter coefficients in the l_{2} norm sense. Solving this minimization problem gives birth to the efficient RVSSDNLMS algorithm. The convergence performance and computational complexity of RVSSDNLMS algorithm are analyzed. Finally, simulations show that the proposed RVSSDNLMS considerably outperforms the normalized least-mean-square (NLMS), robust variable step-size NLMS, and pseudoaffine projection algorithms in terms of convergence rate and steady-state error in Gaussian noise and impulsive noise environments.]]>2412236823761837<![CDATA[Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms]]>2412237723891280<![CDATA[Adaptive Compensation of Misequalization in Narrowband Active Noise Equalizer Systems]]>2412239023993663<![CDATA[Estimating Speech Recognition Accuracy Based on Error Type Classification]]>2412240024131821