Recently, large margin based discriminative training methods have been successfully applied to speech recognition, where Gaussian mixture continuous density hidden Markov models (CDHMMs) are estimated based on the principle of maximizing the minimum margin, such as [1], [2], [3]. From the theoretical results in machine learning, a large margin classifier implies good generalization power and generally yields much lower generalization errors in unseen data. The estimation of large margin HMMs turns out to be a constrained minimax optimization problem. Several convex optimization methods have been proposed to solve large margin estimation of CDHMMs, as in [3], [5]. In the previous work of Li and Jiang [5], large margin estimation of Gaussian mixture CDHMMs has been formulated as a semidefinite programming (SDP) under some SDP relaxation conditions. The SDP problem can be solved by many efficient algorithms, such as interior-point methods, which lead to the globally optimal solution since SDP is a well-defined convex optimization problem. Moreover, it has been experimentally shown that the SDP relaxation is extremely tight to maintain high accuracy for LME. As the result, it has been reported in [5] that the SDP-based large margin estimation method (denoted as LME/SDP for short) outperforms all other discriminative training methods used for speech recognition and it has achieved one of the best performances in the standard TIDIGITS connected digit string speech recognition task. However, optimization time of the LME/SDP method increases dramatically as the size of HMM models grows because size of the SDP variable matrix in [5] (i.e., *Z*) is roughly equal to square of total number of Gaussians in the model set. It has been reported in [5] that the LME/SDP method has been successfully managed to handle a CDHMM set consisting of about 4 k Gaussians but it is unlikely to directly extend the LME/SDP method in [5] to large vocabulary speech recognition tasks which typically involve tens or even hundreds of thousands of Gaussians because of long optimization time and large amount of memory required for solving such a large scale SDP problem.

In this paper, we propose to use a different convex optimization method, namely *second order cone programming (SOCP)*, to solve the large margin estimation of CDHMMs for speech recognition. Comparing with SDP, SOCP is a simpler convex optimization problem and SOCP can be solved much faster than SDP for the same problem size and structure, see [6], [7], [8]. But just like SDP, an SOCP algorithm can guarantee to find the globally optimal solution since SOCP is also a well-defined convex optimization problem. Based on the SOCP relaxation method originally proposed by Kim and Kojima in [6], we have formulated large margin estimation (LME) of CDHMMs as an SOCP problem, where the size of SOCP variable vector is only proportional to the total number of Gaussians, as opposed to square of number of Gaussians in the previous LME/SDP method. However, it has been found that the original SOCP relaxation in [6] is too loose to yield good performance for LME in speech recognition. In this work, we have studied and proposed two new tighter SOCP relaxation methods for LME of CDHMMs. Experimental results on the standard TIDIGITS task show that our LME/SOCP methods based on the newly proposed SOCP relaxations significantly outperform the previous gradient descent based method and they can achieve almost comparable performance as the previous LME/SDP approach. But the new LME/SOCP methods show much better efficiency in terms of optimization time (about 20–200 times faster for various model sizes) and memory usage when compared with the previous LME/SDP method in [5].

SECTION II

## LARGE MARGIN ESTIMATION (LME) OF HMMS

From [1], [2], separation margin for a speech utterance *X*_{t} in a multi-class classifier is defined as:
TeX Source
$$d(X_t) = \min_{j\in\Omega\,j \neq W_t} [{\cal F}(X_t\,\vert\,\lambda_{W_t})-{\cal F}(X_t\,\vert\,\lambda_j)]\eqno{\hbox{(1)}}$$where *Ω* denotes the set of all possible words or word sequences, λ_{W} denotes HMM model representing a word or word sequence *W*, *W*_{t} is the true word identity for *X*_{t} and is called discriminant function, which is usually calculated as log-likelihood function of HMM .

Given a set of training data consisting of *T* speech utterances, , we usually know the true word identities for all utterances in , denoted as . The large margin principle leads to estimating the HMM model set *Λ* based on the criterion of maximizing the minimum margin of all training data, named as large margin estimation (LME) of HMM:
TeX Source
$$\eqalignno{\Lambda^* &= \arg\max_{\Lambda}\min_{X_t\in {\cal D}} d(X_t)\cr&= \arg\max_{\Lambda}\max_{X_t\in {\cal D}\, j\in \Omega\, j\neq W_t}\left[{\cal F}(X_t\,\vert\,\lambda_j) - {\cal F}\left(X_t\,\vert\,\lambda_{W_t}\right)\right].&\hbox{(2)}}$$

In practice, in order to apply LME to real-world problems involving large training data sets, we normally pre-select a subset of training data, named as *support vector set* , as follows:
TeX Source
$${\cal S} = \{X_t\,\vert\,X_t \in {\cal D}\ {\rm and}\ 0 \leq d(X_t)\leq \gamma\}\eqno{\hbox{(3)}}$$where γ > 0 is a pre-set positive number. All utterances in are relatively close to the classification boundary even though all of them locate in the right decision regions. Then, in each iteration of LME, we only apply the selected support token set, , in the above minimax optimization in (2) instead of the entire training set, . It is reasonable to exclude those data with large margin from each step of LME since they are normally inactive in optimization towards maximizing the minimum margin, especially when a local optimization method is used.

SECTION III

## SECOND ORDER CONE PROGRAMMING (SOCP)

A second order cone programming (SOCP) problem is a nonlinear convex optimization problem in which a linear function is minimized over the intersection of an affine set and the product of various second-order cones. A standard SOCP has the following form:
TeX Source
$$\eqalignno{{\rm minimize}\quad &f^Tx\cr{\rm subject\ to}\quad & \Vert A_ix + b_i\Vert\leq c^T_i x + d_i\quad (i=1,\ldots,N)&\hbox{(4)}}$$where *x* ∊ **R**^{n} is the optimization variable, and the problem parameters include *f* ∊ **R**^{n}, *A*_{i} ∊**R**^{(ni−1) × n}, *b*_{i} ∊**R**^{ni−1}, *c*_{i} ∊**R**^{n} and *d*_{i} ∊**R**. The norm appearing in the constraints is the standard Euclidean norm, i.e., ‖ *u*‖ = (*u*^{T} *u*)^{½}. We call the constraint ‖ *A*_{i} *x* + *b*_{i}‖ ≤ *c*^{T}_{i} *x* + *d*_{i} in (4) a second-order convex cone constraint of dimension *n*_{i}.

SOCP includes linear programming (LP) and (convex) quadratic programs (QP) as special cases, but it is less general than SDP. Many efficient primal-dual interior-point methods have been developed for SOCP. The computational effort per iteration required by these methods to solve SOCP problems, although greater than LP and QP problems, is much less than that required to solve an SDP problem with similar size and structure. Moreover, since SOCP is a well-defined convex optimization problem, the efficient algorithms for SOCP can lead to the globally optimal solution.

SECTION IV

## LME OF HMMS BASED ON SECOND ORDER CONE PROGRAMMING (SOCP)

In this work, we are interested in formulating LME of Gaussian mixture CDHMMs as an SOCP problem. At first, we assume the HMM set *Λ* is composed of totally different Gaussians, denoted as where . For simplicity, we only consider to estimate Gaussian mean vectors with LME while assuming other HMM parameters remain constant during LME. Given any speech utterance *X*_{t} = {**x**_{t1},**x**_{t2},…,**x**_{tR}}, the decision margin *d*_{t}(*X*_{t}) in (1) can be represented as a standard diagonal quadratic form as:
TeX Source
$$\eqalignno{d_j(X_t) &= {\cal F}\left(X_t\,\vert\,\lambda_{W_t}\right) - {\cal F}(X_t\,\vert\,\lambda_j)\cr&\approx c_{tj}- {1\over 2}\sum^R_{r=1}\sum^D_{d=1}\left[{(x_{trd}-\mu_{i_td})^2 \over \sigma^2_{i_td}} -{(x_{trd}-\mu_{j_td}^2 \over \sigma^2_{j_td}}\right]&\hbox{(5)}}$$where *D* is feature dimension and we denote the optimal Viterbi path of *X*_{t} against λ_{Wt} as **i** = {*i*_{1},*i*_{2},…,*i*_{R}}, and the optimal Viterbi path against λ_{j} as **j** = {*j*_{i}, *j*_{2},…,*j*_{R}} and *c*_{tj} is a constant independent of all Gaussian means.

Since the margin as defined in (5) is actually unbounded for Gaussian mixture CDHMMs (see [2] for details), we adopt the following spherical constraint to guarantee the boundedness of margin as in [5]:
TeX Source
$$R(\Lambda) = \sum^{\cal K}_{k=1}\sum^{\cal D}_{d=1}{\left(\mu_{kd} -\mu_{kd}^{(0)}\right)^2 \over \sigma^2_{kd}} \leq r^2\eqno{\hbox{(6)}}$$where *r* is a pre-set constant, and μ_{kd}^{(0)} is also a constant which is set to be the value of μ_{kd} in the initial models.

As shown in [5], the minimax optimization problem in (2) becomes solvable under the constraint (6). Following [5], we introduce a new variable −ρ(ρ > 0) as the common upper bound for all terms in the minimax optimization, we can convert the minimax optimization in (2) into an equivalent minimization problem as follows:

*Problem 1:*
TeX Source
$$\Lambda^* = \arg\min_{\Lambda,\rho} - \rho\eqno{\hbox{(7)}}$$*subject to*
TeX Source
$$\eqalignno{&{\cal F}(X_t\,\vert\,\lambda_j) - {\cal F}(X_t\,\vert\,\lambda_{W_t}) \leq - \rho&\hbox{(8)}\cr&R(\Lambda) = \sum^{\cal K}_{k=1}\sum^{\cal D}_{d=1}{\left(\mu_{kd} -\mu_{kd}^{(0)}\right)^2 \over \sigma^2_{kd}} \leq r^2&\hbox{(9)}}$$and ρ ≥ 0, for all and *j* ∊ Ω and *j* ≠ *W*_{t}.

Now, we introduce some notations: A column vector **x** is written as **x** = (*x*_{1};*x*_{2};…; *x*_{n}) and a row vector as **x** = (*x*_{1},*x*_{2},…, *x*_{n}). **I**_{D} is a *D* × *D* identity matrix. **0**_{D} is a *D* × *D* zero matrix. And **u** is a large column vector created by concatenating all normalized Gaussian mean vectors as:
TeX Source
$${\bf u} = (\tilde{\mbi \mu}_1;\tilde{\mbi \mu}_2;\ldots;\tilde{\mbi \mu}_{\cal K})\eqno{\hbox{(10)}}$$where each normalized Gaussian mean vector is . In the following, we will consider how to convert the minimization *Problem 1* into an SOCP problem as shown in (4).

Firstly, we will formulate the constraint in (9) into the standard second order cone constraint form shown in (4):
TeX Source
$$R(\Lambda) = \sum^{\cal K}_{k=1} \left(\tilde{\mbi \mu}_k - \tilde{\mbi \mu}_k^{(0)}\right)^T \left(\tilde{\mbi \mu}_k - \tilde{\mbi \mu}_k^{(0)}\right) = \left\Vert {\bf u} - {\bf u}^{(0)}\right\Vert^2 \leq r^2\eqno{\hbox{(11)}}$$where **u**^{(0)} denotes the initial Gaussian mean vectors normalized as in (10).

Secondly, we will re-formulate the constraint in (8) into a standard second order cone constraint form. Suppose *μ*_{i} and *μ*_{j} denote two large column vectors created by concatenating all normalized Gaussian mean vectors along the Viterbi paths **i** and **j** respectively, . And and denote two concatenated feature vectors in *X*_{t} = (**x**_{t1},**x**_{t2},…,**x**_{tR}) normalized by Gaussian variance along the Viterbi paths **i** and **j** respectively, with and where and . Next, we construct a large matrix **Φ**_{i} according to the above Viterbi path **i** as follows:
TeX Source
$$\Phi_{\bf i} = \left(\matrix{\overbrace{{\bf 0}_D\quad \ldots\quad {\bf 0}_D}^{i_1-1} &{\bf I}_D &\ldots{\bf 0}_D\ldots\cr\overbrace{{\bf 0}_D\quad \ldots\quad {\bf 0}_D}^{i_2-1} &{\bf I}_D &\ldots{\bf 0}_D\ldots\cr\vdots\cr\overbrace{{\bf 0}_D\quad \ldots\quad {\bf 0}_D}^{i_R-1} &{\bf I}_D &\ldots{\bf 0}_D\ldots}\right)\eqno{\hbox{(12)}}$$

Obviously, **u** (in (10)) and *μ*_{i} satisfy: *μ*_{i} = **Φ**_{i}**u**. Similarly, we have *μ*_{j} = **Φ**_{j}**u**.

Therefore, we rewrite (5) as:
TeX Source
$$\eqalignno{-d_j(X_t) &= {\cal F}(X_t\,\vert\,\lambda_j) - {\cal F}(X_t\,\vert\,\lambda_{W_t})\cr&= -{1\over 2}\left[{\bf u}^T\left(\Phi_{\bf j}^T\Phi_{\bf i} -\Phi_{\bf i}^T\Phi_{\bf i}\right){\bf u}+2\left(\left(\tilde{\bf x}^{\bf i}_t\right)^T\Phi_{\bf i}-\left(\tilde{\bf x}^{\bf j}_t\right)^T\Phi_{\bf j}\right){\bf u}\right.\cr&\quad +\left.\left(\left(\tilde{\bf x}^{\bf j}_t\right)^T\tilde{\bf x}^{\bf j}_t - \left(\tilde{\bf x}^{\bf i}_t\right)^T\tilde{\bf x}^{\bf i}_t\right)\right]-c_{tj}&\hbox{(13)}}$$We denotes *Q*_{tj} = **Φ**_{i}^{T}**Φ**_{i}− **Φ**_{j}^{T}**Φ**_{j}, and , and . It is easy to show that *Q*_{tj} is a diagonal matrix. After applying all of these to (13), we rewrite the constraint in (8) as follows:
TeX Source
$${\bf u}^T Q_{tj} {\bf u} + q^T_{tj}{\bf u} + q_{tj} + 2\rho \leq 0\eqno{\hbox{(14)}}$$We know a convex quadratic constraint can be converted into a second order cone constraint (see [8] for details). However, we can not gurantee the constraint in (14) is a convex quadratic constraint since *Q*_{tj} in (14) is not a positive semidefinite matrix.

Here we adopt the SOCP relaxation in [6] under which the constraint in (14) can be converted into a second order cone constraint combined with a linar constraint. Suppose be eigenvalues of matrix *Q*_{tj}. Let be eigenvectors of *Q*_{tj} where **v**^{m}_{tj} is the eigenvector corresponding λ^{m}_{tj} to and satisfying ‖**v**^{m}_{tj}‖ = 1 and (**v**^{m}_{tj})^{T}**v**^{l}_{tj} = 0 (*m* ≠ *l*). Then we have .

Suppose contains index of all the eigenvalues of *Q*_{tj} with . We can rewrite *Q*_{tj} as:
TeX Source
$$Q_{tj} = Q^+_{tj} + \sum_{m\in {\cal M},\lambda^m_{tj} < 0} \lambda^m_{tj}{\bf v}^m_{tj}\left({\bf v}^m_{tj}\right)^T\eqno{\hbox{(15)}}$$where *Q*^{+}_{tj} is constructed from all positive eigenvalues as . Obviously, *Q*^{+}_{tj} is a positive semidefinite matrix.

Substituting (15) to (14), we derive the following two constraints which are equivalent to (14):
TeX Source
$${\bf u}^TQ^+_{tj}{\bf u} + \sum_{m\in {\cal M},\lambda^m_{tj} < 0}\lambda^m_{tj}z_m + q^T_{tj}{\bf u} +q_{tj} + 2\rho \leq 0\eqno{\hbox{(16)}}$$with
TeX Source
$$z_m = {\bf u}^T v^m_{tj}(v^m_{tj})^T {\bf u} = \tilde{\mu}^2_m \quad\left(\forall m \in {\cal M}\ {\rm and}\ \lambda^m_{tj} < 0\right)\eqno{\hbox{(17)}}$$where is a component in vector **u** in (10).

Obviously the constraint in (16) is a convex quadratic constraint. However the constraint in (17) is a non-convex equality constraint. Here as in [6] it can be relaxed into a convex constraint by allowing *z*_{m} to be a free variable but bounded as follows:
TeX Source
$$\tilde{\mu}^2_m \leq z_m \leq C_m \quad \left(\forall m \in {\cal M}\ {\rm and}\ \lambda^m_{tj} < 0\right)\eqno{\hbox{(18)}}$$where *C*_{m} is a constant. In our case, *C*_{m} can be roughly estimated from (11) as .

After the relaxation, *Problem* 1 can be converted into the following convex optimization problem:

*Problem 2:*
TeX Source
$$\min_{{\bf u},\rho,z_m}-\rho\eqno{\hbox{(19)}}$$*subject to:*
TeX Source
$$\eqalignno{&{\bf u}^TQ^+_{tj}{\bf u} + \sum_{m\in {\cal M},\lambda^m_{tj} < 0}\lambda^m_{tj}z_m + q^T_{tj}{\bf u}q_{tj} + 2\rho \leq 0&\hbox{(20)}\cr&\tilde{\mu}^2_m \leq z_m \leq \left(\left\vert \tilde{\mu}_m^{(0)}\right\vert + r\right)^2\quad\left(\forall m \in {\cal M}\quad \lambda^m_{tj} < 0\right)&\hbox{(21)}\cr&\left\Vert {\bf u}- {\bf u}^{(0)}\right\Vert \leq r \quad \rho \ge 0&\hbox{(22)}}$$for all *X*_{t} ∊ *S* and *j* ∊ Ω and *j* ≠ *W*_{t}.

The above SOCP relaxation can be intuitively illustrated in Fig. 1. The original LME problem can be viewed as optimizing the objective function along the solid curve segment in Fig. 1(a). After relaxation, optimization is performed within the shaded area under the constant upper bound, which becomes a convex set. For convenience, *Problem* 2 is named as SOCP0.

SECTION V

## TWO IMPROVED SOCP RELAXATION METHODS FOR LME

From Fig. 1(a), it is obvious that the relaxation in SOCP0 is quite loose which inevitably will introduce significant error into the original LME problem. In the following, we consider two tighter relaxation methods to improve accuracy.

### A. Mean-dependent Linear Upper Bound on *z*_{m}

Obviously, instead of using a constant upper bound for *z*_{m} as in SOCP0, it is possible to use a linear upper bound as plotted in Fig. 1(b). Considering the locality constraint in (11), we can easily derive this new constraint as follows:
TeX Source
$$\tilde{\mu}^2_m \leq z_m \leq 2\tilde{\mu}^{(0)}_m\tilde{\mu}_m + r^2 - \left(\tilde{\mu}^{(0)}_m\right)^2\quad\left(\forall m \in {\cal M}\ {\rm and} \ \lambda^m_{tj} < 0\right)\eqno{\hbox{(23)}}$$

Considering the constraint in (23), the upper bound of *z*_{m} becomes a linear function of The optimization is performed in the shaded area under this linear upper bound as in Fig. 1(b), which still remains as a convex set. For convenience, the SOCP under this linear constraint in (23) instead of the constant constraint in (18) is named as SOCP1. It is clear that the relaxation in SOCP1 is much tighter than the one in SOCP0.

### B. Mean Shifting for Tighter SOCP Relaxation

However, if the upper bound and lower bound of have different signs, i.e., the shaded area crosses the origin, the SOCP1 relaxation becomes quite loose, as shown in Fig. 2(a). In this work, we propose to right-shift all Gaussian means, to ensure that the shaded area does not cross the origin for all Gaussian means. More specifically, we choose a positive constant value *d*_{m} for each and shift as: . If *d*_{m} is big enough, we can guarantee that the upper bound and lower bound of all shifted means will have the same sign. Thus, the linear upper bound of *z*_{m} in Section V.A in terms of will achieve better approximation, as shown in Fig. 2(b). This method is named as SOCP2 in this paper.

SECTION VI

## EXPERIMENTAL RESULTS

The proposed SOCP-based optimization methods for LME have been evaluated on the TIDIGITS database for connected digit string speech recognition in string level. In our experiments, only adult portion of the TIDIGITS corpus is used in our experiments. The training set has 8623 digit strings (from 112 speakers) and the test set has 8700 strings (from other 113 speakers). Our model set consists of 11 whole-word CDHMMs representing all digits. Each HMM has 12 states and uses a simple left-to-right topology without state-skip. Acoustic feature vectors consist of standard 39 dimensions (12 MFCC's and the normalized energy, plus their first and second order time derivatives). Different number of Gaussian mixture components (from 1 to 32 per state) are experimented. In all LME methods, we use the best MCE models (see [4]) as the initial models and only HMM mean vectors are re-estimated with LME. In each iteration of LME, a number of competing string-level models are computed for each utterance in training set based on its N-best decoding results (*N* = 5). Then we select support tokens according to (3) and obtain the optimal Viterbi sequence for each support token according to the recognition result. Then, the relaxed SOCP optimization is conducted with respect to **u**, ρ and *z*_{m}. At last, CDHMM means are updated based on the optimization solution **u**^{*}. In this work, *Problem 2* is solved by an SOCP optimization tool, *MOSEK 4.0* [7], under Matlab.

In our experiments, three LME/SOCP methods, namely SOCP0, SOCP1 and SOCP2, have been compared with the LME using gradient descent based LME method in [2], denoted as *GRAD*, and the LME using the SDP method in [5], denoted as *SDP.* We also include the maximum likelihood (ML) and minimum classification error (MCE) [4] baseline systems in the table for reference. In Table I, we gives performance comparison on the TIDIGITS test set using all these different training methods. From the results, we can also see that the SOCP0 method only achieves very little improvement over MCE method, while SOCP1 and SOCP2 method not only significantly improve over the MCE method, but also largely outperform the simple gradient descent based LME method. By applying mean shifting, SOCP2 achieves the best recognition performance among three SOCP approaches. Also all LME/SOCP methods (SOCP0, SOCP1, and SOCP2) are relatively easy to run while the gradient descent method needs lots of fine-tuning on its parameters such as step size and penalty weight coefficients. From the results, we can see that the LME/SDP method in [5] still achieves the best overall performance, especially for large models. However, the performance gap between LME/SDP and our best SOCP2 is not significant. It is safe to say that the SOCP2 yields comparable performance as LME/SDP in the TIDIGITS task. However, if we compare the efficiency between LME/SDP and our proposed SOCP approaches, all three SOCP methods^{1} run substantially faster and consume much less memory during optimization. As one example, the CPU times needed to optimize each problem per iteration are listed in Table II for comparison between LME/SDP and SOCP1. It is clear that the SOCP1 method runs about 20-200 times faster than the SDP method. The speed gap between them grows bigger when the model size increases. This can be easily explained because an SOCP problem can be solved more efficiently than an SDP problem with the similar problem size and structure. Moreover, the size of optimization variable matrix in the LME/SDP [5] is proportional to square of total number of Gaussians while the size of optimization variable (e.g., **u**) is only proportional to the number of Gaussians. As a result, for the same CDHMM model set, the problem size of SOCP is significantly smaller than the LME/SDP.

In this paper, we have proposed to use second order cone programming (SOCP) for large margin estimation (LME) of CDHMMs in speech recognition and studied three different SOCP relaxation methods. The two new SOCP relaxation methods, namely SOCP1 and SOCP2, have been demonstrated to be effective in recognition performance. Compared with the previous LME/SDP method, the proposed LME/SOCP method runs much more efficiently in terms of optimization time and memory consumption. This opens up a door for applying the LME training to a state-of-the-art speech recognition system which normally involves very large HMM model set. Currently, we are extending the proposed LME/SOCP method to some large vocabulary speech recognition tasks.