Autoregressive Asymmetric Linear Gaussian Hidden Markov Models

In a real life process evolving over time, the relationship between its relevant variables may change. Therefore, it is advantageous to have different inference models for each state of the process. Asymmetric hidden Markov models fulfil this dynamical requirement and provide a framework where the trend of the process can be expressed as a latent variable. In this paper, we modify these recent asymmetric hidden Markov models to have an asymmetric autoregressive component, allowing the model to choose the order of autoregression that maximizes its penalized likelihood for a given training set. Additionally, we show how inference, hidden states decoding and parameter learning must be adapted to fit the proposed model. Finally, we run experiments with synthetic and real data to show the capabilities of this new model.


Introduction
Hidden Markov models (HMMs) have been successfully used to analyze dynamic signals, e.g., in speech recognition [1] and tool wearing monitoring [2] or sequential signals, e.g., in gene prediction [3]. These models assume the existence of a latent or hidden variable that drives an observable set of variables. However, traditional HMMs in the case of continuous data, make the hypothesis that for all the driving dynamical process, a complete dependence probabilistic model involving all the variables is held, which can be untrue. This causes that the models learn a considerable number of unnecessary parameters that may cause data overfitting.
The idea of asymmetric HMMHs is introduced in [4] and [5]. These models imply that depending on the value of certain variables, the distribution of the remaining variables may change. For HMMs, the asymmetric component is expressed with the hidden variable, with which depending on its value, a context-specific Bayesian network [6] encodes the distribution of the emission probabilities. These context-specific Bayesian networks reduce the number of parameters needed.
Autoregressive (AR) processes have been studied for a long time, especially for regression tasks [7]. However, the traditional approaches to AR processes make strong assumptions as to stationariness that do not hold for many real case scenarios. This issue was addressed by [8], allowing the models to have changing parameters depending on the value of a hidden variable. Nevertheless the order of the AR process had to be fixed beforehand by trial and error. HMMs and AR processes were combined in [9], where AR coefficients were added to the emission probabilities. In this paper we combine the ideas of asymmetric HMMs with AR processes to overcome the previous shortcomings: determine the AR order of a model for each hidden state and reduce the number of unnecessary parameters. Specifically, our model enables each variable, depending on the hidden state, to determine its parents within the context-specific Bayesian network and the number of lags that its distribution requires to maximize a model fitting score.
The structure of this document is as follows. Section 2 describes related work about asymmetric probabilistic models and HMMs with AR processes. Section 3 reviews HMMs in general and summarizes the expectation maximization algorithm (EM), the structural EM and the Yule-Walker equations [7] that are relevant tools for our model. Section 4 introduces the proposed autoregressive asymmetric linear Gaussian hidden Markov model (AR-AsLG-HMM). In this section we discuss the adaptation of the forward-backward and Viterbi algorithms [1]. We also describe the parameter and structural learning and show that the EM algorithm iteratively improves the log-likelihood of the data for our model. Section 5 presents experiments with synthetic data, real air quality and ball-bearing degradation data. The results obtained using the AR-AsLG-HMM are compared against its non-AR version and other state-of-the-art approaches. The paper is rounded off in Section 6 with conclusions and comments regarding possible future research.

Related work
In this section we review the related work regarding HMMs with AR behavior and asymmetric probabilistic models. Table 1 shows the reviewed articles grouped according to their contribution.

Modified emission probabilities in HMMs
One of the first combinations of HMM and AR models attempted to process speech data [9]. Autoregressive polynomials were added to the Gaussian emission probabilities, in which coefficients were determined via the Baum-Welch algorithm [1]. Later, [10] proposed mixtures of Gaussian hidden Markov models (AR-MoG-HMMs) where the emission probabilities were modelled as mixtures of Gaussians. These models were used for speech recognition. In [12] a vectorial AR multivariate Gaussian HMM (VAR-MVGHMM) was introduced. This model enables variables to have temporal dependencies with all the other variables. Again, the model was used for speech recognition.
Some authors [13] modified the emission probabilities such that they behave as an AR Gaussian but with an error coefficient given by the linear prediction residuals [28].
Others also considered variations of HMMs such as hidden semi-Markov models (HSMMs), where the time duration of each hidden state can be modified to not always follow a geometric distribution, or hierarchical hidden Markov models (HHMMs) where AR behavior was considered. For instance, [14] proposed an AR-HSMM, where AR variables and non-AR variables could be considered in the same model depending on the modeller's decision. [16] proposed a vector AR hierarchical hidden semi-Markov model (VAR-HHSMM) to classify and determine hand movements.
Other approximations of HMMs with AR properties can be found in [11] and [8]. The author proposed an edited log-likelihood function to represent the AR behavior in data. Markov mean-switching AR models (MMSAR) and linear Markov-switching AR model (LMSAR) were studied and their parameters were calculated with the EM algorithm. [15] proposed the transitional Markov switching autoregressive (TMSAR) model as an extension of MMSAR and LMSAR models. In this case, the emission probabilities depend on past values of the hidden process to determine changes in its mean and its weight. The authors used maximum likelihood methods with a Newton-Raphson strategy to estimate the model parameters.

Modified hidden variables
In more recent works, new approaches have been proposed in which the assumptions about the hidden variables that govern the process were modified such as the model given by [17], where the authors edited an autoregressive hidden Markov model (AR-HMM) by introducing a memoryless hidden variable. The Markovian hidden states had a probabilistic dependency of this memoryless hidden variable. AR higher-order HMMs (AR-HO-HMMs) were introduced in [18]. The authors not only considered an autoregressive property in the observations, but also a fixed order Markov assumption in the hidden states specified by the user. They used mixtures of Gaussians with AR properties for the emission probabilities.

Missing data in HMMs
Other works focused on the missing data. In [19] an AR-HMM with a missing at random assumption was proposed to perform exact inference in such scenarios. In [20] the missing data was considered as latent variables. Additionally, the authors proposed a modified forward-backward algorithm and Baum-Welch parameter updating formulas.

Asymmetric models
As regards asymmetric probabilistic graphical models, the Bayesian multinets introduced in [22] were used to describe different local graphical models depending on the values of certain observed variables; the similarity networks in [21] allowed the creation of independent influence diagrams 1 for subsets of a given domain. Context-specific independence in Bayesian networks in [6] used tree structured conditional probability distributions with a D-separation-based algorithm to determine statistical dependencies between variables according to contexts given by instantiations of subsets of variables. Following these ideas, more recently in [25], stratified graphical models (SGM) were proposed, where the concept of stratum was introduced to allow different factorizations for a probability distribution depending on the values of some of the variables. A nonreversible Metropolis-Hastings algorithm to calculate marginal likelihoods and learn decomposable SGMs was given. [24] introduced the chain events graphs (CEG). A CEG consists of a directed colored graph obtained from a staged tree 2 by successive edge contraction operations. The obtained graphical model can represent conditional independence and causal behavior that traditional Bayesian networks cannot show. Later, a dynamic version was proposed [26].
Other authors have attempted to combine asymmetric models with HMMs. For example, in [23] the buried Markov models (BMM) were introduced. In that article, the models of [12] were used, but the temporary dependencies can vary depending on the hidden state. These context-specific dependencies are learned using mutual information strategies. [4] used Chow-Liu trees and conditional Chow-Liu trees coupled with HMMs. The HMMs were used to model the dynamic behavior of a process, and the Chow-Liu tree was used to model the emission probabilities. A Chow-Liu tree or conditional Chow-Liu tree was associated with each value of the hidden variable. The parameters of the model were computed with the EM algorithm; specifically, the tree structure was determined in the maximization step. However, the model was specified only for discrete variables. More recently, asymmetric hidden Markov models (As-HMMs) were proposed in [5], where a local graphical model was associated with each value of the hidden variable, and the graphical model was not restricted to Chow-Liu trees. However, again only models with discrete observable variables were allowed. In [27], this issue was addressed with the asymmetric linear Gaussian HMMs (AsLG-HMMs), where the emission probabilities were modeled as conditional linear Gaussian Bayesian networks. The estimation of the model parameters was performed with the EM algorithm.
In this paper, we extend asymmetric HMMs for continuous variables of [27], where the model during its learning phase can estimate for each variable the order of the AR process as well as its parameters depending on the context or value of the hidden variable. Thus, we couple for the first time asymmetric linear Gaussian HMMs with AR processes.

Theoretical Framework
Because the proposed model needs to fit the forward-backward and Viterbi algorithms, we first review these algorithms and the traditional HMM. The parameter and structure learning of the proposed model will be performed via the EM and SEM algorithms; therefore, we also review these algorithms and their properties. Additionally, because the Yule-Walker equations will be used to determine the order of an AR process, they are briefly examined. Additionally, in Table 2, a description of relevant symbols used in this article is shown.

Hidden Markov Models
An HMM can be seen as a double chain stochastic model, where a chain is observed, namely X 0:T = (X 0 , ..., X T ), where X t = (X t 1 , ..., X t M ) ∈ R M and the other chain is hidden, namely Q 0:T = (Q 0 , ..., Q T ). Here, T is the length of the data. The usual approach for HMMs [1] is to assume that the hidden process has the first-order Markovian property, that is, P (Q t |Q 0:t−1 ) = P (Q t |Q t−1 ). Furthermore, it is assumed that the observable process depends on the hidden process, more specifically P (X t |X 0:t−1 , Q 0:t ) = P (X t |Q t ). Additionally it is assumed that the range R of the hidden variable is finite, i.e., R(Q t ) = {1, 2, ..., N } for t = 0, 1, ..., T . Moreover R(Q 0:T ) = {1, 2, ..., N } T +1 .
All the previous HMM specifications can be summarized with the parameter λ = (A, B, π) ∈ Ω, where Ω denotes the space of all possible parameters, A = [a ij ] N i,j=1 is a matrix representing the transition probabilities between hidden states i, j ∈ R(Q t ) over time, i.e., a ij = P (Q t+1 = j|Q t = i, λ); B is a vector representing the emission probability of the observations given the hidden state, is a probability density function; π is the initial probability distribution of the hidden states, π = [π j ] N j=1 , where π j = P (Q 0 = j|λ).
Additionally, an HMM can be seen as a probabilistic graphical model [30] (Fig. 1), where the nodes of the graph represent random variables and the arcs represent direct probabilistic dependencies.
Three main tasks can be performed in the context of HMMs. First, compute the likelihood of an observation x 0:T given a model λ, i.e., P (x 0:T |λ), which can be performed using the forward-backward algorithm. Second, compute the most likely sequence of hidden states for a sequence of observations, i.e., find the value of δ t (i) = max q 0:t−1 {P (x 0:t , q 0:t−1 , Q t = i|λ)}, t = 0, ..., T , i = 1, ..., N , which can be solved using the Viterbi algorithm. Third, learn the parameter λ, which is estimated with the EM algorithm. A theoretical tutorial for understanding these algorithms can be found in [1]. We briefly review them below.   Figure 1: An HMM as a probabilistic graphical model

The forward-backward algorithm
To execute the forward-backward algorithm, we first must define the forward and backward variables: α t (i) = P (Q t = i, x 0:t |λ), β t (i) = P (x t+1:T |Q t = i, λ), respectively, i = 1, ..., N , t = 0, ..., T . The forward and backward variables can be written recursively: Their initial values are α 0 (i) = π i b i (x 0 ) and β T (i) = 1. The forward variable can help us compute the likelihood of x 0:T since:

The EM algorithm
To learn the parameter λ = (A, B, π) given a dataset x 0:T and a priori λ , the traditional EM approach [31] is used. In the EM algorithm, two steps called the expectation step (E-step) and maximization step (M-step) are iterated until convergence is met.
For the E step, we will need only to calculate the probabilities γ t (i) := P (Q t = i|x 0:T , λ ) and ξ t (i, j) := P (Q t = i, Q t+1 = j|x 0:T , λ ) i, j = 1, ..., N , t = 0, ..., T , which are related in the following manner: For the M step, we must derive the updating formulas for parameter λ. For π i and a ij for the hidden states i, j = 1, ..., N , are: .
The updating formula for parameter B relies on the assumptions made over the transition and emission probabilities. For example, in [1], the updating formulas are calculated when the emission probabilities are assumed to be discrete, a mixture of Gaussians (MoG) or a mixture of AR Gaussians (AR-MoG). If the hypotheses about the transition probabilities or the initial distribution change, the formulas given above are no longer valid.

The SEM algorithm
When we deal with an unknown a priori probabilistic graphical model B, it is desirable to find the structure that maximizes the likelihood of the data. However, as many parameters are used in dense networks, the likelihood improves but it can be due to data overfitting. Therefore, penalized likelihood-based scores such as the Bayesian information criterion (BIC) or Akaike information criterion are used in structure optimization algorithms to prevent this issue. In [32], the structural EM (SEM) algorithm is introduced with its convergence and optimality properties. SEM finds both the desired model and the parameters. SEM tries to maximize the function Q(B, λ|B , λ ), where B is a previous or a prior graphical model: Eq. (1) considers changes in the structure of the probabilistic graphical model and its parameters, and to prevent overfitting, it is penalized by the number of parameters in the model #(B) and the logarithm of the length of the data T . The SEM algorithm consists of using the EM algorithm with a prior model, B , and a prior parameter λ to obtain the parameters λ ; then, using the latent probabilities P (q 0:T |x 0:T , B , λ ) and the parameters λ finding a new structure B by solving max B Q(B, λ |B , λ ). Finally, the EM is again applied to the new structure. This process is iterated until convergence is met.

The Yule-Walker equations
The Yule-Walker equations [7], will be a key issue in constructing the proposed model. A linear AR process with k time lag coefficients for a one-dimensional variable Y t can be described as: where t ∼ N (0, σ 2 ) is an error term following a Gaussian distribution with mean zero and variance σ 2 . Correlogram function ρ k returns the correlation between Y t and Y t−k . We defineȲ t := Y t − µ Y where µ Y is the mean of Y t and ζ k := E[Ȳ tȲ t−k ], which is the expected value of the product of both shifted variables. The correlogram function is computed as: The partial correlogram function Φ(k) encodes the correlation between variables Y t and Y t−k once the effect from intermediary lags has been removed. To determine these partial correlations, observe that for l ∈ {1, ..., k}: In the last line of Eq. (3), we applied the expectation operator and divided by ζ 0 . We assumed that E[Ȳ t−l t ] = 0 for all t, which implies that Y t is not correlated with the error term; a plausible hypothesis in real situations. Additionally, ρ(0) = 1. Moreover, notice that if these equations are computed for l = 1, ..., k, we obtain a system of linear equations, which corresponds to the Yule-Walker equations: The partial correlogram function returns Φ(k) :=φ kk . Note that if we wish to evaluate up to k lags for the partial correlogram function, we must construct and solve k linear systems.
Assume that the sample is white noise. Then the parameterφ kk is distributed approximately as N (0, 1/T ). With this information, it is possible to perform hypothesis tests to determine the relevancy of each lag parameter. If Φ(p * ) is the higher time lag coefficient that is significantly different from zero, then p * is considered the AR order of the model [7].
It is worth mentioning that the previous lag estimation is only useful when the observed data are stationary; in other words, the parameters do not change over time. This particular assumption is violated for the problems in which we want to use HMMs because identifying changes in the model parameters according to changes in the data distribution is sought. With our proposed model, however, we will see that the order of the AR process within the HMM will be able to change dynamically and self adapt depending on the state of the hidden variable.

Proposed model
The proposed model uses context-specific linear Gaussian Bayesian networks to factorize the emission probabilities. The context is given by the hidden variable. Also, an AR component is added to each variable. The AR order of each variable for each possible context is determined by the SEM algorithm and the Yule-Walker equations when a score (to be specified later on) is optimized. Furthermore, for the proposed model, the likelihood function is modified; therefore, the forward-backward, Viterbi and EM algorithms have to be adapted.

Autoregressive asymmetric linear Gaussian hidden Markov models
Let p * m be the AR order (time lag) determined by the Yule-Walker equations and the individual relevancy hypothesis tests for each variable X t m , m = 1, ..., M . Set p * = max m p * m . For our proposed model we work with the following log-likelihood function which ensures that during the SEM algorithm, the updated structures and AR orders are comparable: For this proposed HMM model which is, as explained below, asymmetric autoregressive with linear Gaussian emission probabilities (AR-AsLG-HMM), we modify the emission probabilities {b i (x t )} N i=1 such that they can be factorized into linear Gaussian Bayesian networks [33] with an asymmetric component [5], i.e., each variable X m for each state i ∈ R(Q) is associated with a set of parents Pa i (X m ) = {U im1 , ..., U imkim } ⊂ {X 1 , ..., X M } of size k im (apart from Q) which influences its mean in a linear form. Additionally, the emission probabilities are now conditional probabilities given p im ≤ p * past values of the variables X t m , m = 1, ..., M (AR terms) for each state i ∈ R(Q). More specifically, we define: ). Fig. 2 shows an example of an AR-AsLG-HMM. In this example, , but X t 1 depends only on Q t and X t−1 1 . However, when Q t = 2 (bottom), X t 1 depends only on Q t and X t 2 is dependent on X t−1 2 and Q t . In terms of the model, this can be expressed as p 11 = 1, p 12 = 2 AR terms, k 11 = 0 and k 12 = 1 when Q = 1, and p 21 = 0, p 22 = 1 AR terms, k 21 = 0 and k 22 = 0 when Q = 2. From the model we can see that p * ≥ 2, because p im ≤ 2 for i = 1, 2 and m = 1, 2.
Some comments on Eq. (5) follow. The set of parents Pa i (X m ) of each variable X m for each state i ∈ R(Q) is related to a context-specific Bayesian network B i . Furthermore, depending on that hidden state, each variable X m may have a different AR order, namely p im , which is upper bounded by p * . This model must estimate the new Additionally, because the first p * observations are used as conditionals in Eq. (4), the π parameter is shifted to predict the initial distribution of the Q p * hidden variable, i.e., Observe that the complete information probability of an instance x p * :T of X p * :T and an instance q p * :T of Q p * :T can be expressed as:

Feasibility of the EM algorithm in AR-AsLG-HMMs
To perform the parameter learning, the EM algorithm can be applied. However, we must define an auxiliary function Q for the log-likelihood defined in Eq. (4). We propose Q p * (λ|λ ) as the auxiliary function for the EM algorithm, defined as: Moreover, Q p * (λ|λ ) can be decomposed as: If we define H p * (λ|λ ) as the first summand of Eq. (7), i.e., . We now show that if we apply the EM algorithm with Q p * (λ|λ ), each iteration does not decrease the log-likelihood as required.
Lemma 1 Let λ (s) be the parameters at iteration s of the EM and λ (s+1) be the resulting parameters after the next iteration of the EM. We have that Lemma 2 Given two arbitrary models with respective parameters λ and λ , we have that H p * (λ|λ ) ≤ H p * (λ |λ ), and the equality holds when P (q p * :T |x 0:T , λ) = P (q p * :T |x 0:T , λ ).
Theorem 1 Let λ (s) be the parameters at an iteration s of the EM and λ (s+1) be the resulting parameters after the next iteration of the EM. We have that (a) LL(λ (s+1) ) ≥ LL(λ (s) ). In other words, the log-likelihood of the model cannot worsen after an EM iteration.

The forward-backward algorithm in AR-AsLG-HMMs
As the likelihood function of Eq. (4) and the emission probabilities given by Eq. (5) have changed, the forwardbackward algorithm must be adapted. In the E step, we compute the probabilities γ t (i) = P (Q t = i|x 0:T , λ) for t = 0, ..., T and i = 1, ..., N as the initial point to fit the forward-backward algorithm. Note that γ t (i) can be expressed as: From Eq. (8), the forward variable is α t p * (i) := P (Q t = i, x p * :t |x 0:p * −1 , λ) and the backward variable is β t p * (i) := P (x t+1:T |Q t = i, x 0:t , λ). Observe that these equations only make sense when t ≥ p * . The next lemma shows that we can easily adapt the forward-backward algorithm to compute the α p * and β p * parameters of an AR-AsLG-HMM.
Lemma 3 α t p * (i) and β t p * (i) can be computed as:

Parameter learning in AR-AsLG-HMMs
To execute the EM algorithm, we must iterate the E step and the M step. For the E step we can use the adapted forward-backward algorithm of Section 4.3 to compute γ t (i) and ξ t (i, j) for i, j = 1, ..., N and t = 0, ..., T : Computing these quantities is enough for the E step because Q p * (λ|λ ) can be expressed as: Now, for the M step, we must find the updating formulas for the parameters (A, B, π), where B includes the parameters η imr , β imk and σ 2 im of the Gaussian. In the following theorem, we provide the updating formulas for the proposed model.

Theorem 2
The M-step for an AR-AsLG-HMM model can be performed using the following updating formulas: The parameter A = {a ij } N i,j=1 is updated as: If we set can be updated jointly, solving the following linear system: Require: A prior parameter λ (0) , a dataset X 0:T Ensure: A learned parameter λ * 1: for s = 0, 1, ... until convergence in likelihood is met: do 2: Use Eq. (22) and Eq. (23) to perform the adapted forward-backward algorithm with λ (s)
Eq. (13) forms a linear system of k im + p im + 1 unknowns with k im + p im + 1 equations. If the resulting context-specific Bayesian model for every hidden state is a naïve Bayesian network and p im = 0 for i = 1, ..., N and m = 1, ..., M , then we only require to update the parameters {β im0 } N,M i=1,m=1 . Its updating formula is: Otherwise, we must solve the linear system which can be done using exact or iterative methods. If we use for example the Gauss-Jordan reduction algorithm to solve the linear system, an additional computational cost of O((k im + p im + 1) 3 ) must be assumed. Therefore, simpler structures are recommended in order to not slow down the learning process. This requirement is taken into account during the SEM algorithm as mentioned in Section 3.5.
A pseudocode of the adapted EM algorithm can be found in Fig. 3.

The Viterbi algorithm in AR-AsLG-HMMs
In the following lemma we show that the traditional Viterbi algorithm can be adapted to determine the most probable sequence of hidden states in AR-AsLG-HMMs.
represents the most probable sequence of hidden states up to time t − 1 for state i at time t, then δ t p * (i) can be computed recursively.
The Viterbi algorithm is initialized with δ p * p * (i) = π i b p * i (x p * ).

The SEM algorithm in AR-AsLG-HMMs
Regarding the structural optimization process, the SEM algorithm for AR-AsLG-HMM must also be modified. The proposed auxiliary function is: The steps for the adapted SEM algorithm are the same as in the general SEM. However, we must consider that given a time slice t, the algorithm must not only look for the best instantaneous structure at time t or the best structure with variables (X t 1 , ..., X t M ) but also look for the best transition structure at time t or the relationships between (X t 1 , ..., X t M ) variables and their AR versions, i.e., (X t−1 , which implies that the search space dimension increases. More specifically, we have to search not only in the space of directed acyclic graphs (DAGs) for the best instantaneous structures, but also in the space S p * = {0, 1, ..., p * } N × {0, 1, ..., p * } M , for the best transition structure. A component p im of a vector p ∈ S p * indicates the number of lags for variable X m in the hidden state i ∈ R(Q). For instance, if p im = 2, X t m has incoming arcs from the variables X t−2 m and X t−1 m when Q t = i. A pseudocode of the adapted SEM is given in Fig. 4.
It is pertinent to mention that in the SEM algorithm in the step of finding B (s) = arg max B Q p * (B, λ (s) |B (s−1) , λ (s) ) it is not necessary to use Eq. (15), since the initial distribution and the transition matrix are being unchanged. We can take advantage of the linearity of Eq. (15) to compare structures, i.e. if a dependency of X m has been added or deleted (AR or parent parameter) at the hidden state i, it is reasonable to use the following score: If changes have been done to many variables in many hidden states, it is better to use the following score: To perform the structural optimization step, we must search in the space of structures. In this article we use a heuristic forward greedy algorithm to perform the structure optimization. In this approach, we initialize all the structures in a naïve form with no AR parameters. During the optimization, we visit each variable for each hidden state and add AR or parent dependencies as long as Eq. (16) improves. Its pseudocode can be seen in Fig. 5.

Hidden states labelling
In practice, when HMMs are used, categorical labels are given to the hidden states for interpretation purposes. However, only after training the model, the model parameters are manually checked to determine which categorical label corresponds with each trained hidden state. Here, we propose an automatic numerical labelling for trained models, where a numerical function is used to label a trained hidden state. Let g : R(Q) → R be a function that maps each hidden state into a real number depending on the models parameters. This function g not only helps us determine whether a change in hidden states occurs but also the magnitude of the change. For example, if deviations Compute score im with B (s−1) and λ (s)

4:
while p ik ≤ p * do

6:
Compute score im with B and λ 7: if score im > score im then Compute B im := all the possible DAG graphs resulting by adding one arc to B (s−1) in its i context-specific Bayesian network, where X m is a new children of any variable 17: if B im is non-empty then 18: Compute score im with B (s−1) and λ (s)

21:
Compute score im withB and λ 22: if score im > score im then  .., κ m } of X imply changes in state, the following g functions described in Eq. (18) and Eq. (19) can be used to help in hidden states labelling in AR-AsLG-HMMs. Where Observe that in Eq. (20), the value of ν im depends on the ν value of the parents of variable X m in the context-specific graph related to the i state, so Eq. (18) 19) can be used when high deviations from a single variable is enough to determine the dynamical process. For example, consider a patient with a chronic disease with many sensors that measure different biological variables. For each variable there is a desirable value that determine good health. If only one variable drifts from the desirable value, the health of the patient can be in danger. In conclusion, the experiment and context of the problem may require a different g function to describe the hidden states.

Experiments
In this section, we will compare our model (AR-AsLG-HMMs) with AsLG-HMMs, LMSAR, AR-MoG-HMMs, MoG-HMMs, BMMs and a simple AR-AsLG-HMM with Naïve Bayes context-specific Bayesian networks that we will call naïve-HMMs (this kind of models have been used in [34]). In the case of LMSAR [8] 3 and AR-MoG-HMM [10], it was defined only for one variable. However, in these experiments, we assume that for these models every variable is independent. In particular, LMSAR is a special case of an AR-AsLG-HMM, where only AR parameters are used in the mean, but the number of them do not change with the hidden state and the variance of the model does not depend on the hidden state. In the case of AR-MoG-HMM, the model assumes that the variances are unitary and do not depend on the hidden state; in spite of that it cannot be expressed directly as an AR-AsLG-HMM. The aim is to show the capabilities of our model to change the number of AR parameters and the context-specific Bayesian networks when they are needed.
Experiments with synthetic data are performed. The data are generated such that over time, the AR process changes. Two dynamic processes are used, one with three variables and another with six. The models are learned using only one time series, where three possible hidden states are present and appear in time blocks. We aim to determine for new data the most likely sequence of the hidden states. The sequence of hidden states tells us the current probabilistic distribution of the data and therefore which probabilistic relationships are relevant. Also, the number of parameters and BIC score play an important role to identify which model is better.
Air quality data and real bearing degradation data are also used. The p * values are computed using the Yule-Walker equations. These are used as well for AR-MoG-HMM, BMM and LMSAR to determine the maximum lag during the learning task. For the mixture models, one, two and three mixture components were used, and the models with two mixture components had the highest BIC and log-likelihood. Then, just two mixture components were used.
For both synthetic and real data, the models are initialized with a uniform transition matrix A, specifically a ij = 1/N ; for i, j = 1, ..., N ; the same for the initial distribution π, specifically, π i = 1/N for i = 1, ..., N . In im , is to avoid infinite or nan values in the forward-backward algorithm. For the AR-MoG-HMM, the AR coefficients are initialized as zero, the distribution of the mixture coefficients is uniform, the mean coefficients for the mixtures models are initialized using a k-means algorithm and the covariance matrices where initialized with the covariance matrix coming from the data.

Synthetic data
We consider two scenarios with three known hidden states which follow AR-AsLG emission probabilities. We generate blocks of data for each hidden state. We mix these blocks with no particular order to create a signal and train every model with this unique signal. We try to simulate data as in real life applications where hidden states do not have a particular order of appearance i.e., any possible transition between hidden states is possible. We will evaluate the learned models with four different types of sequences of hidden states. Each one of these four sequences are generated fifty times to be evaluated in the testing phase. From the fifty evaluations we report the mean BIC, mean log-likelihood (LL), the standard deviation of the log-likelihoods (Std) and the number of parameters of the model (#).
The first scenario uses three variables and the second scenario uses six variables. We aim to compare results in the performance of the models when different number of variables are used. From the parameters, in both scenarios we have a hidden state with no structural complexity (no-AR and no-parent relationships in f t im ), a second one with some structural complexity and the last one with a complex structure (many AR and parent relationships in f t im )). We also edit the parameters in such a way that the more complex the context-specific Bayesian networks is, the greater amplitudes for the g 1 (Eq. (18)) function are. The parameters used for the hidden states can be found in the appendix. The g 1 function used for all the experiments has v m = 1 for m = 1, ..., M and κ m = 0, for m = 1, ..., M . The sequence of hidden states used to construct the training signal for both scenarios can be seen in Fig. 6. The four sequences of hidden states for testing are illustrated in Fig. 7.  In Table 3, we observe the results for scenario 1 for all the sequences. We observe that AR-AsLG-HMM obtained the second best results in LL and BIC score followed by AR-MoG-HMM. The MoG-HMM, AsLG-HMM, LMSAR and naïve-HMM obtained fair results. However, BMM obtained the worst results for all the evaluated sequences. In the case of BIC score, the penalization for mixture models was higher since a greater number of parameters were needed. It is remarkable that MoG-HMM in sequences 1 and 4 obtained better LL than AsLG-HMM, however due to the penalization AsLG-HMM obtained slightly better mean BIC scores. In terms of standard deviation, the BMM obtained the best results in spite of having the worst LL and BIC scores. AR-AsLG-HMM was the third model with best Std score after BMM and AR-MoG-HMM. LMSAR in this case obtained the worst results in terms of standard deviation. In terms of number of parameters, the naïve-HMM had the best results since it has the most simple structure. In general, asymmetric models used fewer parameters than mixture models which used up to three times the number of parameters of the Naïve-HMM.   Table 4 shows the results for scenario 2 for all the sequences. In this case we observe that AR-AsLG-HMM obtained the best results in LL and BIC score. The naïve-HMM, AsLG-HMM, AR-MoG-HMM obtained fair results. The remaining models: MoG-HMM, LMSAR and BMM obtained poor results in LL and BIC score. In paticular, BMM could not iterate its EM algorithm after the graph optimization. In terms of standard deviation AR-MoG-HMM obtained the best results in Std followed by AR-AsLG-HMM, whereas MoG-HMM obtained the worst variance results. In terms of number of parameters, again Naïve-HMM uses the least amount of parameters due to its simple structure and again we found that mixture models use up to three times the number of parameters used by Naïve-HMM.  Fig. 9 show the Viterbi paths obtained for sequence 1 and 3 for scenario 1. We are interested in these paths since they tell us the changes in the dynamics of the data for every time instance (smoothing). When new instances arrive, we can use the Viterbi path to determine the hidden state to which the new instance belongs (filtering) and make decisions or analyses of the observed process. In the case of the Viterbi paths obtained by asymmetric models and Naïve-HMM and MoG-HMM, we obtained desirable results, since they follow correctly the pattern of the hidden state sequences showed in Fig. 7 in (a) and (c). The remaining models obtained not well-defined lectures of the evolution of hidden states proposed by Fig. 7. In particular recall from Table 3 that AR-MoG-HMM obtained the best results in BIC score, likelihood and standard deviation; however, the predictions are not as expected, this can be caused by the hypothesis of AR-MoG-HMM of constant unitary variance among the hidden states. Fig. 10 and Fig. 9 show the predicted Viterbi paths for sequence 2 and 4 for scenario 2. In this case we observe that only AR-AsLG-HMM and MoG-HMM obtained good Viterbi paths that follow the pattern described by Fig. 7 (b) and (d). AsLG-HMM and naïve-HMM predicted correctly the transition between hidden states but are wrong in magnitude. The remaining models obtained poor predictions results.

Model
Training time 1 (s) Training time 2 (s) AR-AsLG-HMM 5.32 6 Table 5: Scenario 1 and 2 learning times Table 5 shows the required times for learning the models in each scenario. The number of iterations of the EM algorithm for LMSAR has to be set to one, since further iterations of the algorithm caused nan values. This causes short times of training. In the case of BMM, the EM algorithm after the structure optimization was iterated 1 time before nan values appeared in the models parameters. In the case of scenario 2, AR-MoG-HMM and MoG-HMM had to iterate the EM algorithm 8 times before the appearance of nan values appeared in the emission distribution parameters, this caused short training times in the scenario 2 for AR-MoG-HMM and MoG-HMM. The remaining models and cases could iterate the EM or SEM algorithm until convergence. For both scenarios, we see that the learning times for BMM are the longest ones. The learning of structures is complex in time since many mutual informations and comparisons must be computed. We can clearly see that this process scales with the number of variables. On the other hand, we see that the shortest times with convergence were obtained by Näive-HMM due to the simple structures that the model uses. The proposed model using the forward greedy algorithm took a fair time to be trained taking in consideration the scores obtained and that it was iterated until convergence was met in EM and SEM algorithm.
We can observe from these experiments that AR-AsLG-HMM is capable of being simple enough to explain linear Gaussian autoregressive processes to prevent over-fit, but can be complex enough to detect relevant parameters that drive the hidden states. AsLG-HMM has this property as well, but as can be seen, the AR variables are pertinent. In terms of variance of the predictions, AR-AsLG-HMM also had decent and relevant results, which implies it is stable in the case of synthetic data.

Air quality in Beijing
Here we use a dataset found in the UCI Machine Learning Repository named: "Beijing Multi-Site Air-Quality Data Data Set" [35]. The dataset consists of measurements of air quality in different monitoring stations in Beijing. We in particular take the measurements from the file "PRSA Data Aotizhongxin" which represent the name of the monitoring station Aotizhongxin. This dataset has hourly measurements from March 2013 until February 2017. The data contains missing data (3.37% of the dataset for the selected variables). The missing data is filled using the mean of the values of the five previous hours. The hidden variable in this problem can be understood as the air quality. For this study we use the following variables: sulfur dioxide (SO 2 in µg/m 3 ), nitrogen dioxide (NO 2 in µg/m 3 ), carbon monoxide (CO in µg/m 3 ), ozone (O 3 in µg/m 3 ), coarse particulate matter (PM 10 in µg/m 3 ) and fine particulate matter (PM 2.5 in µg/m 3 ). Bayesian networks and HMM have been used before to determine air quality [36,37,38,39], showing advantages in the generation of information and discovery of relationships between variables.
The Chinese air quality limits for hourly, daily and monthly measurements are expressed in the law GB 3095-2012. These limits are used to model the g function for this problem. In particular, κ = {500, 200, 10000, 200, 150, 75} and v = {1/500, 1/200, 1/10000, 1/200, 1/150, 1/75}. The g function in this case uses Eq. (19). If g 2 (i) > 0 it means that one or many variables are above the permissible limit and the air quality is pretty bad. Great negative values are desirable for g 2 to indicate good air quality. The aim is to learn models to determine the air quality when new observations arrive. We use the year 2013 to train the models. The number of hidden states is set to two to indicate those two states: good air quality and bad air quality. For each model we predict individually the air quality for year 2014, 2015 and 2016. We present the mean likelihoods (or the mean likelihood of the tested years), mean BICs, variance of likelihoods and number of parameters of each model.     (6) shows the scores obtained from the testing years. We can observe that AR-AsLG-HMM and AR-MoG-HMM obtained the best results in the LL and BIC scores. In terms of stability, we see that also, AR-AsLG-HMM and AR-MoG-HMM have the lowest standard deviation, followed by LMSAR and MoG-HMM . In term of number of parameters, we observe that naïve-HMM and AsLG-HMM have the lowest number of parameters. Followed by these models, AR-AsLG-HMM and LMSAR obtained a fair number of parameters and finally, mixture models as expected had to use more parameters (up to four times the number of parameters used by Naïve-HMM). In the case of BMM, the learning of the structure could not be performed since nan values appeared in the re-training phase. Fig. 12 shows two learned graphs. In the context-specific Bayesian networks, AR variables are denoted as X m AR r, where r is the number of lags for the variable X m . In (a) we show a graph when the air quality is good and in (b) is bad. In both graphs we can observe some interesting relationships similar to the ones founded in [36]. For example, in (a) we observe that CO depends on PM 2.5 and PM 10 and SO 2 and NO 2 are related to CO. These relationships come from the process of combustion of gas and charcoal. Also NO 2 is related to O 3 which indicates the photochemistry of NO 2 for the production of O 3 . In (b) we see that these relationships remain. However, the dependences on previous values for each variable changes, which tells us the level of impact of the past on the pollution levels. Fig. 13 shows the predicted air quality for the first two weeks of 2016 for each model using the Viterbi algorithm. Real readings are shown in (h), where we express 1 when any of the variables surpasses the law limits and -1 when all the variables are under the law limits. Clearly (h) does not tell us the severity of the pollution nor the closeness to an outlaw pollution level. We observe that AR-AsLG-HMM is capable of observing the periods of high pollution and low pollution but in a noisy way, on the other hand, the AsLG-HMM, MoG-HMM, Naïve-HMM and BMM make clearer predictions. On the other hand, AR-MoG-HMM and LMSAR obtained poor predictions in spite of their good scores in mean BIC and Std. Recall that in both models, the variance is assumed to be the same for all the hidden states, which for this dataset is not true.

Ball-bearings degradation
Ball-bearings are used inside many mechanic tools as drills, rotors, etc. Ball-bearings represent critical components inside these machines. The failure or degradation of these components can be translated to loses in time, money and assets for industries. Monitoring ball-bearings is crucial and relevant, and the use of HMM can give insight of the bearing degradation process and therefore help in the development of maintenance policies [40].
The benchmark used to validate the proposed model in this section comes from ball-bearing vibrational data [41]. The run-to-failure tool machine setup is shown in Fig. 14. Four ball-bearings are tested in the setup. The signals are obtained with vibrational sensors. The desired vibrations are submerged in noise; therefore, filtration techniques are required. In this study, the signals are filtered as in [42], where spectral kurtosis algorithms are used. From the filtered signal, we calculate its spectrum with the Fourier transform and the ball-bearing fundamental frequencies, namely, ball pass frequency outer (BPFO) related to the ball-bearings outer race, ball pass frequency inner (BPFI) related to the ball-bearings inner race, ball spin frequency (BSF) related to the ball-bearings rollers and the fundamental train frequency (FTF) related to the ball-bearings cage.
The training signal consists of 2156 records, while the testing signal has 6324 records. We use the fundamental frequencies as variables of the models, hence four variables are used. We must recall that the dataset comes from a coupled mechanical system. Therefore, in the presence of a fail in any part of the system, vibrations will be generated that will transmit across the whole system. For all the ball-bearings, the magnitude of their frequencies grows abruptly at the end of the measures indicating a phase of ball-bearing failing somewhere in the mechanical setup. In the training dataset, Bearing 3 fails due to its inner race and Bearing 4 due to its rollers. In the testing dataset Bearing 3 fails due to its outer race. The hidden variable for this context can be understood as the ballbearing health state. The number of hidden states was set depending on the scores obtained by naïve-HMM in the training data. We observed that with seven hidden states, the scores of the naïve-HMM were optimized.
For this problem, we constantly obtained underflow problems due to the small amplitudes of some frequencies. Therefore, all the dataset were multiplied by 1000. The g function uses v = {1/1000, 1/1000, 1/1000, 1/1000} and κ = {0, 0, 0, 0}. Therefore if we use Eq. (18), g 1 adds the magnitude of all the relevant frequencies. If there is a degradation in any of the ball-bearing components, the relevant frequencies will have greater magnitudes and this will be perceived by g 1 . Therefore, predicting the hidden state in the testing data can be seen as an approximation of the degradation of the ball-bearing. The idea here is to train models such that they can determine the degradation state of forthcoming ball-bearings. This can be accomplished with the Viterbi paths. In particular for this dataset, we are interested in Bearing 3, since it fails in both the training and testing dataset. Nevertheless, we will train the models for all the bearings and show the scores obtained in the testing dataset. Additionally we show the Viterbi paths of Bearing 3 to see the respective degradation.
During the training time, the iterations of the LMSAR and BMM had to be tuned to prevent nan values in the parameters. Additionally, for the BMM, no structural optimization was performed, since it was unfeasible in time.    Table 7 shows the results obtained by the models for each ball-bearing. We observe that for all the ball-bearings, the best results were obtained by AR-AsLG-HMM and LMSAR. Whereas, the worst results were obtained by BMM. However in B3, AR-AsLG-HMM, AsLG-HMM and naïve-HMM obtained results far from the ones obtained by AR-MoG-HMM and LMSAR. Nevertheless, as it will be observed in the viterbi paths, the predictions obtained by these models are poor and again the hypothesis of constant variance among hidden states may be the cause of the good fit but the poor prediction capabilities. Nevertheless, in terms of number of parameters, we see that naïve-HMM, AsLG-HMM and AR-AsLG-HMM used the least amount of parameters for all the ball-bearings. The remaining models used two or three times the amount of parameters used by naïve-HMM. Fig. 15 shows the paths for the testing B3. All the models exhibit low g 1 (i) values during all the life of the ball-bearing and a jump to higher values of g 1 (i) at the end of the life of the ball-bearing as expected. High values of g(i) represent high amplitudes in the fundamental frequencies of the ball-bearing and therefore shows evidence of degradation and a near failure. As we can see, the proposed model is capable of determining this behaviour from incoming ball-bearing data. On the other hand, LMSAR and AR-MoG-HMM in spite of the high BIC score obtained, we observe that obtained predictions may be noisy and not so clear as the predicted Viterbi path obtained by AR-AsLG-HMM. A relevant part of this model is the generation of context-specific Bayesian networks. In Fig. 16 we observe two context-specific Bayesian networks. (a) represents a good health state. In this graph we observe that the cage frequencies (FTF) determines the remaining variables. This implies that knowing the behaviour of the cage, determines the behaviour of the ball-bearing rollers and races. (b) represents a bad health state and shows a more complex structure. In this context-specific Bayesian network AR values are relevant and are taken into consideration. We again see the dominance of the ball-bearings cage (FTF) to determine the dynamical process of the model, but some frequencies are not directly dependant to this variable e.g., the outer race frequencies (BPFO) depends on the inner race frequencies (BPFI) and the rollers frequencies (BSF) and these depend directly on the cage frequency (FTF). In summary, these graphs are capable of explaining the ball-bearings dynamical process depending on its health.

Conclusions
In this paper, we extended the development of asymmetric hidden Markov models allowing us to determine and learn the optimal number of time lags depending on the value of the hidden state via the SEM algorithm. Also we introduce a greedy-forward heuristic used to find the best structure for the model. We also theoretically adapted the forward-backward, Viterbi and EM algorithms to our proposed log-likelihood function. Additionally, we showed that every iteration of the EM algorithm improves the log-likelihood of the model. We introduced a numerical labelling function, which can be helpful in determining the nature of the learned hidden Markov models and to identify changes in the magnitude of the hidden variable.
We used synthetic and real data to validate the proposed model. We compared ourselves with many other models. In general, the AR-AsLG-HMM obtained fair results for synthetic data and for real data. We also showed the use of the learned context-specific Bayesian networks to extract information about the nature of the problem being modelled. Additionally, the number of parameters learned by AR-AsLG-HMM were usually in an intermediate point between the simplest model (naïve-HMM) and the mixture models which is helpful to prevent data overfit and interpret the nature of the model.
In future work, we would like to combine the idea of asymmetric autoregressive models with other types of HMM such as HSMMs or HHMMs. Finally, we want to apply the proposed model to online environments and observe its behaviour to detect and treat concept drifts. 3.5 σ 33 4.0 σ 34 6.5 σ 35 5.5 σ 36 7.0 Table 9: Scenario 2 parameters Proofs Lemma 1: From the maximization step, we have Q p * (λ|λ (s) ) = max λ Q p * (λ|λ (s) ) ≥ Q p * (λ (s) |λ (s) ).
To prove (b), we know from (a) that the sequence {LL(λ (s) )} s∈Z + does not decrease and is also upper bounded by zero. Therefore, {LL(λ (s) )} s converges to a certain real finite number LL * with LL * ≤ 0.

Lemma 3:
For the forward variable, we have that for t = p * + 1, ..., T and i = 1, ..., N : P (x t |Q t = i, x t−p * :t−1 , λ) × P (Q t = i, Q t−1 = j, x p * :t−1 |x 0:p * −1 , λ) × P (Q t−1 = j, x p * :t−1 |x 0:p * −1 , λ) In the second equality of Eq. (22), we used the fact that X t is D-separated from Q t−1 given Q t and X t−p * :t−1 . In the third equality, we used the fact that Q t is D-separated from X 0:t−1 given Q t−1 . D-separation implies conditional independence in Bayesian networks. As we can see, the forward variable can be computed iteratively as in the traditional HMM. Additionally, the forward variable is initialized with α p * p * (i) = π i b p * i (x p * ), i = 1, ..., N .
If we compute the derivative of L in Eq. (24) with respect to π i and equalize to zero, we obtain from Q p * (λ|λ ): Then, π i = γ p * (i) τ0 , and . Hence, τ 0 = 1. Therefore, the updating formula for π i , for i = 1, ..., N , is: Similarly, with respect to a ij : Then, a ij = t=p * γ t (i). Therefore, the updating formula for a ij for i, j = 1, ..., N is: Now, we compute the derivative of L in Eq. (24) with respect to parameters η imp , β imk and σ 2 im of the Gaussian. For the derivative with respect to β im0 , we obtain: ln(N (x t m |f t im , σ 2 im )). Thus, Then, Observe that: δ t p * (i) = max q p * :t−1 {P (x t |x t−p * :t−1 , Q t = i, λ) × P (x p * :t−1 , q p * :t−1 , Q t = i|x 0:p * −1 , λ)} = max q p * :t−1 {b p * i (x t )P (Q t = i|q t−1 , λ) × P (x p * :t−1 , q p * :t−1 |x 0:p * −1 , λ)} = max q p * :t−1 The first equality of Eq. (31), we noticed that X t is D-separated of Q p * :t−1 given Q t and X t−p * :t−1 . In the second equality, we also take advantage of Q t being D-separated from X 0:t−1 and Q p * :t−2 given Q t−1 . In the third equality, we have used the dynamic programming principle [43,44] in δ t−1 p * . As we can see, δ t p * (i) can be computed iteratively as in its traditional version. The Viterbi algorithm is initialized with δ p * p * (i) = π i b p * i (x p * ) for i = 1, ..., N .