Algebraic Reduction of Hidden Markov Models

The problem of reducing a hidden Markov model (HMM) to one of smaller dimension that exactly reproduces the same marginals is tackled by using a system-theoretic approach. Realization theory tools are extended to HMMs by leveraging suitable algebraic representations of probability spaces. We propose two algorithms that return coarse-grained equivalent HMMs obtained by stochastic projection operators: the first returns models that exactly reproduce the single-time distribution of a given output process, while in the second, the full (multitime) distribution is preserved. The reduction method exploits not only the structure of the observed output but also its initial condition, whenever the latter is known or belongs to a given subclass. Optimal algorithms are derived for a class of HMMs, namely observable ones.


I. INTRODUCTION
Hidden Markov processes are an ubiquitous class of stochastic models that has extensive application in modeling and prediction for speech [1], [2], biological systems [3]- [6], information and communication systems [7]- [9].Dedicated optimal control and estimation methods have been developed for this class of models, see e.g.[10]- [12].
In the development of the realization theory for HMMs, two related yet well distinct problems emerge: constructing an HMM from data, and reducing an existing model, when possible, to an equivalent one of smaller size.For an analysis and review of the first one, see for example [13], [14], and more recent results in [15].In this paper, we shall focus on the reduction problem.Besides its theoretical interest, methods for model reduction are critical in effectively addressing problems in large-scale systems [16]- [18].A characterization of equivalent HMMs, that is, models that produce the same output marginals of a given one, is proposed in [19].Their treatment of equivalent HMMs is based on the definition of effective spaces, which specify equivalence classes of HMMs, representing the HMM analogue of minimal realizations spaces for linear systems.In the same paper, the authors pose the problem of finding a minimal equivalent HMM.As a reduction to the effective space does not guarantee to preserve the positivity of the model, the problem has so far remained unsolved.
In this paper, we show how effective spaces can be extended so that the reduced model remains an HMM.In fact, we propose a general approach to the model reduction problem that is based on an algebraic description of probability spaces.While this is done very frequently and almost implicitly, we take a deeper look into the algebraic structures and the associated representations.In particular, we shall need minimal algebraic models that represent a set of random variables (r.v) and conditional expectations.Such an algebraic approach T. Grigoletto and F. Ticozzi are with the Department of Information Engineering, University of Padova, Via Gradenigo 6, 35131 Padova, Italy.Emails: tommaso.grigoletto@phd.unipd.it,ticozzi@dei.unipd.it.
has been developed to generalize the classical Kolmogorov description to the non-commutative case so that it suitably covers quantum mechanics [20]- [22], but it has proven useful in many other areas, from random matrix theory (see, e.g. the insightful introduction [23]) to algebraic statistics [24].In our setting, the algebraic framework and the induced matrix representations allow us to leverage on observability and reachability ideas in the characterization of equivalent models, as well as linear-algebraic algorithms that compute reduced models.Our approach remains deeply rooted in the system-theoretic analysis of the dynamical model and can be seen as a way to construct reduced stochastic realizations for an HMM.Furthermore, the proofs of effectiveness for the proposed methods all hinge on a result of model reduction for switched linear systems, In order to maintain the focus on HMM, the latter is presented in Appendix A).
In what follows, we deal with reductions of a given HMM that exactly reproduce the marginals of the original systems.This allows us to clearly illustrate the working and theoretical foundation of the method: extension to approximate reduction will be the focus of upcoming work.
Similar problems have been studied from different perspectives: in particular, the concept of lumpability of Markov processes [25], which induces coarse-grained processes analogous to those presented here, has been employed to characterize a class of exactly reducible HMMs (2-lumpable systems), see [26] and references therein.Other works, as [27] and references therein, reframe the problem using cellular automata for hidden information sources and study reductions of Markov transition kernels within this abstract approach.
The differences between our approach and the existing results are manifold, both in the tools used and the nature of the results.In the proposed framework, we introduce and solve two types of reduction problems: preserving only the single-time marginal, or the full (multi-time) distribution of the outcomes.We show that the former, which is of interest in model reduction of master equations for statistical models or mixing processes and algorithms [28], can lead to further reduction and smaller final models, as one might expect.In addition, our reductions leverage not only the structure of the measured process, but also the particular initial distribution of the HMM.We show that the initial conditions are indeed critical for obtaining minimal reductions in many situations, in particular, when the original model is initialized in an equilibrium density.The method hinges on the use of conditional expectations as projections for obtaining a reduced representation of the dynamics.While the idea is certainly not new to the control community, see e.g. the derivation of Kalman filters [10], [29], in this work we develop it in an algebraic framework.After representing a conditional expectation as a linear operator, we construct stochastic, non-square factorization of its dual with respect to the inner product associated to the expectation: the factors are then used to obtain the reduced probabilistic description, preserving its stochastic character.Lastly, we make direct contact with system-theoretic ideas in a linear-algebraic framework, which allows for effective, practically implementable algorithms for the reduction process.In fact, while the whole analysis could be carried out in the infinite-dimensional case, we here restrict to the finite case: in order to derive computable algorithms a finite-dimensional approximation would be needed anyway.
The structure of the paper is as follows: In Section II we review the fundamentals of the algebraic probabilistic models needed for our aims.The approach is directly borrowed from non-commutative probability [22], [30] and its use in quantum theory, where the algebras used for embedding the probability space need not be commutative (and are typically infinite-dimensional [21]), and can then be used to model quantum systems [20].As remarked above, in this work we only use commutative, finite-dimensional associative algebras, represented as R n endowed with its element-wise product.Subsection II-B is focused on conditional expectations as linear maps on algebras, their duals, and their representations.These are some of the key tools in the development of our method.
Section III is devoted to introducing the notation and the problems of interest, namely obtaining reduced models that reproduce either the single-time marginals or the multi-time marginals of a given HMM, while Section IV presents some preliminary results that build upon [19] from an explicit system-theoretic perspective.The main results of the section are obtained specializing a switched-system result that we derive in Appendix A to maintain the focus on HMMs.The key ideas we leverage to obtain reduced HMMs are described in V, where a class of reduction algorithms for the singletime marginal problem is developed.Section VI then extends and adapts these ideas to the multi-time marginal problem.A key point in our analysis is that, in order to develop the algorithms, we must switch from the abstract quotient spaces of [19] to a representative effective subspace.We show that the choice of representative has a non-trivial effect on the reduction itself.How to select this and other parameters used in the algorithms is discussed in Section VII, where we provide optimal choices for a class of models that includes observable HMMs and Markov chains.The same choices prove to be optimal in all the tested examples, also in the presence of non-observable components of the reachable space.Some particularly instructive examples are given in Section VIII, and an outlook on future developments is provided with the concluding remarks in Section IX.

A. Basic Notation
In the following, we typically denote vectors v P R n in boldface, and matrices in capitals V P R nˆm .We denote 1, the vector of all ones, and 0 the vector of all zeros.The matrix transpose of V is V T .Given a vector x P R n and the standard basis te i u for R n , we define its support as the vector space supppxq " spante i |e T i x ‰ 0u.Given a vector space V Ď R n , its support is defined as the vector space supppVq " spante i |Dx P V s.t e T i x ‰ 0u.diagp¨q is the operator that, given a vector v, diagpvq returns a diagonal matrix with rdiagpvqs i,i " v i .

II. ALGEBRAIC APPROACH TO PROBABILITY THEORY
The central idea in algebraic probability models is to represent all the key ingredients of a classical probabilistic model as elements of a suitable algebra A , endowed with a probability functional (or state) p.In the following sections, we start from a probability space pΩ, Σ, Pq and briefly review how to construct an algebraic representation pA , pq, with A Ď R n .Correspondingly, we show that any pair pA , pq admits a classical representation.This allows for a natural probabilistic interpretation of the proposed reduction method.
A. Fundamentals of algebraic probabilistic models 1) Events and σ-Algebras: Throughout the rest of this article, we will consider finite-dimensional probability spaces pΩ, Σ, Pq.Without loss of generality, we can assume Ω " t1, . . ., nu.
The first step in the construction entails the vector representation of events.The latter are in 1-to-1 correspondence to indicator functions: let I E pωq be the indicator function associated with the event E. Since the probability space is finite-dimensional, we can further associate indicator functions to vectors in R n .In particular, each indicator of an elementary event ω P Ω can be associated to its corresponding vector of the standard basis, i.e. e ω P R n .Similarly, we can define indicator vectors for any event E P Σ as f E " ř ωPE e ω .For these vectors, pf E q ω " 1 if ω P E and zero otherwise.Notice that f Ω " 1, and f H " 0.
Let us denote with F Σ the set of indicator vectors of the events of the σ-algebra Σ.Let ^denote the elementwise product pv ^wq i " v i w i , _ denote the modified sum operation defined as v _ w " v `w ´v ^w and denote the negation operation defined as v " 1 ´v.By construction, the set F Σ equipped with the operations ^, _, is isomorphic to the σ-algebra Σ with X, Y, ¨.In the following, we refer to F Σ as a vector σ-algebra, and we will drop the subscript when unnecessary.
A vector partition of Ω is a subset P Ď F zt0u such that f i ^fj " 0, for all f i , f j P P, i ‰ j and 1 " _ fjPP f j .The finest resolution in F is a partition respF q such that f " _ fjPrespF q c j f j with c j P t0, 1u, for all f P F .
Note that respF q is not necessarily equal to the standard basis of R n since, in general, Σ is contained but not equal to the power set of Ω.We shall also denote respΣq to indicate the finest resolution of a classical σ-algebra.
2) Random variables: Random variables (r.v.) are Σmeasurable functions Xpωq : Ω Ñ A Ă R, where A " tx i u is the finite set of outcomes of X, called the alphabet.Let E i " X ´1px i q.An r.v.X can also be represented as linear combination of indicator function Xpωq " ř |A| i"0 x i I Ei pωq.
Using the vector representation f Ei of indicator functions I Ei in the previous equation, each X can also be represented as a vector x " |A| ÿ i"1 x i f Ei P R n such that tf i u Ă F Σ forms a partition of Ω.Notice that in the vector formalism, the notion of F Σ -measurability is equivalent to the condition x P spantF Σ u.Here and elsewhere, the boldface font x is used for (vector representations of) r.v.s, while x denotes the corresponding outcome.As we show below, spantF Σ u has the property of being an algebra, namely a vector space (or subspace) that is closed under the elementwise product ^.An algebra is unital if it contains 1.The whole R n is then an unital algebra, and we denote its subalgebras using the script font, e.g.A .A non-unital algebra A still contains the vector 1 A , which has entries 1 on the support of A and 0 otherwise and acts as the product identity in A .
The following proposition collects some known facts which clarify the relation between F Σ and A " spantF Σ u and proves that it is indeed an algebra.Proposition 1.If F Ă R n is a vector σ-algebra, then A " spantF u is the smallest subalgebra in R n containing F , and it is unital.Conversely, let A be any unital subalgebra in R n and idempA q :" tf P A |f ^f " f u Ă A be the set of idempotent vectors in A .Then idempA q is the smallest σalgebra such that every element in A is F -measurable and respidempA qq forms an orthogonal basis for A .
A proof of this proposition is reported in Appendix B for completeness.This proposition shows that, not only does the space of F Σ -measurable random variables form an unital subalgebra, but, more importantly, given any unital subalgebra A , it is possible to find the minimal (vector) σ-algebra that makes every random variable in A measurable.For convenience, in the following, we refer to respidempA qq as respA q.
3) Probability and expectations: Let now consider a probability measure P : Ω Ñ r0, 1s.For any probability measure Pr¨s on Σ we can define a vector as follows p :" Then, for any f E P F Σ it is immediate to verify that PrEs " p, f E .In particular, notice that if we can write p :" ř frPrespA q p r f r , we find that p can be interpreted as a random variable in the same algebra, p P A .
A vector p is said to be a probability vector if p i ě 0 for all i and 1 T p " 1.The set of probability vectors in A is defined as DpA q :" tp P A |p i ě 0 @i, 1 T p " 1u.Note that DpA q " DpR n q X A .
Consider a r.v.X and let us denote again with f i the indicator function associated to the outcome x i .It then holds that PrX " x i s " p, f i .Similarly, we can compute the expectation of a random variable as Erxs " ř j x j PrE j s " ř j x j p, f j " p, x .
In summary, we have shown that an unital subalgebra A can subsume both the σ-algebra and the space of measurable random variables of a given probability space.Moreover, it is equivalent to a probability space when paired with a positive linear functional, associated to the inner product with a probability vector p.Conversely, given a pair pA , pq, we can always construct a (classical) probability space associated with the pair.This can be done by choosing Ω " t1, . . ., nu and the underlying σ-algebra Σ associated to idempA q as in Proposition 1. Lastly, p represents the probability distribution associated with the functional PrEs " p, f E .

B. Stochastic maps and Conditional Expectations
Let us now focus on the maps between probability vectors.Consider two unital subalgebras F of R n and G of R m .A linear map between probability vectors P r¨s : DpF q Ñ DpG q, p Þ Ñ q " P rps is called a stochastic map.Such a map can be represented as a (column)-stochastic matrix P P R mˆn , i.e. a matrix such that pP q i,j ě 0 @i, j and 1 T m P " 1 T n .In the following, the main task will be to find reduced descriptions of linear dynamics associated with stochastic maps.In doing this, we exploit the properties of a particular class of stochastic maps: the duals of conditional expectations.
Recall that the conditional expectation of an r.v.given a σ-algebra Σ with finest resolution respΣq can be written as follows: ErX|Σs " ÿ EPrespΣq Let consider a vector r.v.x P F Ď R n , a unital algebra A Ď F with ta i u " respA q and d " dimpA q ă n, and the underlying probability measure p.Following the previous definition, we can define the conditional expectation for the vector r.v. with respect to an algebra A : Noticing that it is a linear operator acting on x we can represent it as a matrix E |A ,p P R nˆn , namely: Consider the inner product of the conditional expectation of x with a probability distribution q, which we have shown to correspond to its expectation.The dual of the conditional expectation is then a map on the probability distribution defined as: q, E |A ,p x " E T |A ,p q, x which gives in It is immediate to verify that E T |A ,p is stochastic.The conditional expectation and its adjoint are orthogonal projectors with respect to a modified inner product.Notice that p ^A " spantp ^ai u " diagppqA .
Lemma 1.Let consider the modified inner product v, w p " E p rv^ws, with p ą 0. Then E |A ,p is the orthogonal projector onto A with respect to the inner product ¨, ¨ p and E T |A ,p is the orthogonal projector onto p ^A with respect to the inner product ¨, ¨ p ´1 .
The proof of this lemma is reported in Appendix B for completeness.
Remark 1.Note that the above Lemma also implies that E |A ,p acts as the identity on A while E T |A ,p acts as the identity on p ^A .Furthermore, they are orthogonal projections for the standard inner product ¨, ¨ if (and only if) p P DpA q and is positive, namely p " ř j λ j a j P R n with λ j ą 0, ř j λ j " 1.In this case, we have E |A ,p " E T |A ,p .Consider the standard basis te j u for R d , where d is the dimension of A .We can then construct a (full-rank) stochastic factorization of Then J, R are stochastic matrices that satisfy JR " E T |A ,p , RJ " I d , kerpRq " A K and kerpJ T q " pp ^A q K .Proof.J and R are clearly positive since both ta j u and te j u are vectors of zeros and ones and p is positive.J is clearly stochastic, 1 T n J " 1 T d since 1 T n pp ^aj q " p, a j .On the other hand, we have 1 T d R " We can then observe that a T j pp ^ak q " p, a j ^ak " p, a j δ j´k to conclude that RJ " I d .Finally, if we consider x P A K , i.e. x, a j " 0 for all j we obtain Rx " 0 and, similarly, if x P pp ^A q K , i.e. x, p ^aj " 0 for all j we obtain J T x " 0. This stochastic factorization induces a reduction in the probabilistic description.In fact, we have that for each distribution q and r.v.p: q, E |A ,p x " E T |A ,p q, x " JRq, x " Rq, J T x " q q, q x where we define the reduced distribution as q q :" Rq P DpR d q and reduced random variable q x :" J T x P R d .This property shows that, given an unital algebra A , it is possible to reduce the probabilistic description of the set of measurable events to the space R d with d " dimpA q.For this reason, we name R the stochastic reduction and J the stochastic injection.
In order to obtain smaller reduced models, it is useful to notice that even if A is a non-unital subalgebra of R n , namely the subalgebra has limited support, we can still use the reduction via factorization.In particular, we can use definitions (2), ( 3) and ( 4) to define orthogonal (for a modified product) projections on the algebra, their dual, and their factorization.We use the notation E |A ,p for simplicity, even if these are not true conditional expectations.One relevant difference, in this case, is highlighted in the following.
Corollary 1.Let A be a non-unital subalgebra and p be such that p i ą 0 for all i, then E T |A ,p allows for a factorization E T |A ,p " JR with J and R as defined above.Moreover, J is stochastic while R is stochastic over the support of A , i.e 1 T d R " 1 T supppA q and 1 T supppA q J " 1 T d .Proof.The proof is the same as 2 with the only difference that ř d j"1 a T j " 1 supppA q and 1 T supppA q pp ^aj q " p, a j holds, since A is not unital.

III. HMM AND PROBLEM DEFINITION
Throughout the rest of this work, we consider stochastic processes that can be described as Markov processes or Hidden Markov processes (HMPs).
A stochastic process tx t u is a collection of r.v.s taking values in the finite alphabet A x , indexed by time t.Without loss of generality, we can assume A x " t1, 2, . . ., nu.As the alphabet is independent of time, we can choose a fixed resolution of indicator vectors tf i u with respect to which x t is measurable at all times, the standard basis for R n being the most compact one.With this choice, tx t u is a sequence in R n .In the following we thus denote by x 0:k a stochastic process with t " 0, . . ., k, with x 0:k P A k`1 x an ordered sequence of its outcomes, i.e. x 0:k " x 0 , x 1 , . . ., x k , where x i P A x for all i and |x 0:k | " k `1.Then, the joint probability of a sequence of outcomes can be written as P rx 0 " x 0 , . . ., and such probability is independent of t for all pairs x t`1 , x t .
In this case, we have that there exists an initial probability vector p 0 P R n and a stochastic matrix P P R nˆn called the transition probability matrix such that Prx 0 " x 0 s " p 0 , f x0 and Prx t`1 " x t`1 |x t " x t s " f T xt`1 P f xt where f xt represents the elementary event associated with the outcome x t .
The main focus of this work are partially-observed HMPs, better known as HMPs.The following definition adapts [13, Definitions 9.2 and 9.3] to our setting.

Definition 1 (Hidden Markov processes).
A stochastic process ty t u in R m taking values in A y is an HMP if there exist a Markov process tx t u in R n taking values in A x such that tpy t , x t qu is jointly Markov and Pry t " y t , x t " x t |y t´1 " y t´1 , x t´1 " x t´1 s " Pry t " y t , x t " x t |x t´1 " x t´1 s for all t.
For HMPs, there exists an initial probability distribution p 0 and transition probability matrix P P R nˆn defined as before, as well as a stochastic matrix C P R mˆn , called emission probability matrix, such that Pry t " y t |x t " x t s " e T yt Cf xt , where te i u is the standard basis for R m , and e yt represents the elementary event associated to y t .Definition 2 (Hidden Markov models).We define a Hidden Markov Model (HMM) as the couple θ " pP, Cq.
The HMM θ and the initial distribution p 0 completely characterize the evolution of the probability distributions, leaving n, m and the alphabets implicit.In fact, the marginal distribution evolution can be modeled by # ppt `1q " P pptq qptq " Cpptq (5) associated to θ and initial condition pp0q " p 0 and can then be computed as P θ,p0 ry t " y t s " e T yt CP t p 0 .
Notice that we made the dependence on the HMM θ and initial distribution p 0 explicit whenever necessary to distinguish distributions induced by different models.
We are ready to state the first of the problems we will address in the following sections.
Problem 1 (Single-time marginals).Given an HMM θ " pP, Cq and a finite set of initial probability distributions S Ă DpR n q find a reduced HMM q θ " p q P , q Cq of dimension d ď n and a linear map Ψr¨s : S Ñ DpR d q, p 0 Þ Ñ q p 0 such that P θ,p0 ry t " y t s " P q θ,Ψrp0s ry t " y t s for all t ě 0 and for any initial conditions p 0 P S.
The second problem that we address targets multi-time probability distributions.Problem 2 (Multi-time marginals).Given an HMM θ " pP, Cq and a finite set of initial probability distributions S Ă DpR n q find a reduced HMM q θ " p q P , q Cq of dimension d ď n and a linear map Ψr¨s : S Ñ DpR d q, p 0 Þ Ñ q p 0 such that P θ,p0 ry 0:k " y 0:k s " P q θ,Ψrp0s ry 0:k " y 0:k s for all sequences of the output process y 0:k and for all initial conditions p 0 P S. Remark 2. Although Problem 2 is more natural than Problem 1 for the typical HMM setting, the latter is also interesting in particular cases, which include efficiently simulating an unmeasured stochastic evolution, and reproducing the mixing properties of lifted chains with more compact models.In fact, while we derive solutions of Problem 2 that are also solutions for Problem 1, the size of the effective multi-time reduced model is going to be in general significantly larger, as it must exactly reproduce all transition probabilities -see also Proposition 3 below.Remark 3. As we pointed out before, in Problems 1 and 2 we have assumed that S is a finite set.This assumption can be relaxed since, as we show below, the proposed solution works for any initial condition contained in spantSu.For this reason, when dealing with linear spaces of initial conditions one can study the problem where S are the generators of the set.

IV. PRELIMINARY RESULTS:
A SYSTEM THEORETIC VIEWPOINT Finding minimal realization of linear systems has been a central problem in control and system theory, for which well-established solutions are available.Nonetheless, when positivity is required on the reduced model, the minimal realization problem is, to the best of our knowledge, still open.In this section, we review some existing results, and extend and adapt them so that they can be used in our scenarios.In particular, we shall allow for non-minimal realizations in order to guarantee their positivity.

A. Single time-marginal problem
Let us start by considering model ( 5) with initial condition p 0 P S. Let us define the non-observable subspace as: The subspace N can be characterized as the largest P -invariant subspace contained in ker C [31], [32].In the case of HMM the non-observable subspace has another useful property.
Lemma 2. For all x P N it holds 1 T x " 0.
Proof.From the definition of non-observable space, we have that x P N if and only if CP t x " 0 for all t ě 0. If we then left-multiply by 1 T on both sides we obtain 1 T CP t x " 1 T P t x " 1 T x " 1 T 0 " 0 for all x P N .
Next, define R as the smallest linear space that contains all probability distributions pptq generated by the HMM for every t ě 0 and any initial distribution p 0 P S: R :" spantP t p 0 |t ě 0, p 0 P Su. ( Remark 4. The space R is, in fact, the reachable subspace of a state-space model in the typical form: where B P R nˆ|S| is a matrix whose columns are the initial conditions in S.This model reproduces the trajectories of (5) for inputs corresponding to discrete impulses.The nonobservable subspaces of ( 5) and ( 8) are the same, and the subspace R coincides with the reachable subspace of model (8), and thus shares the same properties: R is the smallest P -invariant subspace that contains spantSu.In light of this, we call the R defined above the reachable subspace.
Lastly, we call effective subspace E any subspace namely, a completion of the intersection R X N to the reachable subspace R. Notice that the choice of E is not unique, in fact, any representative of the quotient space R{pR X N q is a suitable candidate for this choice.The most natural choice for the effective subspace is of course the orthogonal complement (with respect to the natural inner product) of R X N in R, which we shall denote with E K .Any other orthogonal complement, with respect to a modified inner product, would also be a suitable choice for E.
Remark 5.The situation is reminiscent of the classical linear state-space analysis proposed by Rosenbrock [33], where all representatives of the quotient space R{pRXN q are equivalent and associated to minimal realizations.In our case, however, E needs to be further extended to ensure positivity of the reduced dynamical matrix, a notion that depends on the chosen reference basis.For this reason, not all choices of the effective subspace are equivalent.While we will show how the algorithm we propose works with any choice of the effective subspace, in Section VII we will argue that the choice of the representative E of R{pRXN q plays a key role in constructing an optimal reduction.
As we just recalled, the restriction of model ( 8) to (any) E corresponds to a minimal realization (yet not necessarily positive or stochastic).The next corollary shows that the same reduction method can be used for the HMM (5), while also allowing for extensions of the effective space.In this case, the minimality of the linear realization may be lost, but will later allow us to enforce positivity.The proof relies on a related result for general autonomous switching systems we present in detail in Appendix A.
Corollary 2. Consider an effective subspace E for the HMM (5) and a subspace V such that E Ď V with d " dimpVq.Let Π V be the orthogonal projection onto V with respect to an arbitrary inner product ¨, ¨ , such that Π V pRX N q Ď RX N .Let R : R n Ñ R d and J : R d Ñ V be two (non-square) factors of the orthogonal projection, Π V " JR.
Define the reduced model p q P , q Cq " pRP J, CJq and the map q p 0 " Rp 0 , for all p 0 P S. Then the linear systems associated with the pairs pP, Cq and p q P , q Cq reproduce the same marginal distribution at a specific time instant, i.e.CP t p 0 " q C q P t q p 0 for all t ě 0 and any initial condition p 0 P spantSu.
Proof.This result follows from the application of Theorem 4 reported in Appendix A with only one F i " P , H " C and xp0q " p 0 .
In the following sections, we shall construct V so that the reduction is also an HMM.

B. Multi-time marginal problem
For the multi-time marginal problem, following on the seminal work [19], we will consider C for our initial model to have only zero or one entries, i.e.C P t0, 1u mˆn .The assumption is not restrictive, as any Hidden Markov Process admits a realization with C of this type [13,Theorem 9.4].
The minimal reduction of the system producing the multitime distribution can be obtained along the same lines.Calculating the probability of a sequence of events is however more involved: [19, Lemma 1] provides a closed form for such a computation.We report it here for completeness.Lemma 3. Given an HMM θ and an initial probability distribution p 0 , the probability of a sequence of outcomes is given by P θ,p0 ry 0:k " y 0:k s " 1 T P y 0:k C p 0 where P y 0:k In the above lemma, the multiplication by the diagonal matrices diagpe T yt Cq accounts for the conditioning of p t on the outcome y t " y t .Without the latter we obtain the formulas for the single marginals.
In order to exploit system-theoretic tools, it is useful to write the probability of a sequence of outcomes as the output of a dynamical model.The dynamical model we are going to present next resembles the "observables representations of HMMs" described in [34].Call ψptq " Ppy 0:t " y 0:t q.We can obtain its evolution as the output of a discrete-time, autonomous, switching, linear system described by # φpt `1q " P yt C φptq ψptq " 1 T φptq (10) with initial condition φ y0 p1q " diagpe T y0 Cqp 0 , P yt C defined as in the previous lemma and where ψptq represents the probability associated to the sequence of events y 0:t .Clearly, the output ψptq depends on the sequence of P yt C , which in turn depends on the outcomes of the sequence.The output at any time k ą 0 can be computed as ψpy 0:k q " 1 T ś 1 i"k P yi C φ y0 p1q, while for l " 0 we have ψpy 0 q " 1 T φ y0 p1q, thus recovering the formulas of the lemma.
Given a finite set S of initial distributions of interest, the corresponding set of initial conditions for this model is Φ " Ť y0 diagpe T y0 CqS.Following the approach of [19] in a system-theoretic setting, we can define the reachable, non-observable, and effective subspaces for the multi-time problem.To avoid confusion with the previous definitions, we call these the conditioned subspaces and denote them with a C subscript.Given an HMM pP, Cq and a set of initial conditions S we define the conditioned non-observable subspace as: and the conditioned reachable subspace as R C :" spantP y 0:l C p 0 , @y 0:l , @p 0 P Su.
We can then define the conditioned effective subspace E C as a completion of the intersection R C X N C to the conditioned reachable subspace R C , i.e.E C 'pR C XN C q " R C .As before, the choice of E C is not unique, as any representative of the quotient space R C {pR C X N C q is a suitable choice.The properties of these spaces have been described in [19, Lemma 3, Section 3].We recap them in the following Lemma for the reader's convenience.Lemma 4. N C and R C are P -invariant, diagpe T i Cq-invariant for all i and thus, P y 0:l C -invariant for all sequences y 0:l .A result similar to Cayley-Hamilton Theorem holds and lets us compute the spaces by using a finite number of generators: R C " spantP y 0:l C p 0 , @p 0 P S, @y 0:l s.t.l ă nu. ( 14) We can then notice that N C is the non-observable subspace of model ( 10) see e.g.[35], R C is its reachable subspace and E C is its effective subspace.The second statement holds trivially, while the first holds because N C is diagpe T i Cqinvariant for all i.The third follows by combining the first two.
An useful property of the propagator P y 0:k C is proved in the following Lemma.
Lemma 5.The sum over all sequences y 0:k of the same length k of P y 0:k C is equal to the k-th power of P, i.e. ÿ y 0:k Proof.The statement is simply proved by observing that ř yi diagpe yi Cq " I for all i and summing over all the possible strings y 0:k , starting from the first character.
The next Proposition shows that, in general, solving the multi-time marginal case requires a larger model than the single-time case defined before.
and also The proof of this Lemma can be found in Appendix B.
Remark 6.This result clarifies the relation as well as the distinction between problems 2 and 1.In fact, this Proposition shows that, at least in principle, there could be a larger reduction if we are only interested in describing only the evolution of the marginal distribution at a specific time.Moreover, the conditioned effective subspace contains the effective subspace, thus showing, due to Corollary 2, that a solution for Problem 2 is also a solution for Problem 1.
We now propose a class of effective model reductions for the multi-time marginal problem.

Corollary 3. Consider any conditioned effective subspace E C
and subspace V such that E C Ď V with d " dimpVq, and let Π V be the orthogonal projection onto V with respect to an inner product ¨, ¨ , such that Let then consider the reduced model pt q P yi C u, 1 T m q " ptRP yi C Ju, 1 T n Jq and the map q φp1q " Rφp1q for all φp1q P Φ.Then the two models described by equations (10) and denoted by the couples ptP yi C u, 1 T n q and pt q P yi C u, 1 T m q reproduce the same probability of a sequence of outcomes, i.e.
for any sequence y 0:k and any initial condition φp1q P spantΦu.
Proof.This result follows from the application of Theorem 4 reported in Appendix A with F i " diagpe T yi CqP , H " 1 T , xp0q " diagpe T y0 Cqp 0 .
Remark 7. At this point one may notice that Corollary 3 provides a reduction for model (10) which includes the conditioning as part of the dynamics and in general may not translate directly into a reduction of (5) in the HMM form p q P , q C, q Sq.Nevertheless, we anticipate here that the algorithm we propose in Section VI for the multi-time case provides a model in HMM form, thanks to Lemma 4. Thanks to Proposition 3 and Corollary 2 the obtained model also reproduces the single-time marginals.
Remark 8.The two main results in this section, Corollary 2 and 3, as well as the underlying Theorem 4 shown in Appendix A, have been stated for time-invariant dynamics for sake of simplicity.While it is possible to generalize the analysis to time-dependent systems, in that case, Cayley-Hamilton-type results do not apply and consequently, the computation of reachable and non-observable spaces may become impractical.

V. SINGLE-TIME SOLUTION
In this section, we illustrate how to obtain solutions to Problem 1 appropriately choosing V in Corollary 2. We first discuss the intuition behind the method, next we present the proposed solution in form of a parametric algorithm, and prove that, under appropriate constraints, the algorithm indeed provides a solution.Finally, in Section VII, propose a way to choose the relevant parameters.

A. Intuition
The core idea behind the method stems from the fact that in order to define an HMM we need an underlying probability space and, as we have seen in Section II, any probability space is associated to an algebra.This directly suggests that, in order to preserve the (stochastic) HMM structure in the reduction it is natural to restrict the model to an algebra whose dual contains the effective subspace, and then use the dual of the conditional expectation to obtain a stochastic reduction.
More in detail, consider the two stochastic reduction matrices R and J obtained in Section II-B as factors of the dual of a conditional expectation E T |A ,p , which is an orthogonal projection onto p ^A with respect to the inner product ¨, ¨ p ´1 .Then, according to Corollary 2 we know that as long as E Ď V " p ^A , and E T |A ,p leaves R X N invariant, then the reduced model reproduces the same marginal distribution as the original one.
In order to choose A such that E Ď p ^A we can ^multiply left and right by p ´1 obtaining p ´1 ^E Ď A .Let algpX q denote the minimal sub-algebra of R n containing the set X .Then, if we define A :" algpp ´1 ^Eq, we ensure that E Ď p ^A is satisfied and that the reduced model reproduces the same marginal at a single time.
To make this idea more concrete, we provide a simple illustrative example, which also highlights the importance of choosing the distribution p to be used in E T |A ,p .
Example 1.Let us consider the following HMM: Notice that p 0 is an equilibrium, P p 0 " p 0 thus the output distribution is equal to qptq " " 2{5 3{5 ‰ T , @t ě 0. We can then compute the following.
and R X N " spant0u and we can thus choose E " R. If we then choose p " 1 we obtain and thus the relative factors of the dual of the conditional expectation are R " and the associated reduced HMM is  which correctly reproduces the output marginal distribution qptq " " 2{5 3{5 ‰ T , @t ě 0.
On the other hand, if we were to choose p " p 0 we would obtain a different result.In fact, in that case, we have and thus the relative factors of the dual of the conditional expectation are R " " 1 1 1 ‰ , and J " " 1{5 1{5 3{5 ‰ T and the associated reduced HMM is q P " 1, q C " " 2{5 3{5 ‰ and p 0 " 1 which also reproduces the output marginal distribution and is clearly minimal (optimal reduction).This shows that the choice of p is important if we are interested in minimizing the dimension of the reduced model.

B. Proposed solution
We now formalize the proposed method to solve Problem 1 in the following Algorithm.Let ΓpR, N q be a map that selects an effective space E given some R, N .
Notice that this algorithm depends, in addition to its inputs, on two parameters: the first one, p, is a positive vector; the second one, is the map Γ that selects the effective subspace.We will discuss more in detail the choice of the effective subspace in Section VII.
We are finally ready to prove that Algorithm 1 solves the single-time marginal problem.
Algorithm 1: HMM reduction for problem 1 Input : pP, Cq, S. Parameters: p, Γ. 1 Compute R and N using equations ( 7) and (6); 2 Compute E " ΓpR, N q; 3 Compute A :" algpp ´1 ^Eq; 4 Compute E T |A ,p using equation (3) ; 5 If E T |A ,p pR X N q Ę R X N : redefine A :" algpp ´1 ^Rq and recompute E T |A ,p ; 6 Compute the factors R and J of E T |A ,p with the definition given in equation ( 4); Output : p q P , q Cq " pRP J, CJq and R.
Theorem 1.For any choice of E and p positive i.e. p i ą 0 @i, Algorithm 1 provides a solution to Problem 1.
Proof.To prove the statement we have to prove that: i) The reduced model q θ " p q P , q Cq and the linear map R provide the same marginal distribution at any time as the original model; ii) the reduced model q θ is an HMM, and Rp 0 is a probability vector.
We shall start by proving the first point.We do so leveraging Corollary 2. First of all, we have that, for any vector p such that p i ą 0 for all i the inner product ¨, ¨ p is positive-definite and thus well defined.Moreover, by definition of the algebra A , we have that, for any choice of the effective subspace E it holds E Ď p ^A so, by choosing V " p ^A , and using the restriction and injection map defined in equation ( 4), i) follows from Corollary 2 if case E T A ,p pR X N q Ď R X N .If E T A ,p pRX N q Ę RX N , pick Ñ " t0u so that RX Ñ " t0u and Theorem 4 applies with V " algpp ^Rq .
Regarding ii) we have that, if A is unital, then Proposition 2 ensures that J and R are stochastic and thus RP J and CJ are stochastic and Rp 0 is a probability vector for any p 0 probability vector.If A is not unital, because of Corollary 1 we have, that J is stochastic (and thus CJ is stochastic) but R is only stochastic over supppA q, i.e. 1 T d R " 1 T supppA q .We next show that this condition is sufficient to show that the reduced model is stochastic.
We shall first notice that supppEq " supppA q Ĺ R n .Let assume that dimpsupppEqq " k.Then we can consider a permutation (that is a double-stochastic change of basis) T such that T x " " x 1 0 T n´k ‰ T for all x P E, with and thus P 21 " 0. This shows that supppEq is P -invariant.Since P , T and T T are stochastic, T P T T is also stochastic.This implies that 1 T k P 11 " 1 T k .Then, it holds that 1 T supppA q T T T P T T " or, in other words 1 T supppA q P " 1 T supppA q .We can also verify that q P is stochastic by verifying the following chain of equivalences: 1 T d RP J " 1 T supppA q P J " 1 T supppA q J " 1 T d where the last equality comes from Corollary 1. Finally, to prove that Rp 0 is a probability vector we can observe that 1 T n p 0 " 1 T Π E p 0 `1T Π RXN p 0 loooooomoooooon "0 " 1 and then re-use the reasoning above.
Remark 9.In the proof of Theorem 1 we stated that a positive vector p is necessary to have a well-defined inner product ¨, ¨ p .This assumption, however, can be relaxed to the following: p is positive over supppEq " supppA q, i.e. p i ą 0 for all i such that e T i x ‰ 0 for some x P E. This is due to the fact that the values of p where E has no support has no role in the projection.
Although such p defines a positive semi-definite inner product over R n , it provides a positive definite inner product over supppSq and this is sufficient to define the orthogonal projection onto A .Consider, for example, the following case: assume supppA q Ĺ R n then let p s be a positive vector over the supppA q, p n be a positive vector over the remaining support, i.e. s.t.p :" p s `pn , suppppq " R n .We can then notice that p ^x " p s ^x and y, x p " y, x ps for all x P supppA q and y P R n .This implies that The role of the positivity of p will be further discussed in Section VII.

VI. MULTI-TIME SOLUTION
The solution of Problem 2 follows the same ideas presented in the previous section.In fact, the algorithm we propose to solve Problem 2 is identical to the previous algorithm but for the involved subspaces.We now present our proposed method to solve Problem 2. This method takes the form of the following Algorithm, where Γ is defined as in the previous section.

Algorithm 2: HMM reduction for problem 2
Input : pP, Cq, S. Parameters: p, Γ. 1 Compute R C and N C using equations ( 14) and ( 13); |AC ,p with the definition given in equation ( 4); Output : p q P , q Cq " pRP J, CJq and R We are finally ready to prove that Algorithm 2 solves the multi-time marginal problem.
Theorem 2. For any choice of E C and p positive, i.e. p i ą 0 @i, Algorithm 2 provides a solution to Problem 2.
Proof.The proof of this theorem follows the lines of the proof of Theorem 1.In fact, the proof of the fact that the reduced HMM q θ is stochastic and Rp 0 is a probability vector is identical to the one given in 1.The only difference in the two proofs regards proof of the fact that the reduced model q θ with initial condition Rp 0 provides the same probability of a sequence of events as the model θ with initial condition p 0 .
From Corollary 3 we have that pRdiagpe T i CqP J, 1 T Jq with initial condition Rdiagpe T i Cqp 0 generates the same probability as the original model.Since R C and N C are both P and diagpe T i Cq-invariant, Corollary 5 applies, thus leading to the reduced HMM q θ " pRP J, CJq and initial conditions Rp 0 .

VII. CHOOSING THE ALGORITHM'S PARAMETERS
In this section, we discuss what is the best choice of the parameters for Algorithms 1 and 2. Being the structure of the two algorithms identical, we only discuss the optimal choice of E and p: the results can be extended directly to E C .The notion of optimality is related to the dimension of the reduced system, meaning: we want to find a choice of E and p positive such that the reduced model returned by Algorithm 1 has minimal dimension.This is equivalent to finding E and p such that algpp ´1 ^Eq has minimal dimension.

A. Optimal distributions for observable HMMs
We shall start the discussion by finding the optimal choice of p assuming that an effective subspace E is given.Before we prove the main result of this section, we shall first state the following useful result.Lemma 6.Given a vector space W Ď R n with generators tw i u, W " spantw i u there exists a vector w :" ř i λ i w i , with λ i ‰ 0 for all i and such that supppwq " supppWq.
The proof of this Lemma can be found in Appendix B. Theorem 3. Let consider a vector space W Ď R n and a vector w as in Lemma 6.Then there exists a unique algebra A ˚of minimal dimension such that W Ď x ^A ˚for some x P R n .Moreover, A ˚" algpw ´1 ^Wq and it is unital over the support of W, i.e.
Proof.The existence of such a w is proved in Lemma 6.
Since A " R n satisfies W Ď x ^A , for all x P R n and its possible sub-algebras are finite (corresponding to the partition of n), A ˚exists.To prove that it is an unique solution we proceed by contradiction.Let assume that there exist two different algebras A , B Ď R n with minimal dimension dimpA q " dimpBq and two vectors a, b P R n such that W Ď a ^A and W Ď b ^B.From Proposition 1 we know that A " spanta j u and B " spantb j u where ta i u and tb i u are the finest resolutions in idempA q and idempBq respectively.Clearly, if a i " b i for all i then A " B which yields a contradiction.Therefore, we assume that there exists an index j such that a j ‰ b i for all i.We can then notice that for all v P W, we can write v " ř i µ i a ^ai " For j such that a j ‰ b i or all i we can then write a j ^v " µ j a ^aj " ÿ i ν i a j ^b ^bi .
The first equality implies that over the support of each a j every v must be proportional to a ^aj .The second equality, on the other hand, due to the fact a j ‰ b j implies at least two of the products a j ^b ^bi must be non-zero.In order for the nontrivial sum to be always proportional to a ^aj it must be that the coefficients ν i appear always in a fixed ratio.Hence, the corresponding b i can be substituted by their sum, and still, generate the full W when multiplied by a suitable vector b.This shows that B could not be a minimal algebra unless a i " b i for all i, up to a reordering.Let then A ˚be the unique algebra of minimal dimension such that W Ď x ^A ˚for some w.From Proposition 1 we know that A ˚" spanta j u where ta i u is the finest resolution in idempA ˚q.In particular ta j u forms an orthogonal basis for A ˚and its elements have completing mutually-orthogonal supports, i.e. supppa k q K supppa j q for k ‰ j and ř j a j " 1 supppWq .We can then observe that x ^A ˚" spantx ^aj u and that the vectors x ^aj have complementary mutuallyorthogonal supports.Then for W Ď x ^A ˚to hold it must be that w " ř j µ j x ^aj for all w P W. By the above discussion we can write w i " ř j µ i j x ^aj for each generator of W. Notice that, for all j, µ i j ‰ 0 for at least one i.Let then use the definition of w given in the statement and, substituting the form of the w i we just reported we obtain w " ř j σ j x ^aj with σ j " ř i λ i µ i j .From the argument above, from the fact that λ i ‰ 0 for all i and from the fact that, by hypothesis, w has maximal support, we have that σ j ‰ 0 for all j.Because of the structure of tx ^aj u we have that pa j ^wq ´1 " a j ^w´1 " σ ´1 j px ^aj q ´1 " σ ´1 j x ´1 ^aj , and thus w ´1 " ř j σ ´1 j x ´1 ^aj .From this we have that the vector space w ´1 ^W is generated by vectors of the type This proves that w ´1 ^W Ď A ˚and that any vector v P w ´1 ^W can be written as v " Let then consider any two vectors v, u P w ´1 ^W and compute their ^-product, This implies that algpw ´1 ^Wq Ď A ˚. On the other hand, it trivially holds that W Ď w ^algpw ´1 ^Wq.But then, since we assumed that A ˚was the unique algebra of minimal dimension such that W Ď x ^A ˚for some x it must hold that algpw ´1 ^Wq " A ˚.
Finally, since w P W, then w ´1 ^w " Remark 10.Theorem 3 shows that, given any choice of the effective subspace, we can construct a vector w such that the algebra algpw ´1 ^Eq has minimal dimension.However, not all such w are positive over the support of E. As a matter of fact, it could happen that some choices of E do not contain any non-negative vector, while w " p being non-negative is fundamental to construct a stochastic reduction.
Theorem 3 is nonetheless sufficient to determine the optimal reduction for a class of HMMs, namely those for which R is "observable", i.e.R X N " H. Proposition 4. Let tr i u be an N -dimensional set of positive generators of R and let p :" ř r i {N .Then, if R X N " t0u, A :" algpp ´1 ^Rq provides the optimal reduction.Proof.By hypothesis we have R X N " H.This implies that E " R.Then, using Theorem 3 we have that p " ř i r i {N , provides the minimal dimension for algpp ´1 ^Rq and thus the optimal reduction.Notice that this result applies in particular fully observable HMMs, i.e. when the pair pP, Cq is observable, and thus to finite-state Markov chains.In fact, the latter can be seen as HMMs with C " I.The corresponding optimal reduction is then a maximally-lumped version of the original process [25].

B. Effective subspace for the general case
In order to address the general case, in addition to a distribution p we also need to choose an effective subspace.Example 2 below illustrates that not all effective spaces are equivalent and lead to different dimensions for the reduced model, making this choice critical towards the optimality of the reduction.A natural candidate effective subspace is E K , the orthogonal complement (with respect to the natural inner product x, y " x T y) of R X N in R. Let then tε i u i"1,...,d be the set of generators of E K .Then any choice of the effective subspace can be described as E " spantε i `ni u i"1,...,d , where tn i u i"1,...,d is a set of vectors in N .
We next show that the choice of the orthogonal complement E K always allows for finding a positive vector w " p as in the statement of Theorem 3, and hence a valid stochastic reduction.The following proposition is instrumental to this aim.Proposition 5. Let p P R n be a probability vector, and let V be a vector space such that 1 T v " 0 for all v P V. Let then Π V be the orthogonal projector on V with respect to the standard inner product ¨, ¨ .Then q :" p ´ΠV p is a probability vector.
Proof.Let us start by defining w :" 1{2 ´p.We can then write p " 1{2 ´w to notice that p i P r0, 1s if and only if ´1{2 ď w i ď 1{2, that is if and only if ||w|| 8 ď 1{2.Moreover, we have that 1 T p " 1 if and only if 1 T w " pn 2q{2.We can then compute q: q " 1{2 ´w ´ΠV 1{2 `ΠV w " 1{2 ´pI ´ΠV q loooomoooon where we used the the hypothesis and thus a contraction in norm, we have that ||Π V K w|| 8 ď ||w|| 8 .Then, using the argument above, we have that q is a non-negative vector with q i P r0, 1s.Lastly we have that 1 T q " 1 T 1{2 ´1T w " n{2 ´pn ´2q{2 " 1.
The result we are after is then obtained as a corollary of the previous one.
Corollary 4. Let E K be the orthogonal complement of R X N to R. Let tr i u be an N dimensional set of probability vectors such that R " spantr i u.Then ε i :" r i ´ΠRXN r i are such that E K " spantε i u.Moreover, ε " ř i ε i {N satisfies supppεq " supppEq and ε i ě 0 for all i.
Proof.From Lemma 2 we have that 1 T x " 0 for all x P N and thus, by applying the proposition above on every generator of R we have that the set tε i u is a set of probability vectors.Being ε a convex combination of probability vectors it is itself a probability vector and it shares the same support as E.
Other choices are possible, and the choice of the effective subspace can influence the dimension of the reduced model, as illustrated in the following example.
Example 2. Consider the following spaces: / /then we clearly have that R X N " N .Let us denote with E K the orthogonal complement of R X N to R, i.e.
We can easily notice that E K is an unital algebra.Let us now consider another completion E of R X N to R. In general, we can write for some values a, b P R. We can then consider two cases.First, if a " 0 and b ‰ 0 we can choose v " " 1 1 1 `b 1 ´b‰ T thus obtaining algpv ´1 ^Eq " E K .On the other hand, if we have a ‰ 0 and b ‰ 0 we can choose w " " 1 1 a `1 `b ´a `1 ´b‰ T (assuming that a `b ‰ ˘1) thus obtaining This example shows that the choice of the effective subspace can affect the size of the reduced model.We shall start by studying the single-time marginal problem.We can observe that p 0 P S is an equilibrium for P and thus R " spantSu and also that N " spant " 1 ´2 1 0 0 ‰ T u.Clearly, the intersection contains only the zero vector, R X N " t0u and thus the effective subspace can be taken as the reachable one: E K " R. If we then take p " p 0 we obtain A " algpp ´1 ^Eq " spant1u.
The corresponding stochastic reduction and injection matrices are R " 1 T and J " p which provide the (trivial) reduced model: We next focus on the multi-time marginal problem.We have that N C " N , while the conditioned-reachable is equal to: This implies that the intersection R C X N C " N C and thus: In this case we are only interested in the single-time marginal problem.We can notice that R " R n and thus Then we can consider the effective subspace as the orthogonal complement of N , that is / / -.
Then we can define p :" 1{4 to obtain that leads to the reduced model Suppose that, instead of the orthogonal complement, we were to consider the following space as an effective subspace: We can immediately notice that there is no convex combination of the generators of E such that it is positive, however, if we consider v " " 8 10{3 14 ´28{3 ‰ T we have that / /thus showing that a smaller algebra could be found for the reduction, if we were to consider vectors that are not nonnegative.

IX. CONCLUSIONS AND FUTURE WORK
In this work, we exploited system-theoretic ideas and algebraic representation of probability spaces to obtain effective reductions of HMMs that preserve the marginals of the original output process, in either the single-or multi-time case.While optimal reductions are explicitly characterized for a class of HMMs, including observable ones, the freedom of choice in the effective subspace makes finding the optimal reductions more challenging in the general case.Nonetheless, we provide an algorithm that produces reduced HMMs of minimal dimension in all considered examples.Based on the analytical and numerical examples we examined, we formulate the following conjecture on the optimality of the natural orthogonal complement.
Conjecture 1.Let E K be defined as the (standard) orthogonal complement of N X R to R, and let p be defined as in Corollary 4.Then, given any other choice of E and w nonnegative it holds that dimpalgpp ´1 ^EK qq ď dimpalgpw ´1 ^Eqq.
Remark 11.If the effective subspace is already an algebra with respect to a p -inner product then dimpE K q " dimpalgpp ´1 ^EK qq, since dimpE K q " dimpEq and dimpEq ď dimpalgpw ´1 ^Eqq by Theorem 3 then the choice of E K is optimal.Also notice that removing the assumption that w is non-negative makes the statement false.A counterexample is presented at the end of Example 4.However, having w nonnegative is necessary in order to obtain a stochastic model.
Proving the conjectured minimality may require novel mathematical ideas: the choice of E, E C that minimize the size of the generated algebras is equivalent to identify the representative of the quotient space that can be described with the least number of indicator vectors, and a way to relate this notion to orthogonality to N does not seem straightforward to find.
Other natural developments of the proposed framework include a relaxation of the method so that it allows for approximate preservation of the marginals, thus yielding reductions in practical situations where noise and partial knowledge might make the exact equivalence we require in this work too stringent, due to the fact that controllable pairs are a dense set [36].In addition, in many algorithms used to estimate HMMs from data, e.g.[34], the dimension of the "hidden" state space (i.e.n) is assumed to be known.When this is not the case, one could estimate an HMM with a larger than necessary number of hidden variables, and then use an approximate reduction scheme to reduce the estimated model to one of more manageable size.Future work will also be devoted to the adaptation and application of the method to approximate coarse-graining of large-scale systems, to address otherwise untreatable problems [16]- [18].
The algebraic approach also naturally extends to the noncommutative domain, and our method will be extended to quantum systems, in particular quantum walks and open systems in general.Analogies between HMM and quantum walks have been already noted in [37], as well as [38] and [39], which extend the result of [19] to include quantum walks.Lastly, the algebraic viewpoint makes our results potentially interesting towards the solution of outstanding open problems in realization theory and model reduction for positive systems [40].

SYSTEMS
This appendix is dedicated to introducing a general condition ensuring exact model reduction for switching autonomous systems.Both the single-time and the multi-time marginals can be described by the dynamics of this type.Let us denote with yps 0:l q the output of the system at time l associated to a sequence s 0:l " s l , . . ., s 0 of length l of selected evolution F s k .The output at any time l ą 0 can be computed as yps 0:l q " H ś 0 j"l F sj x 0 while for t " 0 we have yp0q " Hx 0 .
Let R Ď R n be a linear subspace such that I Ď R and is F i -invariant, i.e.F i R Ď R, for all i.Let Ñ Ď R n be a linear subspace such that Ñ Ď ker H and is F i -invariant, i.e.F i Ñ Ď Ñ , for all i.Let then define E to be any completion of R X Ñ to R, i.e.R " pR X Ñ q ' E. Theorem 4. Consider any subspace V such that E Ď V with m " dimpVq and let Π V be the orthogonal projection onto V with respect to an inner product ¨, ¨ .Assume that Π V pR X Ñ q Ď R X Ñ , and let R : R n Ñ R m and J : R m Ñ V be two factors of the orthogonal projection, Π V " JR.
Let consider the reduced model pt q F i u, q H, q Iq " ptRF i Ju, HJ, RIq.Then the reduced model reproduces the same output as the original model, i.e.
for any sequence s 0:l and any initial condition x 0 P spantIu and the relative q x 0 " Rx 0 .
Proof.Let W 1 be the completion of R X Ñ to Ñ , i.e.Ñ " W 1 'pRX Ñ q; let W 2 be the completion of pRX Ñ q'E 'W 1 to R n , i.e.R n " W 1 'W 2 'E 'pRX Ñ q; let T the remainder sub-space, such that R n " V ' T and thus T Ď p Ñ X Rq ' W 1 'W 2 .Let us also denote with Π T the orthogonal projector onto T with respect to the considered inner product ¨, ¨ .We can notice that, for any sequence s 0:l we have that q H ś 0 j"l q F sj q x 0 " HΠ V ś 0 j"l F sj Π V x 0 and thus the statement can be also be put in the form x 0 " 0 for any sequence s 0:l and for any x 0 P I. To prove the statement we will thus show that for any sequence s 0:l and for any initial condition x 0 P I it holds We will prove this statement by induction.Let then consider the case of t " 0. We have to prove rI ´ΠV sx 0 P Ñ X R. Then by noticing that, pI ´ΠV qx 0 " Π T x 0 and that Π T x 0 P Ñ X R, the statement is proved in the case t " 0.
Assume then that v :" and we want to prove that « By rewriting this as in Equation ( 15) we can observe that it is equal to the sum of three parts.We can then notice that: Finally, since all three summands belong to Ñ X R, their sum also belongs to the same subspace, and the statement is proved.
In order to apply the result to our multi-time problem, we need a straightforward extension.
Corollary 5.Under the assumptions of Theorem 4, let us further assume that F i are factorized as F i " D i A, that R and Ñ are A-invariant and D i -invariant for all i and also that I " Ť i D i S for some set S. Then the matrices t q F i u of the reduced model can be taken to be q F i " q D i q A with q D i " RD i J, q A " RAJ. .
The proof of this corollary follows exactly that of Theorem 4, where HΠ , and in the induction we leverage the fact that Ñ , R and thus Ñ X R are invariant for Π V , A and D i , for all i.

APPENDIX B PROOFS
This Appendix collects some proofs that were not included in the main text to improve readability.
Proof of Proposition 1.Let start with the first part of the statement.The fact that A is closed under linear combinations and 1 P A follows directly from the definition of A .The closure of A under element-wise product follows from the closure of F under the same operation.In particular let consider x, y P A , then x ^y " ř i,j x i y j f i ^fj and , since f i P F for all i, f i ^fj P F and thus x ^y P A .So A is an algebra, namely the set of F -measurable random variables, and it is the minimal one by construction.
We can then consider the second part of the statement.First of all, notice that the vectors that are idempotent for the element-wise product are composed only of zeros and ones.Let then consider a general element x P A and let x i ˚" max i"1,...,n |x i |.We can then compute x 1 " x{x i ˚P A that will have value 1 in the positions where x has value x i ˚, possibly values -1 in the position where x has value ´xi ånd values in the range p´1, 1q in all the others positions.We can then define x 2 " 0.5px 1 `x1 ^x1 q P A that will have have value 1 in the positions where x has value x i ånd values in the range p´1, 1q in all the others positions.Finally the first idempotent element of the desired set is f 1 " lim nÑ8 px 2 q n P A with element-wise power.Notice that f 1 will have 1 in the same positions as x 1 and zeros in all the others.This implies that f 1 is idempotent.By iterating the procedure on x ´xi ˚f1 , and so on, we obtain the whole set of idempotent elements tf i u Ă A such that x " ř i x i f i up to a reordering of the coefficients x i .We shall denote with idempxq the function that, given an element x, returns the set of idempotent elements tf i u that generate x.We then have that idempA q Ě Y xPA idempxq by definition, while to prove idempA q Ď Y xPA idempxq it suffice to notice that each element of idempA q is also an element of A .This implies idempA q " Y xPA idempxq.Then, by construction, it holds that spantidempA qu " A .
We shall then notice that F " idempA q contains the elements, 0, 1 P A, and is closed under the operations ^, _ and .This shows that idempA q is a σ-algebra.Then, since A " spantidempA qu then any element in A is idempA qmeasurable.Moreover, idempA q is minimal because subtracting any element from it would make that element (seen as a r.v.) non-measurable.We thus have F " idempA q.
Finally, respF q Ă F is such that f i ^fj " 0, for all f i , f j P respF q i ‰ j.This implies that f i , f j " δ i´j which means that is a set of orthogonal vectors.Moreover f " _ fjPS f j with S Ď respF q for all f P F zt0u or, equivalently, f " ř j c j f j with c j P t0, 1u for all f P F .This implies that A " spantrespAqu and thus respA q is an orthogonal basis for A and dimpA q " |respA q|.
Proof of Lemma 1.First of all, note that the modified inner product can be written in many equivalent forms: v, w p " E p rv ^ws " p, v ^w " p ^v, w " v, p ^w .
Let us then consider E |A ,p .We can notice that imagepE |A ,p q " A and that E 2 |A ,p " E |A ,p .It remains to be proven the fact that E |A ,p is self adjoint with respect to the inner product ¨, ¨ p , that is v, E |A ,p w p " E |A ,p w, v p .Such an equality can be rewritten, using equivalent forms of the modified inner product above, as v, p ^p ^E|A ,p w " p ^p ^E|A ,p v, w .That is equivalent to prove that p Ê|A ,p is self-adjoint with respect to the standard inner product, which can be verified by simply computing it.
Identical reasoning can be done for E T |A ,p .Note that imagepE T |A ,p q " p ^A , that pE T |A ,p q 2 " E T |A ,p and that p ´1 ^E|A ,p is self adjoint with respect to the standard inner product and the statement is proved.
Proof of Proposition 3.Both ker C Ě N and S Ď R are well-known properties.We have to prove that N Ě N C and R Ď R C .
Regarding the reachable space we have that spantP y 0:l p 0 , @y 0:l s.t.l " k, @p 0 P Su Ě spantP k p 0 , @p 0 P Su for all k ě 0. This is proven directly using lemma 5.
For the non-observable subspace it holds that » -- Then, we have that kerrCP P y 0:l´1 C s " kerrCP l s for all y 0:l´1 of length l, for any length l.Once again this is proved by using Lemma 5. Consider a vector v P kerrCP P y 0:l´1 C s for all y 0:l´1 .Then it holds that CP P y 0:l´1 C v " 0 summing both sides of this equation over all sequences y 0:l´1 of length l ´1 and using the Lemma above we obtain CP |y 0:l´1 |`1 v " 0, thus proving the statement.
The statement on the effective subspaces follows directly from the other two.
Proof of Lemma 6.We shall start by constructing a vector r w such that suppp r wq " supppWq.Starting from it we then construct a vector w such that it is a linear combination of every generator.
By definition of support of a vector space, for each e i P supppWq there exists a vector x i P W such that e T i x i ‰ 0, forming a set tx i u.Without loss of generality, we assume i " 0, . . ., m with m " dimpsupppWqq.We can then define r w 0 " x 0 and iteratively compute r w i " r w i´1 `λi x i with λ i R t´e T j r w i´1 {e T j x i , @j|e T j x i ‰ 0u Y t0u.Since this set is finite, it is always possible to choose a suitable λ i P R for each i.At the end of the iteration process, we obtain r w " r w m " ř m i"0 λ i x i P W. To prove that suppp r wq " supppWq we can simply observe that e T j r w " ř m i"0 λ i e T j x i ‰ 0 by construction for all e j P supppWq.On the other hand, for every e j R supppWq, e T j x i " 0 for all i and thus e T j r w " 0. This r w must be described as a linear combination of some of the generators, say r w " ř iPS λ i w i , for some set of indices S. We can then use the same procedure as before: take w i such that i R S, by choosing any λ i R t´e T j r w{e T j w i , @j|e T j w i ‰ 0u Y t0u we have suppp r w `λi w i q " supppWq.Iterating this procedure on the remaining vectors tw i |i R Su we obtain w " r w `řiRS λ i w i such that supppwq :" supppWq.

VIII. EXAMPLES Example 3 .
Let consider the HMM provided in [19, Example 3

Example 4 .
can notice that E C is a unital algebra and by taking p " 1{5 we obtain the stochastic reduction and injection matrices Consider the HMM defined by: Notice that in this case, the dimension of the algebra is greater than the effective subspace.We thus obtain the stochastic reduction and injection matrices R " T diagpe 0 qP P y 0:l´1 C . . . 1 T diagpe m qP P y 0:l´1 C fi ffi fl " CP P y 0:l´1 C p 0 .
denoted by the triplet ptF i u, H, Iq.The evolution at any time clearly depends on the sequence of evolutions F i activated.