On the Latent Variable Interpretation in Sum-Product Networks

One of the central themes in Sum-Product networks (SPNs) is the interpretation of sum nodes as marginalized latent variables (LVs). This interpretation yields an increased syntactic or semantic structure, allows the application of the EM algorithm and to efficiently perform MPE inference. In literature, the LV interpretation was justified by explicitly introducing the indicator variables corresponding to the LVs' states. However, as pointed out in this paper, this approach is in conflict with the completeness condition in SPNs and does not fully specify the probabilistic model. We propose a remedy for this problem by modifying the original approach for introducing the LVs, which we call SPN augmentation. We discuss conditional independencies in augmented SPNs, formally establish the probabilistic interpretation of the sum-weights and give an interpretation of augmented SPNs as Bayesian networks. Based on these results, we find a sound derivation of the EM algorithm for SPNs. Furthermore, the Viterbi-style algorithm for MPE proposed in literature was never proven to be correct. We show that this is indeed a correct algorithm, when applied to selective SPNs, and in particular when applied to augmented SPNs. Our theoretical results are confirmed in experiments on synthetic data and 103 real-world datasets.


INTRODUCTION
S UM-PRODUCT NETWORKS are a promising type of prob- abilistic model, combining the domains of deep learning and graphical models [1], [2].One of their main advantages is that many interesting inference scenarios are expressed as single forward and/or backward passes, i.e. these inference scenarios have a computational cost linear in the SPN's representation size.SPNs have shown convincing performance in applications such as image completion [1], [3], [4], computer vision [5], classification [6] and speech and language modeling [7], [8], [9].Since their proposition [1], one of the central themes in SPNs has been their interpretation as hierarchically structured latent variable (LV) models.This is essentially the same approach as the LV interpretation in mixture models.Consider for example a Gaussian mixture model with K components over a set of random variables (RVs) X: where N p¨| ¨q is the Gaussian PDF, µ k and Σ k are the means and covariances of the k th component, and w k are the mixture weights with w k ě 0, ř w k " 1.The GMM can be interpreted in two ways: i) It is a convex combination of PDFs and thus itself a PDF, or ii) it is a marginal distribution of a distribution ppX, Zq over X and a latent, marginalized variable Z, where ppX | Z " kq " N pX | µ k , Σ k q and ppZ " kq " w k .The second interpretation, the LV interpretation, yields a syntactically well-structured model.For example, following the LV interpretation, it is clear how to draw samples from ppXq by using ancestral sampling.This structure can also be of semantic nature, for instance when Z represents a clustering of X or when Z is a class variable.Furthermore, the LV interpretation allows the application of the EM algorithm -which is essentially maximum-likelihood learning under missing data [10], [11] -and enables advanced Bayesian techniques [12], [13].
Mixture models can be seen as a special case of SPNs with a single sum node, which corresponds to a single LV.More generally, SPNs can have arbitrarily many sum nodes, each corresponding to its own LV, leading to a hierarchically structured model.In [1], the LV interpretation in SPNs was justified by explicitly introducing the LVs in the SPN model, using the so-called indicator variables corresponding to the LVs' states.However, as shown in this paper, this justification is actually too simplistic, since it is potentially in conflict with the completeness condition [1], leading to an incompletely specified model.As a remedy we propose the augmentation of an SPN, which additionally to the IVs also introduces the so-called twin sum nodes, in order to completely specify the LV model.We further investigate the independency structure of the LV model resulting from augmentation and find a parallel to the local independence assertions in Bayesian networks (BNs) [14], [15].This allows us to define a BN representation of the augmented SPN.Using our BN interpretation and the differential approach [16], [17] in augmented SPNs, we give a sound derivation of the (soft) EM algorithm for SPNs.
Closely related to the LV interpretation is the inference scenario of finding the most-probable-explanation (MPE), i.e. finding a probability maximizing assignment for all RVs.Using results form [18], [19], we first point out that this problem is generally NP-hard for SPNs.In [1] it was proposed that an MPE solution can be found efficiently when maximizing over both model RVs (i.e.non-latent RVs) and LVs.The proposed algorithm replaces sum nodes by max nodes and recovers the solution by using Viterbistyle backtracking.However, it was not shown that this algorithm delivers a correct MPE solution.In this paper, we show that this algorithm is indeed correct, when applied to selective SPNs [20].In particular, since augmented SPNs are selective, this algorithm obtains an MPE solution in augmented SPNs.However, when applied to non-augmented SPNs, the algorithm still returns an MPE solution of the augmented SPN, but implicitly assumes that the weights for all twin sums are deterministic, i.e. they are all 0 except a single 1.This leads to a phenomenon in MPE inference which we call low-depth bias, i.e. more shallow parts of the SPN are preferred during backtracking.
The main contribution in this paper is to provide a sound theoretical foundation for the LV interpretation in SPNs and related concepts, i.e. the EM algorithm and MPE inference.Our theoretical findings are confirmed in experiments on synthetic data and 103 real-world datasets.
The paper is organized as follows: In the remainder of this section we introduce notation, review SPNs and discuss related work.In Section 2 we propose the augmentation of SPNs, show its soundness as hierarchical LV model and give an interpretation as BN.Furthermore, we discuss independency properties in augmented SPNs and the interpretation of sum-weights as conditional probabilities.The EM algorithm for SPNs is derived in Section 3. In Section 4 we discuss MPE inference for SPNs.Experiments are presented in Section 5 and Section 6 concludes the paper.Proofs for our theoretical findings are deferred to the Appendix.

Background and Notation
RVs are denoted by upper-case letters W , X, Y and Z.The set of values of an RV X is denoted by valpXq, where corresponding lower-case letters denote elements of valpXq, e.g.x is an element of valpXq.Sets of RVs are denoted by boldface letters W, X, Y and Z.For RV set X " tX 1 , . . ., X N u, we define valpXq " Ś N n"1 valpX n q and use corresponding lower-case boldface letters for elements of valpXq, e.g.x is an element of valpXq.For a subset Y Ď X, xrYs denotes the projection of x onto Y.
The elements of valpXq can be interpreted as complete evidence, assigning each RV in X a fixed value.Partial evidence about X is represented as a subset X Ď valpXq, which is an element of the sigma-algebra A X induced by RV X.For all RVs we use A X " tX P B | X Ď valpXqu, B being the Borel-sets over R. For discrete RVs, this choice yields the power-set A X " 2 valpXq .For example, partial evidence X " t1, 3, 5u for a discrete RV X with valpXq " t1, . . ., 6u represents evidence that X takes one of the states 1, 3 or 5, and Y " r´8, πs for a real-valued RV Y represents evidence that Y takes a value smaller than π.Formally speaking, partial evidence is used to express the domain of marginalization or maximization for a particular RV.
For sets of RVs X " tX 1 , . . ., X N u, we use the product sets H Elements of H X are denoted using boldface notation, e.g.X .When Y Ď X and X P H X , we define X rYs :" txrYs | x P X u.Furthermore, we use e to symbolize any combination of complete and partial evidence, i.e. for RVs X we have some complete evidence x 1 for X 1 Ď X and some partial evidence Given a node N in some directed graph G, let chpNq and papNq be the set of children and parents of N, respectively.Furthermore, let descpNq be the set of descendants of N, recursively defined as the set containing N itself and any child of a descendant.Similarly, we define ancpNq as the ancestors of N, recursively defined as the set containing N itself and any parent of an ancestor.SPNs are defined as follows.The size |S| of the SPN is defined as the number of nodes and edges in G.For any node N in G, the scope of N is defined as

Definition 1 (Sum-Product Network
( The function computed by S is the function computed by its root and denoted as Spxq, where without loss of generality we assume that the scope of the root is X.
We use symbols D, S, P, N, C and F for nodes in SPNs, where D denotes a distribution, S denotes a sum, and P denotes a product.Symbols N, C and F denote generic nodes, where C and F indicate a child or parent relationship to another node, respectively.The distribution p S of an SPN S is defined as the normalized output of S, i.e. p S pxq9Spxq.For each node N, we define the sub-SPN S N rooted at N as the SPN defined by the graph induced by the descendants of N and the corresponding parameters.
Inference in unconstrained SPNs is generally intractable.However, efficient inference in SPNs is enabled by two structural constraints, completeness and decomposability [1].An SPN is complete if for all sums S it holds that @C 1 , C 2 P chpSq : scpC 1 q " scpC 2 q. (3 An SPN is decomposable if for all products P it holds that Furthermore, a sum node S is called selective [20] if for all choices of sum-weights w and all possible inputs x it holds that at most one child of S is non-zero.An SPN S is called selective if all its sum nodes are selective.As shown in [17], [19], integrating Spxq over arbitrary sets X P H X , i.e. marginalization over X , reduces to the corresponding integrals at the input distributions and evaluating sums and products in the usual way.This property is known as validity of the SPNs [1], and key for efficient inference.In this paper we only consider complete and decomposable SPNs.Without loss of generality [17], [21], we assume locally normalized sum-weights, i.e. for each sum node S we have ř CPchpSq w S,C " 1, and thus p S " S, i.e. the SPN's normalization constant is 1. For RVs with finitely many states, we will use so-called indicator variables (IVs) as input distributions [1].For a finitestate RV X and state x P valpXq, we introduce the IV λ X"x px 1 q :" 1px " x 1 q, assigning all probability mass to x.A complete and decomposable SPN represents the (extended) network polynomial of p S , which can be used in the differential approach to inference [1], [16], [17].Assume any evidence e which is evaluated in the SPN.The derivatives of the SPN function with respect to the IVs (by interpreting the IVs as real-valued variables, see [16], [17] for details) yield representing the inference scenario of modified evidence, i.e. evidence e is modified such that X is set to x.The computationally attractive feature of the differential approach is that (5) can be evaluated for all X P X and all x P valpXq simultaneously using a single back-propagation pass in the SPN, after evidence has been evaluated.Similarly, for the second (and higher) derivatives, we get ) Furthermore, the differential approach can be generalized to SPNs with arbitrary input distributions, i.e.SPNs over RVs with countably infinite or uncountably many states (cf.[17] for details).

Related Work
SPNs are related to negation normal forms (NNFs), a potential deep network representation of propositional theories [22], [23], [24].Like in SPNs, structural constraints in NNFs enable certain polynomial-time queries in the represented theory.In particular, the notions of smoothness, decomposability and determinism in NNFs translate to the notions of completeness, decomposability and selectivity in SPNs, respectively.The work on NNFs led to the concept of network polynomials as a multilinear representation of BNs over finitely many states [16], [25].BNs were cast into an intermediate d-DNNF (deterministic decomposable NNF) representation in order to generate an arithmetic circuit (ACs), representing the BNs network polynomial.ACs, when restricted to sums and products, are equivalent to SPNs but have a slightly different syntax.In [26], ACs were learned by optimizing an objective trading off the log-likelihood on the training set and the inference cost of the AC, measured as the worst-case number of arithmetic operations required for inference (i.e. the number of edges in the AC).The learned models still represent BNs with context-specific independencies [27].A similar approach learning Markov networks represented by ACs is followed in [28].SPNs were the first time proposed in [1], where the represented distribution was not defined via a background graphical model any more, but directly as the normalized output of the network.In this work, SPNs were applied to image data, where a generic architecture reminiscent to convolutional neural networks was proposed.Structure learning algorithms not restricted to the image domain were proposed in [2], [3], [4], [29], [30], [31].Discriminative learning of SPNs, optimizing conditional likelihood, was proposed in [6].Furthermore, there is a growing body of literature on theoretical aspects of SPNs and their relationship to other types of probabilistic models.In [32] two families of functions were identified which are efficiently representable by deep, but not by shallow SPNs, where an SPN is considered as shallow if it has no more than three layers.In [17] it was shown that SPNs can w.l.o.g.be assumed to be locally normalized and that the notion of consistency does not allow exponentially more compact models than decomposability.These results were independently found in [21].Furthermore, in [17], a sound derivation of inference mechanisms for generalized SPNs was given, i.e.SPNs over RVs with (uncountably) infinitely many states.In [21], a BN representation of SPNs was found, where LVs associated with sum nodes and the model RVs are organized in a two layer bipartite structure.The actual SPN structure is captured in structured conditional probability tables (CPTs) using algebraic decision diagrams.Recently, the notion of SPNs was generalized to sumproduct functions over arbitrary semirings [33].This yields a general unifying framework for learning and inference, subsuming, among others, SPNs for probabilistic modeling, NNFs for logical propositions and function representations for integration and optimization.

LATENT VARIABLE INTERPRETATION
As pointed out in [1], each sum node in an SPN can be interpreted as a marginalized LV, similar as in the GMM example in Section 1.For each sum node S, one postulates a discrete LV Z whose states correspond to the children of S.
For each state, an IV and a product is introduced, such that the children are switched on/off by the corresponding IVs, as illustrated in Fig. 1. 1 When all IVs in Fig. 1b are set to 1, S still computes the same value as in Fig. 1a.Since setting all IVs of Z to 1 corresponds to marginalizing Z, the sum S should be interpreted as a latent, marginalized RV.
However, when we regard a larger structural context in Fig. 1b, we recognize that this justification is actually too simplistic.Explicitly introducing the IVs renders the ancestor S 1 incomplete, when S is no descendant of N, and Z is thus not in the scope of N. Note that setting all IVs to 1 in an incomplete SPN generally does not correspond to 1.In graphical representations of SPNs, IVs are depicted as nodes containing a small circle, general distributions as nodes containing a Gaussian-like PDF, and sum and products as nodes with `and ŝymbols.Empty nodes are of arbitrary type.marginalization.Furthermore, note that also S 1 corresponds to an LV, say Z 1 .While we know the probability distribution of Z if Z 1 is in the state corresponding to P, namely the weights of S, we do not know this distribution when Z 1 is in the state corresponding to N. Intuitively, we recognize that the state of Z is irrelevant in this case, since it does not influence the resulting distribution over the model RVs X.Nevertheless, the probabilistic model is not completely specified, which is unsatisfying.A remedy for these problems is shown in Fig. 1c.We introduce the twin sum node S whose children are the IVs corresponding to Z.The twin S is connected as child of an additional product node, which is interconnected between S 1 and N. Since this new product node has scope scpNq Y tZu, S 1 is rendered complete now.Furthermore, if Z 1 takes the state corresponding to N (or actually the state corresponding to the new product node), we now have a specified conditional distribution for Z, namely the weights of the twin sum node.Clearly, given that all IVs of Z are set to 1, the network depicted in Fig. 1c still computes the same function as the network in Fig. 1a (or Fig. 1b), since S constantly outputs 1, as long as we use normalized weights for it.Which weights should be used for the twin sum node S? Basically, we can assume arbitrary normalized weights, which will cause S to constantly output 1, where, however, a natural choice would be to use uniform weights for S (maximizing the entropy of the resulting LV model).Although the choice of weights is not crucial for evaluating evidence in the SPN, it plays a role in MPE inference, see Section 4. For now, let us formalize the explicit introduction of LVs, denoted as augmentation.

Augmentation of SPNs
Let S be an SPN over X.For each S P SpSq we assume an arbitrary but fixed ordering of its children chpSq " tC 1 S , . . ., C K S S u, where K S " |chpSq|.Let Z S be an RV on the same probability space as X, with valpZ S q " t1, . . ., K S u, where state k corresponds to child C k S .We call Z S the LV associated with S. For sets of sum nodes S we define Z S " tZ S | S P Su.To distinguish X from the LVs, we will refer to the former as model RVs.For node N, we define the sum ancestors/descendants as anc S pNq :" ancpNq X SpSq, (7) desc S pNq :" descpNq X SpSq.
1: procedure AUGMENTSPN(S) S 1 Ð S 3: @S P SpS 1 q, @k P t1, . . ., K S u : for S P SpS 1 q do 5: for k " 1 . .For each sum node S we define the conditioning sums as S c pSq :" tS c P anc S pSqztSu | DC P chpS c q : S R descpCqu.
(9) Furthermore, we assume a set of locally normalized twinweights w, containing a twin-weight wS,C for each weight w S,C in the SPN.We are now ready to define the augmentation of an SPN.Definition 2 (Augmentation of SPN).Let S be an SPN over X, w be a set of twin-weights and S 1 be the result of algorithm AUGMENTSPN, shown in Fig. 2 In steps 4-11 of AUGMENTSPN we introduce the links P k S which are interconnected between sum node S and its k th child.Each link P k S has a single parent, namely S, and simply copies the former child C k S .In steps 13-15, we introduce IVs corresponding to the associated LV Z S , as proposed in [1].As we saw in Fig. 1 and the discussion above, this can render other sum nodes incomplete.These sums are clearly the conditioning sums S c pSq.Thus, when necessary, we introduce a twin sum node in steps 17-23, to treat this problem.The following proposition states the soundness of augmentation.
Proposition 1.Let S be an SPN over X, S 1 " augpSq and Z :" Z SpSq .Then S 1 is a complete and decomposable SPN over X Y Z with S 1 pXq " SpXq.
Proposition 1 states that the marginal distribution over X in the augmented SPN is the same distribution as represented by the original SPN, while being a completely specified probabilistic model over X and Z.Thus, augmentation provides a sound way to generalize the LV interpretation from mixture models to more general SPNs.An example of augmentation is shown in Fig. 3.
Note that we understand the augmentation mainly as a theoretical tool to establish and work with the LV interpretation in SPNs.In most cases, it will be neither necessary nor advisable to explicitly construct the augmented SPN.
An interesting question is how the sizes of the original SPN and the augmented SPN relate to each other.A lower bound is |S 1 | P Ωp|S|q, holding e.g. for SPNs with a single sum node.An asymptotic upper bound is |S 1 | P Op|S| 2 q.To see this, note that the introduction of links, IVs and twin sums cause at most a linear increase of the SPN's size.The number of edges introduced when connecting twins to the links of conditioning sums is bounded by |S| 2 , since the number of twins and links are both bounded by |S|.Therefore, we have |S 1 | P Op|S| 2 q.This asymptotic upper bound is indeed achieved by certain types of SPNs: Consider e.g. a chain consisting of K sum nodes and K `1 distribution nodes.For k ă K the k th sum is the parent of the pk `1q th sum and the k th distribution, and the K th sum is the parent of the last two distributions.For the k th sum, all preceding sums are conditioning sums, yielding k ´1 introduced edges.In total this gives ř K k"2 pk ´1q " K pK´1q 2 " K 2 ´K 2 edges, i.e. in this case |S 1 | indeed grows quadratically in |S|.

Conditional Independencies in Augmented SPNs and Probabilistic Interpretation of Sum-Weights
It is helpful to introduce the notion of configured SPNs, which takes a similar role as conditioning in the literature on DNNFs [22], [23], [24].
Definition 3 (Configured SPN).Let S be an SPN over X, Y Ď Z SpSq and y P valpYq.The configured SPN S y is obtained by deleting the IVs λ Y "y and their corresponding link for each Y P Y, y " yrY s from augpSq, and further deleting all nodes which are rendered unreachable from the root.
Intuitively, the configured SPN isolates the computational structure selected by y.All sum edges which "survive" in the configured SPN are equipped with the same weights as in the augmented SPN.Therefore, a configured SPN is in general not locally normalized.We note the following properties of configured SPNs.Proposition 2. Let S be an SPN over X, Y Ď Z SpSq and Z " Z SpSq zY.Let y P valpYq and let S 1 " augpSq.It holds that 1) Each node in S y has the same scope as its corresponding node in S 1 .2) S y is a complete and decomposable SPN over XYYYZ.
3) For any node N in S y with scpNq X Y " H, we have that S y N " S 1 N .

4)
For y 1 P valpYq it holds that The next theorem shows certain conditional independencies in the augmented SPN.For ease of discussion, we make the following definitions.We will show that the parents, children and nondescendants play the likewise role as for independencies in BNs [14], [15], i.e.Z S is independent of Y n given Z p .We will further show that the sum-weights of S are the conditional distribution of Z S , conditioned on the event that "Z p select a path to S".One problem in the original LV interpretation [1] was, that no conditional distribution of Z S was specified for the complementary event.Here, we will show that the twin-weights are precisely this conditional distribution.This requires that the event "Z p select a path to the twin S" is indeed the complementary event to "Z p select a path to S".This is shown in following lemma.Lemma 1.Let S be an SPN over X, let S be a sum node in S and Z p be the parents of Z S .For any z P valpZ p q, the configured SPN S z contains either S or its twin S, but not both.
We are now ready to state the our theorem concerning conditional independencies in augmented SPNs.
Theorem 1.Let S be an SPN over X and S 1 " augpSq.Let S be an arbitrary sum in S and w k " w S,C k S , wk " wS,C k S , k " 1, . . ., K S .With respect to S, let Z p be the parents, Y c be the children and Y n be the non-descendants, respectively.Then there exists a two-partition of valpZ p q, i.e.Z, Z : Z Y Z " valpZ p q, Z X Z " H, such that @z P Z : S 1 pZ S " k, Y n , zq " w k S 1 pY n , zq, and (11) @z P Z : S 1 pZ S " k, Y n , zq " wk S 1 pY n , zq.
From Theorem 1 it follows that the weights and twinweights of a sum node S can be interpreted as conditional probability tables (CPTs) of Z S , conditioned on Z p and that Z S is conditionally independent of Y n given Z p , i.e.
) Using this result, we can define a BN representing the augmented SPN as follows: For each sum node S, connect Z p as parents of Z S , and all RVs scpSq as children of Z S .By doing this for each LV, we obtain our BN representation of the augmented SPN, serving as a useful tool to understand SPNs in the context of probabilistic graphical models.An example of the BN interpretation is shown in Fig. 4.
Note that the BN representation by Zhao et al. [21] can be recovered from the BN representation of augmented SPNs.They proposed a BN representation of SPNs using a bipartite structure, where an LV is a parent of a model RV if it is contained in the scope of the corresponding sum node.The model RVs and LVs are unconnected among each other, respectively.When we constrain the twin-weights to be equal to the sum-weights, we can see in (13) that Z S becomes independent of Z p .This special choice of twin weights effectively removes all edges between LVs, recovering the BN structure in [21].In the next section, we use the augmented SPN and the BN interpretation to derive the EM algorithm for SPNs.

EM ALGORITHM
The EM algorithm is a general scheme for maximum likelihood learning, when for some RVs complete evidence is missing [10], [11].Thus, augmented SPNs are amenable for EM due to the LVs associated with sum nodes.Moreover, the twin-weights can be kept fixed, so that EM applied to augmented SPNs actually optimizes the weights of the original SPN.This approach was already pointed out in [1], where it was suggested that for evidence e and for any LV Z S , the marginal posteriors should be given as BSpeq , which should be used for EM updates.These updates, however, cannot be the correct ones, as they actually leave the weights unchanged.Here, using augmented SPNs, we formally derive the standard EM updates for sum-weights and the input distributions, when they are chosen from an exponential family.

Updates for Weights
Assume a dataset D " te p1q , . . ., e pLq u of L i.i.d.samples, where each e plq is any combination of complete and partial evidence for the model RVs X, cf.Section 1.1.Let Z " Z SpSq be the set of all LVs and consider an arbitrary sum node S. Eq. (13) shows that the weights can be interpreted as conditional probabilities in our BN interpretation, where As mentioned above, the twin-weights wk are kept fixed.Using the well-known EM-updates in BNs over discrete RVs [10], [15], the updates for sum-weight w k are given by summing over the expected statistics followed by renormalization.We make the event Z p P Z explicit, by introducing a switching parent Y S of Z S : When the twin sum of S exists, Y S assumes the two states valpY S q " ty S , ySu, where Y S " y S ô Z p P Z and Y S " yS ô Z p P Z.When the twin sum does not exist, Y S just takes the single value valpY S q " ty S u.Clearly, when observed, Y S renders Z S independent from Z p .The switching parent can be explicitly introduced in the augmented SPN, as depicted in Fig. 5.Here we simply introduce two new IVs λ Y S "y S and λ Y S "yS , which switch on/off the output of S and S, respectively.It is easy to see that when these IV are constantly set to 1, i.e. when Y S is marginalized, the augmented SPN performs exactly the same computations as before.It is furthermore easy to see that completeness and decomposability of the augmented SPN are maintained S S looooomooooon λZ S "1 λZ S "2 λZ S "3 when the switching parent is introduced.Using the switching parent, the required expected statistics (15) translate to To compute (16), we use the differential approach, [16], [17], [19], cf. also Section 1.1.First note that S 1 pZ S " k, Y S " y S , e plq q " B 2 S 1 pe plq q Bλ Y S "y S Bλ Z S "k .
The first derivative is given as BS 1 pe plq q Bλ Y S "y S " BS 1 pe plq q BP Spe plq q (18) where P is the common product parent of S and λ Y S "y S in the augmented SPN (see Fig. 5b).Differentiating (19) after λ Z S "k yields the second derivative delivering the required posteriors S 1 pZ S " k, Y S " y S | e plq q " 1 S 1 pe plq q BS 1 pe plq q BP w k C k S pe plq q.
(21) We do not want to construct the augmented SPN explicitly, so we express (21) in terms of the original SPN.Since all LVs are marginalized, it holds that S 1 pe plq q " Spe plq q and BS 1 pe plq q BP " BSpe plq q BS , yielding S 1 pZ S " k, Y S " y S | e plq q " 1 Spe plq q BSpe plq q BS w k C k S pe plq q, (22) delivering the required statistics for updating the sumweights.We now turn to the updates of the input distributions.

Updates for Input Distributions
For simplicity, we derive updates for univariate input distributions, i.e. for all distributions D Y we have |scpD Y q| " 1.Similar updates can rather easily be derived also for multivariate input distributions.In [17], the so-called distribution selectors (DSs) were introduced to derive the differential approach for generalized SPNs.Similar as the switching parents for (twin) sum nodes, the DSs are RVs which render the respective model RVs independent from the remaining RVs.More formally, for each X P X, let D X be the set of all input distributions which have scope tXu.Assume an arbitrary but fixed ordering of D X and let rD X s be the index of D X in this ordering.Let the DS W X be a discrete RV with |D X | states.The so-called gated SPN S g is obtained by replacing each distribution by the product node The introduced product is denoted as gate.As shown in [17], X is rendered independent from all other RVs in the SPN when conditioned on W X .Moreover, D X is the conditional distribution of X given W X " rD X s.Therefore, each X and its DS W X can be incorporated as a two RV family in our BN interpretation.When each input distribution D X is chosen from an exponential family with natural parameters θ DX , the M-step is given by the expected sufficient statistics where k " rD X s.When e plq contains complete evidence x 1 for X, then the integral ş D X px | e plq qθ DX pxqdx reduces to θ DX px 1 q.When e plq contains partial evidence X , then Depending on X and the the type of D X , evaluating ( 25) can be more or less demanding.A simple but practical case is when D X is Gaussian and X is some interval, permitting a closed form solution for integrating the Gaussian's statistics θpxq " px, x 2 q, using truncated Gaussians [34].
To obtain the posteriors S g pW X " k | e plq q required in (24), we again use the differential approach.Note that S g pW X " k, e plq q " BS g pe plq q Bλ WX "k " BS g pe plq q BP D X pe plq q, (26) where k " rD X s and P is the gate of D X , cf. ( 23).If we do not want to construct the gated SPN explicitly, we can use the identity BS g pe plq q BP " BSpe plq q BDX .Thus the required posteriors are given as 1: procedure EXPECTATION-MAXIMIZATION(S) 2: Initialize w and input distributions 3: while not converged do 4: @S P SpSq, @C P chpSq : n S,C Ð 0 5: @X P X, @D X P D X : θ DX Ð 0, n DX Ð 0 6: for l " 1 . . .L do 7: Input e plq to S 8: Evaluate S (upward-pass) 9: Backprop S (backward-pass) @S P SpSq, @C P chpSq : @X P X, @D X P D X : set parameters to end while
The EM algorithm for SPNs, both for sum-weights and input distributions, is summarized in Fig. 6.In Section 5.1 we empirically verify our derivation of EM and show that standard EM successfully trains SPNs when a suitable structure is at hand.
Note that recently Zhao and Poupart [35] derived a concave-convex procedure (CCCP) which yield the same sum-weight updates as the EM algorithm presented here and in [19].This result is surprising, as EM and CCCP are rather different approaches in general.

MOST PROBABLE EXPLANATION
In [1], [4], [7], SPNs were applied for reconstructing data using MPE inference.Given some distribution p over X and evidence e, MPE can be formalized as finding arg max xPe ppxq, where we assume that p actually has a maximum in e. MPE is a special case of MAP, defined as finding arg max yPerYs ş erZs ppy, zq dz, for some two-partition of X, i.e.X " Y Y Z, Y X Z " H.Both MPE and MAP are generally NP-hard in BNs [36], [37], [38], and MAP is inherently harder than MPE [37], [38].Using the result in [18], it follows that MAP inference is NP-hard also in SPNs.In particular, Theorem 5 in [18] shows that the decision version of MAP is NP-complete for a Naive Bayes model, when the class variable is marginalized.Naive Bayes is represented by the augmentation of an SPN with a single sum node, the LV representing the class variable.Therefore, MAP in SPNs is generally NP-hard.Since MAP in the augmented SPN representing the Naive Bayes model corresponds to MPE inference in the original SPN, i.e. a mixture model, it follows that also MPE inference is generally NP-hard in SPNs.A proof tailored to SPNs can be found in [19].
However, when considering the the sub-class of selective SPNs (cf.Section 1.1 and [20]), an MPE solution can be obtained using a Viterbi-style backtracking algorithm in max-product networks.
Definition 5 (Max-Product Network).Let S be an SPN over X.We define the max-product network (MPN) Ŝ, by replacing each distribution node D by a maximizing distribution node D : and each sum node S by a max node Ŝ :" max A product node P in S corresponds to a product node P in Ŝ.
Theorem 2. Let S be a selective SPN over X and let Ŝ the corresponding MPN.Let N be some node in S and N its corresponding node in Ŝ.Then, for every X P H scpNq we have NpX q " max xPX Npxq.
Theorem 2 shows that the MPN maximizes the probability in its corresponding selective SPN.The proof (see appendix) also shows how to actually find a maximizing assignment.For a product, a maximizing assignment is given by combining the maximizing assignments of its children.For a sum, a maximizing assignment is given by the maximizing assignment of a single child, whose weighted maximum is maximal among all children.Here the children's maxima are readily given by the upwards pass in the MPN.Thus, finding a maximizing assignment of any node in an selective SPN recursively reduces to finding maximizing assignments for the children of this node; this can be accomplished by a Viterbi-like backtracking procedure.This algorithm, denoted as MPESELECTIVE, is shown in Fig. 7.Here Q denotes a queue of nodes, where Q ð N and N ð Q denote the en-queue and de-queue operations, respectively.Note that Theorem 2 has already been derived for a special case, namely for arithmetic circuits representing network polynomials of BNs over discrete RVs [39].
A direct corollary of Theorem 2 is that MPE inference is tractable in augmented SPNs, since augmented SPNs are selective SPNs over X and Z.This can easily be seen in AUGMENTSPN, as for any z and any sum S, exactly one IV of Z S is set to 1, causing that at most one child of S (or S) can be non-zero.Therefore, we can use MPESELECTIVE in augmented SPNs, in order to find an MPE solution over both model RVs and LVs.Note that an MPE solution for the augmented SPN does in general not correspond to an MPE solution for the original SPN, when discarding the states of the LVs.However, this procedure is a frequently used approximation for models where MPE is tractable for both model RVs and LVs, but not for model RVs alone.
In [1], MPESELECTIVE was applied to original SPNs, not to augmented SPNs, but also with the goal to recover an else if N is a product node then 10: else if N is a maximizing distribution node then 12: N Ð corresponding distribution node 13: x ˚rscpNqs " arg max MPE solution over both model RVs and LVs.The states of the LVs were assigned during max-backtracking, as sumchildren and LV states are in one-to-one correspondence.
The states of the LVs whose sums are not visited during backtracking, are not assigned -again, this causes some confusion, since some LVs appear to be undefined in some contexts, cf. the illustrations in Section 2. However, since this algorithm was used as approximation for MPE over model RVs by discarding the states of the LVs, this situation was not paid any further attention.Nevertheless, as we show here, applying MPESELEC-TIVE to original (non-selective) SPNs effectively "simulates" MPESELECTIVE in the corresponding augmented SPN.Thereby, however, deterministic twin-weights are implicitly assumed, i.e. twin-weights which are 0, except a single 1.To see this, let us modify MPESELECTIVE, such that it can be applied to an original SPN, but returning an MPE solution for the corresponding augmented SPN.First note that in the augmented MPN, every twin node simply outputs the maximal twin-weight among all children whose states are contained in evidence e.For twin node S, let this maximal weight be denoted by ŵS.The effect of the twin nodes can now be simulated in the original SPN by replacing each weight w S,C in the original SPN by w S,C ˆw S,C .Here wS,C is a correction factor and given as wS,C " ś S ŵS, where the product runs over all twins of those sums for which S is a conditioning sum.By using these corrected weights, each max node in the corresponding MPN gets the same input as in the MPN of the augmented SPN, i.e. the twin nodes are simulated.We can identify the maximizing states of those LVs whose sums are visited during backtracking, as in [1].The states of the sums which are not visited are given by the child which correspond to the maximal twin-weight ŵS.Pseudo-code for this somewhat technical modification of MPESELECTIVE can be found in [19].
We see that the algorithm used in [1] is essentially equivalent to MPESELECTIVE in augmented SPNs when wS,C " 1 for all sum nodes, which implies that the twin-weights are Fig. 8. Illustration of the low-depth bias using an SPN over RVs tX 1 , X 2 , X 3 u.The structure introduced by augmentation is depicted by small nodes and edges.When deterministic twin-weights are used, the state of Z S 1 corresponding to P 1 is preferred over P 2 and P 3 , since their probabilities are "dampened" by the weights of S 2 and S 3 , respectively.deterministic.Therefore, although the LV model in [1] is not completely specified and it was not shown that the Viterbilike algorithm recovers an MPE solution, it nevertheless corresponds to MPE inference in the augmented SPN for special twin-weights, i.e. deterministic weights.
However, using deterministic twin-weights is a rather unnatural choice, since this prefers one arbitrary state over the others in cases where this LV is actually "rendered irrelevant".In this case, MPE inference also has a bias towards less structured sub-models, which we call lowdepth bias.This is illustrated in Fig. 8, which shows an SPN over three RVs X 1 , X 2 , X 3 .The augmented SPN has two twin sum nodes S2 and S3 , corresponding to S 2 and S 3 , respectively.When their twin-weights are deterministic, the selection of the state of Z S 1 is biased towards the state corresponding to P 1 , which is a distribution assuming independence among X 1 , X 2 and X 3 .This comes from the fact, that the values of P 2 and P 3 are dampened by the weights of S 2 and S 3 , respectively, which are generally smaller than 1.Therefore, when using deterministic weights for twin sum nodes, we introduce a bias towards the selection of sub-SPNs that are less deep and less structured.Using uniform weights for twin sum nodes is somewhat "fairer", since in this case P 1 gets dampened by S2 and S3 , P 2 by S 2 and S3 , and P 3 by S2 and S 3 .Uniform weights are to some extend the opposite choice to deterministic twin-weights: the former represent the strongest possible dampening via twin-weights and therefore actually penalize less structured distributions.Investigating these effects further is subject to future work.

Experiments with EM Algorithm
In [1], [40] SPNs were applied to image data, where a generic architecture reminiscent to convolutional neural networks was proposed.We refer to this architecture as PD architecture.Standard EM was not used in experiments for two reasons: First, explicitly constructing the proposed structure and to train it with standard EM is hardly possible with current hardware, since the number of nodes grows Opl 3 q, where l is the square-length of the modeled image domain in pixels [19].Instead, a sparse hard EM algorithm was used, which virtualizes the PD structure, i.e. sum and products are generated on the fly (see [40] for details).Second, using standard EM seemed unsuited to train large and dense SPNs, either because it is trapped in local optima or due to the gradient vanishing phenomenon.
In our experiments, 2 we investigated three questions: 1) Is our derivation of EM correct, both for complete and missing data?2) Can the result of hard EM [1] be improved by standard EM? 3) Given a suited sparse structure, does EM yield a good solution for parameters?
Question 1) is important since the original derivation contained an error.Questions 2) and 3) are concerned with the general applicability of EM for training SPN.We used the same datasets and SPN structures as in [1], obtainable from [40].The datasets comprise Caltech-101 (inclusive background class) [42] and the ORL face images [43], i.e. in total 103 datasets.The input distributions in these SPNs are single-dimensional Gaussians (4 for each pixel), where means were set to the averages of the 4-quantiles and variances were constantly 1.We ran EM (Fig. 6) for 30 iterations, with various settings:  on the training set, 3 i.e. our derived EM algorithm showed monotonicity in our experiments.Moreover, as can be seen in Fig. 9a, the training log-likelihood actually increased over iterations.The curves for the missing data scenarios are similar.This gives affirmative evidence for question 1).Fig. 9b shows the log-likelihood on the test set.Note that optimizing the parameter sets V and WV led to severe overfitting: while achieving extremely high likelihoods on the training set, they achieved extremely poor likelihoods on the test set.Also the parameter sets MV and WMV tend to overfit, although not as strong as V and WV.
Regarding question 2), we closer inspected the test loglikelihood when the original parameters are used for initialization, i.e. when the parameters obtained by [40] are post-trained using EM.Table 1 summarizes the results.When parameter sets not including Gaussian variances are optimized (i.e.W, M, and WM), the test log-likelihood increased most of the time, i.e. for 83.5% (M) to up to 92.23% (WM) of the datasets.Furthermore, having oracle knowledge about the ideal number of iterations (i.e.column best), the average log-likelihood increased by 0.58% (M) to up to 1.39% (WM) relative to the original parameters.Most of this improvement happens in the first iteration, yielding 0.52% (M) up to 1.05% (WM) improvement.These results indicate that the parameters obtained by [40] slightly underfit the given datasets.Similar as in Fig. 9, we see that parameter sets including the Gaussian variances (V, 3. Except for tiny occasional decreases (always ă 10 ´8) after EM had converged, which can be attributed to numerical artifacts.WV, MV, WMV) are prone to overfitting: more than 60% of the datasets decreased their test log-likelihood during EM.However, in the remaining 40% of the datasets, the test loglikelihood could be improved substantially by at least 14% on average.We now turn to question 3).As pointed out above, a hard EM variant was used in [1], [40] which at the same time finds the effective SPN structure.Optimizing W using the 3 random initialization amounts to using the oracle structure obtained by [1], [40], discarding the learned parameters.For each dataset we selected the random initialization which yielded the highest likelihood on the training set in iteration 30.For this run, we compared the log-likelihoods with the log-likelihoods obtained by the original parameters.The results are summarized in Table 2.We see that on all data sets the log-likelihood on the training set is larger than for the original parameters.This is also the case for each individual random start (not just best one) -every random restart always yielded a higher training log-likelihood than the original parameters.Thus, by considering the actual optimization objective -the likelihood on the training set -EM successfully trains SPNs, given a suited oracle structure.Furthermore, as can be seen in Table 2, EM is also not more prone to overfitting than the algorithm in [1]: on 67.96% of the datasets, EM delivered a higher test log-likelihood than the original parameters, when using oracle knowledge about the ideal number of iterations (column best).

Experiments with MPE Inference
To illustrate correctness of MPESELECTIVE (Fig. 7) when applied to augmented SPNs, we generated SPNs using the PD architecture [1], arranging 4, 9 and 16 binary RVs in a 2ˆ2, 3ˆ3 and 4ˆ4 grid, respectively.As inputs we used two indicator variables for each RV representing their two states.The sum-weights were drawn from a Dirichlet distribution TABLE 3 Differences of log-likelihood to the ground-truth MPE solution found by exhaustive enumeration, averaged over 100 independent draws of sum-weights.Numbers in parentheses are the number of times where an MPE solution was found.Results for augmented SPNs using uniform twin-weights.
The results relative to the ground truth MPE solutions are shown in Tables 3, 4, and 5.As can be seen, MPEUNI always finds an MPE solution in the augmented SPN with uniform twin-weights and MPEDET always finds an MPE solution in augmented SPNs with deterministic twin-weights.This gives empirical evidence for the correctness of MPESELEC-TIVE for MPE inference in augmented SPNs.Furthermore, we wanted to investigate the quality of both algorithms when serving as approximation for MPE inference in the original SPNs.For the SPNs considered here, MPEDET delivered on average slightly better approximations than MPEUNI.However, these results should be interpreted with caution, due to the rather similar nature of the distributions considered here.Closer investigating approximate MPE for (original) SPNs is an interesting direction and will be subject to future research.

Fig. 1 .
Fig. 1.Problems occurring when IVs of LVs are introduced.(a): Excerpt of SPN containing a sum S, corresponding to LV Z. (b): Introducing IVs for Z renders S 1 incomplete, assuming that S R descpNq.(c): Remedy by extending SPN further, introducing twin sum node S.

Definition 4 .
Let S be a sum node in an SPN and Z S its associated LV.All other RVs (model RVs and LVs) are divided into three sets: ‚ Parents Z p , which are all LVs "above" S, i.e.Z p " Z ancSpSq zZ S .‚ Children Y c , which are all model RVs and LVs "below" S, i.e.Y c " scpSq Y Z desc S pSq zZ S .‚ Non-descendants Y n , which are the remaining RVs, i.e.Y n " pX Y Z SpSq qzpZ p Y Y c Y Z S q.

Fig. 5 .
Fig. 5. Explicitly introducing a switching parent Y S in an augmented SPN.(a): Part of an augmented SPN containing a sum node with three children and its twin.(b): Explicitly introduced switching parent Y S using IVs λ Y S "y S and λ Y S "yS .

‚
Update any combination of the three different types of parameters, i.e. sum-weights, Gaussian means and Gaussian variances.Each set of parameters types is encoded by a string of letters W (weights), M (means) and V (variances).(7combinations) ‚ Use original parameters for initialization, obtained from [40], or use 3 random initialization, where sumweights are drawn from a Dirichlet distribution with uniform α " 1 hyper-parameter (i.e.uniform distribution on the standard simplex), Gaussian means are uniformly drawn from r´1, 1s and Gaussian variances from r0.01, 1s.Only parameters which are actually updated are initialized randomly; otherwise the original parameters [1] are used and kept fixed.(4 combinations) ‚ Use complete data or missing training data, randomly discarding 33% or 66% of the observations, independently for each sample.(3 combinations) Thus, in total we ran EM 7 ˆ4 ˆ3 ˆ103 " 8652 times, yielding 259560 EM-iterations.To avoid pathological solutions we used a lower bound of 0.01 for the Gaussian variances.In no iteration we observed a decreasing likelihood 2. Code available under [41].

Fig. 9 .
Fig. 9. Normalized log-likelihood over EM-iterations, averaged over all 103 datasets and 3 random initializations.(a): Training set.(b): Test set; Curves for V and WV are outside the displayed region, for better readability of the other curves.They start at approximately ´8000 nats and decreased to approximately ´11000 nats.
).A Sum-Product network (SPN) S over a set of RVs X is a tuple pG, wq where G is a connected, rooted and acyclic directed graph, and w is a set of non-negative parameters.The graph G contains three types of nodes: distributions, sums and products.All leaves of G are distributions and all internal nodes are either sums or products.A distribution node (also called input distribution or simply distribution) D Y : valpYq Þ Ñ r0,8s is a distribution function over a subset of RVs Y Ď X, i.e. either a PMF (discrete RVs), a PDF (continuous RVs), or a mixed distribution function (discrete and continuous RVs mixed).A sum node S computes a weighted sum of its children, i.e. S " ř CPchpSq w S,C C, where w S,C is a non-negative weight associated with edge S Ñ C, and w contains the weights for all outgoing sum-edges.A product node P computes the product over its children, i.e.P " ś CPchpPq C. The sets SpSq and PpSq contain all sum nodes and all product nodes in S, respectively.
. K S do . . ., K S u: connect λ Z S "k as child of S, and let wS ,λ Z S "k " wS,k 19:for S c P S c pSq do 20:for k P tk | S R descpP k S c qu do . S 1 is called the augmented SPN of S, denoted as S 1 ": augpSq.Within the context of S 1 , C k S is called the k th former child of S. The introduced product node P k S is called link of S, C k S and λ Z S "k , respectively.The sum node S, if introduced, is called the twin sum node of S. With respect to S 1 , we denote S as the original SPN.

TABLE 1
Changes in test log-likelihoods when original parameters are post-trained using EM.% inc.: percentage of datasets where log-likelihood increased in the first iteration.% all, % pos., % neg.: relative change of log-likelihood, averaged over all datasets, datasets with positive change, datasets with negative change, respectively.

TABLE 2
Log-likelihoods when sum-weights (W) are trained, using random initialization.% ą: percentage of data sets, where log-likelihood is larger than for original parameters.% all, % pos., % neg.: relative log-likelihood w.r.t.original parameters, for all data sets, data sets where relative log-likelihood is positive/negative, respectively.