Sum-product networks: A survey

—A sum-product network (SPN) is a probabilistic model, based on a rooted acyclic directed graph, in which terminal nodes represent univariate probability distributions and non-terminal nodes represent convex combinations (weighted sums) and products of probability functions. They are closely related to probabilistic graphical models, in particular to Bayesian networks with multiple context-speciﬁc independencies. Their main advantage is the possibility of building tractable models from data, i.e., models that can perform several inference tasks in time proportional to the number of links in the graph. They are somewhat similar to neural networks and can address the same kinds of problems, such as image processing and natural language understanding. This paper oﬀers a survey of SPNs, including their deﬁnition, the main algorithms for inference and learning from data, the main applications, a brief review of software libraries, and a comparison with related models.


I. Introduction
S UM-PRODUCT networks (SPNs) were proposed by Poon and Domingos [1] in 2011 as a modification of Darwiche's [2], [3] arithmetic circuits.Every SPN consists of a directed graph that represents a probability distribution resulting from a hierarchy of distributions combined in the form of mixtures (sum nodes) and factorizations (product nodes), as shown in Figure 1.SPNs, like arithmetic circuits, can be built by transforming a probabilistic graphical model [4], such as a Bayesian network or a Markov network, but they can also be learned from data.The main advantage of SPNs is that several inference tasks can be performed in time proportional to the number of links in the graph.
In this decade there has been great progress: numerous algorithms have been proposed for inference and learning, and SPNs have been successfully applied to many problems, including computer vision and natural language processing, in which probabilistic models could not compete with neural networks.The understanding of SPNs has also improved and some aspect can now be explained more clearly than in the original publications.For example, the first two papers about SPNs [1], [5] presented them as an efficient representation of network polynomials, while most of the later references, beginning with [6], define them as the composition of probability distributions, which is, in our view, more intuitive and much easier to understand.Consistency was initially one of the defining properties of SPNs, which made them more general than arithmetic circuits, but it later became clear that decomposability, a stronger but much more intuitive property, suffices to build SPNs for practical applications.In contrast, selectivity (called determinism in arithmetic circuits), which was not mentioned in the original paper [1], proved to be relevant for some inference tasks and for parameter learning [7], [8].Additionally, some of the algorithms for SPNs are only sketched in [1], without much detail or formal proofs, and one of them turned out to be correct only for selective SPNs.Other basic algorithms are scattered over several papers, each using a different mathematical notation.For these reasons we decided to write a survey explaining the main concepts and algorithms for SPNs with mathematical rigor.We have intentionally avoided any reference to network polynomials, which has forced us to develop new proofs for some algorithms and propositions, alternative to those found in other references, such as [9].We have also reviewed the literature on SPNs, with arXiv:2004.01167v1[cs.LG] 2 Apr 2020 especial emphasis on their applications.However, we are aware that some relevant references may have been omitted, especially those published recently in this thriving research area.
The rest of this paper is structured as follows.The two subsections of this introduction highlight the significance of SPNs by comparing them with probabilistic graphical models and neural networks respectively.After some mathematical preliminaries (Sec.II), we introduce the basic definitions of SPNs (Sec.III) and the main algorithms for inference (Sec.IV), parameter learning (Sec.V), and structural learning (Sec.VI).We then review some applications in several areas (Sec.VII), a few open-source packages (Sec.VIII), and some extensions of SPNs (Sec.IX).Section X contains the conclusions, Appendix A compares SPNs with arithmetic circuits, Appendix B analyzes the interpretation of sum nodes as weighted averages of conditional probabilities, and Appendix C contains the proofs of all the propositions.

A. SPNs vs. probabilistic graphical models (PGMs)
SPNs are similar to PGMs, such as Bayesian networks (BNs) and Markov networks (also called Markov random fields) [10], [4], in their ability to compactly represent probability distributions.The main difference is that in a PGM every node represents a variable androughly speaking-links represent probabilistic dependencies, sometimes due to causal influences, while in an SPN every node represents a probability function.PGMs and other factored probability distributions can be compiled into arithmetic circuits or SPNs [11].In general a PGM is more compact than an equivalent SPN, as shown in Figure 1.Inference in BNs and Markov networks is an NP-problem and existing exact algorithms have worst-case exponential complexity [4], while SPNs can do inference in time proportional to the number of links in the graph.In principle this would not be an advantage for SPNs, because the conversion of a BN or a Markov network into an SPN may create an exponential number of links.However, context-specific independencies in BNs [12] can reduce the size of the corresponding SPN, as shown in Figure 2. In fact, any problem solvable in polynomial time with tabular BNs (i.e., those whose conditional probabilities are given by tables) can be solved in polynomial time with arithmetic circuits or SPNs [3], but the converse is not true-see the example in [1]. 1  More importantly, while PGMs learned from data are usually intractable-except for small problems or for specific types of models with limited expressiveness, such as the naïve Bayes-the algorithms presented in Section VI can build tractable SPNs that yield excellent approximations both for generative and discriminative tasks. 1 However, every SPN of finite-state variables can be converted into a BN whose conditional probabilities are encoded as algebraic decision diagrams (ADDs), in time and space proportional to the network size, and it is possible to recover the original SPN with the variable elimination algorithm [13].In contrast, BNs can be built using causal knowledge elicited from human experts and there is a large body of recent research on building causal BNs from experimental and/or observational data, under certain conditions [14].It is also possible to combine causal knowledge and data, and even to build BNs interactively [15].All these options are currently impossible with SPNs.Additionally, the independencies in a BN or in a Markov model are easier to read than those in an SPN.On the other hand, the graph of an SPN can sometimes be built from human knowledge to represent part/subpart and class/subclass hierarchiessee [1,Sec. 5] for an example.
In conclusion, each type of model has advantages and disadvantages, and the choice for a real-world application must take into account the size of the problem, the amount of knowledge and data available, and the explanations required by the user.

B. SPNs vs. neural networks
SPNs can be seen as a particular type of feedforward neural networks because there is a flow of information from the input nodes (the leaves) to the output node (the root), but in this paper we reserve the term "neural network" (NN) for the models structured in layers connected by the standard operators: sigmoid, ReLU, softmax, etc.
The main difference is that SPNs have a probabilistic interpretation while standard NNs do not.Inference is also different: computing a posterior probability requires two passes, and finding a the most probable explanation (MPE) requires a backtrack from the root to the leaves, as explained in Section IV.Additionally, SPNs can do inference with partial information (i.e., when the values of some of the variables are unknown), while in a NN it is necessary to assign a value to each input node.
From the point of view of parameter learning, NNs are usually trained with gradient descent or variations thereof, while SPNs can also be trained with several probabilistic algorithms, such as EM and Bayesian methods, which are much more efficient and have lower risk of overfitting (cf.Sec.V).
When building practical applications, the main difference is the possibility of determining the structure of an SPN from data, looking for a balance between model complexity and accuracy.In contrast, NNs are usually designed by hand and it is necessary to examine different architectures of different sizes with different hyperparameters, in a trial-and-error approach, until a satisfactory model is found.For this reason, NNs that have succeeded in practical applications are usually very big and training them requires huge computational power.There are some proposals to learn the structure of NNs using evolutionary computation, which yields more efficient graphs, but this also requires immense computational power [16], [17].
In spite of these advantages, NNs are still superior to SPNs in many tasks.For example, in 2012 an SPN by Gens and Domingos [1] achieved a classification accuracy of 84% for the CIFAR-10 image dataset, one of the highest scores at the time.However, deep NNs have amply surpassed that result, reaching an impressive 99.3% accuracy. 2evertheless, in Section VII we mention several applications in which SPNs are competitive with NNs and superior in some aspects.For instance, random tensorized SPNs (RAT-SPNs) [18] have recently attained a classification accuracy comparable to deep NNs for MNIST and other image datasets, with the advantages of being interpretable as a probabilistic generative model and much more robust to missing features.For another recent example, Stelzner et al. [19] proved that the attend-infer-repeat (AIR) framework used for object detection and location is much more efficient when the variational autoencoders (VAEs) are replaced by SPNs: they achieved an improvement in speed of an order of magnitude, with slightly higher accuracy, as well as robustness against noise.Other examples can be found in Sections VII and IX.
For a more detailed comparison of SPNs with NNs, VAEs, generative adversarial networks (GANs), and other models, see [18].

II. Mathematical preliminaries
In this paper we assume that every variable either has a finite set of possible values, called states, or is continuous, i.e., takes values in R.

A. Configurations of variables
We denote by a capital letter, V , a variable and by the corresponding lowercase letter, v, any value of V .Similarly a boldface capital letter denotes a set of variables, V = {V 1 , . . ., V n }, the corresponding lowercase letter denotes any of its configurations, v = (v 1 , . . ., v n ), and conf(V) is the set of all the configurations of V.The empty set has only one configuration, represented by .
We denote by conf * (V) the set of all the configurations of V and its subsets: Put another way, We can think of conf * (V) \ conf(V) as the set of partial configurations of V, i.e., the configurations in which only some of the variables in V have an assigned value.
If X ⊆ V, the projection (sometimes called restriction) of a configuration v of V onto X, v ↓X , is the configuration of X such that every variable V ∈ X takes the same value as in v; for example, (+x 1 , +x 2 , ¬x 3 ) ↓{X1,X3} = (+x 1 , ¬x 3 ) and (+x 1 , +x 2 , ¬x 3 ) ↓∅ = .In order to simplify the notation, when X has a single variable, V , we will write v instead of (v) and v ↓V instead of v ↓{V } ; for example, (+x 1 , +x 2 , ¬x 3 ) ↓X2 = +x 2 .
Given two configurations, x and y, of two disjoint sets, X and Y, the composition of them, denoted by xy, is the configuration of X ∪ Y such that (xy) ↓X = x and (xy) ↓Y = y.For example, (+x 1 , +x 2 )(¬x 3 ) = (+x 1 , +x 2 , ¬x 3 ).
When X ⊆ V, a configuration x is compatible with configuration v if x = v ↓X , i.e., if every variable V ∈ X has the same value in both configurations.All configurations are compatible with .Definition 1.Given a value v of a finite-state variable V ∈ V, we define the indicator function, I v : conf(V) → {0, 1}, as follows: If all the variables in V are binary, then there are 2n indicator functions.

B. Probability distributions and probability functions
Definition 3. A probability distribution defined on V is a function P : conf(V) → R such that: This definition can be extended so that P represents not only a probability distribution but also all its marginal probabilities, as follows.

Definition 4.
A probability function defined on V is a function P : conf * (V) → R such that the restriction of P to conf(V) is a probability distribution and for every configuration x such that X ⊂ V, This equation is the definition of marginal probability: P (x) is obtained by summing the probabilities of all the configurations of V compatible with x.Proposition 5.If P is a probability function defined on V and X ⊂ V, the restriction of P to conf(X) is a probability distribution for X.
It is possible to define a new probability function as the sum or the product of other probability functions.Proposition 6.Let us consider n probability functions {P 1 , . . ., P n } defined on the same set of variables, V, and n weights, {w 1 , . . ., w n }, with w j ≥ 0 for every j, and is a probability function.It is said to be a weighted average or a convex combination of probability functions.

Proposition 7.
Let {P 1 , . . ., P n } be a set of probability functions defined on n disjoint sets of variables, {V 1 , . . ., is a probability function.

C. MAP, MPE, and MAX inference
In some inference tasks e denotes the evidence, i.e., the values observed (for example, the symptoms and signs of a medical examination or the pixels in an image) and X the variables of interest (for example, the possible diagnostics or the objects that may be present in the image), with X ∩ E = ∅.In this context, P (x | e) is called the posterior probability.
The maximum a-posteriori (MAP) configuration is MAP(e, X) = arg max Therefore, MAP inference divides the variables into three disjoint sets: observed variables (E), variables of interest (X), and hidden variables (H = V \ (E ∪ X)).
The most probable explanation (MPE) is the configuration of X = V\E that maximizes the posterior probability: MPE is a special case of MAP in which H = ∅, i.e., every variable that is not observed is a variable of interest.In general, MAP inference is much harder than MPE [20].Finally, MAX is a special case of MPE in which all the variables are of interest, i.e., X = V and H = E = ∅.The MAX configuration is the configuration of X that maximizes the probability: MPE and MAP are relevant when we wish to know the most probable configuration of the variables of interest X (for example, the possible diagnostics), which is different from finding the most probable value for each variable in X, as we will see in Example 26.MAP is relevant when some unobserved variables are not of interest; for example, H may represent the tests not performed: these variables are neither observed nor part of the diagnosis.See also [4, Secs.2.1.5.2 and 2.1.5.3],where MPE and MAP are called "MAP" and "marginal MAP" respectively.The definition of MAX will be useful in Section IV-C.

D. Basic definitions about graphs
Graphs have many applications in computer science.We describe here the type of graph used to build SPNs.
A directed graph consists of a set of nodes and a set of directed links.When there is a link n i → n j we say that n i is a parent of n j and n j is a child of n i ; there cannot be another link from n i to n j .Given a node n i , we denote by pa(i) the set of indices of its parents and by ch(i) the set of indices of its children.For example, in Figure 1, ch(1) = {2, 3}.Node n k is a descendant of n i if it is a child of n i or a child of a descendant of n i ; we also say that n i is an ancestor of n k .
A cycle of length l consists of a set of l nodes and l links A graph that contains no cycles, i.e., no node is a descendant of itself, is acyclic.An acyclic directed graph (ADG) is rooted if there is only one node (the root, denoted by n r ) having no parents.Terminal nodes, also called leaves, are those that do not have children.
A directed tree is a rooted ADG in which every node has one parent, except the root.In this paper when we say "a tree" we mean "a directed tree".

III. Basic definitions of SPNs A. Structure of an SPN
An SPN S consists of a rooted acyclic directed graph such that: • every leaf node represents a univariate probability distribution, • all the other nodes are either of type sum or product, • all the parents of a sum node (if any) are product nodes, and vice versa, and • every link n i → n j outgoing from a sum node has an associated weight, w ij ≥ 0. Usually w ij > 0. We will assume, unless otherwise stated, that all SPNs are normalized, i.e., ∀i, The probability distribution of each leaf node is defined on a variable V , which can have finite states or be continuous.In the first case, the distribution is usually degenerate, i.e., there is a particular value v * of V such that P (v * ) = 1 and P (v) = 0 otherwise.In the graphical representation this leaf is denoted by the indicator I v * , sometimes written I(v * ), as in Figures 1 and 2. When V is continuous, the probability distribution can be Gaussian [6], [21], Poisson [22], piecewise polynomial [23], etc. (SPNs can be further generalized by allowing each terminal node to represent a multivariate probability density-for example, a multivariate Gaussian [24], [25] or a Chow-Liu tree [26].) An SPN can be built bottom-up beginning with sub-SPNs of one node and joining them with sum and product nodes.All the definitions of SPNs can be established recursively, first for one-node SPNs, and then for sum and product nodes.Similarly, all the properties of SPNs can be proved by structural induction.
If the probability distribution of a leaf node is defined on V , its scope is the set {V }.If n i is a sum or a product node, its scope is the union of the scopes of its children: The scope of an SPN, denoted by sc(S), is the scope of its root, sc(n r ).The variables in the scope of an SPN are sometimes called model variables-in contrast with latent variables, which we present below.We define conf(S) = conf(sc(S)) and conf * (S) = conf * (sc(S)).
A sum node is complete if all its children have the same scope.An SPN is complete if all its sum nodes are complete.(In arithmetic circuits this property is called smoothness.) A product node is decomposable if its children have disjoint scopes.An SPN is decomposable if all its product nodes are decomposable.

Proposition 8. A product node n i is decomposable if and only if no node in the SPN is a descendant of two different children of n i .
In the rest of the paper we assume that all the SPNs are complete and decomposable.

B. Node values and probability distributions
In the following expressions we assume that all the variables have finite states.If V were continuous, we would write p(v) instead of P (v).Definition 9 (Value S i (x)).Let n i be a node of S and x ∈ conf * (S).If n i is a leaf node that represents P (v) then if it is a sum node, and if it is a product node, Because of Equations 3 and 13, if P (v) is degenerate, with P (v * ) = 1, then S i (x) = I v * (x), which justifies using indicators as leaf nodes.
Definition 10 (Value S(x)).The value S(x) returned by the SPN is the value of the root, S r (x).
Theorem 11.For a node n i in an SPN, the function P i : conf(S) → R, such that is a probability function defined on V = sc(n i ).
Please note that P i is defined on the configurations of the scope of the node, conf * (sc(n i )), while S i is defined on the configurations of the scope of the whole network, conf * (S).
The probability function P for the SPN is P (x) = P r (x).The above theorem guarantees that the SPN properly computes a probability distribution and all its marginal probabilities.The proof of the theorem, in Appendix C, relies on the completeness and decomposability of the SPN.We offer there some counterexamples showing that if an SPN is not complete or decomposable then S(x) is not necessarily a probability function.This theorem would still hold if we replaced decomposability with a weaker condition, consistency [1], but this would complicate the definition of SPN without offering any practical advantage.

C. Selective SPNs
We introduce now a particular type of SPNs that have interesting properties for MPE inference and parameter learning and for the interpretation of sum nodes.
When computing S(x) for a given x ∈ conf * (S), probability flows from the leaves to the root (cf.Def.9).Equation 14 says that all the children of a sum node n i can contribute to S i (x).However, n i may have the property that for every configuration v ∈ conf(S) at most one child makes a positive contribution, i.e., S j (x) = 0 for the other children of n i .We then say that n i is selective [27].The formal definition is as follows.
Please note that this definition says "conf", not "conf * ".Therefore even if n i is selective there may be a partial configuration x ∈ conf * (S) \ conf(V) such that several children of n i make positive contributions to S i (x).Definition 13.An SPN is selective if all its sum nodes are selective.Arithmetic circuits satisfying this property are said to be deterministic.Example 14.Given the SPN in Figure 2, we can check that if v = (+a, +b, ¬c) then S 2 (v) = 0.36 and S 3 (v) = 0.Only node n 2 makes a positive contribution to S 1 (v), so Property 17 holds for this v with j * = 2.We can make the same check for each of the 6 sum nodes and each of the 8 configurations of {A, B, C} in order to conclude that this SPN is selective.However, instead of making these 48 checks, in the next section we show that for this SPN selectivity is a consequence of the structure of its graph, regardless of its numerical parameters (the weights).

D. Sum nodes that represent model variables
Several papers about SPNs say that sum nodes represent latent random variables.However, in this section we show that in some cases sum nodes represent model variables.

Definition 15.
Let n i be a sum node having m children and V ∈ sc(n i ) a variable with m states.Let σ be a one-to-one function σ : {1, . . ., m} → ch(i).If for every m ∈ {1, . . ., m} either I vj is a child of n σ(j) (and hence a grandchild of n i ) or I vj = n σ(j) (i.e., the indicator itself is a child of n i ), we then say that n i represents variable V .In Appendix B we prove that when the root node of an SPN represents a variable V , this node can be interpreted as the weighted average of the conditional probabilities given V , with w i,σ(j) = P (v j ).Similarly, if every ancestor of a sum node n i represents a variable, then n i can be interpreted as the weighted average of the conditional probabilities given the context defined by the ancestors of n i .

Proposition 17. If a sum node n i represents a model variable V , then n i is selective.
This proposition is useful because it establishes a sufficient condition for concluding that an SPN is selective by just observing its graph, without analyzing its weights.For example, we can see that, according to Definition 15,   .Augmentation of an SPN, assuming that n i is not selective in S.This process adds an indicator I z(j) for every child n j of n i .
Node n i is added to restore the completeness of n l in S .
every node in Figures 1 and 2 represents a variable, which implies that every node is selective and consequently both SPNs are selective.In the next section we will use this property to transform any non-selective SPN into selective.

E. Augmented SPN
The goal of augmenting a non-selective SPN S [9], [7] is to transform it into a selective SPN S that represents the same probability function.For every non-selective node n i in S the augmentation consists in adding a new finite-state variable Z so that n i represents Z in S .The process is as follows.For every child n j we add a state z(j) to Z.If n j is a product node, we add the indicator I z(j) as a child of n j , as shown in Figure 3; if n j is a terminal node, we insert a product node, make n j a child of the new node (instead of being a child of n i ) and add I z(j) as the second child of the new node.In the resulting SPN, S , n i represents variable Z (because of Def.15) and is therefore selective.
However this transformation of the SPN may cause an undesirable side effect.Let us assume, as shown in Figure 3, that n i has a parent, n k , and n l is a parent of both n k and n k .Even though n l was complete in S, the addition of Z has made this node incomplete in S because Z ∈ sc(n i ) and Z ∈ sc(n k ) but Z / ∈ sc(n k ).It is then necessary to make Z ∈ sc(n k ) in order to restore the completeness of n l .So we create a new sum node, n i , and make it a parent of all the indicators of Z, {I z1 , . . ., I zm } (see again Fig. 3); the weights for n i can be chosen arbitrarily provided that they are all non-negative and their sum is 1.If n k is a product node, then we add n i as a child of n k .If n k is a terminal node, we insert a product node, making both n i and n k children of this new node.If n l has other children, such as n k , or ancestral sum nodes in S, then we must make each one of them a parent or a grandparent of n i , as we did for n k .
Given that variable Z was not in the scope of the original SPN, we can say that Z was latent in S and the augmentation of n i has made it explicit.The SPN S obtained by augmenting all the non-selective nodes is said to be the augmented version of S. Therefore, sc(S ) = sc(S) ∪ Z, where Z contains one variable for each sum node that was not selective in S.3 Proposition 18.If S is the augmented version of S, then S is complete, decomposable, and selective, and represents the same probability function for sc(S), i.e., if x ∈ conf * (S), then P (x) = P (x).

F. Induced trees
This section is based on [27].

Definition 19.
Let S be an SPN and v ∈ conf(S) such that S(v) = 0.The sub-SPN induced by v, denoted by S v , is a non-normalized SPN obtained by (1) removing every node n i such that S i (v) = 0 and the corresponding links, (2) removing every link n i → n j such that w ij = 0, and (3) removing recursively all the nodes without parents, except the root.

Proposition 20. If we denote by S
, and S(v) = 0, then S v is a tree in which every sum node has exactly one child.Following the literature, in this case we will write T v instead of S v to remark that it is a tree.

Example 22.
Given the SPN in Figure 2 and v = (+a, +b, ¬c), S v only contains the links drawn with thick lines in that figure and the nodes connected by them.This graph is a tree because the SPN is selective.
When an SPN is selective, the set of trees obtained for all the configurations in conf(S) is similar to the the set of trees obtained by recursively decomposing the SPN, beginning from the root, as proposed by Zhao et al. [28].
where (i, j) denotes a link.

A. Marginal and posterior probabilities
As defined in the previous section, P (x) = S(x) = S r (x).The value S(x) can be computed by an upward pass from the leaves to the root in time proportional to the number of links in the SPN.If X and E are two disjoint subsets of V, then P (x | e) = S(xe)/S(e), where xe is the composition of x and e.Therefore, any joint, marginal, or conditional probability can be computed with at most two upward passes.Partial propagation, which only propagates from the nodes in X ∪ E, can be significantly faster [29].

B. MPE inference
The MPE configuration for an SPN is (see Sec. II-C) Let us assume that S is selective.Then X ∪ E = sc(S) implies that xe ∈ conf(S) and, because of Proposition 21, the sub-SPN induced by xe is a tree in which every sum node has only one child.Therefore, the MPE can be found by examining all the trees for the configurations xe in which e is fixed and x varies.It is possible to compare all those trees at once with a single pass in S, by computing S max i (e) for each node as follows: • if n i is a sum node, then • otherwise S max i (e) = S i (e) (cf.Eqs. 13 and 15).Then the algorithm backtracks from the root to the leaves, selecting for each sum node the child that led to S max (e) and for each product node all its children.When arriving at a terminal node for variable V , the algorithm selects the value v = arg max v P (v ).In particular, if the terminal node is an indicator, A tie means that there are two or more configurations having the same probability P (x, e); these ties can be broken arbitrarily.The backtracking phase is equivalent to pruning the SPN in order to obtain a tree in which every sum node has only one child and there is exactly one terminal node for each variable; the v values selected at the terminal nodes make up the configuration x = MPE (e).
Example 26. Figure 4 shows the MPE inference for the SPN in Figure 2 when e = +c.The MPE is obtained by backtracking from the root to the leaves: x = MPE (e) = (+a, ¬b).We can check that S max 1 (e) = P (xe) = P (+a, ¬b, +c) = 0.144.For any other configuration x of X = V \ E = {A, B}, we have S(xe) < S(xe), in accordance with Equation 19.We can also check that the nodes selected by the backtracking phase are those of the tree induced by xe-see Def.19 and Prop.21.Note that, as mentioned in Section II-C, the MPE cannot be determined by selecting the most probable value for each variable.In this example P (¬a | e) = 0.57 > P (+a | e) and P (¬b | e) = 0.68 > P (+b | e), so we would obtain the configuration (¬a, ¬b), which is not the MPE.This algorithm was proposed by Chan and Darwiche [30] for arithmetic circuits, adapted to SPNs by Poon and Domingos [1], and later called Best Tree (BT) in [31].Peharz [7, Theorem 2] proved that when an SPN is selective, BT computes the true MPE.However, when a network is not selective, the sub-SPN induced by a configuration xe is not necessarily a tree, so the value S max (e) computed by BT-which only considers the probability that flows along trees with one child for each sum node-may be different from max x S(xe) and, consequently, the configuration returned by BT may be different from the true MPE.Therefore, even though the MPE can be found in time proportional to the size of the graph for selective SPNs, MPE is NP-complete for general SPNs [9, Theorem 5.3]. 4

C. MAX and MAP
Exact MAP inference for SPNs is NP-hard because it includes as a particular case MPE (see Sec. II-C), which is NP-complete.Nevertheless, Mei et al. [31] proposed several algorithms that are very efficient in practice.First, they presented an algorithm for the MAX problem in general SPNs.Second, they proved that every MAP problem for SPNs can be reduced to a MAX problem for a new SPN built in linear time.This way they were able to exactly solve MAP problems for SPNs with up to 1, 000 variables and 150, 000 links.
Third, they proposed several approximate MAP solvers that trade accuracy for speed, obtaining excellent results.In particular, they extended the BT method to the MAX problem for non-selective SPNs.This extension, called K-Best Tree (KBT), selects the K trees with the largest output.Then, the corresponding configurations are obtained (by backtracking) and evaluated in the SPN.The one with the largest output is the approximate solution to the MAX problem.Note that, for K = 1, KBT reduces to BT.

V. Parameter learning
Parameter learning consists in finding the optimal parameters for an SPN given its graph and a dataset.In generative learning the most common optimality criterion is to maximize the likelihood of the parameters of the model given a dataset, while in discriminative learning the goal is to maximize the conditional likelihood for each value of a variable C, called the class.

A. Maximum likelihood estimation (MLE)
Let D = {v 1 , v 2 , . . ., v T } be a dataset of T independent and identically distributed (i.i.d.) instances.We denote by W the set of weights of the SPN and by Θ the parameters of the probability distributions in the terminal nodes; both of them act as conditioning variables for the probability of the instances in the database.We define L D (w, θ) as the logarithm of the likelihood: The goal is to find the configuration of W ∪ Θ that maximizes L D (w, θ).Given that there is no restriction linking the parameters of one node (either sum, product, or terminal) with those of others, the optimization can be done independently for each node.For terminal nodes representing univariate distributions, standard statistical techniques can be applied.In this section we will focus on the optimization of the weights, W, so we omit Θ in the equations.The configuration that maximizes the likelihood is subject to w ij ≥ 0 and j∈ch(i) w ij = 1.
1) MLE for selective SPNs: When the SPN is selective and S(v) = 0, then the weights of the sum nodes can be estimated in closed form by applying MLE [27] as follows.
Proposition 27.When an SPN is selective and its weights are all different from 0, where n ij is the number of instances in the dataset for which (i, j) ∈ T v t and c is a value that does not depend on w.
The n ij 's can be computed by having a counter for every link in the SPN.For each instance v t in the dataset, we compute S(v t ) and then backtrack from the root to the leaves: for each product node we select all its children and for each sum node n i we select the only child for which S j (v t ) > 0 and increase by 1 the counter n ij .
It is then necessary to obtain the configuration w that maximizes the likelihood, defined in Equation 22.The only constraint is j∈ch(i) w ij = 1 for every i, which implies that the parameters for one node can be optimized independently of those for other nodes.The values that maximize the i-th term in Equation 23are There is a special case in which j∈ch(i) n ij = 0.This occurs when S i (v t ) = 0 for every t, i.e., when none of the instances in the dataset propagates through the sum node n i .In this case, the weights of this node can be set uniformly: Alternatively, it is possible to use a Laplace-like smoothing parameter α, so that with 0 < α ≤ 1.
2) Partial derivatives of S: For every node n i we define For the root node we have If n j is not the root, where pa(j) is the set of indices for the parents of n j .If n i is a sum node, if it is a product node, These equations mean that, for every node n j , S ∂ j (x) can be computed once we have the S ∂ i (x)'s of its parents and the S j (x)'s of its siblings.Therefore, after computing S i (x) for every node with an upward pass, S ∂ i (x) can be computed with a downward pass, both in linear time.This algorithm is similar to backpropagation for neural networks and can be implemented using software packages that support automatic differentiation.
3) Gradient descent: a) Standard gradient descent: Gradient descent (GD), a well known optimization method, was proposed for SPNs for both generative and discriminative models in [1] and [5], respectively. 5The algorithm is initialized by assigning an arbitrary value to each parameter, w (0) ij .and in every iteration, s, this value is updated, in order to increase the likelihood of the model: where γ is the learning rate (a hyperparameter).It may be necessary to renormalize the weights for each i after each update.
The partial derivative for L D can be efficiently computed using the S ∂ i 's defined above, as follows.Because of Equation 21, and which together with Equations 14 and 26 leads to and, finally, This equation allows us to perform each iteration of GD in time proportional to the size of the SPN and the number of instances in the database.
b) Stochastic GD and mini-batch GD: In the stochastic version of GD, each iteration s uses only one instance of the database, randomly chosen, so that until the algorithm converges.
Another possibility is to use mini-batches to update the weights, that is, to divide the dataset in batches of L randomly drawn instances, where L < T (usually L T ), and update the weights as follows: This version is the most popular when applying GD. c) Hard GD: The application of GD to deep networks, either neural networks or SPNs, suffers from the vanishing gradients problem: the deeper the layer, the lower the contribution of its weights to the model output, so the influence of the parameters in the deepest layers may be imperceptible.The hard version of GD for SPNs solves this problem by replacing the sum nodes with max nodes and reparametrizing the weights so that the gradient of the log-likelihood function remains constant.This method was introduced for SPNs by Gens and Domingos [5] for discriminative learning.

4) Expectation-Maximization (EM): a) Standard EM:
We have seen how to learn the parameters of a selective SPN from a complete dataset.However, many real-world problems have missing values.We denote by H t the variables missing (hidden) in the tth instance of the database.Additionally, when learning the parameters of S , an augmented SPN, the database is always incomplete, even if it contains all the values for the the model variables in S, because it does not contain the latent variables Z, added when augmenting the SPN, so Z ⊆ H t for every t.
In this situation we can apply the expectationmaximization (EM) algorithm, designed to estimate the parameters of probabilistic models from incomplete datasets.The problem is as follows.If we had a complete database, we would be able to estimate its parameters as explained in the previous section.Alternatively, if we knew the parameters, we would be able to generate a complete database by sampling from the probability distribution.
The EM algorithm proceeds by iteratively applying two steps.The E-step (expectation) computes the probability P (h t | v t ) for each configuration of the variables missing in v t in order to impute the missing values.More precisely, instead of assigning a single value to each missing cell, we create a virtual database in which all the configurations of H t are present, each with probability P (h t | v t ).The Mstep (maximization) uses this virtual complete database to adjust the parameters of the model by MLE, as in Section V-A1.The two steps are repeated until the parameters (the weights) converge.
The problem is that initially we have neither a complete database nor parameters for sampling the values of the missing variables.The algorithm can be initialized by assigning arbitrary values to the parameters or by assigning arbitrary values to the variables in Z. Unfortunately, a bad choice of the initial values may cause the algorithm to converge to a local maximum of the likelihood, which may be quite different from the global maximum.
The n ij 's required by the M-step are obtained by counting the number of instances in the database for which the link (i, j) belongs to the tree induced by v t h t .These are the n ij 's introduced in Equation 23, which in the case of the virtual database are and can be efficiently computed by applying the following result: Proposition 28.Given a database D with T instances and a selective SPN, the n ij 's in Equation 32are Once we have the n ij 's, the weights can be updated using Equation 24or 25.The time required by each iteration of EM is proportional to the size of the network and the number of instances in the dataset.
b) Hard EM: The EM algorithm needs the value of S ∂ i , which is proportional to ∂S/∂w ij and may thus be very small when the link (n i , n j ) is in a deep position, i.e., far from the root.Therefore this algorithm may suffer from the vanishing gradients problem in the same way as GD.To avoid it, Poon and Domingos [1] proposed a hard version of EM for SPNs that selects for each hidden variable H ∈ H t the most probable state.Thus, in the E-step of each iteration, every instance of the dataset contributes to the update of just one weight per sum node, instead of contributing to all of them proportionally.
Hsu et al. [25] proposed a variant of hard EM for SPNs with Gaussian leaves.This method proceeds top-down.It starts at the top sum node and distributes the instances among its children by maximum likelihood.Every other sum node receives only the instances distributed to its parents and redistributes them in the same fashion.This way it updates the weights of the sum nodes locally.The process is similar to the automatic parameter learning in LearnSPN (cf.Sec.VI-B).These authors also provide formulas for updating the parameters of Gaussian leaves.

5) Comparison of MLE algorithms:
The application of the EM to SPNs has been justified with different mathematical arguments.Peharz [9] exploited the interpretation of the sum nodes in the augmented network as the sum of conditional probability functions (cf.Eqs.39 and 40).In turn, Zhao et al. [28], using a unified framework based on signomial programming, designed two algorithms for learning the parameters of SPNs: sequential monomial approximations (SMA) and the concave-convex procedure (CCCP).GD is a special case of SMA, while CCCP coincides with EM in the case of SPNs, despite being different algorithms in general.Their experiments proved that EM/CCCP converges much faster than the other algorithms, including GD.In turn, Desana and Schnörr [24] derived the EM algorithm for SPNs whose leaf nodes may represent complex probability distributions.
In discriminative learning, neither EM nor CCCP have a closed-form expression for updating the weights [5].Rashwan et al. [32] addressed this problem with the extended Baum-Welch (EBW) algorithm, which updates the parameters of the network using a transformation that increases the value of the likelihood function monotonically.In the generative case, this transformation coincides with the update formula of EM/CCCP (the M-step), while in the discriminative case it provides a method to maximize the (conditional) likelihood function with a closed-form formula.They also adapted this method to SPNs with Gaussian leaves.
Both the algorithm of Desana and Schnörr and EBW outperformed GD and EM in a wide variety of datasets.

B. Semi-supervised learning
Trapp et al. [33] introduced a safe semi-supervised learning algorithm for SPNs.By "safe" they mean that the model performance can be increased but never degraded by adding unlabeled data.They extended the EM to generative semi-supervised learning and defined a discriminative semi-supervised learning approach.They also introduced the maximum contrastive pessimistic algorithm (MCP-SPN), based on [34], for learning safe semi-supervised SPNs.Their results were competitive with those of purely supervised algorithms.

C. Approximate Bayesian learning
There are alternative methods for learning the parameters of an SPN based on approximate Bayesian techniques, such as Bayesian moment matching [35] and collapsed variational inference [36], which are not as exposed to overfitting as GD or EM.Both Bayesian methods start with a product of Dirichlet distributions as a prior; the posterior distribution P (w ij | D) is a mixture of products of Dirichlets, which is computationally intractable.In both works the solution applied was to approximate that distribution with a single product of Dirichlet distributions.Rashwan et al. [35] applied online Bayesian moment matching (oBMM), which approximates the posterior distributions of the weights by computing a subset of their moments and finding another distribution from a tractable family that matches those moments.In this case, it sufficed to match the first and second order moments of the distribution.The experiments showed that this approach outperforms SGD and online EM.This method has also been adapted to SPNs with Gaussian leaves by Jaini et al. [37].In the same vein, Zhao and Gordon [38] presented an optimal linear time algorithm for computing the moments in SPNs with general acyclic directed graph structures, based on the mixture of trees interpretation of SPNs.This provides an effective method to apply Bayesian moment matching to a broad family of SPNs.
As mentioned above, Zhao et al. [36] addressed the problem by applying collapsed variational Bayesian inference (CVB-SPN).This approach treats the dataset as partial evidence, whose missing values correspond to the latent variables of the SPN.They assumed that the missing values of each instance are not independent of those missing in other instances, and marginalized these variables out of the joint posterior distribution (the "collapse" step).Then, they approximated this distribution with the product of the Dirichlets that maximizes some evidence lower bound (ELBO) of the log-likelihood function of the dataset (the "variational inference" step).The experiments showed that the online version of CVB-SPN-i.e., the version that receives a stream of data in real time-outperforms oBMM in many datasets.

D. Deep learning approach
Peharz et al. [18] considered a special class of SPNs, which they called random SPNs, and trained them with automatic differentiation, stochastic GD, and dropout, using GPU parallelization.The resulting model was called RAT-SPN.Its classification accuracy, measured on the MNIST images and other databases, was comparable to that of deep neural networks, with the advantages of a probabilistic generative model, such as interpretability and robustness to missing features.

VI. Structural learning
Structural learning consists in finding the optimal (or near-optimal) graph of an SPN.Most of the algorithms for this task require some computation of probabilities during the process.

A. First structure learners
BuildSPN, by Dennis and Ventura [39] was the first algorithm of this kind.It looks for subsets of highly correlated variables and introduces latent variables to account for those dependencies.These variables generate sum nodes and the process is repeated recursively looking for the new latent variables.
BuildSPN and the hand-coded structure of Poon and Domingos [1], both designed for image processing, assumed neighborhood dependence.In order to overcome that limitation, Peharz et al. [40] proposed an algorithm that subsequently combines SPNs of few variables into larger ones applying a statistical dependence test.
BuildSPN was also critiqued by Gens and Domingos [6] because (1) the clustering process may separate highly dependent variables, (2) the size of the SPN and the time required can grow exponentially with the number of variables, and (3) it requires an additional step to learn the weights. .The LearnSPN algorithm recursively creates a product node when there are subsets of (approximately) independent variables and a sum node otherwise, grouping similar instances.(Reproduced from [6] with the authors' permission.)

B. LearnSPN
It is common in machine learning to see a dataset as a data matrix whose columns are attributes or variables and whose rows are observations or instances.The LearnSPN algorithm [6] recursively splits the variables into independent subsets (thus "chopping" the data matrix, as shown in Figure 5) and then clusters the instances (thus "slicing" the matrix).Every "chopping" creates a product node and every "slicing" a sum node, as indicated in Algorithm 1.There are two base cases: 1) When the piece of the data matrix produced by "chopping" contains a single column (i.e., one variable), the algorithm creates a terminal node with a univariate distribution using MLE.2) When the piece of the data matrix produced by "slicing" contains several columns with relatively few rows, the algorithm applies a naïve Bayes factorization over those variables.This is like "chopping" that piece into individual columns, which will be processed as in the base case 1.
LearnSPN can be seen as a framework algorithm in the sense that it does not specify the procedures for splitting independent subsets of variables (splitVariables in Algorithm 1) and clustering similar instances (clusterInstances in that algorithm).Originally Gens and Domingos [6] chose the G-Test for splitting and hard incremental EM for clustering.
Splitting the variables ("chopping") only considers pairwise independencies.The process starts with a graph containing a node for each variable and no links.It randomly selects one variable and adds an edge to the first other variable deemed dependent by the G-test, then moves to that variable, and iterates until no new variable can be linked to this component of the graph.At the end, Algorithm 1 LearnSPN(T, V, α, m) Input: T: a data matrix of instances over the variables in V; m: minimum number of instances to allow a split of variables; α: Laplace smoothing parameter Output: an SPN S with sc(S) = if this component has gathered all variables only one component is generated; then the clustering concludes and the algorithm clusters instances instead.
Clustering similar instances ("slicing") is achieved by the hard EM algorithm assuming a naïve Bayes mixture model, where the variables are independent given the cluster C i : P (v) = i P (c i ) j P (v j | c i ).This particular model produces a clustering that can be chopped in the next recursion.This version of LearnSPN forces a clustering in the first step, without attempting a split.

C. ID-SPN
Rooshenas and Lowd [21] observed that PGM learners usually analyze direct interactions (dependencies) between variables while previous SPN learners analyze indirect interactions (dependencies through a latent variable).The indirect-direct SPN (ID-SPN) structure learner combines both methods.Their initial idea is that any tractable multivariate distribution that can be represented as an arithmetic circuit or an SPN can be the leaf of an SPN without losing tractability.With this idea they learned arithmetic circuit Markov networks (ACMN) [41], which are roughly Markov networks learned as arithmetic circuits.ID-SPN begins with a singular ACMN node and tries to replace it with a mixture (yielding a sum node) or a product (yielding a product node), similar to the cluster and split operations in LearnSPN.If a replacement increases the likelihood, it is saved and the algorithm recurs on the new ACMN leaves, until the likelihood does not increase.This top-down process represents the learning of indirect interactions, while the creation of ACMN leaves represents the learning of direct interactions.This algorithm outperformed all previous algorithms and is currently the state of the art.However, ID-SPN is slower and more complex than LearnSPN, and has many more hyperparameters to tune, which requires a random search in the space of hyperparameters instead of a grid search.

D. Other algorithms
Peharz et al. [27] proposed a structure learner that searches within the space of selective SPNs and showed that it is competitive with LearnSPN.
Adel et al. [42] pointed out that previous work had only compared algorithms on binary datasets.They designed SVD-SPN, which proceeds by finding rank-1 matrices.This allows the algorithm to cluster and split at the same time, producing optimal data matrix pieces.It operates recursively, like LearnSPN, but constructs the SPN from the rank-1 submatrices extracted.It also considers a multivariate base case when the variables in the pieces of the data matrix are highly correlated.In this case a sum node is created with as many children as instances in the piece of the matrix; each child is a product node of all the variables in the matrix.In their experiments, SVD-SPN obtained results similar to those of LearnSPN and ID-SPN for binary datasets, but outperformed them in multiple-category datasets, such as Caltech-101, and is 5 times faster.

E. Improvements to LearnSPN
Even though LearnSPN is not the best performing algorithm, it is still widely used owing to its simplicity and modularity [29] and has led to several variants.
1) Algorithm of Vergari et al. [26]: It consists of three modifications to LearnSPN: 1) Binary splits.Every split cuts the data matrix into only two pieces.This avoids creating too complex structures at early stages (which often occurs when learning from noisy data) and favors deep structures over shallow ones.This is not a limitation in the number of children of product nodes because consecutive splits can be applied if necessary.2) Chow-Liu trees (CLTs) in the leaves.The naïve Bayes factorization used as the base case of LearnSPN (see Algorithm 1) can be replaced by the creation of Chow-Liu trees [43], which are equivalent to treeshaped Bayesian networks or Markov networks.Every tree is built by linking the variables with higher mutual information until there is a path between every pair of variables.CLTs are more expressive than the naïve Bayes factorization (which is a particular case of CLT) without adding computational complexity.LearnSPN stops earlier when using CLTs as leaves because each tree can accommodate more instances, thus yielding simpler SPN structures (with fewer edges) with lower risk of overfitting.3) Bagging.This technique, originally used to build random forests [44], consists in taking several random samples from a dataset, each consisting of several instances, and building a classifier for each sample.
The overall classification can be the average of the outputs of the individual classifiers (for continuous variables) or the mode (for finite-state variables).
In SPN learning, it extracts-with replacementn samples of the dataset and produces a sum node with n children, setting every weight to 1/n, as shown in Figure 6.Each child represents one of the individual classifiers and the sum node averages the n results.Since the network size would grow exponentially if bagging were applied before every clustering, it is applied only before the first Learn-SPN operation-which is a clustering-in order to achieve the widest effect on the resulting structure.
The experiments showed that (1) binary splits yield deeper and simpler SPNs and generally reduce the number of edges and parameters, (2) using Chow-Liu trees attains the same effect and generally increases the likelihood, and (3) bagging also increases the likelihood, especially in datasets with a low number of instances.With these modifications, LearnSPN achieved the same performance as ID-SPN.
2) Beyond tree SPNs: One of the main disadvantages of both LearnSPN and ID-SPN is that they always produce trees (except when the leaves are Markov networks).In order to generate more compact SPNs, Dennis and Ventura [45] designed SearchSPN, an algorithm that produces SPNs in which nodes may have several parents.It selects the product node that contributes less to the likelihood and greedily searches for candidate structures using modified versions of the clustering methods of LearnSPN.The resulting likelihood is significantly better than that of LearnSPN for the majority of datasets and comparable with that of ID-SPN, but on average the execution is 7 times faster and the number of nodes 10 times smaller.
In the same vein, Rahman and Gogate [46] created a post-processing algorithm that, after applying LearnSPN with CLTs in the leaves, merges similar sub-SPNs.Similarity is measured with a Manhattan distance; if two sub-SPNs are closer than a certain threshold, the pieces of the data matrix from which they come are combined and the algorithm chooses the sub-SPN with the higher likelihood for the combined data.This modification of LearnSPN increases the likelihood and reduces the number of parameters of the SPN; additionally, it dramatically increases the learning time for some datasets.In combination with bagging, it outperformed other algorithms-including ID-SPN [21]-for high-dimensional datasets.
3) Further improvements to LearnSPN: As mentioned in Section V-A5, Zhao et al. [28] showed that learning the parameters with the CCCP algorithm improves the performance of LearnSPN.
Di Mauro et al. [47] proposed approximate splitting methods to accelerate LearnSPN, thus trading speed for quality (likelihood).
Butz et al. [29] studied the different combinations of algorithms for LearnSPN.They compared mutual information and the G-test for splitting, and k-means and Gaussian mixture models for clustering.The best results were obtained when using the G-test and either k-means or Gaussian mixture models, both for the standard Learn-SPN and for the version that generates CLTs in the leaves.
Liu et al. [48] proposed a clustering method that decides the number of instance clusters adaptively, i.e., depending on each piece of data matrix evaluated.Their goal was to generate more expressive SPNs, in particular deeper ones with controlled widths.When compared previous algorithms (namely, standard LearnSPN [6], LearnSPN with binary splits [26], and LearnSPN with approximate splitting [47]), their method achieved higher likelihood in 20 binary datasets and generated deeper networks (i.e., more expressive SPNs) while maintaining a reasonable size.
4) LearnSPN with piecewise polynomial distributions: Most learning algorithms assume that when a terminal node represents a continuous variable, the univariate distribution belongs to a known family (Poisson, Gaussian, etc.) and only the parameters must be optimized.However, there are at least two variants of LearnSPN that, in addition to having indicators for finite-state variables, use a piecewise polynomial distribution for each leaf node representing a numeric variable, instead of requiring the user to specify a parametric family [49], [23].
In LearnWMISPN [49], which combines LearnSPN with weighted model integration (WMI), the order of each polynomial is determined using the Bayesian information criterion (BIC) [50].A preprocessing step transforms finite-state, categorical, and continuous features into a binary representation before applying LearnSPN.The corresponding inference algorithm can answer complex conditional queries involving both intervals for continuous variables and values for discrete variables.

F. Online structural learning
The algorithms presented so far need the complete dataset to produce a structure.However, sometimes the dataset is so big that the computer does not have enough memory to store it at once.In other situations, e.g., in recommender systems, the data arrive constantly.In these cases the learning algorithm must be able to update the structure instead of learning it from scratch every time new data arrives.
In this context, Lee et al. [52] designed a version of LearnSPN where clustering (slicing) is replaced by online clustering, so that new sum children can be added when new data arrive, while product nodes are unmodified.
Later Dennis and Ventura [53] extended their Search-SPN algorithm [45] to the online setting.This online version is as fast as the offline version that works only on the current batch and the quality of the resulting SPN is the same.
Hsu et al. [25] created oSLRAU, an online structure learner for Gaussian leaves (oSLRAU) which begins with a completely uncorrelated SPN structure that is updated when the arriving data reveals a new correlation.The update consists in replacing a leaf with a multivariate Gaussian leaf or a mixture over its scope.
Jaini et al. [54] proposed an algorithm, Prometheus, whose first concern is to avoid the parameter that decides when two subsets of variables are independent in order to perform a LearnSPN split.So instead of creating a product node, it creates a mixture of them, representing different subset partitions.The way the partitions are created allows them to share subsets, which is reflected in the structure by common children, thus overcoming the restriction to trees on the way.This is in some sense similar to bagging in sum nodes (cf.Sec.VI-E) and makes the algorithm robust in low data regimes.However, the complexity of the algorithm grows with the square of the number of variables.In order to extend it to highdimensional datasets, the authors created a version that samples in each step from the set of variables instead of using all of them.This algorithm can treat discrete, continuous, and mixed datasets.Their experiments showed that this algorithm surpasses both LearnSPN and ID-SPN in the three types of datasets.It is also robust in low data regimes, achieving the same performance as oSLRAU with only 30-40% of the data.

G. Learning with dynamic data
Data are said to be dynamic when all the variables (or at least some of them) have different values in different time points-for example, Income-at-year-1, Income-atyear-2, etc.The set of variables for a specific time point is usually called a slice.The slice structure, called template, is replicated and chained to accommodate as many time points as necessary.The length of the chain is called the horizon.
For this problem Melibari et al. [55] proposed dynamic SPNs (DSPNs), which extend SPNs in the same way that dynamic Bayesian networks (DBNs) [56] extend BNs.A local-search structure learner generates an initial template SPN and searches for neighboring structures, trying to maximize the likelihood.Every neighbor, which comes from replacing a product node, represents a specific choice of factorization of the variables in its scope.The algorithm searches over other choices of factorizations and updates the structure if a better one is found.This method outperforms non-dynamic algorithms, such as LearnSPN, and other models, such as dynamic Bayesian networks and recurrent neural networks.
Later, Kalra et al. [57] extended oSLRAU to the dynamic setting by unrolling the SPN to match the length of the chain to the horizon, with shared weights and a shared covariance matrix, to decide when a new correlation requires a change in the template.This algorithm surpassed that of Melibari et al. [55] and hidden Markov models in 5 sequential datasets, and recurrent neural networks in 4 of those datasets.

H. Relational data learning
Nath and Domingos [58] introduced relational SPNs (RSPNs), which generalize SPNs by modeling a set of instances jointly, allowing them to influence each other's probability distributions, and modeling probabilities of relations between objects.Their LearnRSPN algorithm outperformed Markov Logic Networks in both running time and predictive accuracy when tested on three datasets.

I. Bayesian structure learning
Lee et al. [59] designed a Bayesian non-parametric extension of SPNs.Trapp et al. [60] criticized this work for neglecting induced trees in their posterior construction, corrected it by introducing infinite sum-product tree, and showed that it yields higher likelihood than infinite Gaussian mixture models.
A common problem of structural learning algorithms is the lack of a principled criterion for deciding what a "good" structure is.For this reason, Trapp et al. [61] proposed an alternative Bayesian approach that decomposes the problem into two phases: first finding a graph and then learning the scope-function, ψ, which assigns a scope to each node.The function ψ and the parameters of the model are learned jointly using Gibbs sampling.The Bayesian nature of this approach reduces the risk of overfitting, avoids the need for a separate validation set to adjust the hyperparameters of the algorithm, and enables robust learning of SPN structures under missing data.

VII. Applications
SPNs have been used for a wide variety of applications, from toy problems to real-world challenges.
A. Image processing 1) Image reconstruction and classification: Poon and Domingos, in their seminal paper about SPNs [1], applied them to image reconstruction, using a hand-designed structure that took into account the local structure of the image data.They tested their method on the datasets Caltech-101 and Olivetti.Then Gens and Domingos [5] used a different hand-made structure for image classification on the datasets CIFAR-10 and STL-10, obtaining excellent results for that time, as mentioned in Section I-A.
2) Image segmentation: Image segmentation consists in labeling every pixel with the object it belongs to.Yuan et al. [62] developed an algorithm that scales down every image recursively to different sizes and generates object tags and unary potentials for every scale.Then, it builds a multi-stacked SPN where every stack has a bottom and a top SPN.The bottom SPN works on a pixel and its vicinity, going from the pixel to bigger patches.Product nodes model correlations between patches while sum nodes combine them into a feature of a bigger patch.When the patch is as big as the pixel in the next scaled image, the results are introduced in the top SPN alongside the unary potentials and the tags of that scale.This process is stacked until the "patch" treated is the whole image.Multi-stacked SPNs have been especially effective for handling occlusions in scenes.
Rathke et al. [63] applied SPNs to segmentation OCT scans of retinal tissue.They first built a segmentation model for the health model and for every pathology and then added to those models typical shape variations of the retina tissue for some pathology-specific regions.The resulting SPN extracts candidate regions (either healthy or unhealthy) and selects the combination that maximizes the likelihood.After a smoothing step, they obtain a complete segmentation of the retina tissue as well as the diagnosis and the affected regions.This method achieved state-of-the-art performance without needing images labeled by pathologies.
3) Activity recognition: Wang and Wang [64] addressed activity recognition on still images.They used unsupervised learning and a convolutional neural network to isolate parts of the images, such as a hand or a glass, and designed a spatial SPN including the spatial indicator nodes "above", "below", "left", and "right" for the product nodes to encode spatial relations between pairs of these parts.They first partitioned the image to consider only local part configurations.Its SPN structure has two components: the top layers represent a partitioning of the image into sub-images where product nodes act as partitions and sum nodes as combinations of different partitions, while the bottom layers represent the parts included in each sub-image and their relative position using the spatial indicator nodes.In this sense the SPN first learns spatial relations of isolated parts in sub-images and then learns correlations between sub-images.Spatial SPNs outperformed other activity recognition algorithms and were able to discover discriminant pairs of parts for each class.
Amer and Todorovic [65] worked on activity localization and recognition in videos.In their work, a visual word is a meaningful piece of an image, previously extracted with a neural network.Visual words lie in a grid with three dimensions: height, width, and time.Every grid position has a histogram of associated visual words, called bag of words.To construct an SPN, each bag of words is treated as a variable with two states: foreground and background.Product nodes represent a combination of sub-activities into a more complex activity (for example, "join hands + separate hands = clap") and sum nodes represent variations of the same activity.An SPN is trained for every activity in a supervised context, in which the foreground and the background values are known, and in a weakly supervised context, in which only the activity is known.The structure is a near-completely connected graph, pruned after parameter learning, which proceeds iteratively: it learns-with GD-the weights of the SPN from the parameters of the bag of words and then learnswith variational methods-the parameters of the bag of words from the weights of the SPN.The accuracy of this weakly supervised setting was only 1.6 to 3% worse than that of the supervised setting.This approach in general achieved better performance than state-of-the-art algorithms on several action-recognition datasets.
4) Object detection: Stelzner et al. [19] proved that the attend-infer-repeat (AIR) framework used for object detection and location is much more efficient when the variational autoencoders (VAEs) that model individual objects are replaced by SPNs: they achieved an improvement in speed of an order of magnitude, with slightly higher accuracy, as well as robustness against noise.

B. Robotics
Sguerra and Cozman [66] used SPNs for aerial robots navigation.Micro aerial vehicles need a set of sensors that must comply with two criteria: light weight and realtime response.Optical recognition with cameras satisfies the former while fast inference with SPNs ensures the latter, as they were able to classify-in real time-what the camera sees into pilot commands, such as "turn right", and obtaining 75% of accuracy with just 66 images.
Pronobis et al. [67] designed a probabilistic representation of spatial knowledge called "deep spatial affordance hierarchy" (DASH), which encodes several levels of abstractions using a deep model of spatial concepts.It models knowledge gaps and affordances by a deep generative spatial model (DGSM) which uses SPNs for inference across different levels of abstractions.SPNs fit naturally with DGSM because latent variables of the former are internal descriptors in the latter.The authors tested it in a robot equipped with a laser-range sensor.
Zheng et al. [68] designed graph-structured SPNs (GraphSPNs) for structured prediction.Their algorithm learns template SPNs and makes a mixture over them (a template distribution), which can be applied to graphs of varying size re-using the same templates.The authors applied them to model large-scale global semantic maps of office environments with a exploring robot, obtaining better results than with the classical approach based on undirected graphical models (Markov networks).
The authors joined both models into an end-to-end deep model for semantic mapping in large-scale environments with multiple levels of abstraction, called TopoNets [69].These can perform inference about unknown spatial information, are useful for novelty detection, and achieve real-time performance.

C. NLP and sequence data analysis
Peharz et al. [70] applied SPNs to modeling speech by retrieving the lost frequencies of telephone communications (artificial bandwidth extension).In this problem tractable and quick (real-time) inference is essential.They used a hidden Markov model (HMM) to represent the temporal evolution of the log-spectrum, clustered the data using the Linde-Buzo-Gray algorithm and trained an SPN for each cluster.The SPNs model each cluster and can be used to retrieve the lost frequencies by MPE inference.This model has achieved better results than state-of-the-art algorithms both objectively, with a measure of log-spectral distortion, and subjectively, through listening tests.
In language modeling, Cheng et al. [71] used a discriminative SPN [5], whose leaf nodes represent vectors with information about previous words.This SPN was able to compute the probability of the next word more accurately than classic methods for language modeling, such as feedforward neural networks and recurrent neural networks.
Later, Melibari et al. [55] used dynamic SPNs (DSPNs) to analyze different sequence datasets.Unlike dynamic Bayesian networks, for which inference is generally exponential in the number of variables per time slice, inference in DSPNs has linear complexity.In a comparative study with five other methods, including HMMs and neural networks with long short-term memory (LSTM), DSPNs were superior in 4 of the 5 datasets examined.
Ratajczak et al. [72] replaced the local factors of higherorder linear-chain conditional random fields (HO-LC-CRFs) and maximum entropy Markov models (MEMMs) with SPNs.These outperformed other state-of-the-art methods in optical character recognition and achieved competitive results in phoneme classification and handwriting recognition.

D. Other applications
Vergari et al. [73] used SPNs as autoencoders (SPAEs) for feature extraction.They trained the SPNs with Learn-SPN and used the values of the internal nodes or the states of the latent variables associated to sum nodes as the codification variables.Although this model was not trained to reconstruct its inputs, experiments showed that SPAEs are competitive with state-of-the-art autoencoder architectures for several multilabel classification problems.
Butz et al. [74] used Bayesian networks to recognize independencies in 3,500 datasets of soil bacteria and combined them into an SPN in order to efficiently compute conditional probabilities and the MPE.
Nath and Domingos [75] used relational SPNs (cf.Sec.VI-H) for fault localization, i.e. finding the most probable location of bugs in computer source code.The networks, trained on a corpus of previously diagnosed buggy programs, learned to identify recurring patterns of bugs.They could also receive clues about bug suspicion from other bug detectors, such as Tarantula.
Hilprecht et al. [76] proposed learning database management systems from data instead of queries, using ensembles of relational SPNs.This approach provides better accuracy and better generalization to unseen queries than different state of the art methods.
Vergari et al. [77] also evaluated SPNs for representation learning [78].SPNs encode a hierarchy of part-based representations which can be ordered by scope length.When compared with other representation learners, such as VAEs or random Boltzmann machines (RBMs), they provided competitive results both in supervised and semisupervised settings.Moreover, the model trained for extracting representations can be used as-is for inference.
Vergari et al. [79] designed a tool for automatic exploratory data analysis without the need for expert statistical knowledge.It leverages Gibbs sampling and a modification of mixed SPNs [23] to model the data, and provides functionalities such as data type recognition, missing values imputation, anomaly detection, and others.
Roy et al. [80] addressed the explanation of activity recognition and localization in videos.A deep convolutional neural network is used for localization and its output is introduced to an SPN.Both models are learned jointly.The explainability of the system was evaluated by the subjective user trust in the explanations that the SPN provides about the criteria of the neural net.

VIII. Software for SPNs
Every publication about SPNs presents some experiments, and in many cases the source code is publicly available.The web page https://github.com/arranger1044/awesome-spn contains many references about SPNs, classified by year and by topic; the section "Resources" includes links to talks and tutorials, the source code for some of those publications, and several datasets commonly used for the experiments.Most of the software is written in Python or C++.
In particular, there are two projects that aim to develop comprehensive, simple, and extensible libraries for SPNs.Both of them are written in Python and use TensorFlow as a backend for speeding up some operations.LibSPN, 6 initiated by Andrzej Pronobis at the University of Washington, Seattle, WA [81], implements several methods for inference (marginal and conditional probabilities, and approximate MPE), parameter learning (batch and online, with GD and hard EM), and visualization of SPNs.It lacks algorithms for structural learning, but it allows building convolutional SPNs with a layer-oriented interface [82].The SPNs, stored as Python structures, are compiled into Tensor-Flow graphs for parameter learning and inference; for this purpose LibSPN has implemented in C++ and CUDA some operations that cannot be performed efficiently with native TensorFlow operations.Several tutorials in Jupyter Notebook are available at its website.It has been used mainly for computer vision and robotics [67], [68], [82].
The other library, SPFlow 7 , is developed by Alejandro  There are also some smaller libraries of interest, such as SumProductNetworks.jlfor Julia, 8 which implements inference and parameter learning, and the Libra Toolkit [84],9 a collection of algorithms written in OCaml for learning several types of probabilistic models, such as BNs, SPNs, and others, including the ID-SPN algorithm [21].

IX. Extensions of SPNs
In recent years there have been several extensions of SPNs to more general models.In this section we briefly comment on some of them.
Sum-product-max networks (SPMNs) [85] generalize SPNs to the class of decision making problems by including two new types of nodes, max and utility, like in influence diagrams.These networks can compute the expected utility and the optimal policy in time proportional to the number of links.
Autoencoder SPNs (AESPNs) [86] combine two SPNs with an autoencoder between them.This model produces better samples than SPNs by themselves.
Tan and Peharz [87] designed a mixture model of VAE pieces over local subsets of variables combined via an SPN.This combination yields better density estimates, smaller models and improved data efficiency with respect to VAEs.
In credal sum-product networks (CSPNs) [88] the weights of each sum node do not have a fixed value, but vary inside a set (a product of probability simplexes) in such a way that each choice of the weights defines an SPN.
Sum-product graphical models (SPGMs) [89] join the semantics of probabilistic graphical models with the evaluation efficiency and expressiveness of SPNs by allowing the nodes associated to variables to appear in any part of the network, not only in the leaf nodes.Their LearnSPGM algorithm outperformed both LearnSPN and ID-SPN on the 20 real-world datasets previously used in [90].
Sum-product-quotient networks (SPQNs) [91] introduce quotient nodes, which take two inputs and output their quotient, allowing these models to represent conditional probabilities explicitly.
Tensor SPNs (tSPNs) [92] are an alternative representation of SPNs.Their main advantage is an important reduction in the number of parameters-between 24 and 569 times in the experiments-with little loss of modeling accuracy.Additionally, tSPNs allow for faster inference and a deeper and more narrow neural-network architecture.
Convolutional SPNs (ConvSPNs or CSPN) were developed independently in [82] and [93].The first model exploits the inherent structure of spatial data in a similar way to convolutional neural networks by using the sum and product operations of SPNs.The second one finds a criterion for seeing when a convolutional neural network is an SPN.
Submodular SPNs (SSPNs) [94] are an extension of SPNs for scene understanding, whose weights can be defined by submodular energy functions.
Compositional kernel machines (CKMs) [95] are an instance-based method closely related to SPNs.They have been successfully applied to image processing tasks, mainly to object recognition.
Conditional SPNs (CSPNs) [96] extend SPNs to conditional probability distributions.They include a new type of node, called a gating node, which computes a convex combination of the conditional probability of its children with non-fixed weights.

X. Conclusion
SPNs are closely related to probabilistic graphical models (PGMs), such as Bayesian networks and Markov networks, but have the advantage of allowing the construction of tractable models from data.SPNs have been applied to the same tasks as neural networks, mainly image and natural language processing, which exceeded by far the capabilities of PGMs.Even though deep neural networks yield in general better results, SPNs have the possibility of automatically building the structure from data and learning the parameters with either gradient descent or some of the algorithms developed for probabilistic models.
In this paper we have tried to offer a gentle introduction to SPNs, collecting information that is spread in many publications and presenting it in a coherent framework, trying to keep the mathematical complexity to the minimum necessary for describing with rigor the main properties of SPNs, from their definition to the algorithms for parametric and structural learning.We have intentionally avoided presenting SPNs as representations of network polynomials; readers interested in them can consult [9] and references therein.We have then reviewed several applications of SPNs in different domains, some extensions, and the main software libraries for SPNs.Given the rapid growth of the literature about SPNs, some sections of this paper might become obsolete soon, but we still hope it will be useful for those researchers who wish to get acquainted with this fascinating topic.
In contrast, SPNs were proposed from their start [1] as a method for building probabilistic models from data.Nowadays there are many algorithms for learning ACs and other types of probabilistic circuits.The references can be found in the tutorial by Vergari et al. that we recommended at the end of Section IX.

Appendix B. Interpretation of sum nodes as weighted averages of conditional probabilities
In this appendix we show that when a sum node represents a a variable V , this node can be interpreted as the average of the conditional probabilities given V , weighted by the probabilities P (v).We study first the case in which this node is the root.Proposition 29.Let S be an SPN, whose scope contains at least two variables, such that its root is a sum node that represents a variable V having m values.For every j, let n σ(j) be the product node associated with value v j , as in Definition 15.Then P (v j ) = w i,σ(j) . ( Additionally, we can define X = sc(n i ) \ {V }, so that if w i,σ(j) = 0 then for every configuration x we have This proposition is especially interesting when each child of the root, n σ(j) , has only two children, namely I vj and another node, say n k(j) -as in Figures 1 and 2. In this case the latter equation reduces to Given that sc(n k(j) ) = X for every j, Theorem 11 implies that S k(j) ( x) = P k(j) ( x).Therefore, the probability computed by the root, can be interpreted as a weighted average of the probabilities P k(j) ( x), P ( x) = j w i,σ(j) • P k(j) ( x) , (38) or as a particular case of the law of total probability: Example 30.The root node in Figure 2  • P (¬a) w1,3 .
If every ancestor of a sum node n i represents a variable, then the above interpretation is still valid for the context defined by the ancestors of n i , A. In this case Equation 40 .

Appendix C. Proofs
This appendix contains the proofs of all the propositions and the theorem.
Proof of Proposition 5: It follows from Equation 6that P (x) ≥ 0 and Given that each configuration of V is compatible with exactly one configuration of X, we have Proof: Equation 18 implies w ij = 0 (because S(v) = 0) and ∂S(v) ∂w ij = S(v) w ij .
Proof of Proposition 28: Given that v t h t is a complete configuration of sc(S ), i.e., v t h t ∈ conf(S ), and S is selective, Proposition 35 implies that So Equation 32 can be rewritten as If link (i, j) does not belong to the tree induced by v t h t then ∂S (v t h t )/∂w ij = 0, so the inner summation in the previous expression can be extended to all the configurations of H t : Given that S (v t ) = h t S (v t h t ), we have Because of Proposition 18, S (v t ) = S(v t ) and This result, together with Equations 26 and 14, leads to Equation 33.
Proof of Proposition 29: Let j ∈ {1, . . .m}.Since n r represents variable V , either I vj = n σ(j) or I vj is a child of n σ(j) -see Definition 15.In the first case, we would have sc(n σ(j) ) = {V } and the completeness of S would imply that sc(n r ) = sc(n r ) = {V }, in contradiction with the assumption that sc(S) has at least to variables.Therefore I vj must be a child of n σ(j) , a product node, and S σ(j) (x) = I vj (x) • k∈ch(σ(j))\n k =Iv j S k (x) , for every x ∈ conf * (S).Let x = v j x, i.e., the composition of v j and any x ∈ conf * ( V).When j = j we have I vj (v j x) = I v j (v j ) = 0 and S σ(j ) (v j x) = 0. On the other hand, S σ(j) (v j x) = I vj (v j ) • k∈ch(σ(j))\n k =Iv j S k ( x) , = k∈ch(σ(j))\n k =Iv j S k ( x) .
Since n r is a sum node, In particular, if x = then S k ( ) = 1 for every k and P (v j ) = w i,σ(j) .

Figure 1 .
Figure 1.A Bayesian network (left) and an equivalent SPN (right).The 6 terminal nodes in the SPN are indicators for the 3 variables in the model, A, B, and C; they are the input of the SPN for every configuration of these variables, including partial configurations.The root node, n 1 , computes the joint and marginal probabilities.

Figure 2 .
Figure 2. If P (c | b, +a) = P (c | b, ¬a) for every value of B and C (context-specific independence), nodes n 16 and n 17 in Figure 1 can be coalesced into node n 16 in this figure.The numbers in red are the values S i (v) for v = (+a, +b, ¬c).

Figure 3
Figure3.Augmentation of an SPN, assuming that n i is not selective in S.This process adds an indicator I z(j) for every child n j of n i .Node n i is added to restore the completeness of n l in S .

Figure 4 .
Figure 4. MPE computation for the SPN in Figure 2. Sum nodes turn into max nodes.The numbers in red are the values S max i (e) when the evidence is e = +c.The most probable explanation, MPE (e) = (+a, ¬b), is found by backtracking from the root to the leaves (thick lines).

Figure 5
Figure5.The LearnSPN algorithm recursively creates a product node when there are subsets of (approximately) independent variables and a sum node otherwise, grouping similar instances.(Reproduced from[6] with the authors' permission.)

xPProposition 32 .Proposition 35 .
(x) = v P (v) = 1 .Before proving Propositions 6 and 7, we introduce a new result.Let P be a function P : conf * (V) → R that satisfies Equation6.If P ( ) = 1, then v P (v) = 1.If S is selective, v ∈ conf(S), S(v) = 0,and (i, j) ∈ S v , then [83]na at the University of Darmstadt, Germany, with contributors from several countries[83].It implements methods for inference (marginal and conditional probabilities, and approximate MPE), parameter learning (with GD) and several structural learning algorithms, and can be extended and customized to implement new algorithms.SPNs are usually compiled into TensorFlow for fast computation, but they can also be compiled into C, CUDA, or FPGA code.