Adaptive Social Learning

This work proposes a novel strategy for social learning by introducing the critical feature of adaptation. In social learning, several distributed agents update continually their belief about a phenomenon of interest through: i) direct observation of streaming data that they gather locally; and ii) diffusion of their beliefs through local cooperation with their neighbors. Traditional social learning implementations are known to learn well the underlying hypothesis (which means that the belief of every individual agent peaks at the true hypothesis), achieving steady improvement in the learning accuracy under stationary conditions. However, these algorithms do not perform well under nonstationary conditions commonly encountered in online learning, exhibiting a significant inertia to track drifts in the streaming data. In order to address this gap, we propose an Adaptive Social Learning (ASL) strategy, which relies on a small step-size parameter to tune the adaptation degree. First, we provide a detailed characterization of the learning performance by means of a steady-state analysis. Focusing on the small step-size regime, we establish that the ASL strategy achieves consistent learning under standard global identifiability assumptions. We derive reliable Gaussian approximations for the probability of error (i.e., of choosing a wrong hypothesis) at each individual agent. We carry out a large deviations analysis revealing the universal behavior of adaptive social learning: the error probabilities decrease exponentially fast with the inverse of the step-size, and we characterize the resulting exponential learning rate. Second, we characterize the adaptation performance by means of a detailed transient analysis, which allows us to obtain useful analytical formulas relating the adaptation time to the step-size.

However, such remarkably good convergence properties have a subtle consequence that has been overlooked so far in the literature.This is because the exponentially increasing accuracy in learning the true hypothesis makes all agents stubborn!We illustrate this phenomenon through a simple example.
Consider a weather forecast problem solved by an online social learning algorithm.Assume that the agents are collecting data that drive them to believe that "tomorrow will be sunny".After some time, however, assume the streaming dataset available for the decision evolves in response to changes in weather conditions with the most recent evidences suggesting markedly that "tomorrow will be rainy".The traditional (existing) social learning algorithms discourage agents from changing their "mind" and it will be virtually impossible for the agents to adapt to the new situation and revise their earlier conclusion.This effect is clearly visible in the example of Fig. 1.In this example we considered a network of 10 agents that collect data originating from one of three possible hypotheses, namely, "sunny", "cloudy", "rainy".The data are initially consistent with the hypothesis "sunny".We observe from the blue curve in the top plot of Fig. 1 that the belief of agent 1 for the hypothesis "sunny" approaches the value one and, therefore, this agent is able to arrive at the correct determination about the state of nature.However, in our simulation, the state of nature is made to change to "rainy" at instant i = 200 (not shown in the figure).It is observed that the beliefs of agent 1 start changing only around i = 350 and the agent first transitions to believing that it is "cloudy" (the green curve) before switching to believing that it is "rainy" many iterations later around i = 550.This example shows that, under traditional social learning schemes, agents are not able to recover sufficiently fast to adapt their beliefs and track changes in the state of nature.The outcome of the social learning algorithm (we display in the figure the belief of agent 1, with similar behavior being observed for other agents) shows clearly that the agents learn well until instant i = 200, since they give almost full credibility to the hypothesis according to which the data are drawn, but react far slower afterward when the state of nature changes.As a matter of fact, the traditional social learning algorithm has a delayed reaction to the change, only perceiving that something has changed at instant i ≈ 350,  Bottom panel.The instantaneous decision taken by agent 1 by choosing the hypothesis that maximizes the current belief.
but still not detecting the true state, because the agent gives maximum credibility to the wrong intermediate hypothesis "cloudy".After a prohibitive number of iterations, at i ≈ 550, agents manage to overcome their stubbornness and opt for the correct hypothesis "rainy".
This behavior is problematic for many scenarios dealing with nonstationary environments where the state of nature undergoes regular changes.For this reason, a good learning algorithm must be able to adapt to drifts in the streaming information collected by the agents.This work proposes an Adaptive Social Learning (ASL) strategy to fill this gap.One instance of such strategy is shown in Fig. 2 with reference to the same example from Fig. 1.We see that the ASL algorithm reacts much faster (almost instantly) and is able to track the target change at instant i ≈ 200, exhibiting an adaptation capacity that is remarkably higher than that of the classic social learning algorithm.
There are at least two advantages in devising the ASL algorithm.The first one is related to a modeling perspective.As already indicated, the existing social learning strategies are not able to endow agents with adaptation abilities whereas the proposed ASL model will be able to do so.The second implication is related to a designing perspective.Social learning algorithms are useful not only in modeling opinion formation over social networks.They are also useful in designing man-made engineered systems (such as robotic swarms) tasked to solve decision problems collectively.Endowing such systems with adaptation abilities is critical for many applications.
The main contributions of this work are as follows.First, we introduce a novel social learning strategy that enables adaptation.Then, we provide an accurate analytical characterization of this strategy.In particular, by exploiting recent advances in the field of distributed detection over adaptive networks -see [20] for an overview -we furnish a detailed characterization of the social learning performance at each individual agent, in terms of i) convergence of the system at the steady-state (Theorem 1); ii) achievability of consistent learning (Theorem 2); iii) a Gaussian approximation for the learning performance (Theorem 3); and iv) the error exponents for the learning error probabilities (Theorem 4).

II. ASL STRATEGY
The agents of the network collect streaming observations (or data) about a phenomenon of interest.Agent k = 1, 2, . . ., N , at time epoch i = 1, 2, . .., collects a "private" observation ξ k,i belonging to a certain space X k .The qualification "private" signifies that the observations cannot be shared among agents.The dependence of the space X k upon k allows for a possible heterogeneity in the types of data at the different agents.The data will be assumed statistically independent over time, i.e., over the index i, whereas they can be dependent across agents. 1n social learning, since the inter-agent dependence is usually not known to the agents, the focus is on marginal distributions, i.e., on the distribution pertaining to any individual agent.Specifically, it is assumed that the distribution of ξ k,i belongs to a set of H admissible models that are identified by a discrete parameter (or hypothesis) θ ∈ Θ = {1, 2, . . ., H}.
The likelihood of agent k evaluated at θ is denoted by: The presence of subscript k highlights that the likelihoods are allowed to vary across the agents.In our treatment, L k (ξ|θ) (regarded as a function of ξ) can be either a probability density or mass function, depending on whether ξ k,i is continuous or discrete, respectively.Moreover, in order to avoid trivialities we assume the following regularity condition on Kullback-Leibler (KL) divergences [21].
The data collected by the agents are generated from one particular model (the true hypothesis) and the goal of social learning is to let the agents learn this hypothesis from the data.In the adaptive context, the system conditions can drift over time.For example, the true model governing the data can change, and one fundamental goal of adaptive social learning is to let the agents react promptly to these drifts and start learning the "new" model.
We are now ready to present our solution for adaptive social learning.As usual in social learning implementations, the two main objects of the learning process are: an intermediate belief ψ k,i (θ), which each agent k shares at time i with its neighbors; and the belief µ k,i (θ), which agent k obtains at time i by combining the intermediate beliefs received from its neighbors.For the algorithm initialization, we assume the following standard condition.
The ASL strategy is a two-step algorithm that iterates over time as follows.In the first step, each agent k constructs an intermediate belief vector ψ k,i by incorporating the current observation ξ k,i into the belief of the preceding time epoch, µ k,i−1 , through the following adaptive Bayesian update: In (2), the denominator is a normalization factor that makes ψ k,i a probability vector, and 0 < δ < 1 is a design parameter that we are introducing and will be referred to as the step-size, for a reason that will become clear later.
In the second step, each agent k aggregates into its own current belief µ k,i the intermediate beliefs received from its neighbors: using a collection of convex combination weights, namely, where N k denotes the neighborhood of agent k, with k itself being included.For later use, we introduce the left-stochastic We assume that the network is strongly connected (i.e., for any two nodes and k, there exists always a path with nonzero weights linking them in both directions, and at least one node has a self-loop, i.e., a kk > 0 for at least one agent k) [22].Under this assumption, the Perron eigenvector π associated with the matrix A has all strictly positive entries [22]: We see from (3) that the agents combine linearly the logarithm of the received intermediate beliefs, and then use exponentiation and normalization to get back an admissible probability vector.
In order to capture the essence of the ASL strategy, it is useful to contrast it with traditional social learning algorithms.
We refer in particular to the useful algorithms presented in [16], [17], [18], [19].The combination step used there is identical to (3).The fundamental difference resides in the update step (2).The earlier methods do not incorporate the exponentiation factor δ. Different from the adaptive Bayesian update (2), the Bayesian update employed in [16], [17], [18], [19] is: We see that the classic Bayesian update incorporates the new information into the past belief by giving equal weight to both µ k,i−1 and the likelihood of the new data L k (ξ k,i |θ).Contrasting this behavior with (2), we see that the ASL strategy implements instead a convex combination of probability vectors at the exponent, through the weights 1 − δ and δ.Observe further that (2) cannot be reduced to (6) for any selection of δ ∈ (0, 1).Notably, such forms of convex combinations have been used in the statistical literature for a very different purpose, namely, in the definition of the Chernoff information [23].
Inspecting (2), we see that each agent performs its update by modulating, through the convex scalars 1 − δ and δ, the relative weights assigned to the past and new information.In particular, relatively large values for δ give more importance to the new data, whereas small values for δ give more importance to the past beliefs.In this way, and as we will see, the step-size parameter δ naturally endows the social learning algorithm with adaptation.
The adaptation properties of the ASL strategy are enabled by a learning mechanism that is fundamentally different from that of classic social learning.To see why, let us assume that the true hypothesis remains stable for a sufficiently long time interval.Different from what happens in classic social learning -e.g., in (6) -in the ASL strategy the belief will not converge as time i increases.In contrast, the belief will oscillate indefinitely, preserving a random behavior also in the steady-state.The learning performance will be then assessed by examining the statistical behavior of the beliefs in the steady state.We will provide an accurate characterization of such statistical behavior in the regime of small step-sizes, i.e., by performing an asymptotic analysis as δ → 0. Under this regime, we will able to show that the probability of guessing the right hypothesis approaches 1 for sufficiently small step-sizes.This behavior will highlight well the adaptation/learning tradeoff: small (resp., large) values of δ mean less (resp., more) adaptation and higher (resp., lower) learning accuracy.

III. STATISTICAL DESCRIPTORS OF THE LEARNING PERFORMANCE
Assume that a certain hypothesis has been in force up to time i 0 , and that the agents have correctly converged to that hypothesis.Then, from i 0 + 1 onward the true hypothesis changes, with the evolution of the system up to i 0 being summarized in the "initial" belief vectors µ k,i0 .Starting from i 0 , the ASL algorithm behavior will exhibit two important phases: a transient phase, where, given the wrong initial belief, each agent must suddenly adapt in order to depart from this wrong state and start learning the correct hypothesis; and a steady-state phase, where, given sufficient time to learn (i → ∞), each agent must accurately learn the correct hypothesis.According to the theory of adaptive inference, the performance of an adaptive learning strategy is characterized under the steady-state regime.
The following property is relevant for steady-state analysis.By examining the algorithm recursions ( 2)-( 3), it is straightforward to see that, in light of Assumptions 1 and 2, the belief remains always nonzero at any θ during the algorithm evolution. 2Now, assume that the algorithm is running under stationary conditions up to time i 0 , and that the distribution changes and stays stable for sufficiently long time.In order to perform a steady-state analysis from i 0 + 1 onward, we need to consider µ k,i0 as initial state.Since we have observed that the beliefs are always nonzero, we can see that the initial belief vector µ k,i0 fulfills Assumption 2.
In summary, for the purpose of the steady-state analysis and without loss of generality, we will assume that the steadystate analysis starts at time i 0 = 0 and consider an initial belief vector µ k,0 that fulfills Assumption 2. The true hypothesis θ 0 is kept constant over time, yielding: Therefore, for the purpose of the steady-state analysis, we will always imply that expectations and probabilities are evaluated under the distributions L k (ξ|θ 0 ).Note also that, under the steady-state regime, the data {ξ k,i } are independent and identically distributed (i.i.d.) over time, i.e., over the index i.We will assume that they can have different distributions across the agents, i.e., across the index k.Statistical independence across the agents will not be assumed in general, but will be used to prove some of the forthcoming results (Theorems 3 and 4 further ahead).

A. Log-Belief Ratios and Error Probabilities
In order to characterize the learning performance, it is convenient to introduce the logarithm of the ratio between the belief evaluated at θ 0 and the belief evaluated at a generic hypothesis θ = θ 0 : which is well-defined since, as noticed, the belief remains nonzero at any θ during the algorithm evolution.Before continuing, it is important to make a notational remark.With the symbol λ k,i , we will be referring to the (H − 1) × 1 vector of log-belief ratios, namely, where the elements in the set of wrong-hypotheses have been indexed as: One natural way for the agents to choose a hypothesis is to select the hypothesis that maximizes the belief.Therefore, the error probability can be expressed as It is useful to rewrite the error probability as a function of the log-belief ratios.To this end, observe that the event within brackets in (11) corresponds to saying that the belief is not maximized at θ 0 , which in turn corresponds to saying that the log-belief ratios in (8) are less than or equal to zero for at least one θ = θ 0 .Therefore, the error probability can be equivalently rewritten as: Finally, we introduce the steady-state error probability: There are two fundamental questions related to the concept of steady-state error probability.The first question regards its existence, which is in principle not guaranteed.Theorem 1 will provide an affirmative answer to this question by characterizing the steady-state behavior of the log-belief ratios.The second question regards the evaluation of the steadystate error probability.An exact evaluation is generally a formidable task.Therefore, to tackle this critical problem, we will perform an asymptotic analysis in the regime of small δ, which will allow us to obtain reliable predictions of the steady-state performance.
In Fig. 3 we show an example of evolution for the error probability of two agents in a network implementing the ASL strategy. 3All the probabilities are estimated empirically by Monte Carlo simulation.We see how the instantaneous error probability p k,i converges to a steady-state nonzero value p (δ) k as i increases.It is useful to remark that this behavior is different from that of classic social learning, where, under stationary conditions, the error probability of each agent vanishes as time elapses.This is one instance of the adaptation/learning tradeoff: non-adaptive strategies can increase their accuracy indefinitely under stationary conditions.However, astronomically low values of the error probabilities lead to a detrimental inertia in responding to possible changes.

B. Log-Likelihood Ratios
For k = 1, 2, . . ., N , i = 0, 1, . .., and θ = θ 0 , we introduce the log-likelihood ratio: and its expectation: namely, the KL divergence between L k (ξ|θ 0 ) and L k (ξ|θ), which is finite in view of Assumption 1, implying that the loglikelihood ratios cannot diverge (but for an ensemble of realizations with zero probability).We recall that the expectation in ( 15) is computed assuming that the random variable ξ k,i is distributed according to model L k (ξ|θ 0 ).Since we focus on the steady state, this distribution is constant over time, which explains why d k (θ) does not depend on i.Furthermore, since the true hypothesis θ 0 is held fixed during the steady-state analysis, in order to avoid a heavier notation we are not emphasizing the dependence of the KL divergence d k (θ) on θ 0 .
We continue by introducing an average variable that will play a role in the forthcoming results, namely, the network average of log-likelihood ratios, for all θ = θ 0 : The random variable x ave,i (θ) appearing in ( 16) is obtained by combining linearly the local log-likelihood ratios x ,i (θ).
The combination weight assigned to the log-likelihood ratio of the -th agent is given by the limiting combination weight, i.e., by the -th entry, π , of the Perron eigenvector.We will see in the following that the asymptotic properties of the ASL strategy as δ → 0 can be directly related to the vector of average variables, x ave,i .

IV. STEADY-STATE ANALYSIS
As we have remarked in the introduction, different from the classic social learning setting, in the adaptive setting the belief will not converge as i → ∞.In contrast, the belief of each agent will preserve an oscillatory behavior.This everlasting oscillation is critical to keep adaptation alive.On the other hand, it makes the steady-state analysis more difficult, since the beliefs preserve a random character even when i → ∞.In order to carry out a meaningful steady-state analysis, the fundamental preliminary step becomes then to establish whether such random oscillations lead to stable random variables as i → ∞.Theorem 1 further ahead ascertains that this is the case.
Before stating the theorem, let us examine the evolution of the log-belief ratios.Exploiting (2) and (3), we end up with the following recursion, for every θ = θ 0 : which can be rewritten as the following two-step recursion: The time-evolution of the log-belief ratios in (18) and ( 19) is in the form of a diffusion algorithm with constant step-size δ -see, e.g.[22].This is why we referred to δ as the step-size.
Developing the recursion in (17) and recalling that A = [a k ] is the combination matrix we can write, for all θ = θ 0 : Since the transient term dies out as i → ∞, in order to evaluate the steady-state behavior of λ k,i (θ), we can ignore it and rewrite, with slight abuse of notation: where we have further exchanged the order of summation for later convenience.

A. Steady-State Log-Belief Ratios
The goal of the steady-state analysis is to evaluate the performance (i.e., the error probability) for large i.For this evaluation to be meaningful, we must ascertain that the error probability in (12) converges as i → ∞.To this end, we will now establish that there exists a certain limiting random vector, λ k , such that the probability distribution of the vector of log-belief ratios, λ (δ) k,i , converges, as i → ∞, to the probability distribution of λ (δ) k .This notion of convergence can be formally defined as follows.
We say that the sequence (over the index i) of random vectors λ (δ) k,i converges in distribution or weakly as i → ∞ if we can define a random vector λ (δ) k such that [24]: for all measurable sets B whose boundary ∂B has zero probability under the limiting distribution, namely, for all measurable sets B fulfilling the condition: In the following, weak convergence will be compactly denoted as: and the vector λ k will be referred to as the steady-state log-belief vector, since it provides the statistical characterization of the log-belief vector λ (δ) k,i as i → ∞.We are now ready to present the theorem that establishes the existence of steady-state log-belief ratios.
Theorem 1 (Steady-state log-belief ratios).Let Assumptions 1 and 2 hold, and let be the random sum obtained from (21) by taking the summands in reversed order.
First, we have that all the N inner sums in (25) are almost-surely absolutely convergent as i → ∞, implying that λ k,i (θ) converges almost surely to the random series: Second, we have that the vector of log-belief ratios λ k,i (with the original, i.e., non-reversed ordering of summation) converges in distribution to the vector λ Proof: See Appendix B.
It is useful to make some comments on Theorem 1. First, finiteness of only the expectation of x k,i is required (through Assumption 1) to guarantee the existence of a steady-state random variable.No assumption is made on higher-order moments.
Second, it is important to notice that (26) does not correspond to letting i → ∞ in the summation in (21).In order to explain why, let us compare the random sums: and In Fig. 4 we examine a sample path for these sums, and we can see that they exhibit different behavior.The random sum in (28), displayed with solid line in Fig. 4, oscillates indefinitely as i → ∞.In contrast, the random sum in (29), displayed with dashed line, is (almost-surely) convergent, namely, it converges to the (random) value λ (δ) k (θ) defined in (26).Both behaviors are consistent with what we have already shown in Theorem 1.These profoundly different behaviors depend on the different ordering of the summands in (28) and (29).In particular, in (29) the most recent term, x ,i (θ), takes the smallest weight (1 − δ) i−1 , which lets the remainder of the series vanish (almost surely).In contrast, in (28) the most recent term, x ,i (θ) takes the highest weight (1 − δ) 0 = 1, which keeps oscillation (and, hence, adaptation) alive.
Even though the sums in (28) and ( 29) exhibit a markedly different behavior in terms of their time-evolution (i.e., on the sample paths), one notable conclusion from Theorem 1 is that their probability distributions converge to the same distribution, that is the distribution of the limiting variable λ (δ) k .This equivalence can be explained as follows.With reference to the top panel in Fig. 4, consider a sufficiently large i (say, i = 300) and take the corresponding values of the dashed curve and of the solid curve, namely, λ k,300 (2) and λ k,300 (2).These values are different.However, if we now repeat the experiment in Fig. 4 several times, the realizations of λ k,300 (2) across different experiments will be distributed in the same way as the realizations of λ k,300 (2).
The existence of a limiting distribution for the log-belief vector λ (δ) k,i makes the definition of a steady-state error probability meaningful, since from Eqs. ( 12) and (13) we see that the steady-state error probability can be computed as: 4 p (δ) However, it should be noticed that Theorem 1 constitutes only a first, albeit fundamental step towards the characterization of the ASL performance, since it establishes only the existence of a steady-state error probability without providing any explicit characterization thereof.Such characterization is in general not available.In the next sections we tackle this challenging problem by focusing on an asymptotic characterization of λ k in the regime of small δ.

V. SMALL-δ ANALYSIS
We have ascertained that it makes sense to define steady-state random variables characterizing the log-belief ratios.
Then, the steady-state learning performance can be determined by examining the probability that these random variables fulfill certain conditions.For example, the steady-state probability that an agent learns the truth is the probability that the steady-state log-belief ratio of that agent is positive only at the true value θ 0 .However, in general the exact characterization of these steady-state variables is a formidable task.For this reason we will resort to an asymptotic analysis in the regime of small δ.We will provide three types of asymptotic results.
1) Weak law of small step-sizes (Theorem 2).We will show that, for small δ, the steady-state vector λ k concentrates around the weighted average of the agents' KL divergences defined in (31).This concentration property guarantees that, with high probability as δ → 0, the true hypothesis is chosen by each agent.This result will require only finiteness of the first moments of the log-likelihood ratios, i.e., finiteness of the KL divergences.
2) Asymptotic normality (Theorem 3).We will obtain a Central Limit Theorem (CLT) that will provide a normal approximation, holding for small δ, for the error probabilities of each individual agent.This result will be proved 4 According to the definition of convergence in distribution, the result in (30) holds provided that the limiting random variable λ (δ) k has no point mass at 0. However, we rule out such pathological case that is in practice the exception rather than the rule.
assuming independence across agents and will require finiteness of the variance of the log-likelihood ratios.We remark that previous results of asymptotic normality for adaptive distributed detection assumed finiteness of higherorder moments [26].To the best of our knowledge, the result in Theorem 3 (which is based on part 5 of Lemma 1) is the first result that assumes the minimal requirement of finiteness of second moments.
3) Large deviations analysis (Theorem 4).We will characterize the exponential rate of decay of the error probabilities as δ → 0. This result will be proved assuming independence across agents and will require the existence of the moment generating function of the log-likelihood ratios.Notably, the above three steps reflect perfectly a classic path in asymptotic statistics.However, in order to avoid misunderstandings, it is necessary to clarify one fundamental difference between the small-δ analysis and classic results.In order to illustrate this difference let us refer, for example, to the CLT result.In the traditional setting of asymptotic statistics, one examines the asymptotic behavior of sums of random variables when the number of terms of the sum goes to infinity.In contrast, the CLT proved in this work does not affirm that the sums involved in ( 21) converge to a Gaussian as i → ∞.As a matter of fact, we have shown in Theorem 1 that the sums in ( 21) converge to certain random variables, but these variables are not Gaussian, in general.The CLT that we prove deals instead with the behavior, as δ goes to zero, of the steady-state random vector λ (δ) k .The same distinction applies to the other two types of asymptotic results, namely, the weak law and the large deviations analysis.For this reason, as explained in [20], the correct way to deal with the asymptotic regime of small step-sizes in the adaptation context is made of two steps: 1) First introduce a proper steady-state vector λ (δ) k , which already embodies the effect of combining an infinite number of summands.This steady-state vector will be non-degenerate (i.e., no weak law as i → ∞), will be non-Gaussian (i.e., no CLT as i → ∞), and will be non-vanishing (i.e., no large deviations as i → ∞).
2) Then, characterize the asymptotic behavior of the steady-state random vector λ (δ) k as δ goes to zero.It is worth noticing that, in the adaptation literature, the critical role of the first step is usually not emphasized.This is because the adaptation literature mostly focuses on estimation problems, where one usually quantifies the performance by evaluating convergence of the moments [22].In contrast, when dealing with decision problems (as in our case), the performance is quantified through probabilities of certain events.In order to evaluate probabilities at the steady state, it is critical to obtain first a representation of the steady-state random variables [20].

VI. CONSISTENT SOCIAL LEARNING
We will establish that the ASL strategy achieves consistent social learning under the following standard assumption of global identifiability.
Assumption 3 (Global identifiability).For each wrong hypothesis θ = θ 0 , there is at least one agent that has strictly positive KL divergence.
Let us provide some intuition behind Assumption 3. Consider agent k and hypothesis θ = θ 0 .Now, if the likelihoods L k (ξ|θ) and L k (ξ|θ 0 ) are equal, θ is not distinguishable from θ 0 at agent k, i.e., the classification problem is locally non-identifiable.Clearly, if there exists a hypothesis θ that is indistinguishable from θ 0 at all agents, there is no hope for the system to classify correctly, because the agents will be necessarily uncertain between θ and θ 0 .Therefore, a minimal requirement for global identifiability is that, for each θ = θ 0 , there exists at least one agent for which model L k (ξ|θ) is distinct from L k (ξ|θ 0 ).This is exactly what Assumption 3 requires.It is also useful to highlight that Assumption 3 does not imply in any manner that agent k would be able to classify locally.In fact, saying that agent k is able to distinguish θ from θ 0 does not mean that it can distinguish θ 0 from the remaining hypotheses θ / ∈ {θ, θ 0 }.
We are now ready to state the theorem that establishes achievability of consistent learning.To this end, it is useful to introduce the expectation of the average log-likelihood ratio in ( 16): which does not depend on i owing to the identical distribution over time implied by the steady-state analysis.
Theorem 2 (Consistency of ASL).Under Assumptions 1 and 2, we have the following convergence: Since under Assumption 3 all entries of m ave are strictly positive, Eq. (32) implies that each agent learns correctly the true hypothesis as δ → 0, namely, for all θ = θ 0 we have that the steady-state error probability of all agents k = 1, 2, . . ., N converges to zero as δ approaches zero: The result of Theorem 2 relies on the weak law of small step-sizes proved in Lemma 1, part 3. Technically, this law requires finiteness of only the first moments d (θ), which is guaranteed by Assumption 1.Moreover, the result of Theorem 2 requires that m ave (θ) > 0 for all θ = θ 0 .Since the entries of the Perron eigenvector are all strictly positive, we see that m ave (θ) is strictly greater than zero for every θ if, for every θ, there exists at least one agent for which the KL divergence d (θ) is strictly positive.In other words, in order to achieve consistent learning, it is sufficient that at least one of the first moments (i.e., the KL divergence) is nonzero, which is guaranteed by Assumption 3.
Therefore, we see that Assumption 3 provides one important motivation for agents' cooperation in social learning.In fact, we assume that the learning problem can be non-identifiable (i.e., can be singular) locally, meaning that an individual agent can have one or more hypotheses θ = θ 0 that are indistinguishable from the true one (zero KL divergence).If this happens, an individual agent is not able to learn properly.On the other hand, under a global identifiability condition, the network is able (as shown in Theorem 2) to identify the true hypothesis by fusing the information coming from distinct agents.
We have shown that the ASL strategy allows correct learning of the true hypothesis for sufficiently small step-sizes.In other words, we have established that the error probability vanishes as δ → 0. On the other hand, we have not established how it vanishes.There are at least two good reasons to examine the way this probability converges to zero.The first reason is to get manageable formulas for the evaluation of the social learning performance.The second reason is to characterize the fundamental scaling laws of the system.We will see that the ASL strategy is characterized by an exponential law, since the error probability of each individual agent decays exponentially fast as a function of the inverse step-size 1/δ.

VII. NORMAL APPROXIMATION FOR SMALL δ
We will now prove a central limit theorem for the steady-state random vector λ (δ) k .To this end, we will assume finiteness of second-order moments for the log-likelihoods, and statistical independence across the agents.
In order to state the CLT, it is convenient to define some useful quantities.First, we introduce the covariance between the log-likelihood ratios at θ and θ , that is: (34) Then we introduce the covariance between the average variables x ave,i (θ) and x ave,i (θ ) which, exploiting independence across agents, can be evaluated as: Next, it is necessary to examine the behavior of the first two moments of the log-belief ratios.In view of Lemma 1, part 2, it is possible to conclude that the expectation of the steady-state random vector λ k can be expressed as: where O(δ) is a quantity such that the ratio O(δ)/δ remains bounded as δ → 0. Likewise, using part 4 of Lemma 1, we conclude that the covariance of the steady-state random vector λ k is: Equations ( 36) and (37) can be rewritten in vector and matrix form, respectively as: where k (θ, θ )] and C ave = [c ave (θ, θ )] are the matrices that collect the individual covariances.We see from (38) that, as δ → 0, there is a leading term that does not depend on the agent index k (whose impact is implicitly included in the higher order corrections, i.e., the O(•) terms).
The first relation in (38) reveals that the expectation vector of the steady-state log-belief ratios, m (δ) k , approximates, for small δ, the expectation vector of the average log-likelihood ratios, m ave .In comparison, the second relation in (38) reveals that the covariance matrix of the steady-state log-belief ratios, C (δ) k , goes to zero as C ave δ/2, where C ave is the covariance matrix of the average log-likelihood ratios, namely, We are now ready to state our central limit theorem.
Theorem 3 (Asymptotic normality).Assume that the data {ξ k,i } are independent across the agents (recall that they are always assumed i.i.d.over time), and that the log-likelihood ratios have finite variance.Then, under Assumptions 1, 2 and 3, the following convergence holds: where the symbol denotes convergence in distribution, and G (0, C) is a zero-mean multivariate Gaussian with covariance matrix equal to C.
Proof: See Appendix D.
Theorem 3 entails the following approximation, holding for δ ≈ 0: We see that such approximation does not depend on the agent index k.As shown in [20], in order to capture differences in performance across the agents, it is possible to replace the limiting expectation vector m ave and the limiting covariance matrix C ave δ/2 with their exact counterparts, i.e., with the series appearing in (36) and (37), yielding the refined approximation: The approximations in (41) and (42) will be tested in the section devoted to numerical experiments.

VIII. LARGE DEVIATIONS FOR SMALL δ
In this section we focus on another relevant type of asymptotic analysis, namely, a large deviations analysis [27], [28].
The basic aim of the LD analysis is to estimate the exponential decay rate of the probabilities associated to certain rare events.In our setting, the rare event is the probability that an agent opts for the wrong hypothesis.We will show that, at the steady state, this type of event becomes in fact rare as δ approaches zero.
More formally, the LD analysis furnishes the following type of representation for the steady-state error probability [27], [28]: where the notation

•
= means equality to the leading exponential order (as δ → 0) or, more explicitly: for a certain value Φ that is called the error exponent.We remark that the equality at the leading exponential order in (43) does not imply in any way that we can approximate the probability of error as e −Φ/δ , namely, This is because any LD analysis neglects sub-exponential corrections.For example, it is immediate to check that the probabilities e −Φ/δ and 100 e −Φ/δ have the same LD exponent (equal to Φ), but the second probability is two orders of magnitude larger.In order to compensate for sub-exponential corrections, a refined LD framework exists, usually referred to as "exact asymptotics", which has been applied to binary adaptive detection in [20], [29].
In summary, the aim of a large deviations analysis is to evaluate the asymptotic decay rate of the error probabilities, which is a meaningful and significant index of the inferential performance.Since the error exponent is a compact statistical descriptor of the learning performance, it can be useful to compare different systems (e.g., ASL strategies with different network graphs) and/or to optimize some system parameters (e.g., the network graph) to achieve the fastest learning rate.
Before stating the main result about the LD analysis, it is necessary to introduce the Logarithmic Moment Generating Function (LMGF), a.k.a.cumulant generating function, of the log-likelihood ratios: We recall that, in the steady-state regime, the expectation is computed under the true model L k (ξ|θ 0 ), which does not change over time, and this explains why Λ k (t; θ) does not depend on i.It is also useful to introduce the LMGF of the average variable x ave,i (θ) which, under the assumption that the data are independent across the agents, is: Theorem 4 (Error exponents).Assume that the data {ξ k,i } are independent across the agents (recall that they are always assumed i.i.d.over time), and that the logarithmic moment generating function of x k,i (θ) exists everywhere, namely, for all k = 1, 2, . . ., N and θ = θ 0 : Then, under Assumptions 1, 2 and 3 we have the following two results holding for every agent k = 1, 2, . . ., N .First, we have that: Second, the error probability is dominated by the worst-case (i.e., smaller) exponent: Proof: See Appendix E The main message conveyed by Theorem 4 is that the steady-state error probability of each individual agent converges to zero as δ → 0, exponentially fast as a function of 1/δ.This exponential law provides a universal law for adaptive social learning, which reflects the universal scaling law of distributed adaptive detection -see [20].The exponent Φ governing such an exponential decay is computed from the logarithmic moment generating function of the average log-likelihood, where the weights of this average are the limiting weights, i.e., the entries of the Perron eigenvector.
The need for cooperation has been already motivated in relation to social learning problems that are locally nonidentifiable.Theorem 4 implies another potential benefit of cooperation, namely, that cooperation improves the learning accuracy.We will illustrate this aspect through one example.Assume the most favorable case where all agents could learn the true hypothesis individually.Consider then a doubly-stochastic combination matrix, yielding a Perron eigenvector with uniform entries π = 1/N for all = 1, 2, . . ., N .Exploiting (51), we can easily see that in this particular case the error exponent of the network is given by: where Φ ind is the error exponent of an individual agent.According to (52), we see that the network error exponent is N times larger than the individual error exponent, which in turn implies an N -fold exponential improvement in the learning accuracy.

IX. ILLUSTRATIVE EXAMPLES
We consider the strongly-connected network of N = 10 agents displayed in Fig. 5.We assume that all agents have a self-loop (not displayed in the figure).Besides, the combination matrix is designed using an averaging rule, resulting in a left-stochastic matrix [22].The network is faced with the following statistical learning problem.We consider a family of Laplace likelihood functions with scale parameter 1, seen in Fig. 6.Formally, we are given three Laplace densities: for n ∈ {1, 2, 3}.The likelihoods of the data collected by the agents are chosen from among these Laplace densities.
To make things more interesting, we assume that the inference problem is not locally identifiable, since we consider the following setup for each agent's family of likelihood functions: • For agents k = 7 − 10, In summary, the data {ξ k,i } are i.i.d.(across time and agents) unit-variance Laplace random variables, with expectations that depend both on the agent k and the hypothesis θ.Accordingly, we will use the notation e k (θ) to denote the expectation of ξ k,i , computed under likelihood L k (ξ|θ).For example, using Eqs.( 53)-(56), we see that: (57) We are now ready to dwell on a detailed illustration of the numerical experiments.In all of them, we explore the steady-state regime in the following way.We let the ASL algorithm run for sufficiently long time i, obtaining empirical values of the log-belief vector λ k,i .Then, we test how the empirical data match the theoretical predictions in Theorems 1-4.

A. Consistency
We consider that all agents are running the ASL algorithm for a fixed θ 0 = 1 over 8000 time samples (after which we consider that they achieved the steady state).From Theorem 2, we saw that as δ approaches zero, all agents k are able to consistently learn -see (32).In order to show this effect, for each value of δ (50 sample points in the interval δ ∈ [0.001, 1] are taken), we consider a different realization of the observations.In Fig. 7, for agent 1 and θ = 2, 3, we show how the log-belief ratios λ (δ) 1 (θ) behave for decreasing values of δ.We see the weak-law of small step-sizes arising, since the limiting log-belief ratios tend to concentrate around m ave .

B. Asymptotic Normality
We consider 10000 time samples, where again all agents are collecting data under a true hypothesis θ 0 = 1.From Theorem 3, we saw that in steady state we can approximate the log-belief ratios distribution by a multivariate Laplace pdf, see Eqs. (41) and (42).In Fig. 8, we assume that the ASL algorithm has reached the steady state at i = 10000, and display the log-likelihood ratios corresponding to instant i = 10000.The experiment is repeated over 100 Monte Carlo runs, such that we obtain 100 realizations of the steady-state variable λ the samples whereas the larger ellipse encompasses 95%.In red dotted lines, we see the corresponding ellipses for the limiting theoretical Gaussian approximation seen in ( 41), with the red cross indicating the limiting theoretical expectation m ave .Note how as δ decreases, the ellipses tend to be smaller, which is in accordance with the scaling of the covariance matrices by δ in (41) and (42), and the distributions tend to overlap, which is in accordance with the behavior predicted by Theorem 3.

C. Error Exponents
We start by evaluating the theoretical exponents for the Laplace example at hand.To this aim, we need to compute first the logarithmic moment generating function of the log-likelihood ratios x k,i (θ) in (14).Since the data follow a Laplace distribution, the log-likelihood ratio is: Before we proceed to characterize the random variable x k,i (θ), let us define the auxiliary quantity: We also introduce the centered variable ξ k,i = ξ k,i − e k (θ 0 ), and therefore we can write: For the case in which ∆ k,θ > 0, the random variable x k,i (θ) depends on the random variable ξ k,i in the following manner: We can then express the cumulative distribution function of x k,i (θ) as where P[A] is the probability of event A, computed from the distribution of ξ k,i .Note that its probability density function is given by L k (ξ + e k (θ 0 )|θ 0 ), which is a Laplace distribution with zero mean and scale parameter 1.
From the cumulative distribution function in (62), we can derive the density function of x k,i (θ) as: where rect(•) is the rectangle function, i.e., it is equal to 1 in the interval ] − 1 2 , 1 2 [ and 0 elsewhere.Also we should distinguish the notation δ(x), which represents the Dirac delta-function, from the notation δ, which refers to the step-size parameter.
The LMGF of variable x k,i (θ), whose expression was seen in (46), can be explicitly computed using (63): If similar steps are followed for the case ∆ k,θ < 0, we would find the following expression for the LMGF: Assuming that the true state is θ 0 = 1, we can then evaluate numerically Φ(θ) by employing the expressions in Theorem 4, for θ = 2 and θ = 3, from which we obtain Φ(1) = 0.03348 and Φ(2) = 0.05051.Finally, the error probability dominant exponent is given by: Now we illustrate the details of the numerical experiments.We consider that the true state of nature is set as θ 0 = 1, and we let all agents execute the ASL algorithm for 3000 iterations and for 20 values of δ in the interval [1/150, 1/10].We run 20000 Monte Carlo experiments and we compute the steady-state empirical probability of error for each agent and each value of δ.The data samples for agents 1, 3, 6, 8, 10 can be seen in Figure 9, where we compare the data samples trend with the theoretical slope prediction seen in Theorem 4 and with the error probability in (12) computed using the Gaussian approximation in (41).

X. CONCLUDING REMARKS
Social learning is a relevant inferential paradigm lying at the core of many multi-agent systems.Under this paradigm, the agents are able to learn progressively an underlying state of nature by continually updating their beliefs based on the incoming streaming data and the beliefs exchanged with their neighbors.
Several social learning implementations are currently available.However, these implementations are not open to deal with nonstationary data.For example, even if the agents learned well the true state, when this state changes at a certain instant, in the traditional social learning implementations the agents tend to be stubborn and keep on believing in the old state.In this work we proposed an Adaptive Social Learning (ASL) strategy that overcomes this issue and examined its performance and convergence guarantees in some great detail.The key insight is the introduction of an adaptive Bayesian update depending on a step-size parameter δ that allows to tune the degree of adaptation.
A careful analysis of the system performance has been provided.Specifically, with focus on the small step-size regime, we have ascertained that the ASL strategy is able to learn consistently, and we have provided reliable performance characterization of the learning performance at each individual agent.

APPENDIX A
In the following, the symbols S o and S denote the interior and the closure of set S, respectively.
Lemma 1 (Asymptotic properties of random series useful for adaptation).For m = 0, 1, . .., let {z m } be a sequence of i.i.d.integrable random variables with: Let also 0 < δ < 1, and consider the following partial sums: of distributions.This setting can be relevant in distributed machine learning problems, where the agents construct the class of admissible likelihoods during a training stage, and due to finiteness of the training set, their knowledge of the admissible models cannot be perfect.New fundamental questions arise, including: the links between the accuracy of the training phase and the achievability of consistent social learning; the interplay between training, adaptation, and prediction performance; and the interplay between nonstationarity in the training set and in the streaming data.
In the following, the symbols S o and S denote the interior and the closure of set S, respectively.APPENDIX A Lemma 1 (Asymptotic properties of random series useful for adaptation).For m = 0, 1, . .., let {z m } be a sequence of i.i.d.integrable random variables with: Let also 0 < < 1, and consider the following partial sums: where 0 < ↵ m < 1, with ↵ m converging to some value ↵ and obeying the following upper bound for all m: for some constant  > 0 and for some 0 < < 1.Then, we have the following asymptotic properties.
5) Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, r the case where the underlying distribu-ecessarily belong to the assumed family This setting can be relevant in distributed problems, where the agents construct the le likelihoods during a training stage, and s of the training set, their knowledge of odels cannot be perfect.New fundamental including: the links between the accuracy phase and the achievability of consistent the interplay between training, adaptation, rformance; and the interplay between none training set and in the streaming data.g, the symbols S o and S denote the interior f set S, respectively.APPENDIX A ptotic properties of random series useful For m = 0, 1, . .., let {z m } be a sequence e random variables with: 1, and consider the following partial sums: 1, with ↵ m converging to some value ↵ following upper bound for all m: t  > 0 and for some 0 < < 1.Then, we g asymptotic properties.
te Stability.The partial sums in (68) are ely absolutely convergent, namely, we can (almost-surely) convergent series: then: 5) Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, ining, adaptation, play between nonstreaming data.denote the interior dom series useful m } be a sequence : wing partial sums: to some value ↵ r all m: (69) < 1.Then, we sums in (68) are t, namely, we can t series: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, . This setting can be relevant in distributed g problems, where the agents construct the ible likelihoods during a training stage, and ss of the training set, their knowledge of models cannot be perfect.New fundamental including: the links between the accuracy phase and the achievability of consistent the interplay between training, adaptation, performance; and the interplay between nonhe training set and in the streaming data.ing, the symbols S o and S denote the interior of set S, respectively.APPENDIX A mptotic properties of random series useful ).For m = 0, 1, . .., let {z m } be a sequence le random variables with: 1, and consider the following partial sums: < 1, with ↵ m converging to some value ↵ following upper bound for all m: nt  > 0 and for some 0 < < 1.Then, we ing asymptotic properties.tate Stability.The partial sums in (68) are rely absolutely convergent, namely, we can (almost-surely) convergent series: 5) Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, 0 belong to the assumed family ng can be relevant in distributed s, where the agents construct the ods during a training stage, and training set, their knowledge of not be perfect.New fundamental the links between the accuracy the achievability of consistent lay between training, adaptation, ; and the interplay between nonset and in the streaming data.bols S o and S denote the interior spectively.
ENDIX A operties of random series useful 0, 1, . .., let {z m } be a sequence variables with: nsider the following partial sums: ↵ m converging to some value ↵ upper bound for all m: nd for some 0 < < 1.Then, we otic properties.ty.The partial sums in (68) are tely convergent, namely, we can rely) convergent series: 5) Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: lso 0 < < 1, and consider the following partial sums: e 0 < ↵ m < 1, with ↵ m converging to some value ↵ obeying the following upper bound for all m: ome constant  > 0 and for some 0 < < 1.Then, we the following asymptotic properties.
Steady-State Stability.The partial sums in (68) are almost-surely absolutely convergent, namely, we can define the (almost-surely) convergent series: then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: the following partial sums: nverging to some value ↵ bound for all m:  m , (69) some 0 < < 1.Then, we operties.
e partial sums in (68) are nvergent, namely, we can onvergent series: 5) Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0.
Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, z asymptotically normal as !0.
Assume that z m is nons LMGF: he LMGF of s( ), we have that: n (69).Let ndre transform of (t), which is mber.Then the following Large (LDP) holds for any measurable er an empty set is taken as +1): is lower semi-continuous and tting: + ), where z and z + are the pport of z m , and the function d strictly convex on D o .Finally, 5) Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, otic normality.If z m has finite variance 2 z , following convergence in distribution holds: nce, s( ) is asymptotically normal as !0.
deviations.Assume that z m is nonnistic and has LMGF: g by ⇤ s (t) the LMGF of s( ), we have that: is defined in (69).Let enchel-Legendre transform of (t), which is nded real number.Then the following Large ns Principle (LDP) holds for any measurable e infimum over an empty set is taken as +1): ction ?( ) is lower semi-continuous and Moreover, letting: me value ↵ : (69) 1.Then, we in (68) are ely, we can s: .
and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, .10.Typical shape of the rate function.
if, = m z .A typical shape of the rate function is illustrated in Fig. 10.Exploiting the aforementioned regularity properties of ?( ), from (79)-(80) we have in particular that, for any 2 D o : Proof: We prove sequentially the six parts of the lemma.
rt 1.In view of (67), the following series of (absolute) pectations is convergent: view of [25, Lemma 3.6 0 ], convergence of the series of solute first moments implies that the random series s abs ( ) almost-surely finite, which in turn implies that so is s( ), d part 1 is proved.
rt 2. Since the series of (absolute) expectations is conrgent, so is the series of expectations: the other hand, by triangle inequality we have the llowing upper bound: w we observe that s abs ( ) is a proper random variable view of part 1.Furthermore, it is an integrable random riable from Beppo Levi's monotone convergence theom [26], thanks to the convergence of absolute expectations (86).
We conclude that the random sequence s i ( ) is upper bounded by an integrable random variable.Therefore, the dominated convergence theorem [26] implies that the expectation of the a.s.limit s( ) is equal to the convergent series of expectations, and the first equality in (72) follows.Moreover, we can write: In view of (69), the absolute value of the first summation on the RHS in (88) is dominated by: We conclude from (86), ( 88) and (89) that the second equality in (72) holds.
and consider the following centered variables: In view of parts 1) and 2), the centered partial sums: converge in distribution to e s( ) as i ! 1.By Lévy's continuity Theorem, the corresponding characteristic functions must converge [27].Since the z m 's are i.i.d.we can write: where j = p 1. We want to show that e s( ) converges in probability to 0 as !0. In view of Lévy's continuity Theorem this is tantamount to showing that ' s(t) converges to 1 as !0. Using (93) we can write: 5 Consider, without loss of generality, a positive t.Since the random variables e z m have finite expectation, the first derivative of the characteristic function, ' 0 z (t), is a continuous function, and by the mean-value theorem we can write (since in particular E[e z m ] = 0): , for some t m 2 (0, ⇣ m t).(96) 5 The following inequality is known for complex numbers xm, ym, with |xm|  1 and |ym|  1 [27]: mptotic normality.If z m has finite variance 2 z , the following convergence in distribution holds: , hence, s( ) is asymptotically normal as !0.
ge deviations.Assume that z m is nonrministic and has LMGF: oting by ⇤ s (t) the LMGF of s( ), we have that: re ↵ is defined in (69).Let the Fenchel-Legendre transform of (t), which is extended real number.Then the following Large iations Principle (LDP) holds for any measurable S (the infimum over an empty set is taken as +1): function ?( ) is lower semi-continuous and vex.Moreover, letting: D o = (z , z + ), where z and z + are the emes of the support of z m , and the function ) is smooth and strictly convex on D o .Finally, al as !0.
e have that: (t), which is lowing Large y measurable ken as +1): ? ( ), (81) z + are the the function as finite variance 2 z , in distribution holds: ly normal as !0.
that z m is non- f s( ), we have that: rm of (t), which is the following Large s for any measurable set is taken as +1): inf semi-continuous and s asymptotically normal as !0.
s. Assume that z m is nonhas LMGF: ) the LMGF of s( ), we have that: in (69).Let gendre transform of (t), which is number.Then the following Large le (LDP) holds for any measurable over an empty set is taken as +1): ) is lower semi-continuous and letting: , z + ), where z and z + are the support of z m , and the function and strictly convex on D o .Finally, the rate function is the aforementioned m (79)-(80) we have o : x parts of the lemma.
series of (absolute) nce of the series of ndom series s abs ( ) plies that so is s( ), expectations is con-: uality we have the per random variable n integrable random convergence theobsolute expectations We conclude that the random sequence s i ( ) is upper bounded by an integrable random variable.Therefore, the dominated convergence theorem [26] implies that the expectation of the a.s.limit s( ) is equal to the convergent series of expectations, and the first equality in (72) follows.Moreover, we can write: In view of (69), the absolute value of the first summation on the RHS in (88) is dominated by: We conclude from (86), ( 88) and (89) that the second equality in (72) holds.
and consider the following centered variables: In view of parts 1) and 2), the centered partial sums: converge in distribution to e s( ) as i ! 1.By Lévy's continuity Theorem, the corresponding characteristic functions must converge [27].Since the z m 's are i.i.d.we can write: where j = p 1. We want to show that e s( ) converges in probability to 0 as !0. In view of Lévy's continuity Theorem this is tantamount to showing that ' s(t) converges to 1 as !0. Using (93) we can write: 5 Consider, without loss of generality, a positive t.Since the random variables e z m have finite expectation, the first derivative of the characteristic function, ' 0 z (t), is a continuous function, and by the mean-value theorem we can write (since in particular E[e z m ] = 0): ), for some t m 2 (0, ⇣ m t).(96) 5 The following inequality is known for complex numbers xm, ym, with |xm|  1 and |ym|  1 [27]: nts construct the aining stage, and ir knowledge of ew fundamental een the accuracy ity of consistent ining, adaptation, lay between nontreaming data.enote the interior om series useful } be a sequence ing partial sums: to some value ↵ r all m: (69) < 1.Then, we sums in (68) are namely, we can t series: 5) Asymptotic normality.If z m has finite variance 2 z , then the following convergence in distribution holds: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, tic normality.If z m has finite variance 2 z , following convergence in distribution holds: ce, s( ) is asymptotically normal as !0.
deviations.Assume that z m is nonistic and has LMGF: by ⇤ s (t) the LMGF of s( ), we have that: is defined in (69).Let enchel-Legendre transform of (t), which is ded real number.Then the following Large s Principle (LDP) holds for any measurable infimum over an empty set is taken as +1): tion ?( ) is lower semi-continuous and oreover, letting: = (z , z + ), where z and z + are the of the support of z m , and the function smooth and strictly convex on D o .Finally, cally normal as !0.
e that z m is non- of s( ), we have that: sform of (t), which is en the following Large olds for any measurable pty set is taken as +1): ] inf ]  inf 2S ?( ). (82) r semi-continuous and e z and z + are the z m , and the function convex on D o .Finally, mptotic normality.If z m has finite variance 2 z , n the following convergence in distribution holds: , hence, s( ) is asymptotically normal as !0.
rge deviations.Assume that z m is nonerministic and has LMGF: noting by ⇤ s (t) the LMGF of s( ), we have that: ere ↵ is defined in (69).Let the Fenchel-Legendre transform of (t), which is extended real number.Then the following Large viations Principle (LDP) holds for any measurable S (the infimum over an empty set is taken as +1): function ?( ) is lower semi-continuous and vex.Moreover, letting: n D o = (z , z + ), where z and z + are the remes of the support of z m , and the function ) is smooth and strictly convex on D o .Finally, ality.If z m has finite variance 2 z , g convergence in distribution holds: is asymptotically normal as !0.
ns. Assume that z m is nonhas LMGF: t) the LMGF of s( ), we have that: egendre transform of (t), which is l number.Then the following Large iple (LDP) holds for any measurable over an empty set is taken as +1): ( ) is lower semi-continuous and r, letting: and, hence, s( ) is asymptotically normal as !0. 6) Large deviations.Assume that z m is nondeterministic and has LMGF: Denoting by ⇤ s (t) the LMGF of s( ), we have that: where ↵ is defined in (69).Let be the Fenchel-Legendre transform of (t), which is an extended real number.Then the following Large Deviations Principle (LDP) holds for any measurable set S (the infimum over an empty set is taken as +1): The function ?( ) is lower semi-continuous and convex.Moreover, letting: We prove sequentially the six parts of the lemma.
view of (67), the following series of (absolute) s is convergent: [25, Lemma 3.6 0 ], convergence of the series of st moments implies that the random series s abs ( ) rely finite, which in turn implies that so is s( ), is proved.
nce the series of (absolute) expectations is conis the series of expectations: her hand, by triangle inequality we have the pper bound: bserve that s abs ( ) is a proper random variable part 1.Furthermore, it is an integrable random om Beppo Levi's monotone convergence theoanks to the convergence of absolute expectations We conclude that the random sequence s i ( ) is upper bounded by an integrable random variable.Therefore, the dominated convergence theorem [26] implies that the expectation of the a.s.limit s( ) is equal to the convergent series of expectations, and the first equality in (72) follows.Moreover, we can write: In view of (69), the absolute value of the first summation on the RHS in (88) is dominated by: We conclude from (86), ( 88) and ( 89) that the second equality in (72) holds.
and consider the following centered variables: In view of parts 1) and 2), the centered partial sums: converge in distribution to e s( ) as i ! 1.By Lévy's continuity Theorem, the corresponding characteristic functions must converge [27].Since the z m 's are i.i.d.we can write: where j = p 1. We want to show that e s( ) converges in probability to 0 as !0. In view of Lévy's continuity Theorem this is tantamount to showing that ' s(t) converges to 1 as !0. Using (93) we can write: 5 Consider, without loss of generality, a positive t.Since the random variables e z m have finite expectation, the first derivative of the characteristic function, ' 0 z (t), is a continuous function, and by the mean-value theorem we can write (since in particular E[e z m ] = 0): ), for some t m 2 (0, ⇣ m t).(96) 5 The following inequality is known for complex numbers xm, ym, with |xm|  1 and |ym|  1 [27] where 0 < α m ≤ 1, with α m converging to some value α > 0 and obeying the following upper bound for all m: for some constant κ > 0 and for some 0 < β < 1.Then, we have the following asymptotic properties.
2. First moment.The expectation of s(δ) is: where O(δ) is a quantity such that the ratio O(δ)/δ remains bounded as δ → 0.
3. Weak law of small step-sizes.The series s(δ) converges to α m z in probability as δ → 0, namely, for all > 0 we have that: 4. Second moment.If: then: 5. Asymptotic normality.If z m has finite variance σ 2 z , then the following convergence in distribution holds: and, hence, s(δ) is asymptotically normal as δ → 0.
In view of [30,Lemma 3.6 ], convergence of the series of absolute first moments implies that the random series s abs (δ) is almost-surely finite, which in turn implies that so is s(δ), and part 1 is proved.
Part 2. Since the series of (absolute) expectations is convergent, so is the series of expectations: On the other hand, by triangle inequality we have the following upper bound: Now we observe that s abs (δ) is a proper random variable in view of part 1.Furthermore, it is an integrable random variable from Beppo Levi's monotone convergence theorem [31], thanks to the convergence of absolute expectations in (86).
We conclude that the random sequence s i (δ) is upper bounded by an integrable random variable.Therefore, the dominated convergence theorem [31] implies that the expectation of the a.s.limit s(δ) is equal to the convergent series of expectations, and the first equality in (72) follows.Moreover, we can write: In view of (69), the absolute value of the first summation on the RHS in (88) is dominated by: We conclude from (86), ( 88) and (89) that the second equality in (72) holds.
and consider the following centered variables: In view of parts 1 and 2, the centered partial sums: converge in distribution to s(δ) as i → ∞.By Lévy's continuity Theorem, the corresponding characteristic functions must converge [32].Since the z m 's are i.i.d.we can write: where j = √ −1.We want to show that s(δ) converges in probability to 0 as δ → 0. In view of Lévy's continuity Theorem this is tantamount to showing that ϕ s(t) converges to 1 as δ → 0. Using (93) we can write: Consider, without loss of generality, a positive t.Since the random variables z m have finite expectation, the first derivative of the characteristic function, ϕ z (t), is a continuous function, and by the mean-value theorem we can write (since in particular E[ z m ] = 0): Accordingly we can write: where the latter inequality follows from the fact that ζ m ≤ δ, see (90).Applying (97) to (95) we get: On the other hand, since ϕ z (0) = E[ z m ] = 0, from the continuity of ϕ z (t) it follows that: which proves that s(δ) converges to E[s(δ)] in probability as δ → 0. The claim in (73) then follows from (72).
Part 4. Since the variables z m have common finite variance σ 2 z and are independent, it is immediate to see that: Consider now the squared and centered variables: In view of parts 1 and 2 the quantity on the LHS converges almost surely, as i → ∞, to: Given the convergence of the variance of the partial sums in (100), by Fatou's lemma we conclude that [31]: i.e., the limiting variable s(δ) has finite variance.But since the limiting variable s(δ) can be written as: 5 The following inequality is known for complex numbers xm, ym, with |xm| ≤ 1 and |ym| ≤ 1 [32]: with the two quantities on the RHS being statistically independent, the variance of s(δ) cannot be smaller than the variance of s i (δ) for all i, implying that: Combining ( 103) with (105) we see that the variance of the a.s.limiting variable s(δ) is equal to the convergent series of variances, which is the first equality in (75).
In order to prove the second equality in (75) we write: Reasoning as done to prove part 2, we can easily show that the first summation on the RHS in (106) is O(δ 2 ).The second summation is instead equal to: and the second equality in (75) follows.
Part 5. Let The claim in (76) is equivalent to prove that the random variable s(δ)−mz √ δσ lim converges in distribution to a standard Gaussian.
On the other hand, we have that: Since the second term in (109) converges to zero in view of (72), from Slutsky's theorem [24] it suffices to show that the random variable s(δ)−E[s(δ)] √ δσ lim converges in distribution to a standard Gaussian.To this end, we start by introducing, with slight abuse of notation w.r.t. ( 90) and (91), the quantities: and: We notice that z m has zero mean and unit variance.
We will now show that s(δ) converges in distribution to a standard Gaussian.In view of Lévy's continuity theorem, this claim is equivalent to the convergence, as δ → 0, of the characteristic function of s(δ) to the characteristic function e − t 2 2 .From (71), (108), ( 110) and (111) we see that: Reasoning as done to compute (93), the characteristic function of s(δ) in (111) can be written as: Using the triangle inequality for complex numbers we can write: Now, that the second term on the RHS of (114) converges to zero follows from part 4), since from (75) and the definition of ζ m in (110) we conclude that: Let us now focus on the first term on the RHS of (114).Since the characteristic functions have magnitude not greater than 1, in view of ( 94) and ( 113) we can write: where in the latter step we applied the triangle inequality.Now, the last term in (116) converges to zero since for any positive s we have |e −s − 1 + s| ≤ s 2 /2, and since it is immediate to show that (see the proof in [26]): On the other hand, using [31,Lemma 3.3.7]we can write, for an arbitrarily small > 0: where I{E} is the indicator of event E, and the last inequality follows because ζ m ≤ √ 2δ/α -see (110).Let now: Owing to identical distribution of z m across index m, the function g(δ) does not depend on m.Since z m has finite variance, we have that g(δ) → 0 as δ → 0. In view of (118), recalling that the magnitude of the expectation is upper bounded by the expectation of the magnitude, and that z m has zero mean and unit variance, we have that: and, hence, finally implying, due to the arbitrariness of , that ϕ s(t) converges to e −t 2 /2 as δ → 0. We have therefore shown that s(δ) in (111) converges to a standard Gaussian as δ → 0, and this completes the proof of part 5.
Next we focus on the regularity properties of the Fenchel-Legendre transform φ (γ).Following the development used in [26, Appendix C], we can prove that D o is an interval, that φ (γ) is smooth and strictly convex for γ ∈ D o , and that φ (γ) ≥ 0 with equality if, and only if, γ = αm z .
Thus, it remains to characterize the boundaries of D o and the behavior of the rate function at these boundaries.To this end, it is sufficient to prove the claim with α = 1 and for the right boundary, since the proof for other values of α and for the left boundary is simply obtained using the scaling and reflection properties of the LMGF [27], [28].Now, since it has been shown in [26, Appendix C] that the right boundary of D o is equal to lim t→∞ Λ z (t)/t, we must now prove that this limit equals z + (recall that we are working with α = 1).We start by noticing that, letting z − < z < z + , the LMGF Λ z (t) can be written as: From (122) we get, for all t > 0: where we set q = P[z m > z].We remark that 0 < q < 1 since z is internal to the support of z m .From (123) we get: If z + = +∞ the result is proved due to arbitrariness of z.If z + < +∞, we can choose z = z + − , and conclude that the limit inferior in (124) is equal to z + .The fact that the corresponding limit superior is equal to z + follows by observing that, in view of (122), for all t > 0 the quantity Λ z (t)/t is upper bounded by z + .
Finally, we characterize the behavior of the rate function at the boundaries of D o .We focus again on the right boundary z + .When z + = +∞, it suffices to notice that the rate function φ (γ) is strictly convex in D o and is strictly increasing for γ > m z (see Fig. 10) to conclude that the rate function diverges to +∞ as γ → z + .
The following property, which is relevant for condition (69) to hold in the context of our ASL strategy, will be repeatedly used in the forthcoming proofs.

APPENDIX B
Proof of Theorem 1: We are interested in characterizing, for each agent k, the joint behavior of the random variables λ k,i (θ) for all values of θ = θ 0 .To this end, it is useful to consider the (H − 1) × 1 vector λ (δ) k,i defined in (9).We also introduce, for a fixed time epoch i, the N × (H − 1) data matrix X i , whose entries, for = 1, 2, . . ., N and θ = θ 0 , are: In light of (21) we can write: converges in distribution to g.
When dealing with convergence in distribution of random vectors, the standard path is to reduce the vector problem to a scalar problem through the following argument.In view of Lévy's continuity theorem for random vectors, convergence in distribution takes place if, and only if, convergence of the pertinent (multivariate) characteristic functions takes place [24].
This implies that6 our claim will be proved if we show that, for any sequence of real numbers t(θ 1 ), t(θ 2 ), . . ., t(θ H−1 ): t(θ) λ Obviously, the linear combination on the RHS in (141) is a Gaussian random variable with zero mean and with variance: Let us now examine the LHS in (141).Using (135) we get: whereas using (15) we have: Let us now set: We observe that: VAR z ( ) Exploiting Eqs. ( 145)-( 148), the LHS in (141) can be cast in the form: We see from Eqs. (145)-(147) that the random variables s ( ) (δ) match the structure of the random series used in Lemma 1.
We now verify that s ( ) (δ) fulfills the conditions of part 5 in Lemma 1, for every = 1, 2, . . ., N .First we note that z ( ) m has finite variance since it is a linear combination of random variables that have finite variance.Second we see that condition (69) is verified in view of Property 1.We conclude then from part 5 of Lemma 1 that the following convergence in distribution holds: Since the data are independent across agents, we have that the random variables s ( ) (δ) are independent across index .
For this reason, and in view of (151), we conclude that the LHS in (141) is asymptotically normal, with zero mean and with variance given by: where we have used (149).Since the RHS in (152) coincides with the variance in (142), the proof is complete.

APPENDIX E
Proof of Theorem 4: In light of ( 12), the error probability of not choosing θ 0 can be bounded as follows (with the lower bound holding for every θ = θ 0 ): where the upper bound is the union bound.At the steady state, Eq. (153) implies: One key point to prove the claim of the theorem is the exponential characterization of the probability P λ which yields: Now, part 6 of Lemma 1 would provide the required exponential characterization for the individual variable s ( ) (δ).We need instead the characterization for λ (δ) k (θ), which is the sum of the (independent) variables s ( ) (δ).Let us elaborate on this aspect.The starting point to prove part 6 in Lemma 1 is the convergence in (78).Exploiting additivity of the LMGF for independent variables, we conclude that the LMGF of λ

Fig. 1 .
Fig. 1.Classic social learning algorithm.Top panel.Belief evolution of agent 1, with a state of nature drifting at time i = 200.Bottom panel.The instantaneous decision taken by agent 1 by choosing the hypothesis that maximizes the current belief.

Fig. 2 .
Fig. 2. Adaptive social learning algorithm proposed in this work.Top panel.Belief evolution of agent 1, with a state of nature drifting at time i = 200.

Fig. 3 .
Fig. 3. Evolution of the error probability of two agents in a network running the ASL algorithm.

3 Fig. 7 .
Fig.7.Consistency of the ASL strategy (Theorem 2).According to the weak-law of small step-sizes, the steady-state log-belief ratios for agent 1 concentrates around the predicted expectation mave as δ approaches zero.

Fig. 8 .
Fig. 8. Distribution of data samples at steady state compared with the limiting and empirical Gaussian distributions.

ma 1 (
) then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, e training phase and the achievability of consistent l learning; the interplay between training, adaptation, rediction performance; and the interplay between nonnarity in the training set and in the streaming data.the following, the symbols S o and S denote the interior he closure of set S, respectively.APPENDIX A Asymptotic properties of random series useful daptation).For m = 0, 1, . .., let {z m } be a sequence .d. integrable random variables with: then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, re the agents construct the uring a training stage, and set, their knowledge of perfect.New fundamental nks between the accuracy chievability of consistent ween training, adaptation, the interplay between nond in the streaming data.o and S denote the interior vely.A s of random series useful .., let {z m } be a sequence les with:

o
= (z , z + ), where z and z + are the s of the support of z m , and the function s smooth and strictly convex on D o .Finally,

z
and z + are the , and the function nvex on D o .Finally,

,
z + ), where z and z + are the support of z m , and the function and strictly convex on D o .Finally,

8 >
then D o = (z , z + ), where z and z + are the extremes of the support of z m , and the function ?( ) is smooth and strictly convex on D o .Finally, cal shape of the rate function.= m z .A typical shape of the rate function is rated in Fig. 10.Exploiting the aforementioned arity properties of ?( ), from (79)-(80) we have rticular that, for any 2 D o : m z , (83) lim !0 log P[s( )  ] = ?( ) 8 < m z .(84)