Disentangling Sources of Influence in Online Social Networks

Information propagation in online social networks is facilitated by two types of influence - endogenous (peer) influence that acts between users of the social network and exogenous (external) that corresponds to various external mediators such as online news media. However, inference of these influences from data remains a challenge, especially when data on the activation of users is scarce. In this paper we propose a methodology that yields estimates of both endogenous and exogenous influence using only a social network structure and a single activation cascade. Our method exploits the statistical differences between the two types of influence - endogenous is dependent on the social network structure and current state of each user while exogenous is independent of these. We evaluate our methodology on simulated activation cascades as well as on cascades obtained from several large Facebook political survey applications. We show that our methodology is able to provide estimates of endogenous and exogenous influence in online social networks, characterize activation of each individual user as being endogenously or exogenously driven, and identify most influential groups of users.

Popularity of online social networks allows us to investigate dynamics of social interactions on a scale that was previously unattainable [1][2][3][4][5][6][7][8] , while at the same time raising ethical concerns not previously encountered 9,10 .One particular type of social interaction is an information cascade -a spread of information between the users of a social network 11,12 .Information cascades are instrumental in investigating social influence, which can be defined as the degree to which the behavior of individuals changes the behavior of their peers 13 .Although mathematical modeling of social influence and information cascades is an active field of research in sociology for decades 11,12 , it only recently became technologically feasible to apply it to wide range of domains such as viral marketing 14 , information diffusion 15 , behavior adoption 16 and epidemic spreading 17 .
The most commonly used information diffusion models were inspired by epidemiology which model how a disease spreads in a population [18][19][20] .However, their utility is sometimes hindered by their use of latent states which are unobservable in data.For this it is more appropriate to use Independent Cascade (IC) model 21 and Linear Threshold (LT) model 11,22 which feature two observable states -active and inactive that denote whether an user was already exposed to the piece of information or not.These are popular for their simplicity that facilitates theoretical analysis 23 , statistical inference from data 24 , and can also be used as building blocks for more complex applications such as influence maximization 25 .However, there are several crucial differences between epidemic spreading and information diffusion 26 .Epidemic spreading is better modeled with simple contagion model where endogenous factors play a dominant role, and the activation probabilities are independent of the neighborhood structure and the state of activated users in it.On the other hand, information diffusion is better modeled with complex contagion due to the common presence of exogenous factors 27 and more complex forms of endogenous influence which include various social reinforcement mechanisms such as reciprocity 28 , social feedback 29 and homophily 30 .These additional factors are often neglected in modeling.
Presence of exogenous factors is particularly problematic as it confounds with the endogenous factors, and can be hard to differentiate using observational data alone 31 .Ideally, one would want to perform a study where exogenous influence is negligible 27 , but this is often not possible and exogenous influence has to be explicitly accounted for [32][33][34] .In fact, exogenous influence is instrumental for understanding the information spreading as information can propagate through multiple channels simultaneously, many of which are exogenous to the online social network itself -news media websites, direct communication via email and instant messengers, and even offline word-of-mouth transmission.In addition, external events such as political unrest 1,35 and natural disasters 36 are often strong mediators of information cascades.These exogenous influences are usually not directly observable in the online social network itself, although they can be inferred from the available data.Understanding how endogenous and exogenous forces influence the information diffusion in online social networks could help us estimate to what extent are these vulnerable to manipulation by various interest groups such as organized individuals, news media and government agencies 37 .

arXiv:1811.10372v3 [cs.SI] 27 Mar 2019
In this paper we present a new methodology for estimation of endogenous and exogenous influence in online social networks.Our current model is conceptually similar to the unified model of social influence 38 which was shown to be generalization of many popular influence models, including complex contagion model 4 , independent cascade model 22 and generalized threshold model 22 .In our previous work 39 we proposed a simpler method for inference of endogenous and exogenous influence that exploits statistical differences between the way the two types of influence act on users.The underlying assumption is that the endogenous influence is dependent on the current state of the social network and which users are already active or not, while the exogenous influence is independent on these.By incorporating these assumptions in a statistical model we can infer magnitude of endogenous and exogenous influence from empirical data.
Here, we develop a likelihood-based approach which is expressive enough to accommodate many different microscopic models of influence, and propose a maximum likelihood inference method to estimate the parameters.The inference problem is the following -given a single activation cascade and a friendship network between users, and assuming a particular form of endogenous influence, infer parameters of endogenous and exogenous influence and estimate magnitudes of these influences in time and on a global and user level.Similar attempts exist in literature, including peer and authority model 40 which, however, requires explicit modeling of authorities responsible for exogenous influence, while in our case this is not necessary.Many of the other approaches rely on the availability of multiple activation cascades, while we use only one.Also, we use the social network structure, based on final state of activation cascade, directly in our inference rather than using it implicitly 8 or relying on a network statistic such as degree distribution 41 .
We evaluate our methodology on activation cascades collected via three online survey applications related to three distinct political events in Croatia (Figure 1).First survey, which is related to the referendum on the definition of marriage in 2013, we already used in our previous work 39 .Other two surveys are related to Croatian parliamentary elections in 2015 and 2016 and we collected them exclusively for this research.In all of our surveys the activation cascades are a series of user registrations through time.Surveys were active one week prior to actual elections and through them users were able to express their vote on the upcoming elections, see summary statistics for all users as well as for their online peers, and share the link to the survey through Facebook.Besides votes, we also collected Facebook friendship connections between all users that participated in our survey.In 2013 survey we also collected demographic data and in other two we obtained referral links through which users visited our survey website.These referral links originate either from Facebook, which indicates endogenous influence, or from some external website, which indicates exogenous influence.This classification of referral links served as a proxy for ground truth influence and allowed us to evaluate our inference method.During data collection we followed Facebook's privacy guidelines.
The main contributions of this paper are the following: (i) We collected data on social engagement of over 20 thousand Facebook users that participated on three distinct online political surveys.Datasets where users have to provide an informed consent to collect their data are usually much smaller, and so researchers have to rely on simulated datasets in order to validate their models.(ii) We estimate magnitude of endogenous and exogenous influence in social networks by using only a single activation cascade of users and their friendship network.Most previous research relies on the availability of multiple information cascades and rarely tackles exogenous influence directly by either leaving it as an option 38 , devising experiments where it is negligible 27 or simply treat it as a nuisance 24 .(iii) We show how can our methodology be used to estimate collective influence of various groups of users and characterize to what extent was their activation endogenously or exogenously driven.These estimates agree with both the simulated activation cascades and three realistic use cases where user's referral links served as a proxy for the ground truth labels on whether users were endogenously of exogenously activated.

Results
Crucial components of our methodology are explicit microscopic models of endogenous and exogenous influence with which we expand the Independent Cascade (IC) model.We then use these models in a log-likelihood function which gives us probability of observing particular activation cascade as a function of the model's parameters.Formulating our inference problem in a probabilistic way allows us to optimize for the maximum likelihood parameters and to estimate the magnitude of endogenous and exogenous influence.We apply our methodology on several simulated and empirical activation cascades in order to characterize the activation of users as being more endogenously or exogenously driven.The simulated case is easier because we know both the functional form and the parameters of the model that generated simulated information cascade, which allows us to perform evaluation in a straightforward manner.For the empirical cases we use three Facebook datasets obtained from an online political survey applications.In the end we estimate collective influence of three groups of usersthose who registered by following link from within Facebook, those that registered by following link from an external website, and those that followed a link from a Facebook advertisement.1a) and registration time series (Figure 1c) of users who registered on three of our Facebook online survey applications: referendum2013.hr(11538 registered users), sabor2015.hr(6909 registered users) and sabor2016.hr(3818 registered users).
Network nodes are colored according to the user's votes, and node sizes correspond to the number of their Facebook friends that also registered on the survey application.Clustering of users into communities based on votes shows a homophily effectusers are more likely to associate with other users that share their political preferences.This suggests a potential for endogenous influence.Time series are annotated with times of major news events which reported on our online survey application, and which are used as a proxy for exogenous influence.Collected data (table in Figure 1b) include demographic information, friendships between users, and referral links through which users visited our applications.Time period refers to the period when surveys were active.Depending on whether these referral links originated within Facebook or some external website they could be used as indicators of endogenous and exogenous influence respectively.Time series for sabor2015.hrand sabor2016.hrdatasets in Figure 1c are additionally separated based on the type of the referral links.

Models of endogenous and exogenous influence
We assume that an activation of an user in an online social network is mediated by two influences (Figure 2): (i) endogenous influence p peer which depends on the network structure and users that are already active or not, and (ii) exogenous influence p ext which is modeled as a time dependent random variable and is constant across all users.An additional assumption is that parameters of endogenous influence are constant throughout the period of observation, while parameters of exogenous influence may change in time.Both sets of parameters are equal for all users.This allows us to use a very simple model for the exogenous influence -a single probability of activation p (i) ext (t) which is equal for all inactive users i at each specific time step, although it can change in time.Instead of parameterizing p (i) ext (t) with a suitable closed form, we chose to evaluate it at each time step independently 33 .
For the endogenous influence we choose two commonly used Independent Cascade (IC) models: (i) Susceptible-infected (SI) model p  EXP (t).IC models are an example of simple contagion -activation of users happens due to a direct influence of one of their peers, independently of the rest of the system, including the neighborhood structure and which other users are active or not.EXP model has an added condition that peers that activated recently carry more influence than the ones that activated farther away in time, which is commonly incorporated in endogenous influence models 42,43 .
Probability of endogenous activation for user i at time interval [t − ∆t,t] under the SI model is defined as follows: where N (i) is a set of peers of user i, a i (t) designates how many of them are active at time t, and p 0 is a probability of user i's being activated by each of its peers.Assumption of the SI model is that probability of activating one's peers does not change in time, so once user is activated, every subsequent step he has the same probability p 0 of activating any of his peers.This assumption is more appropriate in epidemiological setting, from where SI model originated, than in information propagation setting where we would expect the influence to decay in time.This could be achieved by adding a parameter for influence decay, which leads us to the EXP model: where t j is the time of activation of user j. p 0 and λ are parameters of endogenous influence which define the shape of exponential decay of influence, with p 0 being the probability of user j activating user i at time t = t j and λ being the half-decay of influence.Both SI and EXP models feature independent cascades -each individual user can independently activate any of his peers.However, in social contagion it is more realistic to add a requirement of multiple interactions for the activation.This effectively models social reinforcement mechanism which is a known driving force for product adoption 27 .One of the simplest examples of such complex contagion models is the threshold model where the probability of endogenous activation is related to the number of already active peers N (i) of user i.We define one such threshold model in the Equation S11 of the Supplementary and show that it can also be effectively incorporated into our inference methodology.We now define a likelihood function L which gives us probability of observing data D (network and activation times) at a particular time t given some functional forms for endogenous and exogenous influence p peer and p ext .Due to typically small probabilities involved in these processes we actually use log-likelihood for maximum likelihood estimation of parameters, where product of probabilities is replaced with the sum of log-probabilities: First term on the right-hand side quantifies the agreement for the users that did activate in a given time period [t − ∆,t], as this had to be due to either endogenous or exogenous influence.Second term quantifies the agreement for the users that did not activate up to time t, neither through endogenous nor through exogenous influence.The time enters our inference only through the activation time of users and is used in two ways -i) to determine which users were active or inactive in time window [t − ∆t,t] (Equation 3), and ii) to calculate endogenous influence decay in EXP model (Equation 2).However, in principle it is possible to use a temporal network where friendship connections between users change in time.This would have to be encoded into the expression for endogenous influence p peer .We can remove explicit dependence on time t by evaluating L nonparametricaly -at each time increment ∆t.
One issue still needs to be addressed -on which users does the exogenous influence actually acts?We know that our friendship network does not contain all possible users, and so the true number of yet inactive users is probably much larger than what we actually observe.This observer bias could lead to the overestimation of the exogenous influence as we approach the end of the activation cascade and the number of eventually observed inactive users decreases towards zero, while the true number of inactive users which could possibly activate (but do not during our observation period) stays large.We correct for this by artificially increasing the part of our log-likelihood which is responsible for inactive users by factor c(t) = 1 + α(N all /N inactive (t)), where N all is the number of all users in the social network, and N inactive (t) the number of all yet users inactive users at time t (more details in Section S7 of Supplementary).

Maximum likelihood inference for endogenous and exogenous influence
We want to compute a single set of endogenous influence parameters for the whole period and a separate set of exogenous influence parameters for every time window.Our assumption is that endogenous influence parameters do not change over time, but that exogenous do.A direct way to do this is to perform a joint optimization of a log-likelihood that contains a single set of endogenous influence parameters and a separate set of exogenous influence parameters for each time window [t + ∆t].Our log-likelihood would then be t + 1-dimensional in the case of SI model, and t + 2-dimensional for the EXP model -t parameters of exogenous influence for each time window we are considering in our inference plus the parameters of endogenous influence (p 0 for SI model and (p 0 , λ ) for EXP model).This makes the number of parameters proportional to the number of time windows, which makes a joint optimization of log-likelihood unfeasible.Instead, we use an alternating method 33 where we alternatively fix either endogenous influence parameters or exogenous influence parameters and optimize the other until both values converge.In addition, we never optimize all of the t parameters of the exogenous influence jointly but do it one by one.This yields a nonparametric estimate for exogenous influence, meaning that we have a separate estimate of exogenous influence p ext (t) at each time step t.Although the number of parameters we have to infer is still proportional to the number of time windows we are considering in our inference, this strategy is much more efficient then joint inference and provides reliable estimates even though there is no formal guarantee that the estimates will actually converge.However, in our experiments we did not experience any problems with the convergence.Figure 2b shows the initialization step of the alternating procedure on a simple simulated activation cascade, where parameters for endogenous and exogenous influence are inferred separately for each time step t.
Using efficient optimization routines allows our method to scale to networks of over 10000 users with resolution of 100 time steps.In our experiments we use a truncated Newton algorithm 44 for maximum likelihood estimation, although in principle any suitable optimization algorithm could be used (more details in Methods section and in Section S4 of the Supplementary).Total number of users activated due to endogenous and exogenous influence (in Figures 3a and 4) is calculated through the exogenous responsibility measure (Equation 4) which is derived from the inferred parameters and quantifies the extent to which is each user's activation is due to endogenous or exogenous influence.This estimate is normalized with the total number of user activations in a given time interval, which is an observable quantity.

Inference of endogenous and exogenous influence on simulated data
Our simulations are designed to approximate, as well as possible, the conditions in which real data were collected.However, instead of using one of the empirical social networks which we collected, we decided to simulate on a configuration model of referendum2013 Facebook friendship network so that our results are reproducible using only a degree sequence, which is a much more compact and anonymous representation in comparison to the whole empirical network.Configuration model of a network preserves the number of connections each user has, but these connections are permuted randomly across all users.This destroys mesoscale structures such as communities, but is still preferable to other permutation methods where either times of activation are permuted (destroying order of activity) or connections themselves are permuted between the users (destroying degree distribution by changing it to binomial) 45 .The simulation starts with a small number of active users and progresses in discrete steps following one of the endogenous influence models (Equations 1-2).Figure 3 shows the results using the EXP model (Equation 2) for endogenous influence.At three distinct times we also simulate an exponentially decaying exogenous influence which acts equally on all inactive users.This resembles a typical situation when a distinct exogenous information source activates some of the users 46 , which we also observe in our dataset (Figure 1c).However, our methodology works equally well for other shapes of exogenous influence (Figures S8 and S9 in the Supplementary).Using just the activation times of all users and their friendship network we are able to estimate the parameters of the assumed endogenous and exogenous influence models as well as the absolute number of users activated predominantly due to the one or the other.In addition, using a measure of external responsibility (Equation 4) we are able to infer, for user, the extent to which endogenous or exogenous influence was responsible for activation.Instead of using a single threshold to classify users we calculated the Our assumption is that information propagation in an online social network is mediated by two types of influence -endogenous (peer) which acts between the users of the social network and exogenous influence which is external to it (Figure 2a).The estimated endogenous influence on the newly activated user i = 1 should be higher because more of his peers are already active, as compared to user i = 2. Figure 2b shows the normalized likelihood function (similar to Equation 3 which shows log-likelihood) at two distinct time steps in the simulated activation cascade using SI model for endogenous influence.SI model features only two parameters at each time step -parameter of endogenous influence p peer (p 0 in Equation 1) and a parameter of exogenous influence p ext .Shape of the likelihood function suggests that these two parameters are correlated as each provides part of the explanation for the observed data, and if one is weaker the other most compensate.Also, when we have more data (time 21) the shape of the log-likelihood function is more concentrated than when we have less (time 50), resulting in more confident estimates.In this simulation we are estimating parameters of endogenous and exogenous influence at each time step separately, which corresponds to the initialization stage of our actual inference procedure which we use in simulated (Figure 3) and empirical (Figure 4) case.In our full inference procedure we infer a single set of endogenous influence parameters for the whole observation period instead of having a separate estimate for each time step like in this example (more details in Methods section and in the Section S4 of the Supplementary).Here we are using a truncated Newton algorithm 44 for optimizing a log-likelihood function in order to obtain a maximum likelihood solution, although in practice any suitable optimization method could be used.whole receiver operating characteristic (ROC) curve and the corresponding area under the curve (AUC) score to evaluate the performance (Figure 3b).We compare our method to a simple baseline commonly used in previous work 33,47 where an activation is considered exogenous if activated user had no other active peers at the time of the activation.However, as more and more users becomes active, it becomes increasingly likely that a user is connected with at least one other active user by pure chance.This underestimates the number of users activated by exogenous influence and consequently underestimates overall exogenous influence.We obtain similar results (Figure S6 in the Supplementary) for the SI endogenous influence model and an additional threshold model we define in the Equation S11 of the Supplementary.The inference itself is fast and scales well to networks of over ten thousand users (Section S6 in the Supplementary).

Inference of endogenous and exogenous influence on empirical datasets
In order to investigate social interactions between users of a large online social network we developed three online surveys that use Facebook API for collection of data.Surveys were related to three distinct political events in Croatia: 1) referendum2013.hrfor referendum on definition of marriage, 2) sabor2015.hrfor parliamentary elections in 2015, and 3) sabor2016.hrfor parliamentary elections in 2016.Figure 1 shows the collected friendship networks between Facebook users and the number of registrations in 30-minute intervals for each of the survey applications during a week preceding the actual elections.Table in Figure 1b shows summary statistics for each of the datasets.The referral links provide information whether each user followed a link originating from a post on Facebook which indicates endogenous influence, or some external website reporting on our survey which indicates exogenous influence.We use this information to evaluate our estimates of endogenous and exogenous influence acting on users.More details on the datasets and the methodology of data collection is available in the Methods section and Sections S1 and S2 of the Supplementary.
Figure 4 shows the results of applying our inference methodology to estimate the magnitude of endogenous and exogenous influence during these three activation cascades.In this experiment we use the EXP model as endogenous influence model because it performed best on average over all three empirical datasets, with and without correction for the observer bias.The results for other models are included in Figures S12 and S13 of the Supplementary.As our methodology operates in discrete time (Equation 3) we discretized the activation times of users into 30 minutes time intervals to determine which users were active or inactive during each specific interval.Considering the duration of the data collection for each of the surveys, this corresponds to 333 time intervals for referendum2013 dataset, 327 intervals for sabor2015 dataset and 328 intervals for Inference on a simulated activation cascade.We use our methodology to infer which users activated due to endogenous or exogenous influence in a simulated activation cascade following exponential decay (EXP) endogenous influence model.In real world applications only total number of activated users (black line) is actually observed, along with the friendship network between users (Figure 3a).We use a configuration model of referendum2013 social network to make our results reproducible even without the whole empirical network.We see that our measure is able to differentiate absolute numbers of endogenously and exogenously activated users throughout the whole cascade period and to correctly infer the parameters of endogenous influence -p peer and λ , and exogenous influence p ext (t) for every time period t.We also infer activation type for each user individually by using the exogenous responsibility measure R (i) (t) (Equation 4) as shown on Figure 3b and achieve AUC of 0.93.We compare this with the baseline method where, instead of exogenous responsibility, we use number of active peers at the time of activation.A special case of this baseline is where we consider users without any active peers as exogenously activated, which is a baseline that we use in Figure 3a.This baseline method underestimates the exogenously activated users towards the end of the observation period, which is due to the fact that more and more users are active and it is increasingly likely that at least one of the peers is active by chance alone.On Figure 3b we show a histogram of the number of active peers and compare it with exogenous responsibility to demonstrate that no reasonable threshold could not serve as a classification measure, which is also confirmed with a relatively low AUC score of 0.86.The results for SI endogenous influence model are similar and are available in Figure S6 in the Supplementary.
sabor2016 dataset.Each user that registered on one of the online survey application using his Facebook credentials is considered activated in the given time period.The referral link from which we visited the website of the survey application will be used as a proxy of endogenous and exogenous influence -referral links from Facebook are considered as endogenous and those from external websites as exogenous.We later use this information for evaluation of our methodology.We estimate magnitudes of endogenous and exogenous influence and characterize each user as being endogenously or exogenously activated.We use the AUC score to evaluate the predictive performance of our inferred model on sabor2015 and sabor2016 datasets for which we had data on referral links from which users visited our survey application.This served as a proxy for ground truth labels which we needed for calculating the AUC scores.The purpose of the model is to estimate the magnitude of endogenous and exogenous influence on each given user, given available data and provided that underlying assumptions of our statistical methodology are satisfied.Similar as in simulated experiments, we compare our methodology with a baseline method that simply estimates the number of exogenously activated users as all those who did not have any active peers at the time of their own activation, and again we observe that it underestimates the number of exogenously activated users, especially near the end of the observation period.Our estimates of endogenously activated users (Figure 4) closely resemble the true number of users activated by following another user's share, which is the strongest indication of endogenous influence we have.On the other hand, it might seem that our method overestimates exogenously activated users by declaring many of the users originating from Facebook as exogenously activated.However, relying on Facebook referrals alone is not a reliable proxy for endogenous activation, as many users might be activated through other means of indirect communication available through Facebook -by following an advertisement, or by directly visiting a Facebook page of the survey application.
We observe that the magnitude of exogenous influence increases as we approach the end of the activation cascade period.This effect is due to the fact that we only observe the friendship network of users that eventually registered on our application, which is only a small subset of the whole Facebook network.However, one of our assumption is that exogenous influence acts uniformly on all users in the friendship network, not just the subset of them, and this manifests in the increased exogenous influence as the activation cascade approaches the size of the network.This observer bias can be corrected by adding a correction factor c to our log-likelihood function (Equation 3), which is regulated with parameter α.The results of applying the correction term on the empirical data are shown on Figure 4, while more detailed experiments are available in Figure S5 of the Supplementary).However, because less and less users got activated near the end of the observation period this observer bias does not influence our final estimates by much.However, we still believe that correction is warranted and useful, especially for estimates near the end of the observation period, and in other use cases where observation period is shorter and observer bias might be more pronounced.
For evaluation (Figure 4) we again calculate the corresponding AUC score which uses exogenous responsibility measure R (i) (t) (Equation 4) to classify users into endogenously and exogenously activated.The achieved AUC scores for our method (AUC our ) for sabor2015 and sabor2016 datasets are 0.76 and 0.82 respectively.This is higher than the baseline measure which uses number of active peers at the time of activation which achieves AUC scores (AUC base ) of 0.68 and 0.78 for the sabor2015 and sabor2016 datasets respectively.Using exponential decay model for endogenous influence allows us to calculate the half-decay of endogenous influence which is 10.1 hours for the sabor2015 dataset.This value is consistent with what we could expect, as it means that endogenous influence diminishes to a fraction of a value in the span of a day or two and requires influx of new users to keep it sustained.

Collective influence
Once we characterized activation of each user as being endogenously or exogenously driven, we can estimate the extent to which each user contributed to the activation of its peers by excluding the portion of the influence attributed to exogenous factors.We do not have a deterministic propagation path for our activation cascade -we do not know who influenced whom directly, so we cannot deterministically incorporate influence of all users in a transitive manner 48 .Nevertheless, our measure of influence simply incorporates all possible endogenous propagation paths to estimate an influence for each user (Figure 5a and Equation 5).If we then average this influence over a group of users we get their collective influence.Instead of using our estimates of endogenous and exogenous activation for each user we could also estimate influence directly from data by using the referral links from which users visited our application.Figure 5 shows the comparison of our methodology with estimates of influence obtained from raw data for different groups of users that activated due to: endogenous factors, exogenous factors, advertisements.Our question was: Which channel of communication is the most influential, that is, recruits users with higher collective influence?The results of our experiments (Figure 5b) on two datasets for which we had data on referral links, shows no clear pattern of influence.Different groups of users are more influential depending on the dataset.However, regardless of the model of endogenous influence (SI or EXP) our estimates are robust and are proportional to the ones obtained from raw data.It is important to emphasize again that our methodology does not use any information on referral links or external influence whatsoever, but rather infers this from the dynamics of the user activations.More details is available in Section S5 of the Supplementary.S12 and S13 of the Supplementary.On the bottom panels we see the effect of correction for the observer bias (α = 0.1) as compared to no correction (α = 0) -it reduces the overestimate of exogenous influence near the end of the observation period.AUC scores for using exogenous responsibility as a measure for classifying users into endogenously and exogenously activated (AUC our ) for datasets where we have information on referral links for evaluation -sabor2015 and sabor2016, are 0.76 and 0.82 respectively.This is higher then those achieved with a baseline measure of number of active friends, which are 0.68 and 0.78 for sabor2015 and sabor2016 datasets respectively.A more direct comparison with the baseline is available in Figure S14 of the Supplementary.Facebook referrals alone are not discriminating enough as there are multiple possible ways by which Facebook users might reach our application, including visiting the webpage of our application directly or through an advertisement, both which are more similar to exogenous rather than endogenous influence. Figure 5. Individual and collective influence.In the example on Figure 5a we estimate influence I 1 of user i = 1 as the extent to which he is responsible for endogenous activation p peer of all of his peers j = {3, 4, 5, 6, 7} which activated after him.Only three of his peers j = {4, 5, 7} activated due to endogenous influence, but he has to share part of this claim with two users i = {0, 2} which are their shared peers.The total individual influence for user i = 1 in the above example is peer .Type of activation (endogenous or exogenous) for each user can be estimated with our methodology or taken from raw data by using referral links from which users visited our application, in which case p ( j) peer simply takes values 0 or 1. Figure 5b shows comparison of influence estimates obtained from our methodology and raw data for different groups of users -those activated due to endogenous (peer) influence, exogenous (external) influence and advertisements (ads).Ads are similar to exogenous influence as they are targeting large number of users independent of their friendship connections, but within the Facebook social network itself.

Discussion
Unlike traditional survey methods where data is manually entered either by a respondent or experimenter 49 , online social networks provide an opportunity to collect much larger amounts of data on user activity.However, due to their nature they provide challenges to experimental design 50 .Observational studies without explicit consent are regularly performed within companies for marketing purposes, which is regulated by company's privacy policy, and in some cases this research can be used for academic purposes 29 .Still, academic publication of such research could raise ethical concerns 2,51 .On the other hand, conducting a study where explicit consent is mandatory heavily restricts the amount of data that can be collected, even when researchers have a direct access to the whole online social network and are in position to present their experiment automatically to the large number of users.For example, a study from Aral and Walker 52 on a sample of 1.3 million Facebook users managed to collect responses of only 7730 users.However, major publicized events such as elections and referendums can serve as catalyzers for mobilizing users.Users are usually willing to participate in a study if through it they receive an information or a service which they perceive as valuable and which could not be easily obtained in some other way.
Despite inherent difficulties in collecting data, we decided to conduct several online surveys using our own web applications and Facebook's API, which allowed us to collect activation cascades and friendship connections of over 20 thousand users in total.Although computational social science is in its infancy, with standards and practices still taking shape, we tried to keep the privacy of the users and follow current recommended ethical practices 9,10 .Conducting a survey through an online social network means that the recruitment happens organically from person to person as a form of snowball sampling and not through some unbiased randomized procedure, so it's the most eager persons that are recruited first.Number of mobilized users mostly depends on highly connected and willing individuals, that mobilize less wiling users.This effect might easily dominate the one from mass media 53 .
Using this data we demonstrate how to estimate exogenous and endogenous influence using only information on the friendship connections between users and a single activation cascade which corresponds to the times of user registration.Our methodology exploits the different ways of how exogenous and endogenous influence propagate -endogenous influence propagates between users and as such is dependent on the friendship structure, while exogenous influence acts uniformly on all users regardless of the social network structure.Our method is not able to reconstruct an exact propagation pathway, as these inevitably include pathways external to the particular online social network as well as pathways that are inherently unobservable such as word-of-mouth communication.Still, our method is able to give a probabilistic estimates of these two influences given minimal assumptions.Any additional information on the activation cascade or the social network could be included in our methodology, most probably along the lines of the unified model of social influence 38 .The advantage of such likelihood-based approaches is that inference is performed in a probabilistically-consistent manner, instead of relying on aggregated statistics to choose among competing models of influence 54 .The availability of efficient numerical solvers means our method can easily scale to large networks of over 10000 users.Computational scalability was already addressed for the unified model 55 , however, only for the modeling and not for inference.Our methodology could be applied for characterizing the types of influence in information spreading, for example the role of external factors in the fake news spreading occurring over online social networks such as Facebook or Twitter 56 .Also, there might also be applications outside the domain of social networks as the paradigm of endogenous and exogenous effects could be applied in the wider context of dynamical systems modeling 32 .
Our methodology suffers from several limitations, which also indicate potential paths for future research.First, we do not elucidate the mechanisms by which endogenous and exogenous influence arise.The form of the endogenous influence is predefined, and choosing between several possible candidates is possible.In our case, we evaluate different endogenous influence models by their prediction on empirical data, but other methods are possible, including information-theoretic approaches.Second, we assume exogenous influence acts equally on all users, and that parameters of endogenous influence are equal for all users.This was necessary in our case because we only have one activation cascade available for inference 34 , and without imposing additional constraints our statistical inference would be infeasible 57,58 .In cases where multiple activation cascades are available, it should be possible to relax these assumptions and allow for different values of endogenous and exogenous influence parameters for various groups of users.Third, we do not try to correct for the confounding effect arising from unobserved or observed characteristics of users.For example, it is expected that users respond differently to influences, both exogenous and endogenous, from entities that share their political orientation as compared to those that do not.Again, including additional parameters in our model would increase the uncertainty of our estimates.

Alternating method for inference
Our two main assumptions during statistical inference are: (i) both endogenous and exogenous influence are equal for all users at any given time, and (ii) endogenous influence does not vary in time while exogenous influence does.This leads us to the inference algorithm where we seek a single set of parameters for the endogenous influence p peer and a set of parameters for the 10/15 exogenous influence {p ext } t for each time step t.This would make the dimensionality of our log-likelihood proportional to the number of time steps we use for inference, which would be hard to optimize numerically.Instead, we use an alternating method 33 where we alternatively fix either p peer or {p ext } t and optimize for the other.The inference procedure is the following: 1. Estimate maximum likelihood values for p peer and p ext for every time window separately.
2. Fix p ext to values obtained from each time window and estimate a single maximum likelihood value p peer for the whole period.
3. Fix p peer to the single value obtained for the whole period and estimate maximum likelihood value p ext for every time window separately.
4. Repeat from step 2 until estimates for p peer and p ext converge.
An actual maximum likelihood estimation in steps 1 to 3 is performed with a truncated Newton algorithm that is Hessianfree and uses conjugate gradients to iteratively compute parameter updates 44 , although in principle any suitable optimization algorithm could be used (more details in Section S6 of the Supplementary).A full pseudocode for this alternating method is available in Section S4 of the Supplementary.

Inference of activation types
Because our model gives us probabilities for endogenous and exogenous activation for each user individually, we can use this information to estimate activation type for each of the users.For this we define a single measure of exogenous responsibility R (i) which quantifies to what degree is an activation of user i due to the exogenous (external) influence: Where t is the time of activation of user i.Values close to zero indicate dominating endogenous influence, and values close to one indicate dominating exogenous influence.An extreme value of zero is achieved for users who activated during time when there was no exogenous influence acting in the network.An extreme value of one is achieved for users who, at the time of their activation, did not have any active peers.Note that it is not possible for both p ext (t) and p (i) peer (t) to be 0, and consequently that the value of responsibility is undefined, because that would mean the activation of this user is evaluated as impossible by our model in Equation 3. In principle, we could also use pure activation probabilities p (i) peer or p (i) ext as measures of influence, but experiments on simulated data showed that exogenous responsibility is the most sensible (more details in Supplementary Information).

Individual and collective influence of users
Our assumption is that each user is, to some extent, responsible for endogenous activation of all of his peers that activated after him.This influence extends beyond user's immediate peers.However, as we do not have a deterministic activation path (we do not know who shared information with whom) it is not straightforward to transitively incorporate influence from far away users as it is usually done 48 .This is why we express the influence I (i) of user i (Equation 5) as the extent to which user i is responsible for activation of his peers j: Where I (i→ j) is the fraction of the endogenous influence that user i can claim for user j.In our case we define it as I (i→ j) = 1 if i and j are peers, and 0 otherwise.This means that all user's are credited equally for the activation of their peers, regardless of how far away in time they themselves activated.For an alternative formulation which involves time see Equation S8 in the Supplementary.As shown on Figure 5a, each user can claim part of the peer activation probability p ( j) peer (t j ) for each of his peers j that activated after him t i < t j .As we do not have a deterministic activation path, this is really just a potential for responsibility and so the user has to share part of his claim to I (i→ j) with all other m peers of j.For the SI model we can set this to 1, meaning that we consider all peers equally responsible regardless of the time of their activation.Each user would then be assigned 1/m of the peer activation probability p ( j) peer for each of his peers that activated after him, where m is the number of user's j peers that activated before him.For the EXP model we can weight this with the times of activation -users can claim larger part of the influence for peers that activated close in time to their own activation (more details in Section S5 of the Supplementary).The collective influence for a group of users G is just an average influence of all users in the group 1/|G| ∑ i∈G I (i) .

Evaluation
Instead of using a single threshold for the exogenous responsibility to classify users into endogenously and exogenously activated we calculate the entire receiver operating characteristic (ROC) curve and associated area under the curve (AUC) score.This allows us to compare different endogenous influence models regardless of the chosen threshold.In order to calculate the ROC curve and AUC score we also need some sort of a gold standard label for each user, for which we use referral links available for sabor2015 and sabor2016 datasets.Depending on the referral link we classify users in one of the three categories (Figure 4): (i) strong endogenous influence for users whose referral link originates from a Facebook share, (ii) potential endogenous influence for users whose referral link originates from Facebook and (iii) strong exogenous influence for users whose referral link originates from an external web site.Users who do not have a referral link are considered as unknown.For the purpose of evaluation we consider users from category (i) as endogenously activated and users from category (iii) as exogenously activated.

Data collection
Our online survey applications were actually web applications which used Facebook Graph API 59 for authentication of users.Some sort of user authentication was necessary to prevent multiple voting.In addition, Graph API allowed us to collect Facebook friendship relationship between users registered on our application.In addition, with referendum2013.hrwe collected basic demographics information such as age and gender and with other three applications we collected referral links through which users visited our web application.These we collected through our own web server which hosted the survey application, not the Graph API.Before users registered they had to accept the privacy policy of the application which was in complete alignment with with Facebook's platform policy 60 (more details in Section S2 of the Supplementary).Facebook's Graphs API assigns application-specific ID's to each user, so it is not possible to associate users from different datasets.After they registered users were able to see summary voting statistics of their friends as well as for all registered users.These statistics were displayed after the user cast his vote in order to minimize the influence on his choice.We also provided an additional incentive to share the link to the application through Facebook and other social media by displaying to each user a number of users which registered to the application after following the referral link from their share, and comparing this to other users.

Descriptive analysis of the collected datasets
As a part of our research we collected three large datasets on Facebook users that registered on one of our online political survey applications (tables 1 and 2) related to three distinct political events in Croatia: 1) referendum on the definition of marriage (raferendum2013 dataset), 2) parliamentary elections in Croatia in 2015 and 2016 (sabor2015 and sabor2016 datasets).We already used referendum2013 dataset in one of our previous research 1 .Depending on the survey we collected different data -for referendum2013.hrapplication we collected demographics data but without referral links, while for all the subsequent applications we collected referral link but without demographics data.In our case referral links are much more useful because they allow us to evaluate our methodology for estimation of endogenous and exogenous influence.On the other hand, demographic data could be used to build more complex model of influence by correcting for the potential confounder variables.However, as we decided to restrict ourselves to a simple model of influence that only takes into consideration friendship connections between users and their times of registration, we decided not to collect demographic starting from the sabor2015 survey application.
Because we had collected demographic information of users only for the referendum2013 dataset, we decided to perform more detailed exploratory analysis on that dataset.Exploratory analysis of the friendship network of voters immediately reveals large homophily with respect to votes (Figure 1a) and age (Figure 2).Homophily with respect to votes is the strongest, with majority of users having 80% or more friends who voted the same as they did, which indicates potential presence of endogenous influence between users.On the other hand, homophily with respect to gender is almost nonexistent, with users being equally likely to friend users of both gender.These statistics are consistent with study performed on a much larger Facebook friendship networks 2 .Table 2 shows first few lines of the sabor2015 dataset which stores the information on the information cascade.More information on the collected data and our data sharing policy is available in the main paper.Figure 1b shows Facebook friendship network for referendum2013 dataset colored by three attributes -vote (blue for "for" votes and red for "against" voters), age (pale blue for for voters bellow 30 years of age, pale yellow for middle age voters and orange-red for voters above 50 years of age), and gender (pink for female voters and blue for male voters).Size of the nodes correspond to the their degree -the number of friends they have.Figure 1a shows homophily with respect to gender and with respect to votesusers are much more likely to friend other users that share their voting preferences, while they are equally likely to friend users of both gender.Homophily with respect to age is visible on the network of users (also on Figure 1b).Figure 2. Exploratory analysis of the referendum2013 dataset.Self-reported locality information (top left) shows that majority of users come from Zagreb, Croatia.We did not restrict participation on the survey based on the location as we were also interested in the opinions of Croatia's citizens living abroad.Language of the survey was Croatian so we believe this served as the most effective filter.Age distribution (top right) and age of friends for users of different ages (bottom left) show that the average user is much younger than expected from the population census.However, we were more interested in obtaining a representative sample of Facebook population rather than obtaining a representative sample of the population itself, and these statistics are qualitatively consistent with the ones obtained from the whole Facebook network2 .Degree distribution of the number of friends for all three datasets (bottom right) is also consistent from what is expected in online social networks.2. First few lines of the sabor2015 data on user sessions.All users are identified by unique survey-specific id's which could not be traced back to their actual Facebook identities.Times are recorded as minutes from reference time which is usually a short time before survey applications went online.Time of login corresponds to the time of the first login by the user to our survey application.We sometimes refer to this as the user's registration time.If a user shared a link to the survey application through his Facebook account, this time is also recorded as "Time share", otherwise it is −1.If a user visited the survey application by following a referral link from another user's share, this user's id is also recorded, otherwise it is −1.A fact that user followed a share is a strong indication of potential endogenous influence between users, and we supplement this with a general category of a referrer -Facebook if referral originated from Facebook (not necessarily from a share), and a specific name of an external website (usually an online news site) if referral originated from there.Information on the total number of Facebook friends ("Friend count") and vote on the survey ("Election list id" in the case of sabor2015 dataset) we do not use explicitly in the inference.Along with this information on user sessions, our inference method also uses a social network dataset on the Facebook friendship relations between users.These are available in GML format and as an edge list.

Facebook application for collecting data and survey methodology
We developed online survey applications as a separate web pages which used Facebook Graph API1 to allow Facebook users to register with their Facebook accounts.The survey was hosted on an independent server with its own database, with the authorization as a crucial component which allowed us to uniquely track the identity of users.Otherwise it would be impossible to know whether a particular person registered multiple times.The Graph API allowed us to retrieve, for each user, all Facebook friends that also registered on our application.From this data we constructed a friendship graph used in our inference methodology.It also allowed us to retrieve additional demographic data on users: their age, gender, hometown.We collected these for our first application referendum2013.hr,but later decided to collect only the friendship network and the referral links for our subsequent applications.It should be noted that users still had to give and explicit permission for each of these variables in order for us to collect them.Permissions were given through the Facebook API's interface.As we do not use demographic data in our inference we decided not to collect it for all subsequent survey applications after referendum2013.The unnecessary collection of demographic data entails a potential security risk due to the possibility of deanonymization -identifying specific users in the dataset.Deanonymization strategies often rely on combining the user data obtained from different sources, for example different Facebook applications or even different social networks.Any piece of information which is shared between these sources, demographics being the most notable one, increases the risk of partial or even full deanonymization.From our side, only information that is shared both by Facebook and us is the subset of the friendship network.Registration times of the users and the referral links from which they visited our web page are specific for our application and as such cannot aid much in the deanonymization attempt.We should note that Facebook also changed its Graph API several times in the last couple of years due to security concerns.Most notably, it is not possible anymore to collect data on user's friends, only an absolute number of friends for each user and a friendship connection toward all other users that are also users of the same application.
Figure 3. News media coverage and a screenshot of survey application.An example of two online news media articles that reported on the referendum2013.hrapplication -jutarnji.hr(Figure 3b) and vecernji.hr(Figure 3c), with the number of users registered in the 30 minutes window (Figure 3a), which is also the window size we used in our inference methodology (more details in the main paper).Sudden peaks in user registrations are closely aligned with the publication of some of the news articles, which indicates possible exogenous influence.Figure 3d shows the sabor2015.hronline survey application which had a similar interface to the referendum2013.hrapplication.All three applications allowed Facebook users to cast a vote for the upcoming elections, see votes of their Facebook friends and summary statistics for all registered users.Also, each Facebook application is given an unique identifying number for each users which are not shared among applications.This prevents owners of multiple applications to directly cross-identify users between different applications.Once registered, users were given an opportunity to cast their votes in the survey.The web applications did not display any summary statistics before this point so as to not influence user's vote.After voting, user's were shown summary statistics of all users and their Facebook friends separately.To protect privacy of their friends, we did not show statistics on friends if less than a specified number of their friends voted on the survey.Users were able to share the link to the survey application through Facebook.To additionally motivate users to vote and share we displayed a number of users which registered by following their Facebook share, and their rank among the top sharers based on this number.This required of the user to regularly visit the survey application in order to track his status among the sharers and the statistics of his friends, which prolonged user's engagement with the application.Survey applications were typically active only during one week prior to the actual elections (Table 1).This probably aided in attracting new users as the attention of the news media as well as general public was focused on the possible outcomes of the elections, and our surveys provided a way to satisfy this curiosity.
In addition to the data collected by the Facebook Graph API and our own web server, we also used Google Analytics to manually collect data on the various news coverages which reported on our application.This was especially important for the referendum2013 -our first survey application, as there we did not yet collect any referral links from the users.The exact times and a number of users coming from these web sites (many of which did not register on our application in the end) gives us a qualitative estimate on the magnitude of exogenous influence at these specific times.
During the design and execution of these surveys we tried to follow recommended ethical guidelines for digial social research 3,4 .Every application that uses Graph API has to comply with Facebook's Platform Policy 2 which states conditions under which data can be collected through the API and the responsibilities of the application owners.It specifies the conditions under which user data can be shared to third parties.For example, it explicitly forbids selling the user's data to third parties.It also requires application owners to have a privacy policy which is displayed to the users before they authorize with their Facebook credentials.Our privacy policy which we displayed to the users clearly stated that we will use user's data for research purposes only, and that anonymized data might be made available to the research community in the future.The privacy policies (in Croatian) for the first two survey applications -referendum2013 and sabor2015, are available on a Github open source code repository (Table 1), along with the source code of the survey application itself.In the main paper we outlined our data sharing policy where we provide access to data upon explicit request and after the interested authors fill in the appropriate agreement.This is becoming an established practice 5 which aims to satisfy the requirements of reproducible research in cases where free distribution of collected data is restricted by the service provider.

Implementation details
Code for statistical inference is vectorized as much as possible by using Scipy's compressed sparse column (CSC) matrix 3 to store adjacency matrix of our friendship network.The CSC format provides efficient addition and multiplication of matrices as well as fast matrix vector products.In addition we sort the matrix elements by the activation times of users in order to exploit the fact that we often slice the friendship matrix into subnetworks of users that activated within a certain time window.This gives performance benefits as slicing a predefined range of a matrix is more efficient than random indexing.Our original likelihood function has the following form: Note that we could be using more general form of exogenous influence p (i) ext (t) which is user-dependent, but instead we are assuming it is the same for all user at each time step t.As multiplication of many small probabilities would soon result in overflow of numerical precision, we exchange multiplication for summation by log-transforming our likelihood.This, however, does not change the value of the maximum likelihood parameters due to the monotonicity of logarithm: For additional numerical stability we also slightly change the way we calculate the log-likelihood.Exogenous influence p (i) ext is assumed to be equal for all users within a specific time frame.The crucial factor here is p (i) peer which is specific for each user and is equal, for SI and EXP models respectively: where N (i) is a set of peers of user i, a i is the number of activated friends of user i, and p SI is a parameter of exogenous influence.However, due to the small probabilities involved we actually calculate equivalent expression using sum-log-exp trick: For this we can use special log1p4 function in Numpy for calculation of log(1 + x), which is more precise when x is small.As for the LOG endogenous influence model which we will introduce in Section 8 (Equation11), we do not use any additional numerical tricks as in Equations 5 and 6.For optimization we use Scipy's standard scipy.optimize.minimize5function with an option method=TNC which implements a truncated Newton algorithm 6 which behaves well with the large number of parameters and is Hessian-free so we do not have to provide a gradient of our log-likelihood function.Truncated Newton algorithm uses conjugate gradients to itertively update parameter values, and the inner solver is run for only a limited number of iterations (truncated).More details on the scalability analysis is available in Section 6.

Alternating method for faster convergence of inference
Algorithm 1 gives the pseudocode of the alternating procedure for inference of endogenous p peer and exogenous {p ext } t influence that we use in our experiments.As we already mentioned in the main paper, we are interested in a single set of endogenous influence parameters p peer which are assumed to be constant in time, and a separate set of exogenous influence parameters {p ext } t for each time step.Both sets of parameters are assumed to be equal for all users at any given time step.In the first part of the algorithm (steps 2-4) we optimize p peer and p ext for every time window separately, which then serve as initial values for the alternating procedure.Optimization procedure is designated with a generic MAP (Maximum A Posteriori) procedure which takes as arguments the parameters which are held fixed and outputs values of the remaining parameters so that the log-likelihood (Equation 2) is maximized.As we already mentioned in Section 3, the actual optimization is performed with a truncated Newton algorithm, although in principle any suitable optimization method could be used.Second part of the algorithm is the actual alternating procedure (steps 5-11) where we first optimize for a single set of endogenous parameters p peer , conditioning on the exogenous parameters {p ext } t we obtained for each time window (step 6).We then optimize exogenous parameters for each window separately {p ext } t , conditioning on a single set of endogenous parameters p peer we obtained in the previous step (step 7).We then alternate between the step 6 and 7 until values for p peer and {p ext } t converge.The difference between the values for the current and previous iteration are calculated in steps 8 and 9 and the convergence itself is checked in step 5.
Algorithm 1 Alternating method for joint inference of influence ext (t)) 10: end while return p The parameters of endogenous and exogenous influence.13: end procedure Even when using this alternating method our inference could still fail to converge, especially in the cases of the twoparameter exponential decay model p  LOG (t) = 1/(1 + e −k(a i (t)−a 0 ) ), where parameters are (k, a 0 ).In that case we can choose several reasonable values for a single parameter, for example λ , and optimize log-likelihood separately for each of these cases where value of this parameter is fixed.We can then choose among these the parameter value which yields the best log-likelihood.In this way we are effectively optimizing multiple 1-parameter models instead of a single 2-parameter model.This could even be done in parallel in order to gain speed benefits.The same method was used in 7 paper where they reduced 2-parameter model for an exposure curve, which describes individual's susceptibility to endogenous influence, into a 1-parameter model.

Calculation of individual and collective influence
Our assumption regarding individual influence is that each user is, to some extent, responsible for the endogenous activation of all of his peers.To illustrate this we demonstrate how to calculate influence in a simple example shown on the Figure bellow where we have a social network of five users (u 1 ,u 2 ,u 3 ,u 4 and u 5 ) that activated due to endogenous or exogenous influence, each at a specific time t i .Arrows indicate potential for endogenous influence -connections from users that activated before are pointed towards users that activated after.We start with the input data -adjacency matrix A which encodes the friendship connections between users, an array of activation times for all users and an array which encodes whether user activated due to endogenous (1) or exogenous (0) influence: activation times = 1 2 3 4 5 endogenous activation = 0 0 1 1 1 Out of these, the endogenous activation is actually not available in empirical data, and has to be inferred.For the purpose of our example we assume we somehow estimated it, either from raw data or using an inference methodology such as ours.Let us calculate several measure which will help us in our calculation -a number of peers which were active before each user's activation and a number of peers which were active after each user's activation: number of active peers at activation = 0 1 1 1 2 number of active peers after activation = 2 2 0 1 0 Users activated sequentially.For the user u 1 , there are two of his peers that activated after, but only one due to endogenous influence (user u 4 ).There are no other peers of user u 4 that activated before him, so user one gets full credit for his endogenous activation, and his individual influence is 1.0.User u 2 has two peers that activated after him -users u 3 and u 5 , and both activated due to endogenous influence.User u 3 has no other peers that activated before him, so user u 2 gets full credit and his individual influence 1.0.User u 5 has one more peer who activated before him (user u 4 ) so user u 2 gets only half of the credit (individual influence 0.5), making a total of 1.5.The final individual influence I for all users is: I = 1.0 1.5 0.0 0.5 0.0 In the above example we have used the simplest expression for the calculation of individual influence I (i) of user i which we already introduced in the main paper: I (i→ j) ∑ m∈N ( j) I (m→ j) p ( j) peer (7)   In this case, the quantity I (i→ j) is simply 1, meaning that each user has equal influence on all of his peers which activated after him, regardless of how far away in time this actually is.A more realistic formulation is to make I (i→ j) dependent on time, so that the users who activated closes in time to their peers are credited with more influence.One way of introducing time dependency is with a simple exponential decaying influence, in which case I (i→ j) = e −λ (t j −t i ) .The new equation for the influence I (i) of user i is now: e −λ (t j −t i ) ∑ m∈N ( j) e −λ (t j −t m ) p ( j) peer (8)

7/17
We should note that the choice of how we calculate individual influence, either through I (i) , I EXP or some other formulation, is independent on the choice of particular endogenous influence model p peer .In our experiments on empirical data we are using formulation of individual influence from Equation 7for both the SI and EXP endogenous influence models.

Scalability of inference
We test the scalability of our method by inference on simulated cascades and increasingly larger networks.We construct the networks with NetworkX's powerlaw cluster graph6 function which implements Holme and Kim algorithm for generating networks with powerlaw degree distribution and desired average clustering.At each step of the algorithm we add a node with 3 new edges (m=3) and set the clustering probability to 0.1 (p=0.1).We explore graphs of sizes ranging from 10 to 1000 (Figure 4).Execution times are almost linear with respect to the size of the networks on which inference is being done.The inference was run on a 64-bit Intel i5-2500 CPU 3.3 GHz and 8 GB of RAM, Python 3.6.1.as a part of Anaconda distribution.

Correction for the observer bias
Due to the particular way we collected our survey data -our friendship network consists only of users who activated eventually in one week time period during which the survey application was active, we tend to overestimate of exogenous influence as we approach the end of the observation period.The reason for this observer bias is because our set of inactive users is getting smaller, but the rate of registrations is independent of this and so we overestimate exogenous influence.We correct for this by artificially extending our friendship network of inactive users by a certain fraction α of the total friendship through factor c(t): , where N all is the number of all users in the social network, and N inactive (t) the number of all yet users inactive users at time t.And including it into our log-likelihood function (Equation 2) to modify the part with inactive users: In the expression for c(t), α is the size of the virtual friendship network with which we want to expand our observed network, expressed as a fraction of the observed friendship network.In case of α = 0 we are not making any expansion at all, and we can expect this observer bias effect to exaggerate exogenous influence as the size of the activation cascade approaches the size of the observed friendship network.
Figure 5 shows the effect of the correction for the observer bias on the empirical datasets we collected.We only perform the experiments on the sabor2015 and sabor2016 datasets because for them we have the information on the referral links from which users visited our survey application, which effectively gives us the means to evaluate our estimates of endogenous and exogenous influence for each individual user.We see that corrections with α = 0.1 and higher successfully reduce the artificially high values of exogenous influence near the end of the observation period.However, the effect of the correction on the predictive power of our model is not large, as judged by the AUC scores.This is mainly due to the fact that the activations fall of near the end of the observation period and so the effect of correcting them is almost negligible.Inference of endogenous and exogenous influence on sabor2015 and sabor2016 empirical datasets with various values α of the observer bias correction factor and SI and EXP as the assumed endogenous influence models.We tried four different values of α ranging from 0.0 (no correction) to 0.3 which corresponds to the increase of the virtual friendship network of users by 0 − 30%.Observer bias arises due to the fact that we only collect friendship relations between users that registered on our application until the end of the collection period -the pool of inactivate users is much larger than what we actually observe, and this influences our estimates of exogenous influence.The effect of observer bias is the exogenous influence artificially increases as we approach the end of the observation period.However, because there is less and less registered users as we approach the end of the observation period, the correction by itself does not significantly increase the predictive power of our inference methodology as judged by the AUC scores.

Experiments on simulated data (extended)
In this section we show the results of the inference on simulated activation cascades following the susceptible-infected (SI) model (Figure 6a).We also introduce one additional model -a logistic threshold (LOG) model, where we do not assume independence of activation cascades in calculation of endogenous activation probability but rather derive it using the number of already active peers N (i) of user i. LOG model is an example of complex contagion model and we define it in the following way: Where a i (t) designates how many of user i's peers are active at time t, and a 0 and k are parameters of endogenous influence which define the shape of sigmoidal activation function, with a 0 being a number of active friends you need for the probability of activation to reach 0.5.Similar as the experiment on the exponential decay (EXP) model, our methodology is able to correctly infer the parameters of both SI (Figure 6a) and LOG (Figure 6b) endogenous influence models.We also estimate an absolute number of users activated due to endogenous or exogenous influence, as well as characterize the activation of each user as being driven dominantly by one or the other influence.In theory, any microscopic model of endogenous influence can be used which can be 9/17 efficiently computed given information on user's friendship connections and the activation state of his peers.This includes simple contagion model where activation probability of users is independent of the rest of the system (in our case these are SI and EXP models) as well as complex contagion models where probability of activation is dependent on the state of the user's neighborhood of peers (in our case this is LOG model).Inference on simulated activation cascades on SI and LOG models.We performed additional inference experiments using SI (Figure 6a) and LOG (Figure 6b) models for endogenous influence, which are representative examples of simple (SI) and complex (LOG) contagion models.We are able to infer correct values of endogenous and exogenous parameters as well as the total estimates for uses activated due to the one or the other influence.Same as in experiments in the main paper, we modeled the exogenous influence so that it resembles typical situation when external news source activates some of the users -with spiked exponentially decaying shape.
We also perform simulated experiments where we use an alternative measure for estimating exogenous influence -instead of external responsibility we use normalized exogenous activation probability p (i) ext directly.Figure 7 shows the results of using exogenous activation probability instead of exogenous responsibility for estimating magnitude of exogenous influence.The parameters of the endogenous influence and the shape of the exogenous influence are kept the same as in the previous simulated experiment in Figure 6.We see that using the exogenous activation ability alone makes coarser estimates of the influence, reflecting the fact that the ground truth exogenous influence assumes just a few distinct values during the several time windows.Also, the AUC scores are worse than when using the exogenous responsibility measure which incorporates endogenous influence as well.
In our main paper we decided to use spike-shaped exogenous influence in our synthetic experiments.This shape closely resembles what would we expect in real cases when the sudden surge in user registration is caused by a media news coverage coming from an external source which then dissipates exponentially in time.Exponentially-decaying spikes of user activity are also observed in Google search queries related to sudden catastrophic events 8 .However, our inference methodology can easily handle exogenous influence with other shapes -for example constant, exponentially decaying and sinusoidal (Figure 8).In theory, as our method is nonparametric and is calculated at each time step, we could perform a correct inference even in cases when exogenous influence is functionally dependent on some of the dynamically changing properties of the activation cascade, for example a number of activated users.
Figure 10 shows experiments with SI model of endogenous influence but with three different definitions of external responsibility R (i) (t) which summarizers information in probability of endogenous activation p (i) peer (t) at time t for each user i and probability of exogenous activation p ext (t) at time t, which is equal for all users.First, the original formulation which we use in the main paper:  Inference on Facebook activation cascades with SI model.Inference of endogenous and exogenous influence on activation cascades derived from referendum2013, sabor2015 and sabor2016 online survey applications, with SI as assumed endogenous influence model.Referrals originating from Facebook shares are diminishing in time, which closely resembles diminishing in endogenously activated users as estimated by our method.AUC scores for exogenous responsibility measure which classifies users into endogenous and exogenous activated show satisfying level of predictive performance.AUC score for sabor2015 and sabor2016 datasets for our method (AUC our ) are 0.75 and 0.83 respectively, which is higher as compared to baseline (AUC base ) which is 0.68 and 0.78 respectively.Facebook referrals alone are not discriminating enough as there are multiple possible ways by which Facebook users might reach our application, including visiting the web page of our application directly or through an advertisement, both which are more similar to exogenous rather than endogenous influence.Comparison with the time series of activations obtained from user's referral links for sabor2015 and sabor2016 datasets shows that referrals originating from Facebook shares are diminishing in time (red line labeled with "from referrals"), which closely resembles diminishing in endogenously activated users as estimated by our method (blue line labeled with "our").Achieved AUC scores in case of no correction for the observer bias (α = 0) with using exogenous responsibility as the criteria for classification of users into endogenously and exogenously activated are 0.76 and 0.82 for sabor2015 and sabor2016 datasets respectively, which is higher than the baseline where we use the number of active peers as the criteria instead, which achieves AUC scores of 0.68 and 0.78.Effect of applying correction for the observer bias (α = 0.1) is minor and raises AUC for sabor2016 dataset only from 0.82 to 0.83.This is probably due to the fact that the correction is strongest at the end of the observation period when there is less and less user activated due to the endogenous influence, which we approximate by observing which users followed referral links originating from a Facebook share.The effect of diminishing correction for the observer bias is also visible in Figure 5.
SI (t) and (ii) Exponential decay (EXP) model p

Figure 2 .
Figure 2. Maximum likelihood inference of endogenous and exogenous influence.Our assumption is that information propagation in an online social network is mediated by two types of influence -endogenous (peer) which acts between the users of the social network and exogenous influence which is external to it (Figure2a).The estimated endogenous influence on the newly activated user i = 1 should be higher because more of his peers are already active, as compared to user i = 2. Figure2bshows the normalized likelihood function (similar to Equation 3 which shows log-likelihood) at two distinct time steps in the simulated activation cascade using SI model for endogenous influence.SI model features only two parameters at each time step -parameter of endogenous influence p peer (p 0 in Equation1) and a parameter of exogenous influence p ext .Shape of the likelihood function suggests that these two parameters are correlated as each provides part of the explanation for the observed data, and if one is weaker the other most compensate.Also, when we have more data (time 21) the shape of the log-likelihood function is more concentrated than when we have less (time 50), resulting in more confident estimates.In this simulation we are estimating parameters of endogenous and exogenous influence at each time step separately, which corresponds to the initialization stage of our actual inference procedure which we use in simulated (Figure3) and empirical (Figure4) case.In our full inference procedure we infer a single set of endogenous influence parameters for the whole observation period instead of having a separate estimate for each time step like in this example (more details in Methods section and in the Section S4 of the Supplementary).Here we are using a truncated Newton algorithm44 for optimizing a log-likelihood function in order to obtain a maximum likelihood solution, although in practice any suitable optimization method could be used.

Figure 3 .
Figure 3.Inference on a simulated activation cascade.We use our methodology to infer which users activated due to endogenous or exogenous influence in a simulated activation cascade following exponential decay (EXP) endogenous influence model.In real world applications only total number of activated users (black line) is actually observed, along with the friendship network between users (Figure3a).We use a configuration model of referendum2013 social network to make our results reproducible even without the whole empirical network.We see that our measure is able to differentiate absolute numbers of endogenously and exogenously activated users throughout the whole cascade period and to correctly infer the parameters of endogenous influence -p peer and λ , and exogenous influence p ext (t) for every time period t.We also infer activation type for each user individually by using the exogenous responsibility measure R (i) (t) (Equation4) as shown on Figure3band achieve AUC of 0.93.We compare this with the baseline method where, instead of exogenous responsibility, we use number of active peers at the time of activation.A special case of this baseline is where we consider users without any active peers as exogenously activated, which is a baseline that we use in Figure3a.This baseline method underestimates the exogenously activated users towards the end of the observation period, which is due to the fact that more and more users are active and it is increasingly likely that at least one of the peers is active by chance alone.On Figure3bwe show a histogram of the number of active peers and compare it with exogenous responsibility to demonstrate that no reasonable threshold could not serve as a classification measure, which is also confirmed with a relatively low AUC score of 0.86.The results for SI endogenous influence model are similar and are available in FigureS6in the Supplementary.

Figure 1 .
Figure 1.Homophily and the Facebook friendship networks for the referendum2013 dataset.Figure1bshows Facebook friendship network for referendum2013 dataset colored by three attributes -vote (blue for "for" votes and red for "against" voters), age (pale blue for for voters bellow 30 years of age, pale yellow for middle age voters and orange-red for voters above 50 years of age), and gender (pink for female voters and blue for male voters).Size of the nodes correspond to the their degree -the number of friends they have.Figure1ashows homophily with respect to gender and with respect to votesusers are much more likely to friend other users that share their voting preferences, while they are equally likely to friend users of both gender.Homophily with respect to age is visible on the network of users (also on Figure1b).

Figure 4 .
Figure 4. Scalability analysis.Scalability analysis is based on inference experiments performed on simulated activation cascades following SI and EXP models.Execution times rise are almost linear with respect to the size of the networks.

Figure 5 .
Figure 5. Correction for the observer bias.Inference of endogenous and exogenous influence on sabor2015 and sabor2016 empirical datasets with various values α of the observer bias correction factor and SI and EXP as the assumed endogenous influence models.We tried four different values of α ranging from 0.0 (no correction) to 0.3 which corresponds to the increase of the virtual friendship network of users by 0 − 30%.Observer bias arises due to the fact that we only collect friendship relations between users that registered on our application until the end of the collection period -the pool of inactivate users is much larger than what we actually observe, and this influences our estimates of exogenous influence.The effect of observer bias is the exogenous influence artificially increases as we approach the end of the observation period.However, because there is less and less registered users as we approach the end of the observation period, the correction by itself does not significantly increase the predictive power of our inference methodology as judged by the AUC scores.

Figure 6 .
Figure 6.Inference on simulated activation cascades on SI and LOG models.We performed additional inference experiments using SI (Figure6a) and LOG (Figure6b) models for endogenous influence, which are representative examples of simple (SI) and complex (LOG) contagion models.We are able to infer correct values of endogenous and exogenous parameters as well as the total estimates for uses activated due to the one or the other influence.Same as in experiments in the main paper, we modeled the exogenous influence so that it resembles typical situation when external news source activates some of the users -with spiked exponentially decaying shape.

Figure 13 .
Figure 13.Inference on Facebook activation cascades with SI model.Inference of endogenous and exogenous influence on activation cascades derived from referendum2013, sabor2015 and sabor2016 online survey applications, with SI as assumed endogenous influence model.Referrals originating from Facebook shares are diminishing in time, which closely resembles diminishing in endogenously activated users as estimated by our method.AUC scores for exogenous responsibility measure which classifies users into endogenous and exogenous activated show satisfying level of predictive performance.AUC score for sabor2015 and sabor2016 datasets for our method (AUC our ) are 0.75 and 0.83 respectively, which is higher as compared to baseline (AUC base ) which is 0.68 and 0.78 respectively.Facebook referrals alone are not discriminating enough as there are multiple possible ways by which Facebook users might reach our application, including visiting the web page of our application directly or through an advertisement, both which are more similar to exogenous rather than endogenous influence.

Figure 14 .
Figure 14.Inference on Facebook activation cascades with EXP model -comparison with baseline.Comparison with the time series of activations obtained from user's referral links for sabor2015 and sabor2016 datasets shows that referrals originating from Facebook shares are diminishing in time (red line labeled with "from referrals"), which closely resembles diminishing in endogenously activated users as estimated by our method (blue line labeled with "our").Achieved AUC scores in case of no correction for the observer bias (α = 0) with using exogenous responsibility as the criteria for classification of users into endogenously and exogenously activated are 0.76 and 0.82 for sabor2015 and sabor2016 datasets respectively, which is higher than the baseline where we use the number of active peers as the criteria instead, which achieves AUC scores of 0.68 and 0.78.Effect of applying correction for the observer bias (α = 0.1) is minor and raises AUC for sabor2016 dataset only from 0.82 to 0.83.This is probably due to the fact that the correction is strongest at the end of the observation period when there is less and less user activated due to the endogenous influence, which we approximate by observing which users followed referral links originating from a Facebook share.The effect of diminishing correction for the observer bias is also visible in Figure5.
Inference on Facebook activation cascades with EXP model.Inference of endogenous and exogenous influence on activation cascades derived from referendum2013, sabor2015 and sabor2016 online survey applications, with EXP model as assumed endogenous influence model.The results for the SI endogenous influence model are in Figures

Table 1 .
Expanded version of the summary statistics for the collected datasets.Time period refers to the period when surveys were active, which is typically one week prior to the actual elections.Friendships and demographics were collected through the Facebook Graph API, while referral links were collected from our own web server that was hosting the survey application.Source codes of the first two survey applications are published on Github and are freely available.