Estimating the Bot Population on Twitter via Random Walk Based Sampling

The rise of social bots, which contribute to marketing, political intervention, and the spread of fake news, has been noted. Analysis methods for the characteristics of Twitter bots have been developed for third-party researchers who have access limitations to Twitter data. Here, we propose a method for estimating the bot population on Twitter based on a random walk. The proposed method addresses two major problems in estimating the bot population on Twitter based on a random walk. First, the maximum number of retrievable friends or followers of a user per query is limited. Second, there is a certain percentage of private users who do not publish personal content, e.g., friends, followers, and tweets. We conduct a simulation analysis using directed social graph datasets to validate whether the proposed estimator is effective on the real Twitter follow graph. Then, we present three different estimates of the bot population on Twitter using the proposed estimator based on the three sample sequences of 25,000 users collected in 2.5 weeks each. The three estimates consistently suggest that 8%–18% of Twitter users during April–June 2021 are bots.


I. INTRODUCTION
The rise of social bots (i.e., accounts that automatically generate content) has been noted in recent years [1]. Social bots are created for a variety of purposes, including marketing [2], engaging in political intervention [3]- [5], and spreading fake news [6]- [8]. Many studies (e.g., [9]- [12]) have investigated the characteristics of bots on Twitter, which is a popular social networking service with hundreds of millions of daily active users. The bot population (i.e., the percentage of bot users) on Twitter is one of the basic properties for understanding the impact of bots on individuals' opinions and behaviors [9], [11]. Twitter officially reported the percentage of bot users on Twitter in 2014 [13]. Nevertheless, it is crucial to allow third-party researchers who are not data holders to independently estimate the bot population on a given platform [6].
Estimating the bot population on Twitter is challenging for third-party researchers because they have limited access to the data. Uniform independent sampling generally yields unbiased estimates. However, it is practically difficult to sample user IDs uniformly at random because Twitter user IDs are sparsely distributed over the 64-bit integer space as The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Marozzo . of December 2021 [14]. One approach to sample Twitter users is to stream recent public tweets through application programming interfaces (APIs) and collect users who posted the tweets [9], [11]. However, this method introduces the sampling bias toward users who actively post tweets. It is typically necessary to sample a large number of users to correct the sampling bias (e.g., hundreds of thousands to tens of millions of users [9], [11]). Another method for sampling Twitter users is to perform a crawling method (e.g., breadth-first search and random walk) on the follow graph (i.e., a directed graph consisting of a set of nodes representing users and a set of directed edges representing follow relationships) [15]- [17].
Unbiased estimators using a random walk for the percentage of nodes with a specific label on a social network have been studied [18]- [20]. In general, online social networks provide APIs to retrieve neighboring users of a queried user [15], [18], [21], [22]. By repeatedly utilizing this function for a randomly selected neighbor, one obtains a sequence of sampled users via a random walk on the network. Then, one obtains the unbiased estimate by re-weighting each sampled user to correct the sampling bias derived from the Markov chain analysis. As of December 2021, Twitter provides APIs that allow retrieval of the friends (i.e., users VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ who are followed by a user) and followers (i.e., users who follow a user) of a queried user [23], [24]. Therefore, one can attempt to estimate the bot population on Twitter using the existing estimators in which the label of the node is set to a binary label indicating whether the user is a bot or not. However, the existing estimators do not address two major problems in estimating the bot population on Twitter. First, the maximum number of retrievable friends or followers per query is restricted to 5,000 as of December 2021. Second, there is a certain percentage of private users who do not publish personal content, e.g., friends, followers, and tweets, if they are queried.
In the present study, we propose an estimator for the bot population on Twitter based on a random walk. In each step of our random walk, we perform only one query each to retrieve friends and followers of a sampled user and then move to a node randomly selected from the retrieved friends and followers. We theoretically show that the proposed estimator has approximately no bias caused by private users when all friends and followers of a user are retrievable using a single query each. Then, we conduct simulation analysis using datasets of directed social graphs to provide numerical and empirical evidence that the proposed estimator is effective for the real Twitter follow graph. Finally, we present three different estimates for the bot population on Twitter during April-June 2021. We obtained the three sample sequences of 25,000 users by performing our random walk on Twitter for 2.5 weeks each. The three estimates consistently suggest that 8%-18% of Twitter users are bots, which corresponds with existing studies. Our code for estimating the bot population on Twitter is available at https://github.com/meipipo/twitterbot-population-estimator.

II. RELATED WORK A. CHARACTERIZATION STUDIES OF TWITTER BOTS
The impact of Twitter bots on individuals' opinions and behaviors has been actively studied. There have been many studies on the characteristics of activities of social bots during the U.S. presidential election [3]- [5]. For example, right-leaning bots noticeably shifted their tweets to topics encouraging people to vote for Republicans before and during the Democratic National Convention, even though there was no significant difference in the content of tweets from human accounts [4]. It has also been reported that Twitter bots systematically spread fake news and amplify conspiratorial information [6]- [8]. The role of Twitter bots during COVID-19 has also been studied recently. Yang et al. reported that bots tweet links to unreliable sources more than their proportion in overall tweets [25]. Because misinformation spreads faster than information from reliable sources, tweets from social bots are more likely to be retweeted, which can inhibit the spread of correct information and discussion [26].
Several studies have been conducted on classifying bot and non-bot accounts. Chu et al. analyzed 500,000 Twitter accounts to classify human, bot, and cyborg (between human and bot) accounts [10]. Stieglitz et al. found differences in characteristics between bots and humans, such as the number of followers, the number of retweets per day, and the number of links to web pages used per day [27]. In DARPA's Twitter bot detection competition, several teams developed techniques for detecting malicious organizational activity using machine learning [28]. A software called Botometer [29], [30] enables accurate classification of specified Twitter users as automated bots or humans through APIs. Botometers have been frequently used in previous studies to investigate Twitter bots [7], [8], [12]. In this study, we use a Botometer to determine whether each sampled user is a bot for estimating the bot population on Twitter.
Several studies estimated the bot population on Twitter. Varol et al. suggested that up to 15% of active Twitter accounts were bots in 2015 [9]. Luceri et al. examined the period during the 2016 and 2018 elections and classified 12.6% of accounts as bots [11]. These previous studies used hundreds of thousands to tens of millions of users sampled over several months. In contrast, the proposed estimator realizes a reasonable estimate of the bot population using samples obtained over just 2.5 weeks. Our estimate is also consistent with the value of 8.5%, which is the percentage of bots that Twitter officially reported in 2014 [13].

B. RANDOM WALK BASED GRAPH SAMPLING
A random walk is an effective sampling technique for online social networks where neighboring users of a queried user are retrievable using APIs [18], [19], [31]. Ribeiro et al. proposed a sampling algorithm based on a random walk, called DURW, for directed graphs in which only the outgoing edges of a queried node are retrievable [31]. The DURW algorithm has been used for sampling the Twitter retweet graph [32]. For the Twitter follow graph as of December 2021, we are allowed to retrieve both the incoming and outgoing edges of a queried user by calling the endpoints GET friends/ids and GET followers/ids [23], [24]. Therefore, an estimation of the bot population on the Twitter follow graph is reduced to the problem of estimating the percentage of nodes with a specific label on an undirected social graph [18]- [20].
The existing estimators for a percentage of nodes with a specific label assume that all neighbors of a queried node are retrievable via a single query [18]- [20]. However, the maximum number of retrievable friends or followers per query is limited to 5,000 on Twitter as of December 2021 [23], [24]. In the present study, we propose a sampling algorithm based on a random walk in which we perform just one query each to retrieve friends and followers of a sampled user and then move to a node randomly selected from retrieved friends and followers. We numerically compare the proposed random walk with the existing random walk in which one queries each sampled node until one retrieves all of the friends and followers of the node. Our experimental results show that the proposed algorithm improves the estimation accuracy using the same number of queries as the existing algorithm.
The existing estimators [18]- [20] also do not address private users on Twitter, who do not publish personal content, e.g., friends, followers, and tweets, if they are queried. There is typically a certain percentage of private users in online social networks (e.g., 27% on the Facebook network [33] and 34% on the Pokec network [34]). Although Twitter does not officially report the percentage of private users as of December 2021, we empirically found that there are a certain number of private users on Twitter. Private users prevent us from performing a simple random walk (i.e., one repeatedly traverses a randomly chosen neighbor), which causes the bias of estimators [35]. Nakajima and Shudo recently designed a framework for estimators based on a random walk considering private nodes for undirected social graphs [35]. In the present study, we extend the framework [35] to the case of the estimation of a percentage of nodes with a specific label on a directed social graph involving private nodes. Furthermore, we theoretically show that the proposed estimator has little bias caused by private users when all friends and followers of a user are retrievable using a single query each.

A. NOTATIONS
We represent the Twitter follow graph as a directed graph , v n } is a set of nodes (users) and E is a set of directed edges (follow relationships). We denote by n the number of nodes. In the Twitter terminology, if a directed edge (v i , v j ) exists, node v j is called a friend of v i , and node v i is called a follower of v j [23], [24]. The graph G has no multiple directed edges or self-loops. We denote by d in i and d out i the in-degree (i.e., the number of followers) and the out-degree (i.e., the number of friends), respectively, of node v i . We define d max as the larger of the maximum out-degree of the node and the maximum in-degree of the node. Each node v i has a bot label l bot (i) ∈ {bot, non-bot}. We denote by 1 {cond} an indicator function that returns 1 if the condition cond holds and 0 otherwise. We define the proportion of bots as As in the previous study [35], we introduce notations and definitions with respect to private nodes, who do not publish personal content, e.g., friends, followers, and tweets if they are queried. Each node v i also has a privacy label l pri (i) ∈ {private, public}. Suppose that we query a node v i . If v i is a public node, we are allowed to retrieve friends, followers, and tweets of v i ; otherwise, we are not. We refer to connected subgraphs consisting of public nodes in G as public clusters. We denote by C * = (V * , E * ) the largest public cluster. For example, for a directed graph shown in Fig. 1, there are three public clusters. (1,5), (2, 3), (2, 4), (5, 1)}) The largest public cluster is C * = C 1 .  ). Node 6 is a private node, and the others are public nodes.
We denote by k the number of retrievable friends or followers per query. We assume that we sequentially retrieve groups of k friends or followers of a node from the back of the list of friends or followers using a single query. For example, we consider the case of k = 100. If we want to obtain all 520 friends and 450 followers of a node, we need to perform the query 520 100 + 450 100 = 11 times. Note that the case of k = d max (i.e., all friends or followers are available using one query) is identical to the assumption adopted in previous studies [18]- [20], [35].

B. ASSUMPTIONS
We design a method for estimating the bot population p bot under the following assumptions.
1. G is weakly connected. Previous studies of undirected social graphs similarly assumed that the original undirected graph is connected [18], [20]. There is also empirical evidence that the Twitter follow graph is mostly weakly connected, e.g., 99.94% of all nodes with at least one friend or follower belonged to the largest weakly connected component of the Twitter follow graph as of 2012 [36]. 2. G remains static during the duration of our random walk. Previous studies on undirected social graphs similarly assumed that the original graph is static during the sampling of the graph data [18], [19]. Gjoka et al. verified that this assumption sufficiently holds for the Facebook graph [18].

The indices of all friends and followers of a queried
public node are retrievable. For example, if node 8 is queried on a graph shown in Fig. 1, we are allowed to retrieve its friends, i.e., node 6 and node 9, and its follower, i.e., node 9. We confirmed that the Twitter APIs satisfy this assumption as of December 2021 [23], [24]. 4. Each node independently becomes private with probability p, otherwise public, where 0 ≤ p < 1. We assume that private nodes are uniformly distributed over the social graph. Nakajima and Shudo validated this assumption for social graphs [35]. 5. The initial node of our random walk is located in the largest public cluster of G. We do not consider the number of queries used to select an initial node from the largest public cluster because this number is sufficiently small (see [35] for the validity of this assumption).

IV. PROPOSED ESTIMATOR
In this section, we propose an estimator based on a random walk for the bot population on the Twitter follow graph. First, we select an initial node v x 1 ∈ V * from the largest public cluster of G, where x i denotes the index of the i-th sampled node (i = 1, 2, . . . ). Second, we query the friends and followers of node v x i once each. We obtain each subset of the friends and followers of v x i denoted by out k (x i ) and in , respectively. We denote by k (x i ) the set of candidate nodes to be traversed from v x i . We define If v x i has been traversed for the first time, we record k (x i ) to avoid querying v x i again for future steps. When we revisit v x i , we replace the recorded set with k (x i ) ∪ {v x i−1 } and then use it. Third, we randomly select a node u from the set k (x i ). If u is a public node, we move to u as the next sampled node v x i+1 ; otherwise, we randomly reselect u from the set k (x i ). We repeat this transition procedure r − 1 times, where r is the sample size. Finally, we obtain an estimator of the bot populationp bot aŝ The reason why we include the previous node v x i−1 in the set k (x i ) when i ≥ 2 is to avoid having all retrieved friends and followers (i.e., out k (x i )∪ in k (x i )) be private nodes. Fig. 2(a) shows the retrievable friends and followers for each node on the graph presented in Fig. 1 when k = 1. When we start our random walk from node 1, we query friends and followers of node 1 once each, and we obtain out k (1) = {5} and in k (1) = {5}. Then, we move to node 5 and query friends and followers of node 5 once each and obtain out k (5) = {6} and in k (5) = {6}. However, the algorithm cannot proceed anymore since node 6 is a private node. To address this problem, we include node 1, which was sampled in the step before node 5, in the set k (5). This operation ensures that the set k (x i ) contains at least one public node if i ≥ 2. If the initial node v x 1 is a node such that k (x 1 ) consists of only private nodes such as node 5, we reselect the initial node from the largest public cluster of G. However, unless k is too small, it is rare to choose such nodes (see Section VI-A and Fig. 5 for details).
In general, when k (i.e., the number of friends or followers retrievable via a single query) is less than d max , the number of public nodes reachable via our random walk depends on the initial node v x 1 . For example, we consider the case of k = 1 in the directed graph shown in Fig. 1. Fig. 2(a) shows the lists of retrievable friends and followers of each node in this case. When we select node 1 as the initial node in Fig. 2(a), we have out k (1) = {5} and in k (1) = {5}. Therefore, the set of reachable public nodes is {1, 5} (see Fig. 2(b)). When we select node 2 as the initial node in Fig. 2(a), the set of reachable public nodes is {1, 2, 4, 5} (see Fig. 2(c)). However, we empirically find that unless k is too small compared with d max , the variance in the number of reachable public nodes depending on the initial node is quite small, and most public nodes on the largest public cluster are reachable (see Section VI-A and Fig. 5 for details).

V. THEORETICAL ANALYSIS
We theoretically analyze the bias of the proposed estimator caused by private nodes when k = d max . Our theoretical analysis is an extension of the theoretical analysis for undirected social graphs presented in [35]. Unless otherwise stated, we assume that k = d max in this section.
First, we introduce some notations. We denote byG = (V ,Ẽ) the undirected graph generated by removing the directions of the edges in the original directed graph G, wherẽ E is the set of edges in the undirected graph. An edge (v i , v j ) exists inG if and only if at least one directed edge between v i and v j exists in G. We refer to nodes connected to node v i onG as the neighbors of v i . We denote the degree (i.e., the number of neighbors) of node v i inG byd i . LetC * = (V * ,Ẽ * ) denote the largest public cluster ofG. We refer to public nodes connected to node v i ∈ V * as the public neighbors of v i . Letd * i denote the public degree (i.e., the number of public neighbors) of node v i ∈ V * . LetD * = v i ∈V * d * i denote the sum of public degrees.
Proof: First,G is connected because G is weakly connected because of Assumption 1. Second, the initial node of our random walk belongs to the largest public cluster of G because of Assumption 5. Third, for each sampled node v x i , the set of neighbors that can be traversed from v x i in G, i.e., d max (x i ), is equivalent to the set of neighbors of v x i iñ G because all the friends and followers of v x i are retrievable by querying v x i just once when k = d max . Therefore, in our random walk, one first chooses an initial node v x 1 that belongs to the largest public cluster ofG and then repeatedly moves to a randomly chosen public neighbor from the set of neighbors of v x i . This is equivalent to performing the random walk proposed in [35] onG.
Lemma 1 allows us to apply the theoretical results shown in [35] to the undirected graphG. Let the probability that event A will occur be denoted by Pr[A]. We define the distribution induced by the sequence of r indices of sampled nodes as π r = (Pr[x r = i]) v i ∈V . According to Lemma 3.1 in [35], each node in the largest public cluster, v i ∈ V * , is sampled from the following stationary distribution.
Lemma 2: The vector π r converges to the stationary distribution π = (π i ) v i ∈V after many steps of our random walk on G, where π i = 1 {v i ∈V * }d * i /D * . Then, we obtain the following lemma regarding the convergence value of the proposed estimatorp bot .
Lemma 3: The proposed estimatorp bot converges tõ after many steps of our random walk on G. Proof: We define the quantity bot as First, we calculate the expectation of bot with respect to the stationary distribution π as follows: The first equation holds because of the linearity of the expectation. The second equation holds because | d max (i)| =d i holds for each node v i and Lemma 2 holds. Then, we define the quantity bot as We similarly obtain the expectation of bot with respect to π as Quantities bot and bot converge to their respective expectations, i.e., (2) and (3), with respect to π after many steps of our random walk because of the strong law of large numbers for a Markov chain [35]. Therefore, we conclude thatp bot converges top bot = bot / bot .
The following proposition holds because C * is equivalent to G when there are no private nodes in G.
Proposition 1: When there are no private nodes in G, the convergence valuep bot of the proposed estimator is equal to the true value p bot .
We show that private nodes cause approximately no bias between the convergence valuep bot of the proposed estimator and the true value p bot . To this end, we calculate the expectation ofp bot with respect to the set of privacy labels of nodes. We denoted by L pri = {l pri (i)} v i ∈V the set of privacy labels of nodes. Let E pri [X ] denote the expectation of a random variable X with respect to the set L pri . We approximate E pri [p bot ] under the condition that all public nodes belong to the largest public cluster ofG. Under this condition, it holds that Pr[v i ∈ V * ] = Pr[l pri (i) = public] = 1 − p because of Assumption 4. We also approximate E pri [p bot ] as a fraction of the expectation of each quantity in the denominator and numerator ofp bot .
Theorem 1: Under the condition that all public nodes belong to the largest public cluster, we have

Proof:
We define a random variable The expectation of X bot with respect to L pri is given by Equation (4) holds because of the linearity of the expectation and the law of total expectations. Equation (5) holds because Lemma 4.5 in [35] for the proof). We note that 1 {l bot (i)=bot} and the degreed i are constants with respect to the set L pri for each node v i . Then,  Theorem 1 implies that the expectation ofp bot with respect to the set L pri is almost equal to the true value p bot when k = d max . Through the following simulation analysis, we also find that the proposed estimator realizes an accurate estimate of p bot independent of the proportion of private nodes unless k is too small compared with d max .

VI. SIMULATION ANALYSIS
In this section, we conduct numerical simulations using four directed social graph datasets to validate whether the proposed estimator is effective on the real Twitter follow graph. We use the largest weakly connected component of the original directed graph for each dataset. Table 1 shows the number of nodes, the number of edges, and the larger of the maximum in-degree of the node and the maximum out-degree of the node for each directed graph used in our simulations.

A. PERFORMANCE OF THE PROPOSED ESTIMATOR
We examine how the accuracy of the proposed estimator varies with the given number of retrievable friends or followers per query, i.e., k. We could not find publicly available datasets of directed social graphs including real bot labels of users. Therefore, we consider three methods of synthetically labeling nodes as bots on the graph: Random, LowDegree, and HighDegree. In the Random labeling method, we label node v i as a bot with uniform probability, i.e., 1/n. In the LowDegree labeling method, we label node v i as a bot with probability ( In the HighDegree labeling method, we label node v i as a bot with probabilityd i / v i ∈Vd i . In each method, we repeatedly label a node chosen from the corresponding probability until the percentage of labeled nodes reaches a given value. We run a single simulation as follows. First, we label 10% of all nodes by using one of the three methods (i.e., Random, LowDegree, or HighDegree). Second, we set the privacy label of each node to private with the given probability p or to public otherwise, according to Assumption 4. Third, we randomly shuffle the orders of the lists of friends and followers of each node independently. Fourth, we select the initial node for our random walk from the largest public cluster uniformly at random. Fifth, we perform our random walk with the sample size r = 10, 000 for the given number of retrievable friends or followers per query, i.e., k. We vary k from 1 to d max and the probability p from 0.0 to 0.3. We use the normalized root mean square error (NRMSE), namely, (E[(p bot /p bot − 1) 2 ]) 1 2 , to measure the accuracy of the proposed estimator and its variance, as in [19], [35]. We calculate the NRMSE of the proposed estimator over 100 independent runs. Fig. 4 shows the NRMSE of the proposed estimator as a function of k. The following observations apply to all the datasets. First, for all three labeling methods, the proposed estimator achieves almost the same NRMSE when the proportion of private nodes varies from 0.0 to 0.3 when k = d max , which supports our theoretical result in Section V. We also found that the proposed estimator introduces almost no bias due to private nodes when k ≥ 10 for all the labeling methods. Second, the NRMSE is large for small values of k (e.g., k = 1) for all the labeling methods. This is because the number of reachable public nodes from the initial node is considerably small when k is too small, causing bias in the proposed estimator. Third, for all the labeling methods, as k becomes larger, the NRMSE decreases, and when k is larger than a certain value, the NRMSE for k is comparable to that for d max . The critical value of k is different for each of the labeling methods. For example, in the ego-Twitter dataset, k ≈ 5 for the Random labeling method, k ≈ 100 for the LowDegree labeling method, and k ≈ 300 for the HighDegree labeling method. The intuitive reason why the critical value of k is different for each labeling method is as follows. First, when k < d max , the stationary distribution of nodes is typically biased compared to the vector π shown in Lemma 2 because a set of edges that can be traversed from the initial node is restricted. However, the degreed of the labeled node is uniformly distributed, so the NRMSE rapidly decreases as the value of k increases. In contrast, in the LowDegree labeling method and the HighDegree labeling method, the degreed of the labeled node is biased toward low and high degrees, respectively, and hence, the NRMSE relatively slowly decreases as the value of k increases. In particular, the NRMSE is typically high for small values of k in the HighDegree labeling method because a node with a higher degreed has a larger bias of the stationary distribution of the node. Fig. 5 shows the maximum and minimum proportions of reachable public nodes on the largest public cluster depending on the initial node on the ego-Twitter dataset as a function of k. When k = 1, only 20% of the nodes in the largest public cluster are connected at most. The variance in the proportion is also relatively large when k < 10, e.g., the proportion varies from 0.0 to 0.6 when k = 3. This small size of the reachable node causes the bias of the estimator using our VOLUME 10, 2022 FIGURE 5. Maximum and minimum proportions of reachable public nodes on the largest public cluster C * depending on the initial node of our random walk on the ego-Twitter dataset as a function of k. random walk. When k reaches a certain value, for example, when k ≥ 100, most of the nodes are reachable, and the variance becomes small.

B. COMPARISON WITH THE EXISTING RANDOM WALK
We compare the NRMSE of the estimator for the bot population defined as equation (1) using our random walk with that using the existing random walk [35] when we use the same number of queries. In the existing random walk, one repeatedly queries each sampled node until all of the friends and followers of the node are retrieved, and then, one moves to a public node randomly chosen from a union set of friends and followers. When we visit node v i for the first time, while our random walk uses two queries for the node, the existing random walk uses d in i /k + d out i /k queries for the node for the given k.
We run a single simulation on each dataset as follows. First, we synthetically label 10% of all nodes as bots by using one of the three methods (i.e., Random, LowDegree, or HighDegree). Second, we set the privacy label of each node to private with the given probability p or to public otherwise, according to Assumption 4. We set p = 0.2 because we confirmed that the proposed estimator introduces almost no bias due to private nodes (see Fig. 4). Third, we randomly shuffle the orders of the lists of friends and followers of each node independently. Fourth, we select an initial node for our random walk and the existing random walk from the largest public cluster uniformly at random. Fifth, we perform our random walk and the existing random walk using a given number of queries. For each dataset, the value of k is set to the smallest value such that Pr[d > k] ≤ 0.05 or 0.01 holds whered = max(d in , d out ) (i.e., 95th and 99th percentiles of the larger one of in-degree and out-degree of the node). We calculate the NRMSE of each estimator over 100 independent runs. Fig. 6 shows the NRMSE of each estimator as a function of the number of queries for each labeling method on the ego-Twitter dataset. The number of queries for the ego-Twitter dataset varies from 1,000 to 5,000 because of the small size of the graph. Fig. 7 shows the NRMSE of each estimator using 20,000 queries for each labeling method on the YouTube, Higgs, and Flicker datasets. The following observations apply to all the datasets. First, for the Random and LowDegree labeling methods, the estimator using our random walk typically achieves a smaller NRMSE than that using the existing random walk when we use the same number of queries in both the k values. For example, our random walk improves the NRMSE by 77% and 69% on the Flickr dataset in the case of Random and LowDegree methods when k = 50, respectively. Second, in the HighDegree labeling method, the NRMSE of the estimator using our random walk is larger than that using the existing random walk in both the k values. This is because our random walk induces the bias of the stationary distribution of the labeled node that has a high degreed when k is small.

C. DISCUSSION
Up to our numerical efforts, we found that the proposed estimator is effective when the following two conditions are satisfied. First, the bot nodes should be either distributed uniformly throughout the graph or biased toward a lower number of friends and followers. Second, for the given number of retrievable friends or followers per query, i.e., k, the proportion of nodes with friends or followers greater than k (i.e., nodes that satisfyd i > k) should be sufficiently small.
We consider that the real Twitter follow graph satisfies the two conditions. The empirical evidence is as follows. First, several studies empirically showed that Twitter bots tend to have a small number of friends and followers. Stieglitz et al. found that bots tend to have fewer followers than humans [27]. A study of a group of over 350,000 bots [40] found that both the number of friends and followers of bots were very small, with fewer than 31 friends and fewer than 10 followers. Second, several studies have reported that the percentage of users with more than 5,000 friends or followers is quite small. Note that Twitter APIs restrict the maximum number of retrievable friends or followers per query to 5,000 as of December 2021 [23], [24]. According to a previous study [15], fewer than 0.1% of all users have 5,000 or more friends or followers on Twitter as of 2009. Myers et al. reported that 5% of all users have 470 or more friends or followers on Twitter as of 2012 [36], which suggests that the percentage of users who have 5,000 or more friends or followers is sufficiently less than 5%. Therefore, based on our simulation analysis, we expect that the proposed estimator realizes a reasonable estimation of the bot population on Twitter.

VII. ESTIMATION ON TWITTER A. METHOD
We performed our random walk three separate times on the real Twitter follow graph. We selected the initial users as the first author's account, Jack Dorsey's account, and Biz Stone's account. We collected r = 25, 000 sampled users each time. Then, we discarded the first 5,000 sampled users to sufficiently remove the dependency on the initial user, as in the previous case study of random-walk sampling on  the Facebook graph [18]. The periods of data collection were April 13 -30, May 6 -23, and May 23 -June 9 in 2021.
We used three functions of the Twitter APIs. We called the endpoints GET friends/ids and GET followers/ ids to retrieve friends and followers of a public user [23], [24]. As of December 2021, we are allowed to send such queries up to 15 times in 15 minutes. We identified whether a user was private or public by calling the endpoint GET statuses/show/:id. This endpoint allows 900 calls in 15 minutes as of December 2021 [41].
Twitter does not officially provide APIs to determine whether a given Twitter user is a bot. Therefore, we used Botometer v4 [30] to determine whether each sampled public user was a bot. Botometer calculates a score that gives the conditional probability that accounts with an equal or greater score are automated. We labeled users with scores equal to or greater than a given threshold θ as bots. We labeled users whose scores could not be calculated due to problems such as an insufficient number of tweets as non-bots. Fig. 8 shows estimates calculated from the 1st to r -th sampled users (1,000 ≤ r ≤ 20,000) for the three sample sequences with the first 5,000 sampled users discarded each. VOLUME 10, 2022 FIGURE 8. Estimates of the bot proportion on Twitter using our random walk as a function of the sample size. We set the threshold θ for the Botometer score to the given value.

B. RESULTS
We found that all three estimates converge to ≈ 0.08 when we set θ = 0.95 and to ≈ 0.18 when we set θ = 0.90, illustrating the robustness of the proposed estimator with respect to the initial user. Our estimate of between 8% and 18% is sufficiently consistent with Varol et al.'s estimate of 9% to 15% as of 2015 [9] and Luceri et al.'s estimate of 12.6% as of 2019 [11].
We performed the existing random walk [35] independently multiple times for 10 days each in October and November 2021. However, the existing random walk collected only a few dozen sampled users, which suggests that the existing random walk is much less efficient in terms of the number of queries than our random walk. This is because the existing random walk requires a large number of queries to retrieve all the friends or followers of a user with an enormous number of friends or followers. In fact, the existing random walk visited Katy Perry, who has over one hundred million followers, and it took more than 5 days to retrieve all the followers using the Twitter APIs [24]. The existing random walk requires an exorbitant number of queries in total because it tends to frequently traverse users with an enormous number of friends or followers.

VIII. CONCLUSION
In this study, we proposed an estimator based on a random walk for the bot population on Twitter. We addressed two major problems in estimating the bot population on Twitter: (i) the maximum number of retrievable friends or followers per query is restricted to 5,000, and (ii) a certain percentage of users are private users. We obtained the estimates of the bot population on Twitter between 8% and 18%, which is sufficiently consistent with the estimates reported in previous studies. Our future work includes the improvement of the accuracy of bot detectors (e.g., Botometer), which leads to improving the estimation accuracy of the proposed method.