Cross-Country Analysis of User Profiles for Graph-Based Location Estimation

Social media is used all over the world. Although the home location attribute is available anywhere, few studies have been conducted on the differences in the performance of home location estimation between countries. Because social media is used differently in different countries, the difference between countries in terms of users’ relationships needs to be investigated. We consider the performance of graph-based home location estimation in capturing users’ relationships, as it is affected by users’ social media usage. We analyze user profiles in 10 countries by graph-based estimation and find that some profile attributes, in particular, contribute to the estimation performance for all the countries examined.


I. INTRODUCTION
Social media is widely used everywhere in the world. People obtain and post news online. Social media data have been used for detecting phenomena such as influenza epidemics [1], earthquakes [2], and global mobility patterns [3]. In these cases, users' home locations are utilized to combine real-world information with information obtained online. Aggregated tagged locations in users' posts or self-declared locations in users' profiles are ways of determining users' home locations [4]. However, information regarding users' home locations is sometimes missing; therefore, researchers have developed methods for estimating home locations from other information [5].
The social media outlet Twitter is used to post short texts called ''tweets.'' Because users post about themselves, a user's home location can be estimated by the locationrelated words included in their posts [6]. This estimation method is known as a content-based approach. Users also have a tendency to interact and develop relationships with people similar to them on social media [7]. On the basis of this characteristic, a graph-based approach using relationships based on a closeness assumption, according to which users on social media connect with other users located geographically close by, is proposed for estimating users' home locations [8]. Location estimation methods utilize these The associate editor coordinating the review of this manuscript and approving it for publication was Yufeng Wang . characteristics alone or in combination, which is called a hybrid approach.
Alonso-Lorenzo et al. [9] reported that estimation performance using a hybrid approach differs by country. The differences in estimation performance are affected by language and social media usage. A content-based approach depends on the language used in tweets, and different languages are used in different countries. Therefore, it is necessary to employ a suitable estimation method according to the users' language, such as segmentation or tokenization. Unlike a content-based approach, the same graphbased method is expected to be applied even if the users use different languages. Because behaviors on social media differ by user demographics [10], the relationships among users should differ depending on the country. Therefore, the performance of graph-based estimation needs to be investigated.
To analyze the difference between countries in terms of user relationships, we compare user profiles with the performance of location estimation. We assume that a user's profile represents the characteristics of the user and that the user's relationships are explained by the user's profile because these are created by or for the user. We then use graph-based location estimation to measure the characteristics of the user's relationships.
Our research questions are as follows: RQ1 Does the performance of graph-based location estimation vary by country? RQ2 Is there a profile attribute that contributes to estimation performance? Is there a difference in the profile attributes contributing to estimation performance between countries? RQ3 What characteristics of profile attributes contribute to estimation performance after considering the correlation between attributes? RQ4 Can the language used explain the differences in the profile attributes contributing to estimation performance between countries? We collect more than 1,000,000,000 geotagged tweets of more than 10,000,000 users and select more than 6,000,000 users for analysis. The number of unique places is more than 90,000. We select the 10 countries with the most users. We focus on 12 profile attributes, including the indirect attribute follow ratio, which is obtained from two other attributes that can be directly retrieved from the Twitter API. We roughly compare countries using filters based on profile attributes. We find that at least one attribute behaves similarly across all 10 countries.

II. RELATED WORK
Many home location estimation methods have already been proposed [4], [5]. These methods employ tweet contents or social graphs. In addition, other profile attributes such as location text [11], description [12], and time zone [11]- [13] can be employed to estimate a user's home location. Profile attributes also contribute to estimation performance.
Some studies have mentioned users whose relationships are not related to geographic distance, many of whom include celebrity users who have many followers or are mentioned by many users. McGee et al. [14] found that users with many followers tend to be distant from connected users. Rahimi et al. [15] showed that celebrities, who are frequently mentioned users, interfere with graph-based estimation. Ebrahimi et al. [16] categorized celebrities into global and local types and showed that their approach, excluding global celebrities who are mentioned worldwide, outperforms state-of-the-art estimation methods. Tian et al. [17] also proposed an algorithm to find global celebrities.
Some users who pose estimation difficulties have other characteristics. Hironaka et al. [18] compared estimation difficulty with users' centrality and found that users with high hub and authority scores calculated by the HITS algorithm had difficult-to-estimate locations. Users with high authority scores are considered celebrities, whereas those users with high hub scores are the ones who follow many celebrities and constitute a different type of user. McGee et al. [14] also found that users who follow many users tend to be distant from connected users.
Location estimation depends on the behaviors of users on social media. Sloan and Morgan [19] investigated the use of geotagging and user demographics, such as the languages used. They showed that the rate of posting geotagged tweets varies depending on the language used by the user. However, the country in which the user lives has not been analyzed because this information cannot be directly obtained from the profile. Hedayatifar et al. [20] analyzed the geographical structure of the Twitter communication network. They found that languages and national borders appeared as breaks in communication. Tominaga et al. [21] surveyed self-disclosure in profile, usage objectives, and anonymity consciousness on Twitter, targeting users in the United States, India, and Japan, and investigated how self-disclosure is influenced by usage objectives and anonymity consciousness. They suggested that cultural differences exist on social media and that social media is used in different ways depending on the culture.
In this paper, we analyze the differences in graph-based location estimation between countries. Previous studies have focused either on a single country or the whole world. In contrast, we compare estimation performance between countries.

III. DATA
We prepare datasets for each country. A dataset contains home location data used for training and testing, social graph data constructed from relationships between users, and profile data that are used to obtain user characteristics.
Data collection proceed as follows: First, we collect geotagged tweets using Twitter Streaming API and assign a home location to each user. Next, we collect their followerfollowing relationships and construct a social graph. Finally, we extract profile data from the collected geotagged tweets.

A. HOME LOCATION
Home location is a user location attribute. In this paper, we assume that users are using social media around the area they actively post geotagged tweets. We consider the area where the user posted most of their tweets as the home location of the user.
Area data are provided by some governments, but collecting these for all countries is difficult. In this paper, we used a place object 1 that is assigned to a geotagged tweet as an area. A place object contains a place ID, name, country, place type, and bounding box. A place type likely indicates an area's granularity. 2 We assign city-level home locations to users using place objects, where a place type is ''city.' ' We collected 1,495,799,237 geotagged tweets from January 1 to December 31, 2019, using Twitter Streaming API. 3 The number of users who posted geotagged tweets with a city-level place type at least once was 14,917,399, and the number of users who posted geotagged tweets at least 10 times was 7,047,133 (47.2%). We then calculated the number of tweets per area for each user. We assigned a home location as the area where the user posts the most frequently 1 https://developer.twitter.com/en/docs/twitter-api/data-dictionary/objectmodel/place (viewed 2021-04-08) 2 The official documentation does not explain the possible values of place type. Five types (country, admin, city, neighborhood, and poi) appear in the data collected in this paper. We assumed that the size of the area would decrease in this order. 3 https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filterrealtime/api-reference/post-statuses-filter (viewed 2021-04-08)  to users who posted at least 10 times. Few tweets with place objects where country_code or bounding_box is null were excluded. There were no other exclusion rules for auto-posted users or celebrities. Finally, we obtained home location data for 6,685,718 users, which included 92,461 unique places.
The geographic distribution of the users with home locations is shown in Figure 1. Each user's coordinate is calculated as the centroid of the bounding box of the user's home location. Figure 2 shows the ranking of the number of users by country (Top 40). In this paper, we created datasets for the top 10 countries by the number of users: in order, the United States, Brazil, the United Kingdom, Japan, the Philippines, Turkey, Indonesia, India, Mexico, and Saudi Arabia. The number of users in each dataset is |V |, as shown in Table 1. The United States had the most users, followed by Brazil.

B. SOCIAL GRAPH
We collect data of the relationships between users with home locations and construct a social graph based on the relationships. First, we collect the followees and followers of the users with assigned home locations using the Twitter API. 4 Then, we create an edge if the users assigned the home locations are mutual followers.
We collected data and constructed a social graph for each country. Table 1 shows the basic statistics of the social graphs.
Here, |V | is the number of nodes (users), |I | is the number of users who did not have any mutual following relationships (isolated nodes), |E| is the number of edges, K out is the average degree (the number of mutual friends), S out is the variance of the degree, and M out is the median of the degree. Users in the United States tend to have fewer mutual followings than Brazil. India has more isolated users than the other countries.

C. PROFILE ATTRIBUTES
We use the given user profiles for analysis. Each tweet has a user profile (User object 5 ) attached at the posted time. We extract the newest profiles from the geotagged tweets collected in Section III-A.
We employ the following profile attributes as user characteristics: length of the username (screen_name), length of the name (name), length of the location (location), length of the description (description), number of followees (friends), number of followers (followers), number of likes (favorites), number of lists (listed), and number of total tweets (statuses). Text length is counted as the number of UTF-8 characters, including spaces. In addition, we employ the following calculated attributes: the number of days since the account was created (spent_days), average number of tweets per day (avg_tweets), and ratio of followees to followers (follow_ratio). spent_days is defined as the number of days until January 1, 2020, and follow_ratio is defined as (friends + 1)/(followers + 1).
These profile attributes are categorized into three groups: degree-related attributes, text-related attributes, and activity-related attributes. The degree-related attributes are friends, followers, and follow_ratio. The textrelated attributes are screen_name, name, location, and description. The activity-related attributes are favorites, listed, statuses, spent_days, and avg_tweets.
We consider degree-related attributes as directly affecting location estimation performance because these attributes are related to the shape of the social graph. It has been reported that the location of users with too many followers or followees is difficult to estimate [22]. The degree is also used to detect celebrities whose location is difficult to estimate [16].
We consider text-related attributes to relate to users' selfexpressions. Text length was employed because many languages are used in the data. If the language is the same, the larger number of characters contains the larger amount of information. We assume that the amount of self-declared information is related to the number of online relationships. Wesslen et al. [23] reported that users add words to or remove words from their description. For example, Trump supporters add the word ''Trump'' to their descriptions, so we assume that users with a long description have more context. Shima et al. [24] also reported that users change their description and name. They found that users add additional information such as the current state or trend to their name. Regarding descriptions, longer names contain more information. We consider text-related attributes to depend on the habits and cultures of each country.
We consider that the level of activity on social media also affects location estimation. For example, an influencer's high activity may make their location difficult to estimate. In contrast, the locations of users with low activity are also difficult to estimate.

IV. EXPERIMENTAL METHOD
Home location estimation is a task that estimates an unknown user's home location from a social graph and known home location labels. User profiles of all users are given. We conduct two experiments to answer the research questions. In the first experiment, we evaluate the estimation performance for each country. In the second experiment, we investigate the estimation performances with profile attributes.
The experiments proceed in the following steps. First, we filter the users in the social graph using the user profile attributes. In the first experiment, the filter is not applied. Second, we construct a social graph with the filtered users. This social graph is a sub-graph of the social graph in the dataset. Then, we estimate the home locations of the users in the test data using the location labels in the training data and the sub-graph and evaluate the performance with the test data.

A. FILTERING SOCIAL GRAPH
The filter described here filters given users based on a single profile attribute. Then, the filtered users are used to construct a filtered social graph.
In this paper, we use the following two filters to filter users based on their profile attributes: a high-cut filter, which filters users with higher values than the threshold value θ , and a low-cut filter, which filters users with lower values than the threshold value θ . Each filter has a threshold value θ as a parameter. The filters are expressed as follows: where U is a set of the target users of the filtering, and a u is a profile attribute of user u. The filtered social graph is a sub-graph of the social graph in the dataset and is obtained with nodes as the filtered users. Excluded users are not in the graph. Edges that connect to excluded users are also excluded from the graph.

B. HOME LOCATION ESTIMATION METHOD
Home location estimation is a method that estimates a user's home location from relationships between users while assuming that connected users are living geographically close by. In this experiment, we use the method proposed in the previous paper [22]. This method selects the home location that appears most frequently among the home locations of connected users as the estimated location. The method is simple and has been reported to be accurate [8].

C. EVALUATION METRICS
We use precision, recall, F1, and coverage to measure estimation performance. Precision is the ratio of the number of users whose locations are estimated to the number of users whose home location has been correctly estimated. Recall is the ratio of the number of users in the test data to the number of users whose home location has been correctly estimated. F1 is the harmonic mean of precision and recall. Coverage is the ratio of the number of users in the test data to the number of users whose locations are to be estimated.
The metrics are expressed as follows: where T is the set of users in the test data, E is the set of users whose home locations are estimated (E ⊆ T ), l u is the correct home location of user u, andl u is the estimated home location of user u (u ∈ E hasl u ).

V. RESULTS
We first compared the estimation performance between countries. Next, we analyzed the correlations between profile attributes. Then, we analyzed the relationship between user profiles and their relationships using home location estimation.

A. ESTIMATION PERFORMANCE BY COUNTRY
In this experiment, we evaluated the performance of home location estimation for each country. Estimation performance was evaluated by leave-one-out cross-validation using the home location data and social graph data in each country's dataset. The results of each dataset are shown in Table 2.
We also show the number of unique labels |A| and the average size of the areas in the table. Size is defined as an average of the diagonal length in kilometers of the area's bounding box. It is assumed that the difficulty of estimation performance increases with increase in the number of unique labels or the size. In Table 2, the highest F1 of 0.60 is achieved by Brazil, and the lowest F1 of 0.13 is achieved by Indonesia. The United States and the United Kingdom have more unique labels than other countries. However, the F1s of Japan and Indonesia are lower than those of the United States and the United Kingdom. The smallest number of unique labels is 172 in Saudi Arabia, which also has the second highest F1. The sizes of Brazil and Saudi Arabia are greater than those of the other countries and their F1s are relatively high, although Mexico's F1 is not.
The granularity of place information from Twitter differs by country, even when only city-level places are chosen. Saudi Arabia is not a small country and has about 10% the number of users as the United States, but the number of unique labels is only 1% of that of the United States, which is less than expected. Moreover, the total area of the United Kingdom is about 40 times smaller than that of the United States, but the number of unique labels is 2/3 that of the United States. Accordingly, the number of unique labels and the average size of areas differ by country. Although the number of unique labels and the average size of areas are expected to affect the estimation performance to a certain extent, the difference in estimation performance shown in Table 2 cannot completely be explained by this. Hence, we conclude that estimation performance differs by country.
If the relationships among users had similar characteristics in these countries, the difference in estimation could have been explained by the number of areas and the area size. However, this was not the case, suggesting that further analysis of users' relationships is required. For this purpose, the absolute value of the performance of home location estimation may not be suitable.

B. CORRELATION BETWEEN PROFILE ATTRIBUTES
We focus on profile attributes to analyze users' relationships. Some combinations of attributes are assumed to be strongly correlated, such as friends and followers, and statuses and avg_tweets. To understand these results, correlations between attributes need to be considered before further analysis.
We calculated Spearman's rank correlation coefficient between the profile attributes in each dataset. Owing to space limitations, we only show the results of the United States, Japan, and India in Figure 3. In the graph-related attributes, friends and followers are moderately  correlated (0.687). In the activity-related attributes, avg_tweets and statuses are strongly correlated (0.925), and avg_tweets and favorites are moderately correlated (0.641). We categorized the profile attributes into three groups in Section III-C. Correlation tends to be high within the same group. There is little correlation between the text-related attributes and other attributes.
Analysis showed that correlations between profile attributes were similar among different countries. It is thus reasonable to conduct an analysis based on the increases or decreases in the value of profile attributes.

C. PROFILE ATTRIBUTES CONTRIBUTING TO ESTIMATION
To analyze the profile attributes that contribute to estimation for each country, we filtered the social graphs using the two filters described in Section IV-A and evaluated precision using the filtered graph by leave-one-out crossvalidation. The two filters require a threshold of θ to filter users. We changed the threshold θ from the minimum value to the maximum value of each attribute. In the experiment, all possible thresholds of screen_name, name, location, and description were selected. The threshold range of spent_days was divided into 100 equal sections. The threshold ranges of friends, followers, favorites, statuses, listed, avg_tweets, and follow_ratio were divided into 100 equal sections after they had been logarithmically transformed. We combined two filters, 100 thresholds, and all attributes.
The results are shown as follows. First, we find the threshold when the maximum precision is achieved and coverage is greater than 0.05 for each filter and attribute. Then, we test whether the precision is significantly improved by the filtering. The result is marked as ''+'' if the precision is significantly better than the precision without filter for a given condition. If we find that the precision is significantly improved, we consider the attribute as contributing to the estimation performance.
Significance testing is conducted as follows. Let n be the number of users whose location is estimated, and x be the number of users who have correctly estimated their home location. Precision p is calculated as p = x/n. When n users are randomly sampled, the distribution of the precision TABLE 4. Testing result of precision improvement by filtering using text-related attributes. ''+'' means that precision with filtering using the attribute was improved over precision without filtering. Long name and description contributed to difficulty in estimation.

TABLE 5.
Testing result of precision improvement by filtering using activity-related attributes. Activity-related attributes contributed to estimation performance.
follows a normal distribution with p mean and p(1 − p)/n variance: p ∼ N (p, p(1 − p)/n). Then, we calculate the confidence interval of p. We judge that the precision has improved if the precision is better than that without filtering and the 95% confidence interval does not overlap with that without filtering.
The results for degree-related, text-related, and activityrelated attributes are shown in Table 3, 4, and 5, respectively.
In Table 3, the precisions are improved using the graph-related attributes friends, followers, and follow_ratio. The users with high friends and low followers are difficult to estimate in many countries. The results indicate that the home locations of users with high follow_ratio are difficult to estimate in all countries. All attributes related to the shape of the social graph are effective in the estimation. In particular, follow_ratio is the only attribute that is effective in all countries.
As shown in Table 4, precision is improved using the textrelated attributes name, location, and description. The results show that the locations of users with long names, location texts, and description texts are difficult to estimate in certain countries. This result indicates that text length affects the estimation performance regardless of language. The results also show that the locations of users with short location texts and descriptions are difficult to estimate in some countries. In addition, screen_name contributes to estimation performance in few countries. The results suggest that older users are more likely to have shorter usernames and new users tend to have longer usernames. However, new users can choose a short username, and there is not much difference between countries.
In Table 5, precision is improved using the activityrelated attributes favorites, listed, statuses, spent_days, and avg_tweets. The locations of users with high favorites, low listed, low statuses, high spent_days, and low avg_tweets tend to be difficult to estimate. The results indicate that either too low or too high activity on social media can make estimation difficult.

VI. DISCUSSION
We compared the estimation performances between countries to evaluate each country's dataset. We then analyzed the relationships between user profiles and their relationships using home location estimation. We found that estimation performance varies by country, and graph-related attributes, text-related attributes, and activity-related attributes affect the difficulty of estimation.
In this section, we discuss the background of the attributes contributing to graph-based location estimation. First, we discuss why each profile attribute contributed to the estimation performance. In addition, we discuss the fact that different attributes were obtained for different countries, focusing on the language used in each country.

A. FACTOR OF ESTIMATION DIFFICULTY
We can break down the difficulty of estimation into two factors. One is that there are few clues for estimation, and the other is that clues for estimation do not satisfy the closeness assumption.
The graph-based home location estimation method cannot estimate any home location if there is no edge, which is the estimation clue [15]. The results in Table 3 indicate that the home locations of users with low degrees (low friends or low followers) are difficult to estimate. Low activity such as statuses and avg_tweets is associated with low followers. Therefore, users with low activity would have few estimation clues. Although our datasets contain only users who posted tweets at least 10 times, the results showed a tendency that the locations of users with lower activity than others are more difficult to estimate. The effects of these characteristics on difficulty in estimation hold in almost all countries.
Even if there are edges, the estimation will fail if the relationship does not meet the assumption. For example, celebrities or influencers may have relationships that do not relate to geographic distance because they have many followers who are not friends. We found that high follow_ratio is effective in all countries. A previous study [18] reported that location estimation is difficult in the cases of two types of users, hub users and authority users, as described in Section II. Authority users are considered to be celebrities or influencers. Hub users are users with high friends. Thus, our results are consistent with the previous study.
Locations of other users besides celebrities and influencers were also difficult to estimate. The home locations of users with long self-declared text (i.e., high name or description) were difficult to estimate in some countries. We suppose that users easily make online relationships if they reveal more information about themselves. Twitter can be viewed as an interest network: users with the same interests connect online. However, online relationships do not reflect geographic distance, so their locations are difficult to estimate. Our results suggest that the amount of self-declared information is related to the tendency to develop online relationships, which affects the difficulty of location estimation in some countries.
We found two factors affecting estimation difficulty. Moreover, the characteristics of users whose locations cannot be estimated owing to lack of estimation clues were clarified. Further research is needed on the characteristics of users whose locations are difficult to estimate because they do not meet closeness assumptions, other than celebrities or influencers.

B. LANGUAGE USAGE
There are no borders on the Internet, but language can be a barrier to the distribution of information. We investigated languages used in each country and counted the number of tweets and their language information for each country from the geotagged tweets in 2019 collected in Section III-A. Language information was collected from the lang field of the Tweet object, and country information was collected from the country field of the Place object. Table 6 shows the top two languages used in each country. We observe that 80% of the tweets are posted in the official language of the country, except for India and the Philippines. In India, 44% of tweets are in English and 28% are in Hindi. In the Philippines, 58% are in Tagalog and 28% are in English. Both India and the Philippines are unique in that they have more than one official language, including English.
The countries with majority of tweets in English are the United States, the United Kingdom, India, and the Philippines. If the same language is used in these countries, the profile attributes contributing to the estimation, especially text-related attributes, should be similar. Thus, not all the results are the same. The agreement between the results of profile attributes contributing to estimation difficulty in the United States and the United Kingdom is two-thirds. This suggests that there are cultural differences among countries in which the same languages are used.

VII. CONCLUSION
In this paper, we collected tweet data from 10 countries to investigate user profiles with characteristics of users' relationships through graph-based home location estimation. First, (RQ1) we found that the absolute value of location estimation accuracy differs by country. The number of areas in the country or the average size of areas could not completely explain the difference. Second, we focused on the profile attributes contributing to the estimation. (RQ2) We found that profile attributes contributed to estimation difficulty and that the attributes varied between countries. To understand the results, we analyzed the rank correlation between attributes and found that the correlation between the attribute groups tends to be small. Then, we interpreted the result. (RQ3) We found two factors of estimation difficulty: lack of estimation clues and unmet estimation assumption. The home location of users with low degrees or low activity tended to be difficult to estimate owing to a lack of clues. This held true in the cases of almost all countries. Nevertheless, we found that follow_ratio uniformly affects estimation difficulty. We also found that the amount of self-declared information is correlated with the difficulty of estimation because users who share more information tend to make more online relationships. Because these text-related attributes did not correlate with follow_ratio, some cultural aspects of the corresponding countries are captured. We then investigated the languages used in each country. (RQ4) The text-related profile attributes contributing to estimation difficulty are different even between countries where English is used. This suggests that profile attributes can differ owing to cultural differences among countries in which the same language is used.
SHIORI HIRONAKA received the Ph.D. degree in engineering from the Toyohashi University of Technology, Japan, in 2021. She is currently a Project Research Associate with the Department of Computer Science and Engineering, Toyohashi University of Technology. Her research interests include social media mining and computational social science.
MITSUO YOSHIDA received the Ph.D. degree in engineering from the University of Tsukuba, Japan, in 2014. He is currently an Assistant Professor with the Department of Computer Science and Engineering, Toyohashi University of Technology. He is also the Founder of TechTech Inc., which provides a news search engine and others. His research interests include the science of science, computational social science, and natural language processing.
KYOJI UMEMURA (Member, IEEE) received the B.E., M.E., and Ph.D. degrees in engineering from The University of Tokyo, Japan, in 1981, 1983, and 1991, respectively. He is currently a Professor with the Department of Computer Science and Engineering, Toyohashi University of Technology. His research interests include information retrieval, Lisp and symbolic computation, compiler, operating systems, and natural language processing.