An Entity-Based Fine-Grained Geolocalization of User Generated Short Text

Recently, the fine-grained geolocalization of user-generated short text (UGST), which can benefit many location-based applications, has been attracting the attention of academica. The semantic information in UGST is seldom introduced in most existing work, which reduces the effectiveness of existing methods. To address this issue, we propose an entity-based fine-grained geolocalization of UGST, which consists of following steps. (1) We employ location-based social network to model the coupling between entities and locations, which can introduce much semantic information. (2) We extract entities from non-geotagged UGST, and discards this UGST if it has not location-related entities. Otherwise, (3) we utilize the built coupling model to rank the candidate locations for this UGST, and then select top <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> locations as the result. The experiments demonstrate that our method shows marked improvement on <inline-formula> <tex-math notation="LaTeX">$Accuracy\text{@}1km$ </tex-math></inline-formula> and <italic>average error distance</italic> compared to the state-of-the-art FRV, WMV and LW methods.


I. INTRODUCTION
With the bloom of social sites, such as Twitter and WeChat, millions of user-generated short texts (UGSTs) are appearing every day. These UGSTs cover almost all aspects of users, including daily routines, news stories, political opinions [1], etc. The value of UGSTs has been attracting considerable attention. Furthermore, UGSTs with fine-grained geolocation [2] can benefit many location-based applications, such as smart health [3], emergency analysis [4], [5], event detection [6], and user identification [7]- [9].
Most operators of social sites have ascertained the value of UGSTs with fine-grained location and provided the geotagging function to their users. However, due to privacy or other special reasons, few users have adopted the geotagging feature. Existing work has illustrated that exceedingly few UGSTs are geocoded with fine-grained location [9]- [12]. For example, of over 1 billion tweets, only 0.58% are geocoded [12]. In this situation, it would be very difficult to fully exploit the value of UGSTs and seize the business opportunities. Therefore, the geolocalization of UGSTs has become a problem that needs to be addressed. We focus on this The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia .
issue, which differs from existing methods of coarse-grained geolocalization. Generally, these coarse-grained geolocalization methods work by linking the UGSTs to city-or time zone-level locations [10], [11], which are less useful for applications than are fine-grained locations.
In existing fine-grained geolocalization work, Kinsella et al. [13] created the language models of locations using coordinates extracted from geotagged tweets and then employed the content similarity to geolocalize the non-geotagged tweets. Paraskevopoulos and Palpanas [14] considered the time-evolution characteristics to improve the above method. Gonzalez Paule et al. [2] presented a solution for the fine-grained geolocalization of tweets, which utilizes a ranking algorithm combined with majority voting of tweets weighted based on the source credibility. Chong and Lim [10], [11] leveraged three types of information from locations, users and peers to rank the fine-grained geolocalization. Gao et al. [15] utilized the weight probability model to geolocalize UGST.
Existing work heavily relied on the GPS/human-annotated UGST. However, as mentioned above, when users are less willing to actively geocode the UGSTs [12], fine-grained geolocalization becomes a very challenging issue. To address this problem, Lee et al. [12] introduced Foursquare as a source for building the probabilistic models for locations using location-coupled words in tweets, and then proposed a Filtering-Ranking-Validating method for tweet location prediction. Intuitively, we believe that entity contains more accurate location information than word. We take the instance shown in Figure 1 to illustrate this issue. In tweet Just a southern gal living in the Big Apple, entity Big Apple clearly refers to the location New York, and it contain more location information than word big or apple. According to the above analysis, we propose an entity-based Fine-grained Geolocalization of user-generated Short Text based on a Location-based social network (LBSN), which is abbreviated FGST-L. We first build the probability model for location using the location-coupled entity. Based on the built model, we geolocalize the non-geotagged UGST as the following steps. We identify the entities in the non-geotagged UGST. For an UGST with location-related information, we rank the candidate locations for it, and then we select the top n (n ≥ 1) locations as the result. For an UGST without any location-related information, we believe that its location is unpredictable. Our contributions are summarized as follows.
(1) We propose an entity-based solution for the fine-grained geolocalization of UGST, which introduces more semantic information to improve the method performance.
(2) We employ the location-based social network to build the coupled model of entity and location, which can introduce more semantic information than existing work. To the best of our knowledge, this is earlier work towards exploiting the coupled relation between entity and location.
(3) We present a novel entity-based method to filter out UGST without any location-related information, which has better filtering effect and eliminates interference in earlier stage.
(4) We conduct the experiment on three ground-truth datasets, and the results illustrate the superiority of FGST-L over the state-of-the-art methods.
It should be mentioned that we had presented the main idea of FGST-L at RecNLP 2019. 1 According to the comments from workshop attendees, we revised and extended the presentation into the mature work, and wrote this paper. The rest of this paper is organized as follows. We introduce the related work in Section 2. Section 3 first describes the preliminary concept, then provides the problem formulation and details our proposed method. Section 4 then shows the experiments on three ground-truth datasets and the result analysis. In Section 5, we conclude this paper and discuss the future work.

II. RELATED WORK
Recently, the geolocalization of UGST has been attracting significant attention from many scholars. The related work can be primarily categorized into two categories. One is coarse-grained geolocalization, which focuses on predicting the country, state, and city of each UGST or its user. The other is fine-grained geolocalization, which focuses on predicting the street or place of interest of an UGST. In this section, we discuss these two categories of related work.

A. COARSE-GRAINED GEOLOCALIZATION OF UGST
For these methods on coarse-grained geolocalization, the basic idea is building probability models for each country, state or city using region-specific terms and then predicting the location of the UGST or its user according to location-related words in UGST or UGSTs of a user. Concretely, Cheng et al. [16] presented a solution for predicting the city of a Twitter user. After building the probability model for every city using tweets associated with that city, the probabilities of a user being located in every city are estimated and ranked, and then the city with the highest probability is selected as the city of that user. Hecht et al. [17] utilized the selected region-specific terms to build the probability model and then employed a multinomial naive Bayes model to predict the country and/or state of the Twitter user. Mahmud et al. [18] built a set of classifiers for predicting the home of a Twitter user and then created an ensemble of these classifiers to improve the accuracy. Huang and Carley [19] integrated the text and user profile into a single model using a convolutional neural network to predict a Twitter user's country-or city-level location based on the information in a single tweet. Kinsella et al. [13] used the coordinates extracted from geotagged tweets to create the probability models of locations at multiple granularities, ranging from the zip code to the country level, and then predicted the location of a single tweet. Ebrahimi et al. [20] first proposed a solution for categorizing celebrities as local or global and then used local celebrities as location indicators. A label propagation algorithm was employed over the social network for geolocalization at the city level. Finally, a text-based method was integrated into the network-based proposed approach to improve inference accuracy. The difference between our work and these coarse-grained geolocalization methods is that we predict the fine-grained location of a given UGST, such as a street or special restaurant.

B. FINE-GRAINED GEOLOCALIZATION OF UGST
Most existing work on fine-grained geolocalization focuses on predicting the location of each UGST at the place of interest-level. Similarly, their fundamental ideas also include VOLUME 8, 2020 building a probability model for each PoI using PoI-specific terms and then predicting the location of the non-geotagged UGST according to location-related words in the UGST. Li et al. [21] predicted the PoI tag of a tweet based on its textual content and time of posting. They considered fine-grained geolocalization as a ranking problem and then ranked a set of candidate PoIs by language and time models. Kinsella et al. [13] also created the language models of locations using coordinates extracted from geotagged tweets and then inferred the tweet locations based on content-similarity. Paraskevopoulos and Palpanas [14] improved the method of Kinsella et al. [13] by considering time-evolution characteristics in the matching algorithm. Ikawa et al. [22] presented a method to learn associations between a location and its relevant keywords from past microblogs and inferred the location where a microblog was generated by using its textual content. Lee et al. [12] introduced Foursquare as a source for building the probabilistic models for locations using location-coupled words in tweets and then geocoded the non-geotagged tweets. Li and Sun [23] extracted PoI-level locations mentioned in tweets with temporal awareness. To formulate the PoIs' formal names and their informal abbreviations, they also introduced the crowd wisdom of the Foursquare community into the proposed method. Chong and Lim [10], [11] proposed several models that leverage three types of signals from locations, users and peers to infer the locations of non-geotagged tweets. Gonzalez Paule et al. [2] presented a ranking algorithm combined with majority voting for tweets weighted based on source credibility to predict the fine-grained locations of tweets. Whereas most relevant existing methods are based on the probabilistic models for locations using location-coupled words in UGSTs, our proposed method attempts to build the probabilistic models for locations using location-coupled entities in UGSTs because we intuitively believe that entities contain more location information than do words.
In addition, Ghaffari et al. [24] develop a deep-learning solution for fine-grained home location prediction. Xu et al. [25] proposed a deep-learning method for fine-grained location recognition. These two methods have the different goals from our work. Besides, other methods focused on inferring the geographical origins of online contents [12] such as photographs [26], web pages [27] and web search query logs [28].

III. FINE-GRAINED GEOLOCALIZATION OF USER GENERATED SHORT TEXT
We first list some notations which are used in this paper in Table 1, and then formulate the problem of fine-grained geolocalization of user-generated short text. Finally, we detail the proposed approach.

A. PROBLEM FORMULATION
Intuitively, in an UGST, the word group Big Apple contains more semantic information of location than does the single word big or apple. Such word group would be more helpful for the geolocalization of UGST. In FGST-L, we focus more on word group than on word. For convenience, we call such a word group an entity.
Definition 1: Entity. An entity is defined as a set of words which represents the name of a subject or object. An entity is further formalized as e = {w 1 , w 2 , . . . , w n }, where w i is the i th word of the name.
An UGST t is further denoted by t = {e 1 , e 2 , . . . , e m }, where e i is the i th entity in t. Our goal is to exclusively geolocalize the UGST t in a fine-grained manner based on the entities contained in t.
Problem Formulation 1: Fine-Grained Geolocalization of UGST. Suppose we are given an UGST t and a set of fine-grained candidate locations L = {l 1 , l 2 , . . . , l k }. The task of FGST-L is to select n(n ≥ 1) locations from L as the geolocation of UGST t.
The key issue of the problem 1 is to calculate the probability, p(l i |t), ∀l i ∈ L, that the geolocation of t is l i . After we calculate the probability p(l|t) for each candidate location, we rank these locations according to their probabilities, and then select the top n locations as our results. We will detail this key issue in the following subsection. Clearly, this problem is easily generalized to the coarse-grained geolocalization of UGST. Figure 2 shows the framework of FGST-L, which consists of four key components.

B. OVERVIEW OF FGST-L
1) Building the coupled probability model of entity and location: we employ LBSN, such as Foursquare, as source to build the coupled relationship between entities and locations, which allows us to introduce more semantic information. 2) Extracting entities in UGST: we extract the entities in UGST t. 3) Filtering the UGSTs: we filter UGSTs without any location-related entities, which are considered as the unpredictable UGSTs. 4) Ranking the candidate locations: given an predictable UGST t, we calculate the probability p(l|t) for each candidate location, rank these candidate locations, and select the top n(n ≥ 1) locations for t. We detail four components as follows.

C. BUILDING COUPLED PROBABILITY MODEL OF ENTITY AND LOCATION
Foursquare contains numerous Points of Interest (PoI) and a large amount of tips, 2 which makes it is very helpful for FIGURE 2. Framework of FGST-L. FGST-L includes two stages. One is the stage of pre-training where we employ LBSN to model the coupling between locations and entities. The other is the stage of entity-based geolocalization. The latter consists of three parts: 1) extracting entities in UGST; 2) filtering the UGSTs without any location-related entities; 3) for the remaining UGST, computing the probability of candidate locations based on the built coupling model, and ranking the candidate locations.
building high-quality probability models of locations and entities [12].
We denote the PoIs in Foursquare as L = {l 1 , l 2 , . . . , l k }. To express the coupling of entities and locations, we build the conditional probability model for every PoI based on its related tips. We assume that the set of tips tied to l i is T (l i ) = {t 1 , t 2 , . . . , t m }. Actually, a popular PoI contains more tips, and thus, its model is of higher quality. We assume that entity e occurs tf (e, t) times in UGST t, and c(e, l) times in T (l i ). We compute the probability of entity e occurring in PoI l by the technique of maximum likelihood estimation, as shown in Eq.(1).
From Eq.(1), we can easily find that a zero-probability problem, p(e|l) = 0, occurs when c(e, l) = 0. To address this issue, we further define p(e|l) by the Laplace smoothing method as follows.
Furthermore, in UGST t, some entities maybe have common word(s), which creates difficulties to compute the probability of UGST t being tied to PoI l. Generally, because an UGST is very short, it is relatively rare for words to be shared between entities. We assume that entities in t are independent, and approximate p(t|l) as follows.

D. EXTRACTING ENTITIES IN UGST
In FGST-L, we break each UGST into words, stem them, and remove stop words. After that, an UGST t is denoted by a set of words, t = {w 1 , w 2 , . . . , w i , . . . , w n }.
We utilize the Stanford NLP tool [29] to find all possible entities in t based on Microsoft Probase [30]. As a result, we obtain t = {e 1 , e 2 , . . . , e m }, where e i = {w k , w k+1 , . . . , w l |1 ≤ k ≤ l ≤ n}. The extracted entities are restricted to the repository we selected. We selected Probase for extracting entities because it includes tremendous concept space and concept clusters.

E. FILTERING UGSTs
Geolocalization of UGST heavily depends on the location information it contains. For example, it is very difficult to geolocalize UGST It is a good day. Before predicting the geolocation of an UGST, we first determine whether this UGST contains location-related information. We filter the UGSTs without any location indication.
In some situations, the location-related information explicitly occurs in the UGST. For instance, the entity Northwestern Polytechnical University is an explicit location indication in the UGST I am at Northwestern Polytechnical University now. To express these cases clearly, we define the indicator function I ex for entity e i ∈ t.
We further express whether UGST t contains explicit location-related entities by Eq.(5).
In other situations, the location-related information appears implicitly. For instance, the entity Big Apple can implicitly represent the location New York in UGSTs. To express these cases, we employ the idea of TFIDF to identify local words [12] and define the following equation.
where df (e) is the number of locations with entity e. We consider that entity e is l-related when f tfidf (e, l) ≥ θ, where θ is a given threshold. Clearly, if we set θ with a greater value, the number of local entities would become smaller. The indicator function of implicit location-related entities is defined over f tfidf (e, l) as shown in Eq. (7).
We can depict whether UGST t contains implicit location-related entities by Eq. If I(t, L) = 1, t is considered to contain location-related information. Otherwise, it is filtered out.

F. RANKING CANDIDATE LOCATIONS
We rank the PoIs for each remaining UGST. Given UGST t = {e 1 , e 2 , . . . , e m } and the candidate PoIs L = {l 1 , l 2 , . . . , l k }, we rank the candidate PoIs based on the naive Bayes model. Therefore, the probability that the location of t is l is shown by Eq. (10).
where N (l) indicates the occurrences of l. Eq.(10) is further defined as Eq. (11) ln p(l|t) ∝ e i ∈t ln p(e i |l) + ln p(l) After calculating the probability p(l i |t), ∀l i ∈ L, we rank the candidate locations {l i |1 ≤ i ≤ k} according to their probabilities and obtain a ranking list {l r 1 , l r 2 , . . . , l r k }. The top n(1 ≤ n ≤ k) locations, {l r 1 , l r 2 , . . . , l r n }, are selected as the possible geolocations of UGST t.
The overall proposed method can be summarized in two stages. One is the stage of building coupled model of entity and location, as shown in Algorithm 1. The other is the stage of entity-based geolocalization of UGST, as shown in Algorithm 2. The first stage is a pre-training process. We introduce the UGSTs and locations in LBSN to build the coupled model. Obviously, the time complexity depends on |L|, |T (l)| and the number of entities in UGST. In the datasets we obtained from real social sites, The vast majority of |T (l)| is less than 50, and almost all the number of entities in UGST is less than 10. In other words, the running time of the first stage mainly depends on the number of locations, |L|. From Algorithm 2, we easily reach its time complexity is O(|L|). The state-of-the-art method, FRV [12] has similar idea with FGST-L, and its performance is closest to that of FGST-L. Its computational complexity also depends on |L|, |T (l)| and the number of words in UGST. Due to the number of words and the number of entities in UGST having the same order of magnitude, FRV and FGST-L have the same time complexity. In a word, the running time of FGST-L and FRV depends heavily on the number of candidate locations.

IV. EXPERIMENTS A. DATASETS
We collected the PoIs of New York City and the related tips from Foursquare and obtained 74,942 PoIs and 498,722 tips. For convenience, we call this dataset TrainingTips. The number of tips is unevenly distributed over all PoIs, as shown in Table 2   To illustrate the generalization of FGST-L, we also collected the UGSTs generated in New York City from Twitter and Facebook, and ultimately obtained 19,231 tweets and 6,699 posts. In total, 32.4% of tweets and 16.7% of posts are geocoded by the PoI-level location. In addition to the tips in TrainingTips, we obtained additional tips and their PoIs from Foursquare for evaluation and manually selected 12,000 tips, of which 6,000 tips contain hints about locations and 6,000 tips do not contain any hints about locations. The three datasets are named TW, FB, and FS, respectively.

B. EXPERIMENTAL SETTINGS
In FGST-L, there are three key parameters: the predefined threshold θ , the number of ranked locations selected for t, and the number of tips used for building probability model. For convenience, we denote three parameters by θ , n top , and n tip , respectively. We will discuss the effects of three parameters.
We compare FGST-L with the following similar methods.
-FRV [12]: a filtering-ranking-validating technique for the fine-grained geolocalization of tweets, which is a very similar method to FGST-L. FGST-L is an entity-based method, while FRV is a word-based method. -LW [10], [11]: a location-indicative weighting scheme for the fine-grained geolocalization of tweets, which assigns more weight to location-indicative words. -WMV [2]: a weighted majority voting algorithm for the fine-grained geolocalization of tweets, which estimates the location of a tweet by collecting the votes of the geotagged tweets that are similar with that tweet on content. In experiments, we first use the PoIs and tips in Train-ingTips to model the coupling between entities and locations. Then, we use Algorithm 2 to geolocalize each t in TW, FS and FB, respectively. Based on the geolocalization results, we employ the widely used Accuracy@1km and average error distance (km) [2] to evaluate all algorithms.
Average Error Distance (km): we only consider the UGSTs that have not been filtered out and compute the distance on Earth between the predicted location and the real coordinates of the UGST.
Accuracy@1km: After filtering, all UGSTs are divided into two categories: UGSTs that have been filtered out and UGSTs that have not been filtered out. We assume that the number of UGSTs that have been filtered out correctly is n 1 1km . For UGSTs that have not been filtered out, the number of UGSTs whose predicted location lies within a radius of 1 km from the real location is denoted by n 2 1Km . Accuracy@1km is measured as follows.
n T (12) where n T is the number of all UGSTs for testing.
Given UGST t, it would be filtered out if f tfidf (e, l) < θ, ∀e ∈ t. In other words, the value of θ determines whether t is filtered out. Therefore, the number of UGSTs that are not filtered out will vary with the value of θ , as shown in Table 3.
As the value of θ becomes larger, more UGSTs are filtered out. The UGSTs that are not filtered out are considered predictable. Naturally, the threshold, θ, exerts a strong influence on the results. We conduct the experiments with the different values of θ to study its effect on the results, where we set n top = 1 and n tip = 20. The results are shown in Figure 3.  Figure 3(a) shows the Accuracy@1km w.r.t. θ . For FS and TW, the accuracy of FGST-L first rises and then declines with the increase in θ . Its accuracy exhibits the best performance when 0.3 < θ < 0.4. Generally, many UGSTs do not contain location-related information. When θ takes a small value, many UGSTs without location-related entities are mistaken for UGSTs with location information. This reduces the accuracy of FGST-L. Similarly, when θ takes a large value, FGST-L incorrectly filters out the UGSTs with location-related information, which also reduces the accuracy. However, the accuracy curve exhibits a different trend for dataset FB. The curve first rises quickly and then increases only slightly. Because most UGSTs in FB are location-free, an increasing number of location-free UGSTs are correctly filtered out with the increase in θ at the beginning. However, after θ > 0.4, most location-free UGSTs have been filtered out, which slows the increasing trend. The accuracy for the FB dataset always continues to increase, which is primarily due to its severe data skew. When θ = 0.7, FGST-L filters most UGSTs, and its accuracy reaches 83.3%.
The average error distance with θ is shown in Figure 3(b). When 0.3 < θ < 0.4, this metric is optimal. Although it obtains a minimum value when θ > 0.6, most UGSTs are filtered out in this case, which is not our expectation.
The above results shows that FGST-L has the optimal performance when 0.3 < θ < 0.4. Therefore, we can set the value for parameter θ in the interval [0.3, 0.4].
To clearly show the effect of filtering, we illustrate the comparison between the unfiltered results (θ = 0.0) and filtered results (θ = 0.35) in Figure 3(c) and Figure 3(d), respectively. Whether for accuracy or average error distance, the filtered results are significantly better than the unfiltered results. This finding demonstrates that filtering is an essential step of FGST-L.

D. PERFORMANCE OF FGST-L W.r.t n tip
Intuitively, if a PoI is tied to more tips, the probability model built for this PoI is of higher quality. Therefore, we conduct experiments to study the effect of n tip , where n top = 1 and θ = 0.35. Figure 4 shows the experimental results. We find that the accuracy of FGST-L rises slightly as n tip increases, while the average error distance first rises and then declines before n tip ≤ 20. In other words, the number of tips has some impact on the coupling model, but it is not very significant before n tip ≤ 20. This result is not in accordance with our intuition. In particular, when n tip > 20, it exerts little effect on the probability model. The reasons for this result are as follows: 1) in the experimental datasets, many UGSTs are location-free, which interferes with our prediction, and 2) for the UGSTs tied to a PoI, the entities in 20 UGSTs cover most entities in all UGSTs. As a result, we can build the accurate coupled of entities and PoI with only approximately 20 UGSTs. This could reduce the need for computing resources and help us obtain a much more accurate probability model. Therefore, we recommend that n tip is set to the number of UGSTs covering most entities. In our experiments, we set n tip = 20.

E. PERFORMANCE OF FGST-L W.r.t n top
From the above experimental results, we easily find that the best accuracy is approximately 80%, as shown in Figure 3 and Figure 4. This could be caused by selecting the top-ranked location as the location of UGST t. Instead, in many cases, the ground-truth location of t is on the k th (k ≥ 2) place of the ranking list, not the top-ranked place. Intuitively, if we select the top n top locations as the possible locations of t, the accuracy should improve. We conduct experiments to demonstrate the effect of n top , where n tip = 20 and θ = 0.35. The results are shown in Figure 5.
From Figure 5(a), we can easily observe that accuracy has improved significantly. The detailed percentage of accuracy improvement (= × 100%) is shown in Table 4. With the increase in n top , Accuracy is gradually improving, but the acceleration of the percentage gradually  decreases. Before n top = 8, the improvement is relatively large. These results meet our expectations.
Among the datasets, the percentage of accuracy improvement on FS is the highest, while the percentage on TW is the lowest. We have conducted further analysis and found reasonable explanations for these results. 1) The locations in TW are much more coarse. When n top = 1, its accuracy is relatively high, as shown in Figure 3(a). As n top increases, the change in the percentage is not obvious. 2) The location granularity of FS is the finest. The candidate locations close to the ground-truth location of t readily interfere with our inference results. Figure 3(a) supports this statement. Obviously, when we select the top n top locations, the accuracy for FS improves remarkably. 3) For the FB dataset, only 16.7% of posts are geocoded with fine-grained location. Similar to the reason for TW, the change in the percentage is not apparent after n top > 3. In future work, we will extend the datasets and study the relationship between the number of the UGSTs with the fine-grained locations and the accuracy of FGST-L. Figure 5(b) illustrates the remarkable change in average error distance. The percentages of average error distance improvement (=

AveErrDist@Top1−AveErrDist@Topn top AveErrDist@Top1
× 100%) with n top are detailed in Table 5. When selecting the top n top locations, we take the location closest to the real location as the predicted location. Clearly, as n top gradually becomes larger, the value of average error distance becomes smaller. The average error distance improves remarkably for the three datasets, particularly before n top = 10. As mentioned above, the ratio of posts without location information is significantly larger than the ratio of tips or tweets, so the percentage of average error distance improvement for FB is relatively small.   In summary, we present the comprehensive FGST-L experimental results with n top = 1 and n top = 5, as shown in Table 6 and Table 7, respectively. Whether n top = 1 or n top = 5, we easily reach the conclusion that FGST-L performs well when 20 < n tip < 30 and 0.3 < θ < 0.4. When n top takes other values, the performance of FGST-L exhibits a similar pattern.
As n top gets bigger, FGST-L gets better. However, the number of possible locations for t has also become larger, which makes it more difficult for the user to choose one. Therefore, we set n top ≤ 5.

F. PERFORMANCE OF FGST-L UNDER RELAXED CONDITIONS
In the above experiments, we use the metric Accuracy@1km to evaluate the accuracy of FGST-L. If the radius error of the predicted location and the real location is less than 1 km, the inference result is considered correct. Intuitively, the radius error should have a significant impact on the accuracy of FGST-L. We relax the conditions for calculating the metrics of FGST-L to study this issue. Due to limited space, we only demonstrate the results of FS w.r.t. radius error and n top , as shown in Figure 6. From the results, we find that the accuracy of FGST-L increases slightly as the radius error becomes larger when n top is given. Similarly, the accuracy of FGST-L also increases with the increase in n top when the radius error is given. However, the increase in the latter case is significantly greater than the increase in the former case. This illustrates that the location predicted by FGST-L is fine-grained. To increase the feasibility of FGST-L, we recommend selecting the top n top locations as locations of UGST t. Figure 7 shows the comparison between FGST-L and existing work. The results demonstrate the effectiveness of FGST-L due to its superiority over the other baseline methods. These results stem from the fact that we 1) build the probability models with entities instead of words, where entities contain more semantic information than do words, and 2) filter out the UGSTs without location-related entities.

G. COMPARISON WITH EXISTING WORK
FRV exhibits better results than the other baseline methods, WMV and LW, which further indicates that filtering the location-free UGSTs is an effective step that reduces noise interference.
Both FGST-L and FRV exhibit better results on the FB dataset than on the FS and TW datasets. As analyzed above, approximately 83.3% of UGSTs contain few location-related entities in FB. FGST-L and FRV filter them out, which increases the accuracies. This result further demonstrates that determining whether an UGST includes location-related entities is easier than geocoding its PoI. In addition, all methods exhibit inferior accuracies on FS than on FB and TW. This is because FS has more fine-grained location.
We perform a t-test on the results of FGST-L and FRV, and find that there is not significant difference at significance level 0.05. A reasonable explanation is that FRV employ the n-gram model to extract the words, which include most of entities. However, some words that are not entities may be noisy to geolocalization of UGST.
To further validate our method, we remove all location-free UGSTs from three datasets, and rerun four methods. The results are shown in Table 8, where n top = 5, n tip = 20 and θ = 0.35. Compared with Fig. 7, both FGST-L and FRV also show better accuracy, but their advantages reduce significantly. Four methods show much similar results because they rely on the similar information, location-indicative words/entities. However, in FGST-L or FRV, n-gram model or entity is more location-indicative, so its performance is better. The results on average error distance (km) change very slightly.
We present some examples to show the effects of using FGST-L. Table 9 displays three tweets posting at Joe's  Shanghai, New York. Within each tweet, the entities are italicized. For instance, FGST-L easily recognizes the entity joe's shanghai from tweet t 1 , and this entity is the name of a Chinese restaurant. Therefore, FGST-L easily geolocalizes t 1 . For tweet t 3 , FGST-L can not distinguish any entity, and consider t 3 as a location-free tweet. However, for t 2 , it is more difficult for FGST-L to geolocalize. Our method can recognizes the entity soup dumplings, which is a representative food in Chinese restaurants. In this case, FGST-L will incorrectly geolocalize t 3 with high probability. Similarly, the other three methods are also helpless for tweet t 3 .

V. CONCLUSION AND FUTURE WORK
Recently, the value of a tremendous amount of UGSTs in social networks, particularly the UGSTs tagged with fine-grained locations, has been recognized by increasingly numerous business organizations. However, due to privacy issues or special purposes, most users seldom adopt the geotagging functions provided by social sites. To fully exploit the value of UGSTs, the fine-grained geolocalization of UGSTs has been receiving great attention from academia. Most existing methods are word-based and thus rarely utilize the semantic information about a location. This will degrade the performances of existing approaches. To address this problem, we present an entity-based fine-grained geolocalization of UGST based on LBSN. We introduce LBSNs, such as Foursquare, as sources to tightly couple entities and locations, which capture more semantic information of locations than the word-based methods do. After filtering out the UGSTs without any location-related entities, we rank the candidate locations for each remaining UGST based on the coupling model and then select the top n(n ≥ 1) locations as results. The experiments on three ground-truth datasets validate the effectiveness of the proposed method.
To more accurately geolocalize UGSTs, we will extend our method by incorporating more information sources in future work. One extension could be the introduction of UGSTs posted by a user in a LBSN to predict the locations of UGSTs posted by the same user on other social sites. For example, if a user visits one shopping mall at 12 o'clock and simultaneously posts two similar UGSTs on Foursquare and Twitter, then we can accurately predict the location of the UGST on Twitter with the help of the PoI on Foursquare. Another possible extension could be to introduce the location history of a user, which could reduce the search space of candidate locations. XING GAO received the B.Sc. degree in communication and information engineering from the Xi'an University of Science and Technology, China, in 2016, where she is currently pursuing the M.S. degree in computer science and technology. Her research interest includes social computing. VOLUME 8, 2020