Leveraging Social Media as a Source of Mobility Intelligence: An NLP-Based Approach

This work presents a deep learning framework for analyzing urban mobility by extracting knowledge from messages collected from Twitter. The framework, which is designed to handle large-scale data and adapt automatically to new contexts, comprises three main modules: data collection and system configuration, data analytics, and aggregation and visualization. The text data is pre-processed using NLP techniques to remove informal words, slang, and misspellings. A pre-trained, unsupervised word embedding model, BERT, is used to classify travel-related tweets using a unigram approach with three dictionaries of travel-related target words: small, medium, and big. Public opinion is evaluated using VADER to classify travel-related tweets according to their sentiments. The mobility of three major cities was assessed: London, Melbourne, and New York. The framework demonstrates consistently high average performance, with a Precision of 0.80 for text classification and 0.77 for sentiment analysis. The framework can aggregate sparse information from social media and provide updated information in near real-time with high spatial resolution, enabling easy identification of traffic-related events. The framework is helpful for transportation decision-makers in operational control, tactical-strategic planning, and policy evaluation. For example, it can be used to improve the management of resources during traffic congestion or emergencies.

media are nowadays the main agitators of big data statistics analysis [6].
The effective processing and utilization of social media data require a well-designed data mining and business intelligence system [7]. In this work, the transportation sector will be approached, focusing on the evaluation of how georeferenced messages extracted from social networking sites may be a good (maybe better) way to get a city's traffic information.
Transportation data provided by social media can be leveraged to extract valuable insights into various travel aspects, ranging from individual preferences to travel patterns and traffic bottlenecks. These aspects can be obtained with a broad spatial-temporal detail, and utilized by policymakers and transportation providers to optimize the day-to-day operations of transportation systems, such as adjusting signal times, identifying reliable alternative routes, or pinpointing areas with inadequate or inefficient transportation services informing the need of investments in transportation infrastructure and/or services. Additionally, this information can facilitate the evaluation of transportation policies and encourage citizen participation in the policy-making process, leading to more inclusive and effective transportation planning and decision-making [8].
The integration of social media data into transportation planning and management has the potential to revolutionize the way we approach urban mobility. By leveraging the collective intelligence of social media users, we can gain valuable insights that were previously unattainable using traditional data sources. This has the potential to lead to more efficient, cost-effective, and user-centric transportation systems. However, extracting and classifying transportation data from a wide range of user opinions is a challenge.
While social media has been investigated for various purposes, including urban mobility, the existing studies are often fragmented, focusing only on part of the process. The main objective of this work is to create a robust framework capable of representing and handling large-scale data and adapting automatically to new contexts. By identifying patterns and trends in user feedback, the framework will provide transport operators and authorities with valuable insights that can be used to make data-driven decisions about the management of transport services. Ultimately, this knowledge will enable them to optimize their services and improve the overall transportation experience for users. Therefore, the main goals of this work can be summarized as follows.
• To define a scalable methodological framework comprising data collection and system configuration, data analytics, and aggregation and visualization modules able to efficiently aggregate sparse information for transport management. • To overcome the limitations of traditional approaches. • To demonstrate the use of social media as a valuable source of data.
The remainder of this manuscript is structured as follows. First, a comprehensive literature review is performed, identifying gaps in current research on the analysis of transportation using social media data. Then, the overall framework is presented, describing the data and applied models. Next, the main results are outlined and discussed, highlighting the model limitations and main practical implications in mobility control and management. The document ends with remarks on the main conclusions and an overview of future directions.

II. LITERATURE REVIEW
Text mining classification is based on text classification and sentiment analysis (also referred to as opinion mining). Text classification uses keyword search to extract related messages, whereas sentiment analysis is a process that estimates the polarity of people's opinions, sentiments, evaluations, attitudes, and emotions from written language [9].
In the domain of transportation, the text mining studies are focused on the assessment of near real-time traffic information [10], and public transportation [11], [12], [13], [14]. The main purpose of these studies has been focused on the assessment of the travel mode choice [15], [16], the transport flow [12], [13], the customer satisfaction [11], [14], [17], [18], the transport safety [19], and the quality of services [20], [21]. Abdul-Rahman et al. [22] built a framework to pre-process location-based social media big data [22], Azhar et al. [23] tested lexicon and deep learning to identify accidents-related tweets, including not only social media but also weather and geocoding information and Kuflik et al. [24] defined a framework for mining transport-related information from social media.
Some studies using social media data focus only on georeferenced messages. However, in some social media networks such as Twitter, the georeferencing of messages are optional. Thus, keywords associated with the name of the transport system, such as the codes of the bus stops or the number of the bus services have been used to better identify transport services (e.g., [30]). Besides social media data, other data sources from news [19], [25], announcements [19], traveling sites [25] and transport agencies [31] have been also used in text mining studies related to transportation.
Social media data from other cities worldwide were also used in order to gain knowledge about the transportation systems particularly in megacities from Brazil as São Paulo [30], Rio de Janeiro [30], United Kingdom such as London [13], [25] and Manchester [34], and China such as Shanghai [35], Nanjing [36] and Shenzhen [27].
In the context of text mining, information filtering assumes a pivotal role by focusing on the most relevant data. Searchbased methods are commonly employed for information filtering. While some methods are based on filtration based on search parameters, filtration based on users or pages (e.g., [11], [19]), others are based on text classification (e.g., [23], [28], [39]). The filtration by search parameters is done during the extraction phase and typically consists of selecting its content. Typically its used one [38] to four entries [17] from a list of words and/or hashtags [40], [41]. On the other hand, the filtration by pages or users identifies accounts of interest and extracts all their content [32] and/or comments [42]. However, these techniques have strong limitations since they are restricted to a specific page, or a set of pages.
Text classification serves as an effective filtration process that overcomes the limitations of traditional approaches by relying on dictionaries. However, its application in the domain of traffic management is limited. Few dictionaries of travel-related target words have been used in the literature for text mining classification. The number of words used in these studies is quite distinct for each study, ranging from one single word [37] to 500 words [19], [25]. The most popular travel-related words in these studies are associated with the identification of transport modes [15], [21], [26], [39], type of public transport [12], public transport companies [43], [44], [45], and public transit riders' satisfaction [14], [17], [21]. Some of these dictionaries were validated with the help of transportation experts [46].
The earlier mentioned machine learning models have been used considering two different frameworks: one-hot encoding and word embedding [25]. The first and most popular, one-hot encoding framework is the Bag-of-Words (BoW). This model represents a document as a dictionary and contains all words occurring in a document. To overcome the limitations of this method for representing large-scale data, a probabilistic approach has been proposed based on a distributed representation. Word embedding uses a very low-dimensional vector and semantic meaning to represent each word. In the domain of transportation, the most common assumption is BoW [12], [21], [30], [37]. However, in recent years, BoW is becoming more popular [21], [37], mainly using unsupervised techniques such as word2vec and doc2vec [25], [39].
The sentiment of messages has been analyzed using machine learning methods [12], [17], [20] and lexicon-based methods [25], [37]. Examples of machine learning used in the domain of transportation are the Support-Vector Machines [12], [17], [20], the Naïve Bayes models [17], the Decision Trees [17] and more recently, Google BERT main models or one of their adaptations [31], [38], [42]. Whereas machine learning methods need a training dataset to learn the model from corpus data, and a testing dataset to verify the model built, lexicon-based methods are based on a dictionary of words and phrases with a classification value. The approach of using supervised learning techniques to evaluate the sentiment of the text is not as detailed and has high performance as a lexicon-and rule-based technique. Nevertheless, some sentiment analysis research efforts have shown better results [14], [19], [21], [25], [37].
The most widely used lexicon is SentiWordNet [12], [25]. However, when we have different senses of words, this approach may not obtain good results [25]. Currently, many tools, such as Linguistic Inquiry and Word Count (LIWC), used by Kahn et al. [49], offer the means of extracting advanced features from texts. However, most of these tools require some major programming knowledge and are difficult to interpret. Some recent sentiment analysis lexicons are the Valence Aware Dictionary and sEntiment Reasoner (VADER) [50], the TextBlob [50], and the Afinn [51].
VADER is a rule-based tool that is specifically attuned to sentiments expressed in social media once it works well with slang that allows classifying messages according to multiclass sentiment analysis [50]. TextBlob, on the other hand, is used to assign both polarity and subjectivity scores to the data and performs strongly with more formal language usage [51]. Afinn contains a dictionary with 3,300 words and their respective sentiment scores [51]. It is one of the simplest and most popular lexicons. Although earlier studies have shown good results, they may now be considered outdated in comparison to more recent methods such as VADER and TextBlob [52].
The literature review, outlined in Table 1, highlights several gaps in the current research on transport analysis using social media data. First, several frameworks and methods have been applied to get insight into transport using social media data with good results (e.g., [24], [25], [46]). However, some of these efforts are based on frameworks that are not capable of representing large-scale data (e.g., BoW), or supervised methods that are not capable of adapting automatically to a new context. Moreover, although one study used word embedding to identify transport-related words [39], the application of word embedding is not fully explored in the transport domain. Second, several sets of transport-related words have been analyzed ranging from 1 to 500, but the best-balanced dictionaries are not identified. Third, although some frameworks are proposed to gain insights into transport using social media (e.g., [24], [25]), their spatiotemporal analyses are poorly demonstrated. In this work, we address all these gaps using as support a preliminary study conducted by the team [53].

III. MATERIALS AND METHODS
A framework for extracting knowledge from social media is defined following the general platform proposed in Ulloa et al. [37]. This framework includes three main modules: (i) data collection and system configuration, (ii) data analytics, and (iii) aggregation and visualization.
The data collection and system configuration includes the extraction of social media data and its storage on a local database. In this step, it is also included a set of preprocessing text operations to prepare the messages for the analysis by removing everything that is irrelevant to this study.
The data analytics module includes all the text classification procedures of the framework. First, a pre-trained word embedding model was implemented to select travelrelated messages based on three distinct dictionaries of target words. Then, sentiment analysis is applied in order to analyze the opinions and emotions present in the messages. The performance of these two models is assessed to determine the degree of confidence in the results. This analysis will allow for a better understanding of the different aspects that may influence the model.
The aggregation and visualization module consists of the aggregation and plotting of the results in a way to support near real-time decisions. Figure 1 shows an overview of the proposed framework, while the next sections present the implementation details of each module.

A. DATA COLLECTION
The extracted messages were collected from Twitter, which is a great source of information about public opinion and emotion [17]. Twitter is a microblogging platform that allows users to share and access messages within a group of followers. A tweet is a message with up to 280 characters, 1 images and links to other contents. Tweets are classified into original tweets, replies and retweets, each one composed of several indicators. The most relevant indicators are the text 1. In November 2017, Twitter doubled the maximum allowed tweet length from 140 to 280 characters.   content, the geographic information, and the timestamp of the tweet. Messages were extracted from the Twitter API using the default bounding boxes for the cities of London, Melbourne, and New York. In Table 2, it is possible to observe the corresponding South-West and North-East coordinates.
Tweets were collected for these three cities, between May 16 th of 2017 and July 6 th of 2017. We collected all the Tweets written during this period inside these bounding boxes. All tweets have a geographical reference: ones had specific coordinates associated, and others had a bounding box associated.
As the extracted data are in the JSON format, the JSON Encoder and Decoder library 2 was used to read JSON files.
Among the three analyzed cities, New York is the smallest one in area but has the highest number of inhabitants (8.6 M). London has a population comparable to New York, but twice the area (1 572 km 2 ). The main transport in both New York and London cities is the metropolitan (subway). On the other hand, Melbourne has about half of the population of the other two analyzed cities (4.4 M), where the popular means of transport are the tram, the train and the bus. Of these three cities, London is the most traffic crowded. However, the levels of traffic congestion have been decreasing in the last few years [59]. The main language used in the three cities is English. Table 3 shows an overview of the main characteristics of these three cities.
A NoSQL database, MongoDB, 3 was used to store the collected and processed data. This database system was selected due to its convenience since it is compatible with the use of JSON files collected from Twitter. MongoDB is also compatible with the Python programming language.

B. PRE-PROCESSING
Texts are represented by vectors of attribute values that quantify the number of times that a word or a group of words occurs in the document. In order to avoid the creation of vectors with a very large dimension, some adjustments may be made, such as only considering words that appear, at least, a pre-defined number of times and ignoring some grammatical conjunctions [10].
The main objective of data pre-processing is to prepare the messages for text classification. For that, we used Python's NLTK 4 (Natural Language Toolkit), an NLP tool easy to use. NLTK has incorporated most of the tasks such as tokenization, stemming, lemmatization, punctuation, character count, and word count. Therefore, for each message, a set of standard pre-processing operations were conducted, particularly.
• Replace: replacing integer occurrences for textual representations and contractions (e.g., 'it's' is replaced by 'it is'). • Cleaning: removing hashtags, URLs, user mentions, non-ASCII characters (e.g., emoji's), retweets (RT), punctuation, and stopwords since these words do not add value to the text (e.g., in, a, an, the). • Lower-casing: every word is converted into lowercase. • Lemmatization: grouping together the inflected forms of a word (only performed for the verbs). The stemming pre-processing was not performed because it changes the root form of some words. For example, the word happier after the stemming is changed to happi instead of happy. This event conflicts with BERT tokenization that is related to the WordPiece token vocabulary [60] and embeddings presented in the next section.

C. TEXT CLASSIFICATION
An unsupervised pre-trained word embedding was applied for selecting the travel-related messages: the BERT model (Bidirectional Encoder Representations from Transformers). This model, released in 2018, is able of pre-training deep bidirectional representations from unlabelled text [61]. Unlike previous models like Word2Vec or Fasttext which represented words either as uniquely indexed values (onehot encoding), or fixed-length feature embeddings, BERT produces dynamic word representations that depend on the context (nearby words) [62].
The BERT model has been used in several NLP applications, as well as for information retrieval, keyword expansion and semantic search. BERT embeddings generate word and sentence embedding vectors which are high-quality feature inputs to downstream models. The translation of the text to numerical representations is required by NLP models.
The BERT architecture was built on top of Transformer architecture 5 with some variants. The first two sizes of the BERT model made available for the public were [48]: 4. http://www.nltk.org/ 5. The Transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side [61].
• BERT-Base: It is the original BERT model with 12 layers of transformer blocks, 768 hidden units, 12 attention heads, and 110 million parameters, and trained on a corpus of 800 million words. • BERT-Large: It is an extended version of the BERT model with 24 layers of transformer blocks, 1024 hidden units, 16 attention heads, and 340 million parameters, and trained on a corpus of 3.3 billion words. Posteriorly, Google released twenty-four smaller models [63] that consume fewer resources and take less time to execute without compromising the performance too much. The smallest one is BERT-Tiny, which has 2 transformer layers and 128 hidden units per layer. Some other examples of these smaller models are BERT-Mini, BERT-Small, and BERT-Medium. Besides these variants that only change in dimensions, there are several others, including RoBERTa [36], ALBERT [64], DistilBERT [65] and ELECTRA [66]. These last models were developed by different companies such as Facebook and Google AI Language to reduce the memory footprint of the model and make it more efficient for deployment on devices with limited resources. Each of these models has unique advantages and may be more suitable for certain use cases depending on the specific requirements of a given task.
The BERT model can also be classified as "cased" or "uncased." The "cased" model is helpful where the accent and capital letters play an important role.
Three different embeddings form every input embedding: (i) Position Embeddings, which express the position of words in a sentence; (ii) Segment Embeddings, that is a sentence pairs; (iii) Token Embeddings, specific token embeddings from the WordPiece token vocabulary [61].
The BERT model, at the core, was trained using the BooksCorpus (800M words) and English Wikipedia (2,500M words) along approximately 40 epochs. Dropout of 0.1 was used on all layers and the GELU activation was used [61]. The results demonstrate this unsupervised pre-training model can successfully tackle a broad set of NLP tasks.
Posteriorly, Google also released a BERT -Base model specifically trained using the Chinese language and a Google BERT -Base Multilingual but cased, unlike its previous version. This Multilingual version comprises the "top 100 languages with the largest Wikipedias" according to its developers. 6 In this work, we use the BERT-Base-uncased model to extract high-quality language features from the text (featurebased approach). Although fine-tuning the model can boost performance further [48], it was not used in order to maintain simplicity, speed, adaptability, and interpretability. This approach is well-suited for straightforward analysis, limited computational resources, and applications where transparency is important. The "uncased" model was selected as in Twitter the posted messages are short and informal, not always in their correct form. 6. https://github.com/google-research/bert For the input text, BERT expects a tokenization including two special tokens: one that corresponds to the beginning of the sentence ([CLS]) and another that corresponds to the end of the sentence or the separation between two sentences ([SEP]). So, the tokenized text must have this format: In the field of computational linguistics, n-gram is a sequence of n items from a given sample of text. The items can be characters, sequences of characters (e.g., words) or sequences of words. The length of the samples defines the n in the n-gram. In this work, each item corresponds to a word so a unigram approach is used.
The BERT model was assessed for three different dictionaries of target travel-related target words: (#1) a small dictionary with ten words focused on the name of the main urban transport modes; (#2) a medium dictionary with thirty-five words that include all the words from the small dictionary and a few more travel-related target words; and (#3) an extensive dictionary with three hundred and fortyfour words, adapted from a semi-automatically constructed transport lexicon created by Gal-Tzur et al. [67]. These dictionaries were defined in order to be scalable. Therefore, no specific target words as specific subway facilities were included in these dictionaries. Table 4 presents these three dictionaries of travel-related target words.
The cosine function was used to measure the distance (Cosine Similarity) between each message and the chosen dictionary. This distance was then used to understand whether the content was related to transportation or not.

D. SENTIMENT ANALYSIS
The sentiment analysis is performed by using a lexicon and rule-based tool called VADER 7 (Valence Aware Dictionary and sEntiment Reasone) implemented in Python [50]. VADER is optimized for social media data and based on a list of lexical features with a sentiment value assigned. Then, according to their semantic orientation, VADER calculates the text sentiment [68].
VADER uses a set of rules, called heuristics, to better quantify the magnitude of the sentiment in a sentence.
• Punctuation can increase the magnitude of the sentiment (e.g., using the exclamation point). • Capitalization (e.g., using ALL-CAPS), in the presence of other non-capitalized words, can increase the magnitude of the sentiment intensity. • Degree modifiers (also called booster words, degree adverbs, or intensifiers) can either increase or decrease the intensity of the sentiment intensity (e.g., "extremely beautiful"). • Sentiment polarity shifts due to the use of conjunctions, with the sentiment of the text following the contrastive conjunction dominant (e.g., when is used "but"). 7. https://pypi.org/project/vaderSentiment/ VOLUME 4, 2023 669 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. • Catching polarity negation, where negation flips the polarity of the text, is examined by checking a contiguous sequence of three items preceding a sentiment-laden lexical feature.
A compound score metric is used for calculates the sum of all the lexicon ratings (valence scores). The metric is then normalized between -1 and +1 (where the maximum value means the most positive sentiment).
VADER returns the probability of the sentiment of a sentence being negative (compound score <= -0.05), neutral (compound score = 0) or positive (compound score >= 0.05). The sum of the three probabilities must be equal to 1.
In this work, neither missing words/expressions were identified in the lexicon that could be crucial for the sentiment analysis, nor were lexicon entries removed.

E. PERFORMANCE
The performance evaluation of the classifiers measures the agreement between their results and human judgment. In this study, the evaluation was conducted for both the text classification and sentiment analysis models. A total of 1,000 tweets were selected for each city and each target word dictionary: 500 travel-related and 500 non-related tweets for text classification, and 500 positive and 500 negative tweets for sentiment analysis. A travel-related tweet refers to a message that provides information, recommendations, opinions, or updates about various aspects of travel. These messages often cover a wide range of topics, including transportation modes, service quality, travel itineraries, and personal travel experiences. Additionally, they may also include discussions or updates on specific travel-related events, traffic incidents, congestion situations, or other notable occurrences that impact travel. Examples of traffic-related tweets are: • Traffic-related tweet positive: -"Loving the efficient public transportation options. Just hop on the subway or catch a bus!" -"Where is the traffic congestion today?" • Traffic-related tweet negative: -"Stuck in traffic again on my way to the airport. Will probably miss my flight. So frustrating!" -"I've always wanted to experience a slow-motion race. NY is perfect for that!" In the case of sentiment analysis, neutral tweets were excluded from the performance evaluation. This was made because neutral tweets have a compound value within the range of -0.05 and 0.05, which is considered "non-labeled" according to the VADER sentiment analysis (as the compound value ranges between -1 and 1). This evaluation procedure aligns with similar research studies conducted by several studies [20], [26], [54].
For the tweets classification, it is important to highlight that beyond transportation-related aspects, a multitude of factors can contribute to their sentiment. These factors may include personal characteristics (such as age), individual opinions (such as a preference for a specific travel mode), the travel purpose (e.g., leisure or work), the occurrence of events (e.g., accidents or attending a concert), or environmental circumstances like meteorological conditions (e.g., sunny or rainy days). Therefore, these factors may act as confounding factors of the sentiment expressed in tweets. However, to address this concern, a holistic approach that takes into account the broader context of sentiment analysis would be needed, which is out of the scope of this work.
The selection was completely random and was set so it could find tweets created in different contexts. This set of messages was manually labeled in order to confirm whether they were travel-related or not, and with positive sentiment or not. Two human annotators performed this classification. No specific training has been provided to these "labelers." The classifications were compared and the sentences with divergences in the classification were checked again.
Accuracy is the proportion of correctly labeled tweets in the entire sample. Precision is the proportion of positive classifications that are actually correct and recall is the proportion of actual positives that are classified correctly. F1 is the harmonic mean of precision and recall [26]. The closer the performance of all measures is to 1, the better the performance of the models.

F. VISUALIZATION
The visualization consists of the aggregation and plotting of the results in a simple and objective way so they can be easily read and interpreted. This module allows visualizing the knowledge extracted from social media messages taking into account the variations of space and time.
For graphical plotting, first, some data pre-processement is conducted. For each city and each dictionary of target words (Table 4), some analyzes are made, such as counting the travel-related tweets written per hour of the day and per day of the week, or the location assembly of each tweet.
For tweet mapping, the sending coordinates of the messages are used. However, instead of having specific coordinates assigned, some tweets are associated with bounding boxes. In these cases, an approximated coordinate was assigned to those tweets by calculating the middle point of the bounding boxes.
A free and open-source Python library Matplotlib 8 was used for tweets visualization. Matplotlib is a low-level graph plotting library that serves visualization purposes.

IV. RESULTS
The text classification and sentiment analysis results are presented for three different real case studies (London, 8. https://matplotlib.org/ Melbourne, and New York), considering three sets of travel-related target words: a small dictionary, a medium dictionary, and a big dictionary (Table 4). Sentiment analysis is performed for the dictionary of target words with higher performance. The results are analyzed considering the temporal and spatial aspects of the tweets. For a better understanding of the impact of the proposed framework, a specific traffic event in New York City is analyzed. Table 5 presents an overview of the number of tweets collected and the number of tweets after the pre-processing phase. Similar results were obtained among the three analyzed cities. On average, 0.9% of the data was discarded after the pre-processing phase.

A. PRE-PROCESSING
About 30% of the tweets have hashtags and more than 50% contain URLs. The number of tweets with hashtags and URLs is similar among the three cities analyzed. However, the number of tweets with user mentions is much higher in London (45.6%) and Melbourne (44.2%), than in New York (31.1%), which denotes a different social interaction and environment. Table 6 showcases the performance results of the BERT and VADER models. The VADER model results are presented for the medium dictionary of target words (Table 4, #2), which achieved the best performance results for the BERT model.

B. PERFORMANCE ASSESSMENT
The medium dictionary (Table 4, #2) demonstrated highperformance results across the three assessed cities. The average Precision value achieved is 0.80, while both Accuracy and F1-score attained a value of 0.90. The average Recall value, as an average of 0.99, stands out as the best achieved. These results exhibit consistency across all three cities.
The results are consistent for the three cities. However, it is worth noting that due to the use of a unigram approach, some words listed in the medium dictionary, such as train, station and underground,were wrongly associated with mobility. Despite this ambiguity, the overall performance remains consistently good, with all performance measures for the case studies surpassing 0.75.
Text classification systems for social media face a lot of challenges. The main one is certainly dealing with large vocabularies once the lexical variation caused by informal text is significant and the messages have small sizes. This aspect requires some dexterity to manage and control the problem by getting a solution in the pre-processing stage outlining the lexical variation.
A benchmark analysis shows that the results of the BERT model are in agreement with the ones obtained in the literature (Section II). Table 6 summarizes the comparison of the different models. Such a comparison was performed for New York City.
Better results were obtained with the BERT model than the ones yield with supervised machine learning techniques for classification, such as Support-Vector Machines, Logistic Regressions, and Random Forests, namely using BoW. Although the highest value of Precision was obtained using the Linear SVM (0.864) and Logistic Regression (0.849), in both cases using BoE, the BERT model (dictionary #2) has similar values to these models (0.824). The highest Recall (0.990) and F1-score (0.900) values are obtained using the BERT model, and the values of these measures are far away from the obtained with the other models assessed, which denotes its model's superiority.
For the small and big dictionaries of travel-related target words, the performance of the BERT model was similar and, in general, worse than the one obtained with the medium dictionary. This happened because the small dictionary is too short and incomplete, so the BERT model is not able to label a great number of messages as travel-related, and the number of false negative messages is high. On the other hand, the big dictionary has a high amount of ambiguous words that make the number of false positive messages increase, compared to the medium dictionary.
To illustrate the classification, Table 7 shows examples of both travel and non-travel-related tweets classified according to the confusion matrix. Notably, it correctly classifies overt transport-related discussions, such as a delayed journey to the airport, alongside identifying non-transport-related discourse accurately, as seen with the tweet about a promotion at work. However, the model falters when the language used is either metaphorical or ambiguous. For instance, the model misclassifies a motivational tweet about a "train at 5 pm" as non-transport-related due to the surrounding context of selfimprovement. Similarly, it mistakes a metaphoric mention of roads in a resilience-related statement as transport-related. These examples underscore the inherent challenges in text classification tasks, where accuracy depends heavily on the model's ability to discern nuanced language use and contextual cues.
The VADER model consistently yielded good results for the three evaluated cities. London exhibited the highest ratings, with values over 0.80 for Accuracy, Recall, and F1-score. The average Precision score was 0.77, while Recall, Accuracy, and F1-score hovered around 0.78. These results demonstrate the model's reliability across the three assessed cities. For examples of sentiment-classified travelrelated tweets, along with their corresponding confusion matrix, please refer to Table 8. The model effectively identifies straightforward positive and negative sentiments, as evidenced by accurately classifying a tweet praising public transportation and a complaint about traffic jams. However, the model stumbles when faced with linguistic subtleties. It erroneously classifies a neutral or mildly positive query about the absence of traffic congestion as negative, indicating potential difficulties in handling indirect or implicit sentiments. Additionally, the model fails to interpret the sarcasm in a tweet critiquing slow traffic in New York, incorrectly   classifying it as positive. These results underscore the complexity of sentiment analysis tasks, highlighting the need for models to understand better linguistic nuances, context, and subtleties such as irony or sarcasm for improved accuracy.

C. TEMPORAL ANALYSIS
Among the three analyzed cities, New York is the one that has the highest average number of daily tweets (approximately 93 000 daily tweets). The average period with the most tweets posted is in the evening (20:00-21:00); however, no pattern related to the day of the week is identified.
Regarding the travel-related tweets collected with the medium dictionary, we found 186 251 tweets in London, 26 744 tweets in Melbourne, and 440 877 tweets in New York matching the words selected in Table 4 (#2). These values represent around 3% (for London and Melbourne) and 5% (for New York) of the total messages collected. The number of travel-related tweets is in accordance with other studies (e.g., [69]). Figures 2 and 3 shows the average number of tweets per hour of the day and day of the week, respectively, using box plots to illustrate both the average and deviations. The inclusion of average values and deviations in the box plots  offers additional insights into the variation and consistency of tweet frequencies within each category. The results are shown for each dictionary of travel-related words (small, medium and big dictionaries) and city (London, Melbourne and New York).
The analysis of the distribution of travel-related tweets throughout the day shows similar patterns among the analyzed cities. Consistency is observed for each city and each dictionary of target words. However, it is worth noting that a high number of outliers are recorded when using the small dictionary (Table 4, #1) and the large dictionary ( Table 4, #3) of travel-related target words, particularly noticeable in London.
Tweet activity related to travel is higher during working hours, representing approximately 5-6% of the total daily posted tweets during this period. Throughout the day, an hourly deviation of around 35% from the mean values is observed, with the highest standard deviation usually recorded during or around rush hours. The minimum hourly of travel-related tweets is reached during dawn.
The maximum number of tweets is reached at 16:00 in New York (6.0%), at 17:00 in London (6.5%), and at 18:00 in Melbourne (7.6%). According to the TomTom traffic congestion index, these values are coincident with the afternoon rush hour or are one hour after the rush hour [59]. The patterns recognized in this analysis are consistent with the patterns identified in TomTom index. According to this index, the traffic rush hours are from 05:00 to 09:00 and from 13:00 to 20:00 in the three cities [59].
Between the different days of the week, no relevant differences were found, which suggests that it may not have a great influence on travel-related messages from Twitter. However, on weekends, the number of travel-related tweets is slightly lower than the weekdays. Figure 4 shows the average number of travel-related tweets per day of the week and hour of the day, respectively, considering their sentiment: positive, negative and neutral. The results are shown for London, Melbourne, and New York, according to the medium dictionary of target words (Table 4, #2).
The percentage of positive and neutral travel-related tweets is almost equal in London and Melbourne. However, in New York, the number of neutral travel-related tweets is higher than the number of positive ones. In the three cities, the percentage of negative travel-related tweets is around 50% of the positive ones.
The hourly variations are consistent for each hour of the day. Per hour, the sum of positive and negative tweets represents around 80% of the total tweets observed. The majority of outliers are neutral. Negative outliers usually start after lunchtime and their intensity decreases around 21:00. However, this is observed only in London and New York.
Throughout the week, Wednesday and Thursday are the most popular days of the week. The number of positive and negative travel-related tweets reaches, during these days, the maximum values of the week in all the three cities analyzed (Figure 4). This might suggest that these days are crowded. This is in line with the information provided by the TomTom traffic index. According to this index, Thursday is one of the most congested days of the week in these cities [59].
Weekends are in general the period of the week with a lesser number of travel-related tweets. This trend is clearly observed in the biggest cities (London and New York), but not in Melbourne, which can be related to the low number of travel-related tweets recorded daily here, with around 100 positive tweets and around 50 negative tweets.
Soft modes and public travel-related words are popular among London and Melbourne citizens. While in these two cities, the two most popular travel-related target words are train and walk, in New York the most popular words are street and avenue. This is in accordance with the most popular mobility ways usually used in these three cities. The  word traffic is in a similar position to the list of popular words for the three cities: 10 th in London, 7 th in Melbourne, and 9 th in New York. However, the word accident is in the top 11 th in New York City and in the top 16 th in the cities of London and Melbourne, which suggests that, in general, the levels of congestion in New York City are high. According to the TomTom traffic index, in 2017 London and New York City were in the 34 th and 35 th positions respectively, of the most congested cities of the world, while Melbourne was in 86 th . Figure 5 shows the number of tweets with travel-related words for the three cities.

D. SPATIAL ANALYSIS
The geographical plot is important to highlight the most affluent zones of a city. Therefore, Figure 6 represents the geographical distributions of total and travel-related tweets from London, Melbourne and New York. Tweets were mapped according to the medium dictionary of target words (Table 4, #2).
There is a sharp difference between the global tweets plots and the travel-related tweets plots. The travel-related plots show less density than global tweets plots because of the filtering performed on the BERT model. The city center of each city is the zone with the most number of tweets and it corresponds to the zone with more population density. Moreover, the maps to the right of Figure 6 identify the areas with high mobility and traffic congestion levels. For example, in the city center of London, the closer it gets to the river, the denser the plot gets. This is expected to happen as the city center and its adjacent suburban areas are the most populated zones in London, where are living around 9 million people [70].
To analyze the distributions of travel-related tweets, maps of positive and negative tweets were produced for the city of New York, for May 17 th , 2017 (Thursday). Figure 8 and Figure 7 plots the distribution from 06:00 to 00:00. These maps were computed for the results of the VADER model obtained using the medium dictionary of travel-related target words (Table 4, #2).
As one can be observed, the distribution of travel-related words changes throughout the day, with the density of tweets increasing in the morning. Between 06:00 and 12:00, fewer tweets are identified than for the rest of the day. However, as it gets closer to 21:00, the density of tweets grows considerably until it gets lower from 21:00 to 00:00.
Regarding the sentiment, as the density of the tweets grows, more negative tweets appear overtaking the number of positives. During morning and afternoon rush hours, there is a trend for the number of negative tweets to approach the number of positive tweets, suggesting that traffic-related issues and congestion during peak travel times might contribute to a higher occurrence of negative sentiment expressed on social media.
In order to go further in the spatial analysis of the proposed framework, a specific traffic incident in the city of New York was analyzed. At 11:53 EDT of May 18 th , 2017, a car crashed in Times Square [71]. One person died and 20 others were injured. To analyze this occurrence, the model is plotted for the hour before and three hours after the accident occurrence. Figure 9 presents the spatial distribution of positive and negative travel-related words over Manhattan and the list of top travel-related words from the medium dictionary across this period (Table 4, #2).
During the hour before the crash (from 11:00 to 12:00), the flow of tweets was low and mostly with a positive sentiment (Figure 9a). However, right after the hour of the crash, the density of tweets grows. A stain of negative messages appears in the center of Manhattan (right behind the black dot), suggesting the occurrence of a traffic incident at this location. During this period, positive tweets have a clear advantage over negative ones. However, this pattern is quickly spread across all of Manhattan as well as its surroundings (Fig. 9b, c and d). A red stain is all over Manhattan which can be related to the crash related on [71]. Three hours after the accident the number of negative tweets increases even more, which suggests that a general increase in traffic congestion happened in the city was caused by the accident.   (Table 4).

Melbourne (d), and New York (f) according to the medium dictionary of target words (
An analysis of the most popular words used in travel-related messages allowed us to identify the causes of the traffic incident. The word accident needs to be highlighted because from 11:00 to 12:00 it was used 6 times, but right after the crash, from 12:00 to 13:00, it was the most used word with 74 appearances which is more than 1200% 676 VOLUME 4, 2023  of increase. Some words related to the accident are also present in the group of the most used words after the crash, such as, driver, car and street. Relating to the sentiment, one word that appeared mostly after the crash and has a negative impact (besides the word accident) is the word traffic.

V. DISCUSSION
Passenger transit is projected to rise, with respect to 2005, by around 34% in 2030, and 51% by 2050 [59]. Consequently, the social costs of accidents, noise, and pollution will continue to rise dramatically worldwide. In Europe, congestion costs are about 1% of gross domestic product (GDP) each year [72]. However, these costs are expected to increase by about 50% by 2050 [73]. In this context, crowd-sensing will be more and more important to reduce costs and time, as well as also to complement traditional surveying methods. Although traditional surveying methods and crowdsensing methods are in its essence different, they have complementary characteristics. Traditional measures of urban transport management are mostly based on the use of automatic traffic counters, usually used to control the mobility flows of strategic points of the city, and on the use of surveys to define the origin-destination matrices. Crowd-sensing methods can be an additional good provision of information.
The framework proposed provides contextual transport information in near real-time with high spatial detail, using social media data. The traditional tools used for transport management are highly reliable, as they are periodically calibrated and monitored. However, in high-density areas, the crowd-sensing methods based on a large amount of social media data can also be reliable as a significant group of users validate in a recurrent way the veracity of the provided information. In this sense, the users of social media act as semantic sensors. Besides the high diversity of types of users on social media that depends on the gender, age, and race of users, among other characteristics, social media have a high cadence and geographical spread. Such characteristics allow us to classify them as an excellent data source for mobility intelligence applications, not only at the operational control level but also at operational or tactic-strategical traffic management.
Twitter users are mostly younger, more affluent, and more educated than the general population as a whole [74]. Younger people have high mobility levels, allowing them to report transport information for a higher spatial area and during larger periods. Although they are not representative of the overall population, this is not critical in terms of the identification of bottleneck points of transport. On the other hand, elderly people may have physical, cognitive or sensory limitations that may reduce their capacity to report clear information quickly.
While some studies have attempted to analyze transportation information from social networks, they typically rely on the use of keywords (e.g., [14], [21]) or specific data sources related to the topic (e.g., [20], [42]). Other studies have extracted general social media messages (e.g., [26], [34], [56]), but they used supervised machine learning models that require the creation of a manually classified training dataset, which is not only time-consuming but also limits the model's ability to adapt to new realities.
The use of dictionaries to identify messages from a specific context has several advantages. It is a straightforward and transparent method that does not require the creation of a manually classified training dataset. This is because dictionaries contain pre-defined target words already associated with a domain, simplifying the classification process. In addition, dictionary-based approaches can achieve high accuracy in classifying messages, as they rely on specific target words that are strongly associated with a domain, making it easier for the algorithm to classify a message as relevant to transportation. Finally, using dictionaries allows for the customization of the classification process, enabling users to create dictionaries tailored to specific contexts or related topics, thus increasing the accuracy of the classification process.
Despite the advantages of using dictionaries, few studies have utilized them to classify transportation-related messages, and to the best of our knowledge, our study is the first to compare different dictionaries for this purpose. In this research, we employ BERT to classify travel-related tweets using a unigram approach and three dictionaries of varying sizes: small (N = 10), medium (N = 35), and large (N = 344).
The medium dictionary showed the best performance results for the three analyzed cities. This happens as the medium dictionary is the less ambiguous and the more balanced out of the three dictionaries. In a unigram approach, some words listed in the medium dictionary were sometimes wrongly associated with mobility, such as train and station. However, this performance may be increased with the use of an n-gram approach.
Although the three dictionaries did not contain target words for specific subway facilities such as stations, the model's consistently high performance across all three cities demonstrates its great versatility and adaptability to different contexts.
The preprocessing steps allow to transform messages into a representation that can be effectively classified. By enabling the models to understand and extract meaningful features from the text data, these preprocessing steps contribute to achieving accurate classification results. However, it is important to acknowledge that certain preprocessing steps, such as the handling of capitalization, may have an impact on further operations, including sentiment assessment. Although it is feasible to recover the original tweets formatting before the sentiment assessment, determining the best approach for incorporating this information in the evaluation of the sentiment is not straightforward. Future research should focus on addressing this aspect, as in some cases it may significantly influence the accuracy of the results obtained.
The results obtained suggest that the proposed framework is particularly useful to be applied in an urban environment, where the number of travel-related tweets is high. The performed analysis, particularly for Manhattan (Figure 9) demonstrates that the BERT-Base-uncased version of the BERT model detects with high reliability the local and time of a traffic incident as a vehicle crash using data from social media. The model is also able to monitor the spatial-temporal trends of mobility. However, as we go far away from the city centers, the number of tweets decreases, lowering the confidence level about the knowledge extracted from these tweets. Therefore, for suburban and rural areas, further analysis must be conducted in order to understand the degree of reliability. Although the framework developed was used to demonstrate a relationship between the number of tweets and traffic congestion at a macroscopic level, it is noteworthy highlight this platform has the capability to provide analysis at various levels, including at the road section level. This microscopic analysis could offer more detailed insights into transportation patterns, however it is important to acknowledge the trade-offs associated with conducting such analysis.
Social media data does not always provide precise location information, and sometimes, the available location can be incomplete or inaccurate. On May 17th of 2017, 61.5% of users provided precise information (coordinates) about their location for their travel-related messages. For the tweets without this information, some tools recently developed (e.g., [58]) can be applied. However, besides this, the delay between the event and sometimes the posting of a message can also impact the accuracy of temporal analysis, as the sentiment expressed may not be immediate or reflect real-time conditions. Similarly, sentiment analysis may be influenced by the temporal gap, as user emotions and perceptions can change over time. Therefore, the time lag between an incident occurring and its subsequent reporting can introduce bias in the analyses. This bias introduces uncertainties when performing detailed and localized analyses in the context of intelligent transportation systems using solely social media data. Therefore, caution must be exercised when relying solely on social media data for such analyses.
By integrating additional data sources such as done in the work of Wan et al. [58], we can enhance the accuracy and reliability of transportation planning and analysis. Suitable techniques to account for the time lag and mitigate biases in sentiment analysis can also be used to further improve the effectiveness of the application. It is worth noting that while the potential integration of these approaches is acknowledged, they were not the primary focus of this work. However, considering these aspects in future research can further enhance the application's capabilities and overall performance in transportation planning and analysis.
The proposed framework efficiently collects a large volume of social media messages, accurately filters and identifies travel-related messages, and provides their sentiment in a simple manner. The framework's adaptability to different contexts is demonstrated by its successful application in three case studies. The spatial-temporal distribution of travel-related messages, and their sentiment, can provide decision-makers with insights into the causes of traffic incidents. The system's ability to provide contextual details in near real-time enables transport engineers, urban planners, researchers, and policymakers to effectively manage resources and increase operational efficiency and effectiveness. This is particularly important in situations such as traffic congestion, vehicle crashes, terrorist attacks, or severe weather events.

VI. CONCLUSION
The work harnesses social media as a means to improve mobility intelligence, capturing insights into transportation patterns and behaviours. By treating social media as semantic sensors, the study enhances the understanding of mobility trends, thereby informing and empowering future transportation planning and decision-makers. The work offers several distinct contributions that set it apart from existing methods in the field.
Firstly, this work introduces a comprehensive framework for effective text mining of social media data. The proposed methodological framework encompasses data collection, system configuration, data analytics, and aggregation and visualization modules, forming a complete pipeline for efficiently aggregating sparse information. NLP techniques were used to extract and pre-process the data, including cleaning the text which is full of informal words, slang, and misspellings.
Secondly, the framework has been designed to handle the challenges associated with extracting valuable insights from the vast amount of information available on social media platforms, addressing the issues usually encountered by traditional approaches. Thus, a scalable framework was projected.
The framework collects pervasive data from social networks and use filtering techniques (rather than specific pages or users) in order to identify key messages. By using dictionaries for this purpose, the framework enables quick identification of messages from a specific context with high accuracy, offering customization of the classification process.
For text classification unsupervised pre-trained word embedding model, BERT-Base-uncased, was used with a unigram approach and three dictionaries of travel-related target words: small (N=10), medium (N=35), and big (N=344). No location-specific words (e.g., the name of a bus stop), were used in these dictionaries, which made them generalizable to other travel-related tasks. Moreover, the use of this learning method avoids the need for manually classified training datasets or the use of specific data sources. Sentiment analysis was performed using a lexicon rule-based tool called VADER. This allows the framework to adapt and improve its performance over time.
Lastly, the framework was successfully utilized to demonstrate the practical utility of social media as a valuable data source for traffic management. Data from Twitter was collected and assessed for three major worldwide cities: London, Melbourne, and New York. The reliability of the framework showed consistently high performance for the three cities, which demonstrates their adaptability to new contexts. The analysis included an assessment of public satisfaction levels during a car crash event in Manhattan. The results showed a drastic increase in dissatisfaction after the crash. The location of the accident is clearly identified, as well as the spread of congestion around the city over time. The visualization tool allowed the analysis of both the temporal and spatial distribution of the travel-related tweets.
The study demonstrated that the proposed framework efficiently aggregates sparse information from social media to identify travel-related aspects with high spatial resolution in near real-time, which is an improvement compared to traditional tools of transport management. The information provided by this framework can be used to support decisionmaking for transport authorities, including transport engineers, urban planners, researchers, and policymakers. This information is particularly useful for operational traffic control, as well as operational, tactical, and strategic traffic management, as it allows for better resource management and evaluation of transport policies using more participatory approaches.  Institute of Technology and Science (INESC  TEC). She is currently the PI of the Project e-LOG (logistics). She has been participating as a researcher and consultant in several national and international research and development projects in the areas of transportation. Her research interests are focused data analysis and machine learning/data mining, on the areas of urban transport, logistics, and air quality management. He is currently an FEUP's COO with ARMIS.Lab@FEUP, an industry-funded laboratory focused on AI and ML research. His primary research interests include behavioural modeling, social simulation, and machine learning with applications to the design of sustainable sociotechnical systems. He focuses on applying multi-agent systems as a modeling metaphor to address issues in artificial transportation systems, future mobility paradigms, urban smartification, and explores the potential uses of serious games and gamification in mobility systems and sustainable development. He served as a member of the Board of Governors of the IEEE ITS Society from 2011 to 2013 and was a member of the Steering Committee of the IEEE Smart Cities Initiative from 2013 to 2017. He is the Co-Chair of IEEE ITS Society's Artificial Transportation Systems and Simulation Technical Activities Sub-Committee. He is also a member of ACM, the Portuguese AI Society, and the European Social Simulation Association.