InÉire: An Interpretable NLP Pipeline Summarizing Inclusive Policy Making Concerning Migrants in Ireland

Reaching marginal and other migrant communities to elicit their political views and opinions is a well-known challenge. Social media has enabled a certain amount of online activism and participation, especially in societies with abundant multicultural identities. However, it can be quite challenging to isolate the voice of the migrant in English-speaking countries, especially with an abundance of content in English on social media. In this paper, we pursue a case study of Ireland’s Twitter landscape, specifically migrant and native activists. We present a methodology that can accurately ( $>80\%$ ) isolate the Irish migrant voice with as little as 25 English tweets without relying on user metadata and using simple, highly explainable, out-of-the-box machine learning methods. Using this, we distil (via sentiment analysis) polarities of views, segment (via BERT-based topic modelling) and summarise (via ChatGPT) differentiated views in a consumable manner for policymakers. Our approach enables policymakers to further their understanding of multicultural communities and use this to inform their decision-making processes.


I. INTRODUCTION
Social media has become an essential day-to-day communication platform [1], [2] that can be employed to study specific social contexts and processes through the lens of the user-generated content it encompasses. While having a multi-faceted role as a communications platform, it has also become a platform for knowledge sharing, activism, and journalism. The scale of the underpinning network structure allows social media researchers to answer questions that perhaps more traditional methods cannot, for example, when a specific user (sub)population are hard to reach. This is often the case when trying to study and understand the issues or challenges immigrants face: their voice is often politically underrepresented, especially if they are outside mainstream The associate editor coordinating the review of this manuscript and approving it for publication was Yichuan Jiang . society [3] or systemically excluded from national or international policy-making forums [4].
Ireland has recently grown into a multicultural society [5] with an increase in migration of foreign working professionals, international students, asylum seekers and refugees. Due to its welcoming policies confirming diversity and inclusion, Ireland has embraced these communities as their own [6]. Correspondingly, there has also been a significant growth in online activism on popular social networks, such as Twitter [7]. Hence, social media should be a valuable tool in trying to understand the non-native experience in countries like Ireland. This is our objective in this work: to try and ascertain a methodology that can be used to isolate (and later summarise) the voice of the non-native populations of Ireland to support more inclusive policy decisions.
In any society where multicultural identities are present, there remains a boundary between natives and migrants in various aspects. More specifically, over the years, in cases of increased migration to European Union countries, there remains a certain level of scepticism on policies proposed for integrating migrants into society [8]. The general consensus, however, is that the policies need to be more effective in achieving their goals. The general problem with this consensus is that there has been limited quantitative analysis of migrants' lived experiences. One step towards addressing this is a better means of data collection and automated tools to assemble anonymous aggregated data that shed some light on any lived experiences. Nevertheless, this is not so straightforward.
Ironically, the over-representation of English on social media platforms becomes problematic. 1 In countries where English is not the dominant language, proficiency in the target language can be used as an indicator (e.g. [9]). This is not necessarily the case in English simply because its use is so widespread. Instead, researchers often focus on the meta-data of user accounts to help classify specific sociodemographic properties of users, for example, geo-tagged location data [10], [11] or other aspects of the user profile [12]. However, the main issue with this type of approach is that user metadata is often unreliable (due to issues such as self-representation where users present a socially stylised view of themselves [13], which can sufficiently distort metadata potentially biasing such approaches) or large amounts of user data are needed to perform classification, which can be time-consuming to gather at scale (i.e. many users to process).
In this paper, we illustrate that telltale linguistic signs can be used to differentiate natives from immigrants in English-speaking countries such as Ireland. We hypothesise that how users express themselves online (i.e., how they use language rather than what they say) can be used to differentiate native Irish from non-natives. We also illustrate that this is achievable without the need for complex machine learning methods, large samples of data, or even user profile metadata. The latter (large data samples or user profile data) rather make models more accurate.
We focus on the case study of the Twitter landscape in Ireland, specifically examining the perspectives of migrant and native populations. We introduce a methodology that can accurately isolate the voice of Irish migrants using as few as 25 English tweets. We then use sentiment analysis, BERT-based topic modeling, and ChatGPT, to segment and summarize diverse perspectives in a way that policymakers can easily comprehend. Our methodology empowers policymakers to better comprehend multicultural communities and utilize this understanding to inform their decision-making.
The contributions of this paper can be summarised as follows: • Methodology for isolating the voice of non-native populations: We propose a methodology that effectively isolates the voice of Irish migrants on social media platforms, specifically focusing on Twitter. By analysing linguistic patterns in English tweets, we demonstrate that it is possible to differentiate between native Irish users and non-native immigrants without relying on complex machine learning methods or extensive user profile metadata.
• Accurate classification with minimal data: Our approach showcases that as few as 25 English tweets are sufficient for the accurate classification of users as natives or migrants. This highlights the potential for utilising limited data to gain valuable insights into the perspectives and experiences of immigrant populations, enabling a more comprehensive understanding of multicultural communities.
• Sentiment analysis and topic modelling for diverse perspective summarisation: We employ sentiment analysis, BERT-based topic modelling, and ChatGPT to segment and summarise diverse perspectives within the Twitter landscape of Ireland. This allows policymakers to easily comprehend the sentiments and concerns of both native and migrant populations, facilitating more inclusive policy decision-making.
• Empowering policymakers with actionable insights: Our research empowers policymakers to gain a deeper understanding of multicultural communities and their experiences in Ireland. By providing policymakers with accessible and interpretable summaries of diverse perspectives, our work supports the formulation of inclusive policies that address the specific needs and challenges faced by migrants. Overall, our paper contributes to the field of inclusive policy-making by offering a practical and efficient NLP pipeline, InÉire, which supports inclusive policy-making concerning migrants in Ireland. Through our methodology and findings, we aim to bridge the gap between policymakers and migrant communities, fostering a more inclusive and informed decision-making process.
This paper is structured as follows. In Section II, we present and discuss key aspects of related work to situate our work in the literature. In Section III-B, we discuss (briefly) some key ethical considerations of this work that are necessary given its research objectives and context.
We then present our methodology in Section III. In Section III-A, we discuss (briefly) the dataset used in this paper, which we presented in [12]. In Section III-C, we present our approach to linguistically classifying Twitter users with a view to minimising the amount of user-generated content needed, and using only simple (yet interpretable) machine learning methods. In Section III-D, we describe our method of data segmentation via topic modeling; Section III-E explains how we leverage human-in-the-loop techniques to improve our methodology; Section III-F discusses the use of a summarization tool to generate humanunderstandable reports; and in Section III-G we describe our data preparation pipeline, which is the first and critical step in 88808 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
deploying data mining techniques in NLP tasks. We discuss how our data preparation pipeline is tailored to the specific task in each step.
To further illustrate the utility of our data processing pipeline, we conduct a case study on Irish social media content in Section V, which distills social media content into actionable insights for policymakers. Finally, in Section VI, we conclude the paper with an outlook to future work.

II. RELATED WORK
Whilst concerns over excluding valuable insights (mainly when the target sub-population lacks sufficient internet connectivity [14]) exist, many studies have shown that this concern is rapidly waning. In the context of marginalised groups (which immigrants often fall into), social technologies (which include social media) are often essential [15], [16], [17], acting as a ''safe place'' due to aspects of anonymity and control over the intended audience [18]. The general availability of social technologies has also been high among marginalised groups for some time now [18], [19].

A. SOCIAL MEDIA AS A LENS ON SOCIETY
In studies that leverage social media, we can easily distinguish studies that investigate more large-scale macro-level events, such as political campaigning (e.g. [20]), riots and civil unrest (e.g. [21]), event detection (e.g. [22], [23]), and large-scale news events like COVID-19 (e.g. [24]). Here, researchers often use social media as a lens to understand the perceptions and viewpoints of a subset of society or seek to identify key events that occur during the period(s) of observation. In these cases, it is generally easy to think about how to access and curate a meaningful and large corpus of text content. This is, however, not quite the case for studying immigrants: they are, as a hard-to-reach and arguably marginalised group, hard(er) to locate and study with social media. In fact, when it comes to marginalised groups, there is a general observation that researchers lack robust methodological guidelines [25].
There has been a lot of work (e.g. [26], [27], [28], [29]) studying hate, toxicity, and cyberbullying, which is often directed towards immigrants and marginalised groups via social media. Though useful from a methods perspective, these studies often have a strong content bias towards content stemming from North America and the United Kingdom [30]. [31] outlines how social media can be used to specifically study marginalised groups and discusses a number of the associated challenges. The key here is to ensure that the corpus is representative and not just an echo-chamber of a few (obvious) selected topics.
As discussed in [12] and [31] there are many strategies for curating a meaningful dataset for studying marginalised and immigrant communities. We reuse the dataset collected in [12] for this work to enable an illustrated comparison between an approach that utilises content and metadata from one that leverages linguistic properties of text. In [12], we adopted the approach of assembling a manually curated set of users (as proposed in [32]) with the ''activism'' of a user being the key focus. This enables the study of a diverse range of topics emanating from marginalised communities versus privileged ones, with previous such studies limiting their study around an event, hashtag or group of similar users [33], [34], [35]. However, there is still the challenge of accessing new users. In such cases, we do not necessarily wish to extract a large portion of users' Twitter history for classification purposes. Herein lies the main objective of this work: to classify a user with as little of their user-generated content as possible.

B. KEY WORK IN COMPUTATIONAL LINGUISTICS
The objectives of this article mean that our work is related to studies on Native Language Identification (NLI). The goal of NLI is to detect the native language of an author, given a piece of her/his writing in a foreign language. Most of the research on NLI has been performed on identifying the native language of non-native English writers [36]. Most of the works on NLI use classification algorithms along with a range of features.
In order to identify (or classify) a social media user as a native or non-native, we need to cast the problem as a binary classification problem. There has been significant work on text classification with social media data. [37], [38] provide a good general introduction to the general methods of social media analytics, and our work maps well to their general implementation architecture. Twitter has been the platform of choice for many years, and the findings of [39] illustrate this to still be the case 2 with a major emphasis on binary classification tasks and simple off-the-shelf machine learning techniques (i.e. Decision Trees, Support Vector Machines, and Naïve Bayes, we also know from [40] that simple models are often sufficient for a large variety of machine learning tasks). While there is an increased interest in deep learning, it is both more demanding of training data and can obscure the learning process which limits interpretability.
When looking at methodological guidance, [41] provides a general review of text-based analysis of social media data, specifically with an in-depth review of Twitter analyses. Yet, our work suffers from the challenge of a lack of labelled data; a problem shared by hate speech detection (e.g. [21], [26], [27], [42], [43]), extremist content (e.g. [44]), and other forms of distress-based classification (e.g. [45], [46]) all of which can help inform our approach. Yet, we also have the challenge of small user samples, a challenge shared with the task of dialect classification (e.g. [47]). Hence, we build on our previous work (and data curated) in [12].
Returning to NLI, most approaches have been trained and evaluated using the TOEFL11 dataset [48], which has been the standard dataset for this task. TOEFL11 contains about 13,000 English essays written by English learners with 11 different first languages for the TOEFL test. Recent approaches to NLI have focused on social media content to detect the native language of users. Our work falls into this line of research. Since the proficiency level of English authors on social media sites is much higher than that of the learner datasets (such as TOEFL11) [49], the NLI task using social media is much more challenging than in the learner datasets.
Reference [50] built a dataset containing English tweets written by native speakers from 12 other languages besides English. They use this corpus to train a model to predict the native language of users based on their English tweets. Reference [51] attempts to predict the native language of Reddit users using both linguistically-motivated features and the characteristics of the social media outlet. Reference [52] uses BERT (Bidirectional Encoder Representations from Transformers) to detect the native language of the Reddit authors. Reference [50] predicts the native language of users based on about 173 tweets on average. References [51] and [52] try to infer the native language of a user based on 100 Reddit posts. Here, it is also useful to try and capture some discussion on data quantity. In this work, we predict if a user is native (or close to it, for example, if they have lived in an English-speaking country for a long time) of English or not, based solely on 25 tweets and using only linguistic properties of the tweet content. We do so based on the observations of [13], who noted that small amounts of social media content (in this case, 50 words) are sufficient to linguistically profile users even for quite complex machine learning tasks (personality profiling in [13]).
In terms of basic text-mining methodology for classification, our approach is well grounded in the literature following [53]: pre-processing and feature extraction using LIWC [54], [55], followed by a simple machine learning model (based on [40]). We note that our use of LIWC allows us to focus not on what users say, but rather how they express themselves, this follows on from our previous work [13], [20] where the use of LIWC as a method of feature engineering permitted complex research questions, with simple machine learning methods. It has also been shown that combining machine learning methods with LIWC improves performance [56]. LIWC identifies portions of text that belong to specific linguistic categories of text used in the tweet. This means our classifier leverages a feature set that sits at the intersection of part-of-speech tagging (e.g. nouns, pronouns, verbs etc.) affect analysis (forms of emotion in the text, beyond just positive and negative) and other linguistically-orientated traits in the text that can be indicative of different characteristics of users (e.g. average word length, use of singular vs. plural pronouns, presence of non-fluencies, use of function words etc.) [57], [58], [59] that have been used to build machine learning models complex computational social science problems.

III. INÉIRE DATA PROCESSING METHODOLOGY
In this section, we present a data pipeline that can support policymakers in discovering and analysing the topics that migrants and natives in Ireland discuss on social media, which has the potential to inform the policies that affect migrants and other marginalised populations in Ireland. Figure 1 shows an overview of the proposed approach capturing four key steps: 1) classification of social media users with simple and explainable supervised machine learning; 2) content segmentation via sentiment polarity to differentiate between positive and negative content; 3) clustering of content into topics to enable further segmentation of the data into positive vs negative topics of discussion; employing a human in the loop to review topics generated (i.e. remove meaningless topics or topics of no relevance); and finally, 4) we leverage ChatGPT as a summarisation tool to distil each topic into a brief summary of the main themes. In the subsequent subsections, we discuss each step of our methodology in detail, but first, we present the dataset used throughout this work.

A. DATASET
The Twitter dataset [12] used in this study comprises both native (or near to it, for example, if the user has lived in Ireland for a long time) and non-native to Ireland Twitter users. 3 To curate the dataset, we employed a twofold approach. Firstly, we utilised the dataset provided by Younus et al. [12], which comprises tweets from native and non-native users in Ireland. This dataset allowed us to gather a comprehensive collection of tweets related to Irish migrant and native activism. Secondly, we curated a list of Twitter activists from both migrant and native communities in Ireland. The curation process was performed by one of the authors familiar with Ireland's social justice landscape and capable of distinguishing between native and migrant users. Since Ireland is an English-speaking country, we consider Irish users as native English speakers, and the migrants are regarded as non-native English speakers. 4 This yielded a dataset containing (mostly) English tweets, 5 arranged into native and migrant tweets. The following defines the inclusion/differentiation curation strategy of natives and migrants on the Twitter dataset: • Checking the surnames of the users against Irish surnames. 6 • Reading the biography field of the user and checking for Irish terms in addition to flags of various countries. 7 • Reading the last 20-100 tweets of a user to see whether there is any explicit mention of belonging to any country. This curation process resulted in a dataset consisting of tweets classified as either native or migrant tweets. We aimed to maintain a balanced representation of 66 native and 66 migrant activists within the dataset. Although the sample size is relatively small, it provides a fair representation of the Irish Twitter landscape, considering the country's size. The dataset encompasses mostly English tweets, with non-English tweets and retweets filtered out to retain original English user-generated content. Additionally, we focused on non-native speakers who have resided in Ireland, ensuring a higher level of English proficiency among the migrant user subset.
To ensure data privacy, we employed standard practices and adhered to ethical guidelines when collecting and processing the data. As an academic research project, we obtained access to the Twitter API for research purposes and extracted the last 3000 tweets from each user in the curated list, resulting in approximately 300,000 tweets from both native and migrant users.
Following standard supervised machine learning practices, we use a holdout strategy to generate three independent training, validation (or tuning) and testing subsets of the data containing 400,000, 100,000, and 100,000 tweets, respectively. Note that this sampling is at the user level instead of the Tweet level, i.e. a user (and their collected Tweets) is placed into one of three sets, i.e., train, validation, or test set. We note here the findings of [60], i.e., that resampling our dataset could yield structurally different results. Nevertheless, as our objective is to (partially) explore the robustness of the approach, it does not induce significant concerns.

B. ETHICAL AND BIAS DELIBERATION
The presence of specific ethical concerns and/or other biases when using social media data has been discussed at length in the literature (for example [14], [38], [61], [62]). The source of many ethical concerns around social media often stems from the ability to gather and therefore analyse large quantities of data [62]. Yet, taken out of context, this data may lose its meaning [63] or be otherwise warped beyond its original communicative intent. This is key, as social media is often leveraged for information of the now, but when used historically (as in this paper), it can be difficult to access representative data [64]. Many studies (e.g. [60], [63], [64], [65]) have discussed this at length. Essentially, the key takeaways here are that data may be missing, and that changing the sample of data used (e.g. due to it being missing) can fundamentally change the findings and the interpretation of the data.
A severe grey area in social media research is the issue of data availability -generally, it is available and easy to access, which gave rise to many controversial studies (like the Cambridge Analytica Scandal). Many problems have VOLUME 11, 2023 stemmed from data being publicly available (on social media platforms) and freely ''browsable'' but where researchers had permission from the platform though not the users themselves, to process the data. In this sense, there may be ethical tensions in the use of social media content as a means to study potentially vulnerable populations such as migrants. It may also be difficult to predict potential harms from the analyses [31], [63]. A key area of ethical debate in social media research is informed consent. Unlike other research methodologies, social media users are themselves not really ''participants'' in a traditional sense, and are often unaware that their publicly available data is being used. As discussed by [63] and [66] it is rather important that data that would otherwise have been private remains so. This affects the extent to which informed consent of users should be sought in order to comply with data protection laws. This does not mean that data in the public sphere is fair game because a user has ''agreed'' to a set of terms and conditions from the platform. Instead, it is key to ensure anonymity and minimise any potential risk(s) of harm [67], which, in this case, informs the extent to which data can (or even should) be made available following on from any analysis. Thus, in our context, care is needed to ensure that users cannot be spotlighted or identified as a result of this work. With these concerns in mind, we are acutely aware that high levels of caution are needed in studies such as this. We cannot be too invasive into the (sub)population(s) studied, yet we still need to garner meaningful results, albeit at an aggregate level. We also note, at this point, that this work has been reviewed by our institutional ethics committees. 8 Related to the ethical deliberation of this work are considerations of potential biases. For example, selfrepresentation -the promotion of an often stylised ''voice'' of the individual catered for the expected audience (see [13] for an overview). Self-representation is akin to a common method bias for social media researchers, and there are not many easy-to-use tools to combat it. Whilst not specific to this work, it would be naïve to not expect some amount of self-representation or social censoring in the sense of not voicing specific concerns publicly. Thus, while we can be confident that there is some value in exploring the migrant voice via social media, it should not be considered as a complete representation of this voice. Similarly, purely content-based analyses need to be considerate of such challenges.
We must also be aware that migration is often a hotly discussed topic. Specifically for studies that seek to leverage social media as a source of data for scientific study, there is an abundance of additional challenges related to the quality and/or veracity of information communicated, i.e. misinformation. This is not necessarily always malicious, however, there are many cases where it is, as shown by [68], [69], [70], [71], and [72]. Taking all these factors into account, we look for telltale linguistic signs (using LIWC as a method of feature engineering) that focus on how a social media user writes as opposed to what they actually say when classifying users into specific categories.

C. USER CLASSIFICATION
We aim to select a set of k tweets from a user and determine whether this user is a native of Ireland or a migrant. To define the problem formally, assume that we have a set of k tweets from a user as Equation 1, where Tweet i,j is the j th tweet of user i.
Each Tweet i,j contains a set of tokens as shown in Equation 2, where t i,j,m shows the m th token of the j th tweet of user i, and l is the number of tokens in Tweet i,j .
We aim to map each TweetSet i to a label ∈ 0, 1. label = 1 indicates that the user is native, and label = 0 indicates that the user is a migrant. We employ supervised machine learning to predict the label for each user and solve the stated problem. Supervised machine learning algorithms require a reasonable amount of annotated data. For our task, the annotated data are in the form (Tweet i,j , label i ) where (Tweet i,j ) is j th tweet of user i, and lebel i ∈ 0, 1 shows whether the user i is native or migrant. We built a dataset containing 600,000 tweets from natives and migrants from Ireland Twitter users.
Our proposed method consists of three main components: (1) Feature Extraction: extracting linguistic detail from users' tweets; (2) Classification: the training and a (simple) machine learning model at the tweet level, i.e., does a specific tweet appear to belong to a native or migrant; and (3) Majority Voting: use each of the k tweets from one user cast one vote, i.e. native or migrant as a decision aggregation function. This is summarised in Figure 2.
We use Linguistic Inquiry and Word Count (LIWC) to extract features from the tweet texts. LIWC is a text analysis tool that extracts features from natural language text [54], [59]. The features extracted by LIWC are summary language features (analytical thinking, clout, authenticity, and emotional tone), standard linguistic features (e.g. frequency of pronouns, articles, function words, etc. in the text), word categories tapping psychological constructs (e.g., affect, cognition, biological processes, etc.), personal concern categories (e.g., work, home, leisure activities), informal language markers (assent, fillers, netspeak, swear words) and punctuation categories (periods, commas, etc.) [55]. They capture more of how a user communicates rather than what their communication act or message is.
We use LIWC because we aim to model the writing style of users, and LIWC extracts various linguistic features suitable for this purpose. We used LIWC 2017 and extracted 93 features from each tweet's text in our dataset. The feature extraction component receives Tweet i,j = [t i,j,1 , t i,j,2 , . . . , t i,j,l ] instances in our dataset as input, and produces a feature vector We cast the problem as a supervised binary classification task and used a decision tree, an interpretable method, for the classification. We opt for a decision tree because it is relatively simple, fast, and highly interpretable. The latter is specifically poignant as it will also reveal which LIWC categories (features of Tweet i,j ) are most indicative. The decision tree receives FeatureVector(Tweet i,j ) as input and predicts label i,j to suggest whether Tweet i,j is written by a native or migrant to Ireland. We implemented the decision tree using Python's sklearn library [73]. It is also worth reiterating here that there is no significant effort to use complex machine learning methods; the classification task operates at the tweet level, and we sample k tweets from a user to generate a view of their linguistic profile. This reduces specific content biases that can be encoded: we do not look at content; and also use multiple tweets to profile a user.
As mentioned earlier, we aim to get a set of k tweets from a user and determine whether this user is native or not. Since the non-native users might have some tweets written exactly in the same way as natives (for example, typical tweets such as: ''Thanks!'', ''Great!'', ''Congrats'', etc.), we use majority voting on the labels predicted for the k tweets of each user. This affords a degree of robustness against ''outlier'' or unusual tweets a user may have, and similarly, very short or uninformative (from the perspective of classification) tweets.
To this end, we keep the original order of the tweets and segment the dataset into chunks of k tweets. Each chunk contains the k consecutive tweets written by the same user, as in Equation 1. Then we predict the label of each Tweet i,j using the decision tree described earlier. Finally, we use majority voting to aggregate the labels of the tweets and predict the final overall label for the user. If most of the k tweets in TweetSet i had label = 1, we consider the final label as native. Otherwise, we consider it as migrant. We concede that there is a potentially important decision (or hyperparameter) here: the extent of the majority. However, in such cases, it may, in fact, be easier (initially) to just increase the size of k. However, this is a valid concern (the size of k compared to the extent of the majority) that we can explore empirically.

D. DATA SEGMENTATION VIA SENTIMENT AND TOPIC MODELING
After classifying Twitter users into native and migrant groups, we apply sentiment analysis to their tweet corpora (using Vader [74]). This enables us to differentiate online discourse into different polarities aligned to different (sub)populations and obtain four tweet sets:(1) Natives-Positive, (2) Natives-Negative, (3) Migrants-Positive, and (4) Migrants-Negative. Upon these tweet corpora, we perform topic modelling to find potentially relevant topics that could be used to inform policy makers.
There are various techniques to perform topic modelling on social media content, such as tweets. The most popular topic modelling techniques are Latent Dirichlet allocation (LDA) [75] and its variations. LDA is a statistical approach to topic modelling. In LDA, each tweet is considered a probabilistic mixture of topics, and each topic is considered a probabilistic mixture of words. LDA uses the bag-of-words representation of the tweets to find the corresponding topic. BERTopic [76] is a recent approach to topic modelling which leverages BERT embedding and class-based TF-IDFs of the tweets to generate coherent topics. In our preliminary experiments, we examined both LDA (and its variations) and BERTopic. However, BERTopic produced more coherent and stable topics motivating its use in this study. Applying BERTopic generates 15 topics for each of our four tweet sets and also identifies the tweets that correspondingly encompass each topic. Thus, we use it more as a means to cluster tweets VOLUME 11, 2023 according to their content to facilitate the exploration of topical discourse among the two user populations. For the next steps and further analysing, we choose the most sensible topics among the 15 topics for each tweet set.

E. HUMAN IN THE LOOP
BERTopic is a powerful unsupervised topic modeling tool that has been shown to outperform other state-of-the-art techniques. However, relying solely on automatic methods can lead to including irrelevant topics, which can result in inaccurate and incomplete results.
To address this issue, we employ human-in-the-loop(HitL) to review and filter the topics identified by BERTopic. In this step, we involve domain experts who have a deep understanding of the immigration-related issues and can provide valuable insights into which topics are relevant for policymakers. The experts review the topics and investigate the relationship among them, helping to identify the relevant topics that policymakers are most likely to be interested in.
After identifying the relevant topics, we feed the tweets that correspond to these topics into a summarizing tool to generate summaries of the discussions related to the identified topics. These summaries provide a concise and comprehensive overview of the key points discussed in the tweets, allowing policymakers to quickly gain insights into the related issues among migrants and natives populations. Whilst for this work, we use one of the authors an a domain ''expert'' in a real deployment of the methodology we could replace this with a committee of immigration experts to review and select topics for further processing and consideration in the pipeline.

F. SUMMARIZATION
As mentioned earlier, the ultimate goal of this study is to enable policymakers and other related agencies to further their understanding of native and migrant communities in Ireland. Topic modelling extracts the topics discussed by natives and migrants on social media. At the tweet level, topic modelling determines the topic discussed in each tweet. We aggregate the tweets based on their topics and use the aggregated tweets within a topic to establish what users say about a specific topic.
To produce reports that are comprehensible to policymakers, we use a summarisation tool and summarise the aggregated tweets. Specifically, we employ ChatGPT. ChatGPT is a large pre-trained language model developed by OpenAI which is able to generate human-like responses in natural language conversations. We provide the tweets in each topic as input to the ChatGPT prompt and ask it to summarise the tweets. Given that ChatGPT is restricted by input length, we divide the tweets into segments, each containing approximately 70 tweets, and prompt ChatGPT to summarize each segment. We note that as the ChatGPT (and related tools) API develops, that this sample size can be increased. Similarly, paid variants of the API can also support larger numbers of tweets for summarisation. However, our experience (even with arguably small samples of tweets) has been positive as we will discuss in section V.

G. DATA PREPARATION
Data preparation is the first and critical step in deploying data mining techniques in NLP tasks. It has a significant impact on the success of the data mining process. As outlines in this section, our data pipeline consists of four main steps: Classification, Sentiment Analysis, Topic Modelling, and Summarisation. Based on the specific task in each step, we use a different pipeline for data preparation as illustrated in Figure 3.
First, we excluded retweets, as they are not authored by users. Then we perform word tokenisation and split the text of the tweets into tokens. Finally, we remove mentions from the tweet texts (as they typically not going to be useful later in the pipeline).
In the Classification step, we aim to select a set of tweets from a user and determine whether this user is a native of Ireland or a migrant. In this step, we utilise the linguistic patterns in the tweet text to separate natives' and migrants' tweets. Hence, we do not perform any further pre-processing tasks to keep the writing style of the users.
For the Sentiment Analysis step, we remove URLs and Hashtags from tweets since they do not have a direct impact on the sentiment of the tweets. For the topic modeling step, we remove URLs, emojis, smilies, and stop words. We keep the hashtags because they assist topic modeling in detecting the correct topics of the tweets (they are after all used by users to tag the content of their tweet). We remove stop words because they are frequent in the text, and topic modelling techniques may misdiagnose them as the main topics of the text. For the summarisation step, we remove URLs, emojis and smilies. We keep hashtags and stopwords because summarisation techniques require them to comprehend the meaning of the text and summarise it. Figure 3 shows the data preparation pipeline. Table 1 shows the tools we used for performing each of the pre-processing steps.

IV. CLASSIFICATION PERFORMANCE
We perform a binary classification task that predicts a label (native, non-native) for each tweet. We use four popular evaluation metrics over tweets: Accuracy, Precision, Recall, and F 1 . The classifier is trained on the training set and evaluated on both the validation and test sets using a standard holdout strategy, employing a train-validation-test split of 400k/100k/100k tweets. The final label for each user is determined based on k consecutive tweets, and in our experiments, 88814 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  we investigate the effect of the value of k (from 1 to 200) on the performance of the classifier. Ideally, we aim for a small value of k to enable confident classification of new users to be confident without accessing a large portion of their Twitter history. Figures 4a, 4b, 4c and 4d show the impact of k on the accuracy, precision, recall, and F 1 of the classifier on the validation set.
It can be seen that increasing the value of k increases the classifier's performance for all evaluation metrics (accuracy, precision, recall, and F 1 ). All figures show that having more tweets from a user improves classifier performance, which is not surprising. Yet, increasing the value of k from 1 to 25 has maximal impact; there is arguably only marginal improvement above k = 25, and certainly slight improvement above k = 50. This yields an interesting finding that with about 25 tweets, we can predict reasonably well (an accuracy of about 80%) whether a user is a native of Ireland or not. Figure 4b shows that for all values of k, the precision of the classifier for the ''native'' class is higher than its precision in the ''non-native'' class. For k > 25 the precision of the native class exceeds 90%. This shows that the classifier can detect ''native'' users with a high confidence. Similarly, 4c shows that for all values of k, the recall of the classifier in the ''nonnative'' class is higher than the ''native'' class. For k > 25 the recall of the ''non-native'' class exceeds 90%. This means that the classifier can find ''non-natives'' very accurately. Figure 4d shows that the performance of the classifier in terms of F 1 score on the ''non-native'' class is better than the ''native'' class illustrating that the balance between precision and recall is better for this class. This is actually key for our purposes, specifically with the objective of being able to better understand the migrant view(s) of Ireland. Choosing k = 25 and evaluating the model on the test set yields the results illustrated in Table 2: the model performs well on the unseen test set.
In contrast to the results presented in [12], which employed a larger dataset consisting of Twitter data and associated metadata, this study adopts a smaller sample size. Despite this disparity, our approach achieves a slightly lower accuracy rate (80% vs. 85-87%), but achieves a better native precision and migrant recall. An interesting observation is that we achieve these improved metrics using significantly less Twitter data and without relying on metadata. This suggests the efficacy of our approach in isolating the migrant voice using an interpretable NLP technique.
We emphasise that our study focuses specifically on the task of isolating the Irish migrant voice in the Twitter landscape, utilising a distinct methodology and emphasising interpretability. The work by Younus et al. [12], which generated the dataset, differs in terms of its objectives and the inclusion of metadata. While [12] provides valuable insights into a broader set of research questions, our work aims to address the specific challenge of supporting inclusive policy-making concerning migrants in Ireland. By leveraging a simpler and more interpretable approach, we aim to distil differentiated views in a consumable manner for policymakers.
As mentioned in Section III-C we used all of the 93 features extracted by LIWC for the decision tree. LIWC produces linguistic, thematic and psychological features to describe the text. This allows us to also investigate the importance of each LIWC feature in distinguishing between native and non-native tweets. To this end, we inspect the feature importance of the produced model. The higher the value, the more VOLUME 11, 2023 important the feature in labelling the tweets. Table 3 shows the importance of different LIWC features.
The first 10 important features in Table 3, along with their description, are shown in Table 4. Interestingly, all of the first 10 important features are standard linguistic features, informal language markers and punctuation categories. Features corresponding to psychological constructs and personal concern categories are not among the important features.
If we further inspect the 10 most important LIWC features, we see some key takeaways of interest. Natives usually write longer tweets, use terms more likely to be found in the LIWC dictionary, and use more function words and prepositions. In contrast, non-natives usually use more netspeak, and also have a higher usage of personal pronouns and punctuation.
This yields an interesting observation that distinguishing between native and non-native English speakers has been performed mostly based on the linguistic features of the tweet text, i.e. how a user uses language is highly correlated to their corresponding class label. This would also suggest that the model is generalisable in the sense that it is relatively content agnostic and also not dependent on potentially (socially) stylised metadata as well as requiring only a small k ≈ 25 tweets.

V. CASE STUDY ON IRISH SOCIAL MEDIA CONTENT
To illustrate the utility of the InÉire data processing pipeline, we conducted a small-scale case study using all components. The case study aims to highlight the practical implications for policymakers, i.e. distil a set of social media content (tweets) into actionable insights: topics, concerns, and issues discussed on social media platforms, in this case, Twitter. We note that the ''findings'' of this case study will be biased towards the sample of users in the test set and, therefore, not necessarily representative of Irish sentiment or discourse online. Thus, caution is needed in interpreting topics presented here as outcomes of this process. We leverage three parts of the InÉire methodology: topic extraction (V-A) combined with human-in-the-loop (V-B), topic filtering and summarisation (V-C).

A. TOPIC MODELING AND EXTRACTION
We use BERTopic to extract the topics that migrants and natives discuss. BERTopic extracted 15 topics from each tweet set (Migrants-Positive, Migrants-Negative, Natives-Positive, Natives-Negative). Table 5 shows the most sensible topics extracted by BERTopic for each subcategory: positive and negative polarity tweets for the Migrants and Natives sub-populations. Note that this categorisation is determined in this instance by the classification model on only the test sample (III-C) and not by the ground truth label in an attempt to be as authentic an orchestration of the methodology as possible. Here, we define ''sensible'' topics as ones that are obviously meaningful in the presence of a little localised domain knowledge. For example, topics with disproportionately large numbers of tweets associated with them or with several ''filler'' or stop words in their generated title would not be considered ''sensible''. 9

B. HUMAN IN THE LOOP
Following the extraction of topics from the segmented (by type and polarity) corpora, we review topics for content that is potentially thematically relevant. For example, in Table 5, positive topics #1 and #4 for the Migrants are clearly irrelevant to policymakers, yet topics #2 and #3 likely are (Seanad Éireann is the name of the Irish Senate). We can add a little further context of the socio-political climate in Ireland at the time that the data was collected to give meaning to the remaining topics that are ''of interest''.
By socio-political climate, the war in Ukraine was underway; as such, there is no surprise to see content around this humanitarian crisis, and similarly, with an influx of refugees, content around the impact of this would also be expected (e.g. asylum seeking, deportation, etc.). Also, Ireland (especially the capital Dublin) is in the wake of a housing crisis (a lack of affordable housing for purchase and rent), which has been exacerbated by both rises in the costs of living (war in Ukraine, and the associated energy crisis and rise in inflation) and additional pressures due to an increased population in already densely populated major cities (more international students: relaxation of COVID restrictions, potential impact(s) of Brexit, and higher levels of immigrants and asylum seekers). Similarly, school teachers (both primary and secondary) are also currently in short supply. There is also debate around vaccines, especially COVID boosters that currently have a low uptake in the 35-49 age range. Finally, as a point of reference, the term ''traveller'' refers to an indigenous ethnocultural group in Ireland, often in receipt of stereotypical forms of racism.

C. TOPIC SUMMARIZATION
Once topics have been determined as thematically relevant, we use word clouds to create a simple visualisation of the topics: word clouds capture the relative frequency of terms in the corpus; bigger means more frequent. To illustrate the differences between relevant, irrelevant and potentially relevant topics, we can compare the word clouds displayed in Figure 5.
Here, it is already obvious that the football topic (Figure 5a.) is not relevant to policymakers, thus necessitating a human in the loop part of the process. While word clouds are useful for a quick sanity check of a topic, they do not really capture the nature of the discourse beyond the sentiment tag (positive or negative) and a set of frequent words. Thus, to provide more clarity and meaning for policymakers, we leverage ChatGPT as a more sophisticated content summarisation tool. Table 6 shows the summaries generated by ChatGPT for the topics captured in Figure 5. As the number of Tweets corresponding to a topic can be quite high, we randomly sample (n = 150) without replacement from the topic cluster. We note the findings of [60] in doing so, i.e. that the sample can dramatically affect the interpretation. Yet, our goal here is not to perfectly summarise the content but rather give policymakers an insight into social media discourse and specifically highlight or explain potential areas that their policy decisions should consider, specifically around differences between migrant and native populations. In other words, try to make them aware of what they are (potentially) not aware of in terms of potential new policy considerations.

D. DETAILED INSIGHTS FROM THE INÉIRE INTERPRETABLE NLP PIPELINE
We can already see some key differences in Table 5. The positive topics are a collection of largely hedonistic and eudaimonic topics, with some politically focused ones as well. Yet, there are quite striking differences in both the positive and negative topics. In terms of positive topics, migrants' focus lies on the day-to-day joys of life in Ireland, particularly evident from positive topics #1, #3 and #6 (from Table 5), whereas natives appreciated the creation of better cycling routes (natives' topic #1 from Table 5) along with Ireland's commitment towards Ukraine and its firm stance  against Russia (natives' topic #3 from Table 5). On the other hand, in terms of negative topics by migrants, there is a clear focus on racism with diverse themes being covered, such as a comparison of their experiences to that of Ireland's travellers' community (migrants' topic #1 from Table 5), the life of black people in Ireland (migrants' topic #2 from Table 5), Islamophobia in Ireland (migrants' topic #7 from Table 5) and finally, living conditions of asylum seekers in Ireland (migrants' topic #10 from Table 5). From a policy viewpoint, migrants and natives care about justice and fairness. However, it is observed that natives are more concerned with a foreign sense of equality, while migrant voices focus more on domestic equality. Similarly, migrants discuss lived experiences, and natives tend to discuss more administrative issues that Ireland faces; both views offer insights into what Ireland means to both communities.
We also note migrant positive topic #7 (from Table 5) which is a discussion on the book Hani and Ishu's Guide to Fake Dating by Adiba Jaigirdar; a past pupil of a Dublin secondary school. Specifically, it is a discussion surrounding 88818 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  the LGBT themes of the book in a setting often viewed as taboo within its sociocultural setting. It is nonetheless interesting to see this discussion as more prevalent in the migrant than native communities, i.e. it not receiving the same levels of recognition given its successful delivery of a complex emotively charged aspect of adolescent life.

E. VALUE ADDITION VIA CHATGPT SUMMARIES
The summaries generated by ChatGPT (despite often being sampled) provide a reasonable summarisation of the topic discourse segmented by subpopulation, i.e. migrants vs natives, and content polarity, i.e. positive vs negative. We can see very clear indications of forms of action that could be taken by Irish policymakers; here, we draw specific attention to the first and third summaries in Table 6 where ChatGPT explicitly (without explicit prompts for it to do so; although we note that more precise ''prompt engineering'' could further focus it to) highlights potential policy directions that could be taken on the basis of the narrative contained in the Tweet topic corpus. For completion, we also include an VOLUME 11, 2023 irrelevant topic (the second summary) to maintain the argument for the necessity for a human-in-the-loop component of the methodology, and the fourth summary to illustrate that the summary is meaningful, as the war in Ukraine is a well-known topic that can be compared to the other summaries in terms of the type of content that can be generated. We note here too that ChatGPT was (at the time of writing) trained on data that precedes Russia's invasion of Ukraine, and as such, it should not be in a position to supplement the content it has been asked to summarise. Yet, this, in combination with the other examples (in Table 6) illustrates the potential utility of the InÉire methodology for assisting policymakers via summaries that explains the topical discourse.

F. LIMITATION STATEMENT
While our study provides valuable insights into the experiences of migrant communities in Ireland, it is important to acknowledge the limitations inherent in our focus on Twitter data. By solely relying on this particular platform, we recognise that our findings may not be fully representative of the diverse range of migrant experiences in the country. Several key limitations are worth highlighting: • Selection bias: The use of Twitter as our primary data source may introduce a selection bias, as it only captures the perspectives of individuals who actively engage with the platform. Migrants who do not use Twitter or who are not proficient in English may be underrepresented in our sample, potentially skewing our results towards certain segments of the migrant population.
• Digital divide: Our study assumes that Twitter usage is evenly distributed among migrants in Ireland. However, we acknowledge that variations in digital literacy, access to technology, and socioeconomic factors may contribute to a digital divide, limiting the inclusivity of our sample.
• Linguistic limitations: Our analysis is based on Englishlanguage tweets, which may exclude migrants who primarily communicate in languages other than English. This linguistic limitation could restrict the representation of certain cultural groups and impact the generalizability of our findings.
• Geographic focus: Our study specifically focuses on the Twitter landscape in Ireland. While this provides insights into the experiences of migrants within this context, it may not capture the perspectives of migrants residing in other regions or countries.
While these limitations should be taken into account when interpreting our results, we believe that our study still provides valuable insights into the experiences and perspectives of a specific subset of migrants within the Twitter community. Future research endeavours should aim to employ diverse methodologies and data sources to capture a more comprehensive understanding of the broader migrant population in Ireland.

VI. CONCLUSION AND FUTURE WORK
In isolating this set of users, other more common NLP methods (e.g. topic modelling, content summarisation, sentiment analysis etc.) can be employed to reveal specific challenges or themes of interest that could be relayed to policymakers. This entails our main direction for future work (along with curating a larger user dataset): to exploit the main findings of this paper to derive thematic near real-time content that Ireland's migrant populations face in a manner that can be consumed by Irish policymakers and related agencies. As such, we view this work as a series of first steps towards isolating the migrant voice in Ireland via online social media.
This work does have some limitations that also warrant further discussion. First, we (have had to) assume that all meaningful content is expressed in English. This will likely not be the case. Yet, for the purposes of identifying users of different migration status it is; provided a user has sufficient (ca. 25 or so Tweets) English content. There is also the assumption that migrant users will comment at all on social media platforms about lived experiences. Here, we note the work of [15], [16], [17], and [18] who discuss that social platforms (like social media) are often essential for marginalised communities. We would also motivate that policymakers have very little access to such discourse, and we hope in subsequent iterations of this work to uncover something of value that can help make policies in Ireland more inclusive.
In this paper, we have argued that to understand the views of migrant populations in Ireland, we need a robust methodology to isolate the migrant voice. Twitter and other social media platforms have seen a large increase in online activism, yet it can still be difficult to differentiate between the voice of the activist and the individuals they represent. In this paper, we have illustrated that it is possible to accurately classify non-natives of Ireland with as little as 25 tweets without diving into the actual content they generate, but rather how they use language. We also argue that because this approach operates at the level of linguistic markers in the text, it should be transferable to other social media platforms, with small changes to the manner with which the text is vectorised, i.e. prepared for our voting mechanism. Future studies should aim to incorporate a wider range of data sources to capture the perspectives of migrant communities who may not be active on Twitter or proficient in English. While Twitter data provides valuable insights, it is crucial to employ complementary research methods that embrace the diversity of migrant experiences. Surveys can offer a broader understanding of the sentiments, attitudes, and lived experiences of migrants, allowing for quantitative analysis and statistical generalisation. Interviews and focus groups can provide a more nuanced exploration of individual stories, allowing researchers to delve into the intricacies of personal experiences and capture the multifaceted aspects of migration. Ethnographic approaches can provide an immersive understanding of migrant communities by observing their daily lives, interactions, and cultural practices. By combining these diverse data sources, researchers can achieve a comprehensive understanding of the challenges, aspirations, and needs of migrant populations in Ireland, ensuring that policy-making initiatives are more inclusive and responsive to their diverse perspectives.