Framework to Study Migration Decisions Using Call Detail Record (CDR) Data

This article addresses the challenges of using call detail record (CDR) data to study migration. Repurposing CDR data for this task have many advantages, including the lower costs of data collection and the potential for contemporaneous analysis. We present a framework for the repurposing and analysis of CDR data. We identify the home location of a subscriber, with corresponding confidence measures, and determine if the subscriber is a definite migrant, likely migrant, likely nonmigrant, or definite nonmigrant. A predictive model then uses mobility and social network features, extracted from the CDR data, to predict the individual decision to migrate. We are the first to address the challenging task of predicting the migration decision at the individual level. We also provide insight into features that can have an impact on the decision to migrate. An in-depth evaluation using CDR data from two provinces in Sri Lanka provides a granular map of migrant inflow and outflow. The success of our prediction model and the insights gained from the evaluation prepare the way for the repurposing of CDR data for social good with a focus on migration.


Framework to Study Migration Decisions Using
Call Detail Record (CDR) Data Viren Dias, Member, IEEE, Lasantha Fernando, Member, IEEE, Yusen Lin, Member, IEEE, Vanessa Frias-Martinez, Member, IEEE, and Louiqa Raschid , Fellow, IEEE Abstract-This article addresses the challenges of using call detail record (CDR) data to study migration.Repurposing CDR data for this task have many advantages, including the lower costs of data collection and the potential for contemporaneous analysis.We present a framework for the repurposing and analysis of CDR data.We identify the home location of a subscriber, with corresponding confidence measures, and determine if the subscriber is a definite migrant, likely migrant, likely nonmigrant, or definite nonmigrant.A predictive model then uses mobility and social network features, extracted from the CDR data, to predict the individual decision to migrate.We are the first to address the challenging task of predicting the migration decision at the individual level.We also provide insight into features that can have an impact on the decision to migrate.An in-depth evaluation using CDR data from two provinces in Sri Lanka provides a granular map of migrant inflow and outflow.The success of our prediction model and the insights gained from the evaluation prepare the way for the repurposing of CDR data for social good with a focus on migration.Index Terms-Call detail record (CDR) data, migration patterns, migration prediction.

I. INTRODUCTION
T HE worldwide adoption of mobile devices, and human activity within the cyber-physical space, has provided a valuable and ubiquitous stream of digital traces, known as call detail record (CDR) data.In particular, this is a log of the location of a mobile device as recorded by contacts with base transceiver stations (BTSs).There have been many successful applications of repurposing CDR data, for example, in the areas of transportation and urban computing [1]- [3].More recently, there has been awareness of these resources in the wider research community, in particular among computational social scientists who are interested in repurposing CDR data to study human behavior.Agencies such as the World Bank and UNICEF have also been actively engaged in utilizing CDR data through collaborations with mobile service providers.
Internal migration refers to the migration of individuals from one region to another within the same geopolitical entity, typically within the same country [4]- [6].In recent years, there has been an increase in the volume, types, and complexity of human internal migration in many countries, mostly due to economic crises, political instability, and various types of natural disasters [6], [7].Economists have developed econometric models to predict individual internal migration decisions [8], [9].These models typically rely on census data, which is often unavailable, out of date, or difficult and costly to generate.Recognizing these limitations, there has been research on repurposing ubiquitous data generated in a passive manner, e.g., email, Web, and social media data [10]- [12].A limitation of these new approaches is that they may suffer from a bias problem.To explain, the demographic and economic backgrounds of email, Web, and social media users may not be representative of the population at large [13].
In an attempt to overcome this limitation of biased data, there has been more recent research to repurpose digital mobility trace data.As mentioned, CDR data are a log of the location of a device as recorded by BTSs within a cellular network.An alternate source of location data is harvested by location intelligence companies (via the mobile application developer SDK) when smartphone-based mobile applications share their location with the GPS network.Of the two sources, CDR data are known to have much higher penetration rates across diverse population groups, and research has shown that CDR data can be representative of the population at large [14], [15].CDR data have been used to model behaviors during pandemics, such as the H1N1 flu outbreak and natural disasters [16], [17].Research has also successfully demonstrated the use of CDR data to identify migrants and measure the volume and direction of flow of internal migration, for example, in Rwanda and Namibia, respectively [18], [19].Much of this research has been limited to aggregate analysis (volume and flow) but has not studied an individual's migration decision.
To summarize, census data provide accurate models for migration, but the data may be outdated and are expensive to generate.Email, Web, and social media data, as well as smartphone location data, may be biased.CDR data, in contrast, have been shown to be representative.We present a holistic framework for the repurposing and analysis of CDR data to better understand migration and determine if the subscriber is a migrant.We are the first to develop a prediction model that uses mobility and social network features to predict the individual decision to migrate.To identify the home locations of a subscriber, we extend the heuristic approach in [20].Our extension includes metrics to consider the confidence of a BTS being the home location of a subscriber.Using the home location and corresponding confidence measures, we label a subscriber as a definite migrant, likely migrant, likely nonmigrant, or definite nonmigrant.Our work is the first to determine confidence in migration and provide nuanced labeling of migration status.The confidence feature is important since migration maps need to provide an accurate migrant status that will be trusted by the social science community that studies migration.The current gold standard to study migration are the (more expensive to produce) census data or self-reported survey data; both have a reputation for providing accurate migration and demographic information [8], [21].We validate the correctness of our predicted CDR-based results against the gold standard census data.used toward the identification of a or climate change, under extreme situations CDR-based migration maps could also be computed continuously and contemporaneously, producing frequent statistics that would be expensive to collect using surveys.To illustrate this, we produce a map of migrant inflow and outflow for two administrative areas in Sri Lanka circa 2013.One, the Western Province, is the most populous and most developed in the country.In contrast, the second area, the Northern Province, emerged from a 30-year civil war in 2009.We identify the areas with high levels of inflow and/or outflow, i.e., churn, and areas with high levels of net gain or loss of migrants, net gain, or loss of migrants.
We are the first to develop a prediction model for the individual's decision to migrate.Our objective is twofold: 1) to predict whether a subscriber will migrate and 2) to understand the role that behavioral features, including mobility and/or social network relationships, may play in that decision.We build upon previous work exploring the role of spatial dynamics [23] and social networks [24] on migration [23].Further research showed that migrants rely on social support to deal with their migration experiences [24].Our model will also extensive research on features extracted from CDR data [25], [26].Of special note in our model will be identifying a social relationship feature reflecting the presence of migrants as close contacts in an individual's network as relevant to the migration decision.
The main contributions of this article are given as follows: 1) an end-to-end framework for CDR-based analysis of migration that can be contemporaneous; 2) an enhanced algorithm to identify the home location of a subscriber, confidence measures, and nuanced labeling of migration status; 3) a novel model to predict the individual decision to migrate; 4) an in-depth evaluation using CDR data from the Western and Northern Provinces of Sri Lanka.

A. Mobility Trace Data and Human Mobility
The ubiquitous presence of mobile devices has generated a valuable stream of digital traces containing location information that has been used to model human activity within the cyber-physical space [57], [60], [61].For example, researchers have explored the use of location information extracted from social media, e.g., geotagged tweets or foursquare visits, to predict the next location visited by a person [58].Researchers have used GPS data collected from mobile applications to predict trip purpose and route choice [64].CDRs have been used to approximate origin-destination (OD) matrices that characterize aggregate flows between two locations, so as to model travel behaviors, such as commuting patterns [59].CDR data have also been used to predict evacuation patterns during natural disasters [62].
Some of these tasks, such as identifying the next location visited and inferring trajectories, require real-time processing, which is often done on a mobile device.As a result, novel architectural approaches need to be designed to ensure the efficient functioning of the mobile network [63].We build on these works to model migration behaviors using location information extracted from CDR data.However, our specific tasks-to predict the home location and to predict migrationdo not require real-time data or processing.These tasks can be executed offline, using archival data, and they do not impose real-time performance restrictions on the mobile device or the network.
A range of computational methods has been used for human mobility predictive tasks, as reported in the literature.This includes logistic regression (LogRed); random forest or XGBoost [64]; radiation models [65]; Markov models [62]; or deep learning approaches [66].Typically, the deep learning approaches may show some (often limited) performance advantage over other methods, and this depends on the specific task.We use LogRed in our research for the following reasons: 1) the lack of spatiotemporal data complexity, i.e., predicting the migration decision only requires aggregate features, and this task will not benefit from deep learning and 2) the importance of model transparency.Deep learning approaches do not provide clear insights into the specific features that play a role in a person making a decision to migrate [67].However, such insights are of utmost importance in the domain of migration research.LogRed is a popular modeling approach that can, indeed, provide such insights.

B. Mobility Trace Data and Migration
There exist two distinct types of internal migration: 1) circular, repetitive, and nonpermanent moves, such as migrant workers who move periodically between cities and rural areas [4] and 2) permanent migration, when individuals remain in their final destinations [5].Researchers have extensively studied internal migration movements, so as to develop policies that try to prevent migrants from being left behind and allow them to adapt to their new settings.These policies can help enhance work opportunities, lifestyle, and family and social relationships [29].A comprehensive review of internal migration patterns across 15 countries in Asia, including Sri Lanka, is given in [6] and [30].
A majority of the research uses census data.Recognizing the limitations of relying on census data, there has been research on repurposing ubiquitous data generated in a passive manner, e.g., email, Web, and social media data [10]- [12].
Email service logs have been used to identify international migration rates [10].Anonymized log data from Yahoo! users have been used to generate short-and medium-term migration flows across countries [11].Research using Twitter data studied internal and international migrations to determine flow direction and volume [12].Much of this research has been limited to aggregate analysis, i.e., volume and flow, but has not studied individual migration decisions.A more serious limitation of all these approaches is that they suffer from a bias problem.To explain, the demographic and economic backgrounds of email, Web, and social media users are typically not representative of the population at large [13].
In an attempt to overcome this limitation of biased data, there has been more recent research to repurpose digital mobility trace data.CDR data are a log of the location of a device as recorded by BTSs within a cellular network; it is collected by providers and used for billing purposes.Location data are also harvested by location intelligence companies via the mobile application developer SDK; it is collected when the smartphone-based mobile applications share locations with the GPS network.These data have been widely used for digital marketing.GPS data collection requires smartphone ownership, which is not as common as cell (mobile) phone ownership, especially among lower income groups.GPS data can potentially incur a higher selection bias in comparison to CDR data [14], [15].More relevant to our research, CDR data can also be used to reconstruct each subscriber's social network.This is typically not possible using mobile application-based location data from the GPS network.Our research will show the benefits of both mobility and social network features in predicting the decision to migrate.

C. Individual Migration Decision Prediction
There exists extensive work on the use of econometric models to predict whether an individual will migrate to a different region and to understand the reasons behind that decision.Related work has studied the role that expected earnings, housing prices, crime, weather, and employment might play on individual migration decisions [8], [9], [21].A majority of this research relies on the existence of census data or migrationfocused surveys.Discrete choice models assess whether an individual would migrate to a given region.For example, LogRed-based classifiers were used to predict whether an individual would migrate to one of the 324 metropolitan areas in the U.S. [27].They used a set of individual demographic and socioeconomic variables extracted from individual U.S.-based Public Use Microdata Survey, as well as housing costs, climate, crime, and topography variables.
When census data are available, these models are indeed excellent.However, collecting census data, in general, and high granular individual data, in particular, are expensive and not accessible to many resource-constrained countries.Furthermore, available census data may be outdated since more detailed granular data are collected decennially.

D. Home Location for Migration Identification
There exists an important body of work that uses CDR data to determine migration flows, i.e., to approximate the number of people that migrate to a region based on the number of home location changes computed using CDR data.For example, a large dataset of 72 billion CDR data records collected from October 2010 to April 2014 in Namibia was used to determine internal migrant flows within 13 regions [19].The estimated flows were then compared with census-derived migration statistics to assess the accuracy of using CDR-based home location changes as a proxy for migration flows.CDR data from the 2005-2008 period were used to approximate internal migration flows in Rwanda and assess the role that population characteristics might play in migration [18].
Most of the current studies define the subscriber's home location as the BTS where the subscriber was observed most frequently during a given interval [20], [26], [35], [36].While such methods have the advantage of simplicity, they can also be prone to error.Our research considers the confidence of a BTS (or a group representing a BTS entity) being the home location of a subscriber.We label a subscriber as a definite migrant, likely migrant, likely nonmigrant, or definite nonmigrant.Our work is the first to determine confidence in migration using CDR data.We posit that confidence in the home location identification is necessary for CDR-based predictive models to be adopted widely in the relevant research community.That community typically uses census data with accurate migration statistics and demographic information that is self-declared via surveys [8], [21].
In addition, our work is the first to show preliminary results toward the identification of a migration window that characterizes the time interval during which the migration has taken place.These insights will be important when assessing migrations during natural disasters, where collecting specific migration dates via surveys is a much harder task [43].

E. CDR-Based Migration Features
CDR data have been widely used to model mobility and social network behaviors that could be indicative of migration intentions [2], [16], [44], [45].For example, spatial dynamic features, such as entropy or radius of gyration, and social ties features, such as the number of contacts or communication entropy, have been used to characterize postmigration behaviors [25], [26].CDR-based social network data were used to evaluate the evolution of social ties during the migration processes [34].
We extend prior work with novel spatial and social diversity measures to incorporate more nuanced modeling of the relationships.Of special note is that we consider the presence of migrants as contacts in an individual's network.
We note that the features used in this article were at the granularity of individual subscribers.We also computed features that required an aggregation over the complex features of each subscriber in each ego network.This level of granularity and complexity was typically much higher compared to prior research, in particular transportation analysis, where features may be aggregated over each BTS.

III. DATASETS AND METHODOLOGY FOR MIGRATION IDENTIFICATION
We first describe the CDR dataset.We then describe an enhanced algorithm to identify the home location of a subscriber and corresponding confidence measures.We also highlight how the confidence measure can help identify a migration window.

A. Call Detail Records
We use CDR data collected by a telecommunications company from January to September 2013 in the Western and Northern Provinces of Sri Lanka.The first level of administrative decentralization in Sri Lanka is into nine provinces; our data are from two of these provinces, namely, the Western and Northern Provinces.Each province is further subdivided into districts.There are six districts in the Northern Province, three districts in the (most populous) Western Province, and 25 districts overall.A district is further subdivided into divisional secretariat divisions (commonly known as DSDs).There are a total of 331 DSDs in Sri Lanka; the Northern Province contains 34, and the Western Province has 40.We define the migrants within the dataset as those subscribers who have different home locations and different DSDs when considering the first time period of January to March 2013 and the second time period of July to September 2013.
As shown in Table I, the dataset contains approximately five million subscribers in the Western Province and over one million in the Northern Province.We note that the CDRs are granular records at the level of individual subscribers.However, the data are pseudonymized, and we do not have access to any information about the subscriber.This lack of ground truth clearly complicates our tasks of prediction and validation.The CDR data only include voice calls; while the same provider and device are used for a range of other applications, these data were not included.Data cleaning included the removal of duplicate records, subscribers associated with BTSs that were not in the dataset, and so on.
The Spark Graph API was not optimal to handle some computations.Spark API-based computations had to be tuned to use memory efficiently; a similar computation using Hadoop was not memory bound, but there was often a tradeoff with the speed of execution using Hadoop.Computing the extensive set of features used for both home location identification and migration prediction, using the Spark dataset API, took up to two days on a small cluster of five nodes, each with four cores.
The two provinces that are used for the evaluation are very different.The Western Province is the most populous and most developed.In contrast, the Northern Province had emerged from a long civil war that ended in 2009.In the Western Province, the average count of calls per subscriber is 3.50 on working days; this goes down to 3.02 on nonworking days.For subscribers in the Northern Province, the average count of calls, for both working and nonworking days, is higher at 4.08 and 3.97, respectively.Subscribers in the Western Province may be accessing apps more frequently instead of making voice calls.We note that, while we may not be capturing all the activity for these subscribers in the Western Province, we do not believe that this data gap will have a negative impact on the accuracy of the home location algorithm or on the features that impact the decision to migrate.This is, indeed, validated by comparing our results of migrant counts with census-based results for Sri Lanka [30].

B. Base Transceiver Stations
There are 1557 individual BTSs in the Western and Northern Provinces of Sri Lanka.The distribution of BTS in some parts of the Western Province is very dense.Furthermore, calls are often switched between BTS in close proximity for loadsharing purposes.For these reasons, we made the decision to associate each BTS with a map display segment.This enabled us to consider a group of closely located BTS as a single BTS entity when generating features for the purposes of home location identification and migration prediction.
We define a map display segment as a 1 km × 1 km grid.The initial placement of the grid was random, and a total of 100 possible grid placements were considered by shifting the initial grid laterally and longitudinally by 100-m increments.The final placement of the grid was determined by minimizing the error term, i.e., the cumulative Euclidean distance between each BTS and the center of the grid cell containing the BTS.
All BTSs contained within the same grid cell were considered to be a single BTS entity.The location of the BTS entity was defined as the center of the grid cell.The coverage area of the grid cell was the merge of areas defined by the Voronoi tessellation of each individual BTS within the grid cell.As a result of this process, we consolidated the data into 918 BTS entities for the two provinces.

C. Home Location and Migration Identification
To estimate the home location of a subscriber, we used an extended version of the algorithm described in [20] with additional metrics calculated to determine the confidence of our estimation.In particular, for each BTS associated with each subscriber, we computed the following metrics: day Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Then, we computed the following two confidence measures.

1) BTS Confidence:
The ratio of the count of nights during which calls were made via a given BTS to the cumulative night count over all BTS.2) Neighborhood Confidence: The ratio of the count of nights during which calls were made via a given BTS, or any of its neighbors within the BTS entity, to the cumulative night count over all BTSs.
Definitions are given in Table II.We filtered the data to exclude BTSes as a potential home location for each subscriber based on the filtering criteria in Table II.The lower bound thresholds for BTS exclusion were chosen after a careful examination of each of the distributions to eliminate the long tail of BTSs that were not frequented by some subscribers.We verified that this exclusion utilized over 90%-95% of the data over all subscribers.A partial sensitivity analysis was also conducted to ensure that the final BTS selected as the home location was not impacted by the choice of the threshold.
The heuristic for home location identification in [20] only considered the top-one ranked BTS.Our enhancement is to consider the top-three BTS, ranked by confidence, in determining if a migration (nonmigration) is definite or likely.We ranked all candidate BTSs by neighborhood confidence and then by BTS confidence.We used a standard ranking with ties, where we skipped the relevant count of tied positions in the ranking; this is informally referred to as a "1-2-2-4" ranking.In the event of a tie for the top-one rank, we first checked if the tied BTS were neighbors within a BTS entity.If true, a BTS from the BTS entity was selected as the home location.If false, the subscriber was not further included in our experiments since we were unable to determine a home location for that subscriber.This next step, also unique to our approach, compared the top-three BTS across the two time periods and labeled a subscriber's home location as follows using the decision criteria in Table III: 1) definitely changed; 2) likely changed; 3) likely unchanged; and 4) definitely unchanged.The definitions are in Table III.For example, all of the top-three BTS have to be identical and, in the same order, in both time periods, to label the home location as definitely unchanged and the subscriber as a definite nonmigrant.
We note that a home location change might represent a move of a short distance within a given city; this is considered to be a relocation but not a migration.Thus, to disentangle relocation behavior from actual migration, we only considered migrants whose home location has changed from one DSD to another between the two time periods.To determine the DSD for a subscriber, we assign a BTS to a DSD such that the BTS location (center of the grid cell) lies within a DSD boundary.
We then considered whether this change in home location resulted in a change in DSD as well and labeled each subscriber as follows: 1) definite migrant; 2) likely migrant; 3) likely nonmigrant; and 4) definite nonmigrant.The descriptions are outlined in Table IV.

D. Results and Validation
We computed the home locations for all subscribers in the Western and Northern Provinces of Sri Lanka for the two time periods: January to March 2013 and July to September 2013.Figs. 1 and 2 provide plots of the neighborhood confidence of the two most prominent BTSs for four sample subscribers.As we can observe, the neighborhood confidence plots in Fig. 1 clearly identify that the home location has shiftedfrom solid to dashed-for both subscribers.More importantly, such distributions also allow us to pinpoint the actual window when the home location has shifted, i.e., the actual window of migration.In contrast, the confidence distributions in Fig. 2 do not clearly identify if there is a shift in the home location for the corresponding two subscribers.
Tables V and VI show the counts for the number of subscribers in the Western and Northern Provinces broken down by a change in home location and DSD, and migration class, respectively.The percentages of definite migrants identified in the Western and Northern Provinces are 1.62% and 2.84%, respectively.When considering all of the CDR data (not just the two provinces reported in this article), the definite migrant rate is 1.25%, and the likely migrant rate is 1.99%.
To validate our results, we compare our statistics with the closest census data statistics for Sri Lanka.First, we summarize some known limitations: We note that the cell phone penetration rate in Sri Lanka is reported as ≈ 60% circa 2019 [38].We, therefore, expect that some fraction of migrations will be undetected due to the lack of cell phone ownership.There is also potential bias in our sample; cell phone subscribers that use the device consistently to provide sufficient history Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.may also be associated with a higher socioeconomic status.We further note that we are not differentiating long-and shortterm migrations.
Our validation is with respect to census data from 2008, which reports internal migration rates of ≈ 1.8% at the country level [37].The aggregate crude migration intensity (ACMI), the percentage of permanent changes of address, for the five years prior to 2012, is reported as 8% across all of Sri Lanka [6], [30].This translates to an average yearly value of ≈1.6%.Overall, our estimates from the CDR data (of 1.99%) and the statistics reported from the census data counts (between 1.6% and 1.8%) appear to be reasonably consistent.

E. Migration Observations in Two Provinces
Fig. 3 shows a map for net migration, as a proportion of the population at the DSD level of granularity, for: 1) Western Province and 2) Northern Province, respectively.Fig. 3 and Table VII highlight different migration patterns in the Western and Northern Provinces.The Western Province has 40 DSDs; approximately half experience low churn and half medium churn.In contrast, for the 34 DSDs of the Northern Province, approximately half experience medium churn and half high churn.The net migration loss and gain show similar differences.Seven DSDs in the Northern Province experience high net gain, while 11 experience high net loss.In contrast, the net gain or net loss in the Western Province is limited to medium or low values.To summarize, the Northern Province is characterized by high or medium churn, as well as high and medium net loss or gain in migration.In contrast, the Western Province is characterized by medium or low churn and medium or low net loss or gain in migration.
To provide some context for these observations, we note that these two provinces experience the highest levels of both churn and net migration in Sri Lanka.The Western Province is the most populous and developed in Sri Lanka.While this can attract migrants, the high cost of living and other issues can dampen enthusiasm.The Northern Province represents a very different scenario.In 2013, at the time of data collection, the Northern Province was five years out of a three-decadelong civil war and was experiencing significant socioeconomic changes.These contextual observations could provide some intuition into the drivers for the different migration patterns.
To summarize, our evaluation was specific to a single country, but it spanned two very different provinces based on demographic and socioeconomic metrics.We also validated our evaluation against the closest census data statistics for Sri Lanka.We expect that our methodology to determine the confidence in the home location and label the migration status will have similar performance accuracy across other datasets; however, the actual mobility behavior will vary based on the region, country, and local conditions.Some limitations of the approach are summarized under conclusions and future work.

IV. INDIVIDUAL MIGRATION PREDICTION
Our objective is twofold: 1) to predict whether a subscriber will migrate and 2) to understand the role that behavioral features including mobility and/or social network relationships may play in predicting that migration decision.
We frame the migration identification problem as a classification task.Recall that we define the internal migrants within the dataset as those subscribers who have changed home locations and changed DSDs when considering the first time period of January-March 2013 and the second time period of July-September 2013.We use the set of features from the first time period to predict whether a subscriber will migrate between the first and second time periods.
While classification tasks are ubiquitous, the prediction of individual human decisions, such as migration, is both novel and challenging.The difficulty increases as we consider feature extraction from noisy big data, such as CDR data.There has been prior research on migration prediction, but it has relied on an accurate survey and demographic features [27].The closest prediction research using CDR data is to predict BTS locations that a subscriber may have visited to uncover trips made by the subscriber who is hidden in the CDR data [3]; we note that such trip completion predictions are simpler since they consider a limited subset of the data.There is comparable research on predicting human decisions using noisy social media data.For example, migration across social media sites is estimated using features extracted from these sites [39].This research does not predict at the individual level.Individual-level prediction models are presented in [40] and [41].User behavior in creating a new social media post is studied in [40], and linking to a post is studied in [41].
To summarize, predicting individual human decisions, such as migration, is challenging.The prior prediction has relied on an accurate survey and demographic data.Our research is the first to predict these decisions using noisy CDR data.We note that the advantages of an accurate, continuous, and contemporaneous prediction are the ability for planners to stage interventions, offer customized services, and so on to mitigate the costs and stresses associated with migration.
A variety of methods have been explored in the research in [3], [40], and [41], including random forest, ranked support vector machines, LogRed and deep learning using CNNs and RNNs.A deep learning approach slightly outperformed other methods for the task of uncovering hidden trips [3].However, a deep learning approach only has a performance advantage when the CDR features/dataset is disaggregated at the granularity of a single trip or a single day, and when the data are noisy and sparse, as in the case of uncovering hidden trips.For our task of migration prediction, however, we aggregate features over the entire training period and provide a single count or value of that feature for each subscriber.In this setting, there is a marginal performance advantage from using a deep learning approach.
On the other hand, a significant shortcoming of deep learning is the difficulty of interpreting the models.For our migration problem, social scientists have a strict requirement that they must understand the models driving human migration patterns.Thus, in our setting, the marginal potential performance advantage of prediction accuracy from deep learning is clearly offset by the difficulty of interpreting such models.With that in mind, we chose to address our task using a simpler LogRed model that is straightforward to interpret.We compare the results with XGBoost [33], which has been shown to be accurate and efficient across many diverse datasets.

A. Prediction Model Features
We describe the set of features that will be used as predictors of whether a subscriber will migrate.We build upon features from the literature that characterizes human behavior.Table VIII shows the set of features grouped into three classes.
1) Calling Patterns: These features are straightforward summary statistics, including the count of calls, duration of calls, count of distinct social contacts, and so on.For a given ego subscriber, we defined a social contact as having had at least one incoming call and one outgoing call with another subscriber.
2) Social and Spatial Diversities: We extend previous research and capture the behavioral diversity of a subscriber.To do so, we consider social and spatial diversity measures based on the Shannon entropy as defined by [32].The measures are defined as follows.
1) Social diversity (social_entropy): This is defined by considering the diversity in the total call volume occurring between a subscriber and members of their social networks.For a subscriber i with K contacts, the value is given as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VIII FEATURES USED TO PREDICT MIGRATION
where and V ik is the total call volume between subscribers i and k. 2) Spatial Diversity (spatial_entropy): This is defined by considering the diversity in the total call duration to each BTS within the ego social network of a subscriber.For a subscriber i who has contacts in M distinct BTS areas, the value is given as follows: where D im and D im is the total call duration for subscriber i to BTS area m.The preceding features capture diversity across contacts and their associated BTSs.However, this does not consider spatial diversity.For example, an individual subscriber could have a higher diversity with local contacts who share the same home location, or they may have higher diversity with contacts who live in other DSDs or other provinces.To account for this, we propose the following three novel diversity measures: 1) Home Location-Based Spatial Diversity (hl_spatial_ entropy): We determine the home location of the altered contacts of the ego node.We consider the time spent by an ego node communicating with a contact who lives in a given BTS area to measure hl_spatial_entropy.For a subscriber i who has contacts with home locations in A distinct BTS, the value is computed as follows: where A a=1 D ia and D ia is the total call duration for subscriber i to BTS area a.

2) Province-Based Spatial Diversity (prov_spatial_ entropy):
The definition is the same as for spatial diversity; however, only the BTS areas within the home location province of the ego subscriber are considered.

3) Long Distance Spatial Diversity (long_spatial_entropy):
The definition is the same as for spatial diversity; however, only the BTS areas outside the home location province of the ego subscriber is considered.3) Social Relationships: Prior research on predicting individual migration used census and survey data but did not consider social characteristics [8], [9], [21], [27].Social relationship features, such as the count of contacts, were used to characterize postmigration behavior [25], [26].CDR-based social network features were used to study the evolution of social ties during the migration process [34].Our work is the first to use social relationship features for migration prediction.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
In particular, we consider the presence of migrants as contacts in a subscriber's network.

B. Prediction Model Results
We provide some details about the experiment settings as follows.
1) We characterize each definite migrant, likely migrant, likely nonmigrant, and definite nonmigrant, with all the behavioral features described above.2) We use collinearity tests to eliminate highly correlated features (with a correlation estimate > 0.7).3) We use tenfold cross-validation to train and test the migration classification model.4) We report on the area under the ROC curve (AUC) metric-averaged across tenfold-for the following subsets.
a) The entire dataset of definite migrants, likely migrants, likely nonmigrants, and definite nonmigrants.b) A restricted dataset of only definite migrants and definite nonmigrants.c) A balanced dataset of equal numbers of definite migrants and definite nonmigrants generated by randomly downsampling the larger number of nonmigrants and repeated 50 times.Note that the AUC scores (averaged across all 50 iterations) did not show an improvement over the restricted dataset.5) We also report on the AUC scores at different migration distance thresholds, i.e., models trained considering only migrants that migrate a minimum distance of 5, 10, and 15 km.Table IX shows the performance of the LogRed model and XGBoost as measured by the AUC.The values in parentheses following the AUC are the population counts.Considering the entire dataset and LogReg, the AUCs are 0.70 and 0.73 for the Western and Northern Provinces, respectively; the AUCs for XGBoost are 0.73 and 0.76, respectively.These values improve when we consider larger migration distance thresholds.For distances greater than 15 km, the AUC for LogReg can go up to 0.82 and 0.80 for the Western and Northern Provinces, respectively, and 0.86 and 0.83 for XGBoost.The trend is that the AUC increases with longer migration distances, i.e., the predictive algorithm appears to be more robust at detecting long-distance home location changes.
The prediction model for the restricted dataset with only definite migrants and definite nonmigrants shows better results as expected.We observe that the LogReg AUC increases to 0.77 and 0.80 (from 0.70 and 0.73) for the Western and Northern Provinces, respectively.The XGBoost AUC increases to 0.80 and 0.83 (from 0.73 and 0.76).The downsampling experiments to create a balanced dataset for the count of definite migrants and definite nonmigrants did not produce any significant improvement for AUC; we do not report on AUC for the balanced dataset in Table IX.We note that the downsampling does have an impact on the significance of features in prediction, as discussed in Section IV-C.To summarize, both LogRed and XGBoost perform well with a slight advantage for All trends are consistent across both methods.The accuracy is slightly higher for the following across both methods: 1) the restricted dataset of definite migrants and nonmigrants; 2) the Northern Province; and 3) long-distance home location changes.There is no performance improvement for the balanced dataset.

C. Feature Analysis
We report on the significance of features across the various experimental settings for the LogRed.To do so, we report on the odds ratio (OR) for each feature and setting.In LogRed, OR represents the effect of a predictor variable on the likelihood that the outcome will occur.In our migration prediction setting, the OR value allows us to explore the effect of calling patterns, social and spatial Diversities, and social relationships on the individual migration decision.Features with OR values greater than one will reveal a positive impact on the migration decision, whereas OR values less than one will reveal a negative impact on the migration decision, and OR values close to 1 reveal features with little to no impact on the migration decision.OR values can also be interpreted as unit increases, whereby a one-unit increase in a given feature increases the odds of a migration decision, as reflected by the change in the OR value.We note that interpreting the results is not straightforward for XGBoost since the coefficients have to be normalized across a random forest.
OR values for selected features across experiment subsets are reported in Table X.Recall that this includes the entire dataset, a restricted dataset of only definite migrants and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In addition, we use the migration distance thresholds of 5, 10, and 15 km.
1) Calling Patterns: Several calling patterns showed a significant positive impact on the likelihood of being a migrant, i.e., the OR values were greater than one.The contact_distance is the distance between the home BTSs of a subscriber and their contacts.Table X shows that, for a unit increase in the average contact distance, the OR value across settings takes values in a range from 1.225 to 1.782.Thus, subscribers with contacts living further away have an increased probability of migration.The OR values for this feature have values greater than one for the entire, restricted, and balanced datasets; the positive impact also holds across all of the distance thresholds.
The contact rate is the count of calls exchanged with a contact.The OR value for the contact_rate is positive and takes values in the range of 1.109-1.262across the settings, i.e., those who have more intense interactions with contacts in their social network may be better positioned to migrate.We also observe that the value of the OR increases as the distance threshold increases and when moving from the entire dataset to a balanced dataset.This confirms that the more intense interactions with contacts have a significant positive impact, in particular when considering longer migration distances.
The radius of gyration is the distance between a home BTS and any visited BTS, weighted by visit frequency.Table X shows that a unit increase in the value leads to OR values in the range of 1.142-1.408.These numbers reflect that subscribers exhibiting greater mobility might be more likely to make a migration decision.
2) Social and Spatial Diversities: Two of the three novel entropy measures proposed have a small significant effect on the migration outcome, i.e., the OR values for the entire dataset are slightly above one.Small increases in the OR value are observed for the Western Province for distance thresholds of greater than 5 km.Specifically, for each unit increase in the home-location-based hl_spatial_entropy, the odds of being a migrant are in the range of 1.071-1.119.Hence, communications with subscribers who have diverse home locations contribute to an increase in migration.A similar outcome is observed for the other novel feature-the province-based prov_spatial_entropy; a unit increase results in OR values in the range of 1.081-1.122.
3) Social Relationships: A very significant feature is net-work_5_migrant_count.As explained in Table VIII, this feature measures the count of contacts who have already migrated and who remain in close contact.Close contact was defined as exceeding five weekly calls between the subscriber and the contact.Fig. 4 provides the distribution of the count for this feature, for migrants and nonmigrants, for the two provinces.As can be observed, for both provinces, the median count of this feature for migrants exceeds that for nonmigrants.Furthermore, there is less variance in the count for migrants.The median count for the Northern Province is higher, both for migrants and nonmigrants, in comparison with the Western Province.The high OR values for this feature reflect its strong positive impact on a subscriber's migration decision; the impact holds across migration distance thresholds and provinces.In the Northern Province, for each unit increase in close contact, the odds of migration are in the range of 1.303-1.601.For the Western Province, this feature has OR values in the range of 1.199-1.347.This result reflects that subscribers who maintain strong connections with other migrants are more likely to make the individual decision to migrate.

A. Conclusion
This research is the first holistic end-to-end approach to repurpose CDR data to study migration.Our research identified the home locations of subscribers together with corresponding confidence measures.We identified definite migrants, likely migrants, definite nonmigrants, and likely nonmigrants.We created detailed maps and identified migrations patterns at the DSD level for two provinces in Sri Lanka.Our research is the first to predict the individual migration decision using CDR-based features.Of note is that the social relationship of staying in close contact with migrants is a strong indicator of future migration.

B. Limitations
Our dataset is from a single carrier circa 2013; at the time, the count of mobile cellular subscriptions was approximately 2.36 million with a total population of 20.32 million [42].This may limit our coverage of the population, and it may be biased toward urban and mid-to high-income subscribers because of barriers to cell phone adoption in low-income communities.Nevertheless, a GSMA report from 2013 revealed that competition between operators in Sri Lanka lowered prices, potentially giving access to higher numbers of low-income individuals [56].Our approach to determine confidence in the home location filtered out subscribers that did not have significant call activity at night, or whose activity levels are low overall.We are, thus, likely to miss the behavior of migrants with low cellular usage patterns.While some providers collect data at higher frequencies (a.k.a.network data) that may cover subscribers with low activity, such data are rarely made available for research purposes.Our evaluation was specific to a single country, but it spanned two very different provinces based on demographic and socioeconomic metrics.We expect that our methodology to determine the confidence in the home location and to label the migration status will have similar performance accuracy across other datasets.

C. Open Challenges
We would first like to study migration patterns in depth.This includes differentiating long-and short-term migrants, circular migration patterns, and so on.We would like to extend our novel work on determining confidence in the home location to more precisely determine the actual migration window.This will enhance the granularity, and utility, of the migration maps.More important, it may be a valuable tool in predicting migration, as an earlier migration decision may have a later cascading impact through social relationships.
A holistic end-to-end framework for the CDR-based analysis must be based upon the architecture and infrastructure for contemporaneous processing that can provide close to real-time migration maps to social and policymakers.Due to privacy reasons, CDR data are not freely available.Nevertheless, cell phone companies have developed numerous collaborations with academic and nonprofit partners, such as the World Bank and UNICEF to use CDR data in high-stake settings, including poverty [48] and health [49].In addition, data challenges have also been proposed to give broader access to aggregated CDR datasets to the larger research community, e.g., the Syrian refugees challenge [31] or the D4D in Senegal [51].Finally, we note that the COVID19 pandemic has spurred significant interest in the potential use of CDR data to monitor and predict community spread [52], [53].
Finally, the end-to-end framework will have to support the shipment of sensitive CDR data from providers and must satisfy multiple requirements around privacy.Although the CDR data are pseudonymized, under some (limited) circumstances, CDR data were reverse engineered to deidentify individuals [54].GDPR-compliant approaches have been proposed recently to implement ethically founded CDR-based frameworks [55].Recommendations from these studies could be incorporated into the proposed framework to transform it into a CDR-based GDPR-compliant framework.We will explore this transformation in future work.

Fig. 1 .
Fig. 1.Neighborhood confidence distribution with a clear home location shift.

Fig. 2 .
Fig. 2. Neighborhood confidence distribution with an unclear home location shift.

Fig. 3 .
Fig. 3. Net migration (inflow − outflow) as a proportion of the population per DSD for the Western Province on the left and the Northern Province on the right.

Fig. 4 .
Fig. 4. Histogram of five call migrant counts for migrants and nonmigrants in the Western and Northern Provinces.

TABLE II FEATURES
USED FOR HOME LOCATION IDENTIFICATION, CALCULATED FOR EACH BTS VIA WHICH CALLS WERE MADE BY EACH SUBSCRIBER count, night count, neighborhood night count, and day span.

TABLE VII BREAKDOWN
OF DSDS WITH RESPECT TO CHURN (INFLOW + OUTFLOW) AND NET MIGRATION (INFLOW − OUTFLOW) AS A PROPORTION OF THE POPULATION.THE LABELS WERE DETERMINED BASED ON THE ABSOLUTE VALUES OF THEIR DISTRIBUTIONS FOR BOTH PROVINCES COMBINED: "HIGH" REPRESENTS THE TOP QUARTER, "MED" REPRESENTS THE INTER-QUARTILE RANGE, AND "LOW" REPRESENTS THE BOTTOM QUARTER.(a) WESTERN PROVINCE (TOTAL 40 DSDS).(b) NORTHERN PROVINCE (TOTAL 34 DSDS)

TABLE IX PERFORMANCE
AS MEASURED USING THE AUC FOR A LOGRED MODEL AND XGBOOST FOR MIGRATION CLASSIFICATION FOR TWO PROVINCES.THE VALUES IN PARENTHESES ARE THE POPULATION COUNT IN THOUSANDS IN EACH PROVINCE.(a) WESTERN PROVINCE.(b) NORTHERN PROVINCE

TABLE X
ORS.(a) WESTERN PROVINCE.(b)NORTHERN PROVINCEdefinite nonmigrants, and a subsampled dataset of equal counts of definite migrants and definite nonmigrants.