Vaccinated, What Next? An Efficient Contact and Social Distance Tracing Based on Heterogeneous Telco Data

The demand for safety-boosting systems is always increasing, especially to limit the rapid spread of COVID-19. Real-time social distance preservation is an essential application toward containing the pandemic outbreak. Few systems have been proposed which require infrastructure setup and high-end phones. Therefore, they have limited ubiquitous adoption. Cellular technology enjoys widespread availability and their support by commodity cellphones, which suggest leveraging it for social distance tracking. However, users sharing the same environment may be connected to different telecom providers of different network configurations. Traditional cellular-based localization systems usually build a separate model for each provider, leading to a drop in social distance performance. In this article, we propose CellTrace, a deep learning-based social distance preserving system. Specifically, CellTrace finds a cross-provider representation using a deep learning version of canonical correlation analysis. Different providers’ data are highly correlated in this representation and used to train a localization model for estimating the social distances. In addition, CellTrace incorporates different modules that improve the deep model’s generalization against overtraining and noise. We have implemented and evaluated CellTrace in two different environments with a side-by-side comparison with the state-of-the-art cellular localization and contact tracing techniques. The results show that CellTrace can accurately localize users and estimate the contact occurrence, regardless of the connected providers, with a submeter median error and 97% accuracy, respectively. In addition, we show that CellTrace has robust performance in various challenging scenarios.


Vaccinated, What Next? An Efficient Contact
and Social Distance Tracing Based on Heterogeneous Telco Data body because a pathogenic infection of COVID-19 mainly 30 comes from direct physical contact with confirmed cases. 31 Therefore, contact tracing approaches have received significant 32 attention as it is the most effective approach for breaking 33 chains of viral transmission [1]. It involves identifying who 34 may have had contact with an infected person with a recursive 35 tracing of their contacts. Despite the presence of vaccines in 36 some countries, we are moving to a post-COVID-19 world 37 where social distancing is the new normality [2]. This confirms 38 the need for an automatic social distance preserving system to 39 ensure such social distancing leading to safe environments, 40 especially indoors, e.g., schools and universities. 41 Toward realizing contact tracing, Wi-Fi-based systems [3] 42 are proposed due to the prominence of Wi-Fi-based localiza-43 tion systems [4]. These systems leverage the signals received 44 from the installed Wi-Fi access point (AP) infrastructure to 45 estimate the users' locations sharing the same environment. 46 The performance of these systems is obtained only if the area 47 of interest is covered with dense Wi-Fi APs. Nevertheless, 48 neither all environments are well covered with Wi-Fi networks 49 nor do all cell phones enable Wi-Fi by default, limiting the 50 ubiquitous adoption of such systems. A number of systems 51 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In this article, we propose CellTrace: a novel cellular-based 110 social distance and contact tracing system trained with multi-111 ple providers' data. To achieve fine-grained accuracy, Cell-112 Trace builds a deep neural network (DNN) to learn the 113 nonlinear relations between the RSS measured by the users' 114 devices and the corresponding locations in the area of interest 115 and, thus, calculate the social distance between them. Since 116 different users can be connected to different providers, Cell-117 Trace extracts cross-providers' features enabling the training 118 of a single deep-learning-based localization model. Specifi-119 cally, CellTrace employs a deep canonical correlation analy-120 sis (DeepCCA) to learn the complex nonlinear transformations 121 of the RSS of different providers and then project them into a 122 space in which different providers' data at the same location 123 are highly correlated. Then, these projected features are used to 124 train the localization model to detect the locations of different 125 users and their interdistance. 126 We test CellTrace using different Android phones in two 127 indoor environments (small and large sizes). The evaluation 128 results show that CellTrace can achieve consistent median 129 localization errors of 0.4 and 0.7 m in the small and large 130 environments, respectively. The system can also correctly 131 detect the exact social distance between two users 97% of the 132 time. This is better than the baseline cellular-based localization 133 system by more than 235% and 117% when used for contact 134 tracing purposes. Moreover, we show that our system is robust 135 when tested across different network operators, lower cell 136 tower densities, device heterogeneity, and unseen locations. 137 The rest of this article is structured as follows. We briefly 138 describe key concepts relevant to this work in Section II. 139 A literature review is carried out in Section III. In Section IV, 140 we provide an overview of the proposed system, while 141 Section V presents its details. In Section VI, we describe the 142 data collection process and how CellTrace is tested. We discuss 143 the system limitations in Section VII. Finally, we conclude this 144 article in Section VIII. 145 II. BACKGROUND 146 147 In this section, we provide a brief background on the 148 traditional canonical correlation analysis (CCA) on which the 149 DeepCCA is built. The details of our DeepCCA algorithm are 150 given in Section V.

151
CCA [15], [16] is a standard highly versatile statistical 152 method for finding a common correlation between two multi-153 variate sets of variables (vectors) having the same situations. 154 In particular, CCA linearly projects the input sets on another 155 space in which these sets are maximally correlated. This 156 helps in studying the strength of the relationship between two 157 quantitative variables and how they are related. An appealing 158 property of CCA for prediction tasks is that, if there is noise 159 in either set, the learned representations should not contain 160 that noise in the new space.

174
Since α is scaling-invariant, we can rewrite the correlation To find the optimum solution for (3) CellTrace, on the other hand, utilizes only the ubiquitous 213 cellular technology, which is supported by all mobile phones 214 by definition. In addition, contact tracing can realize at the 215 provider side without either incurring extra processing or 216 fooling the localization system at the edge.

217
2) Wi-Fi-Based Systems: Wi-Fi is widely available indoors 218 and has been widely adopted for indoor localization due to its 219 relatively short transmission range. To achieve high accuracy 220 for localization, Wi-Fi-based techniques usually depend on 221 building a Wi-Fi radio map of the overheard Wi-Fi APs, 222 which can be leveraged to identify the user location based 223 on matching the received signals. This matching can be either 224 deterministic [23]   On the other hand, CellTrace leverages the widespread cel-233 lular technology, whose network configurations rarely change 234 due to the consequent high expense and complexity of 235 this process when performed frequently. In addition, unlike 236 Wi-Fi networks, cellular signals have a longer propaga-237 tion range and are less affected by variations in indoor 238 environments [28]- [31]. This leads to a stable infrastructure 239 for localization-based safety systems. 3) Cellular-Based Systems: Due to its high advantages, such 241 as the fact that cellular technology has the most widespread 242 infrastructure and is supported by the vast majority of mobile 243 devices, cellular-based localization has recently gained a lot 244 of attention. Therefore, cellular-based localization systems 245 have been adopted for both outdoor and indoor use cases. 246 The methodology of this technique is that a model is built 247 and trained to learn the relations between the collected 248 RSS measurements and the user locations during the offline 249 phase. Then during the online phase, this model must be 250 able to discriminate between different locations in the area 251 of interest. There have been proposals for both outdoor 252 and indoor cellular-based localization systems. Outdoor 253 cellular-based systems [12], [32], [33], have been proposed 254 as energy-efficient and ubiquitous alternatives for GPS. 255 Furthermore, cellular-based localization has recently been well 256 realized in indoor settings leveraging the computational power 257 of deep learning [  privacy concerns due to its requirement for clients to share 285 contact logs to a central reporting server [36], [39]. Thereby, project adopts similar principles on the operating system 297 level [41], [42], which has been widely adopted [43].

298
Few approaches [44], [45] were recently proposed to track 299 passengers on public transportation depending on the smart-300 phones' inertial sensors (e.g., magnetometer, accelerometer, 301 and gyroscope) using dead reckoning. However, the inherent 302 noise in sensor data leads to an error that accumulates quickly  In contrast, cellular-based contact tracing has shown to be 317 feasible due to its availability and ubiquity, which encouraged interdistance location estimation phase. The offline phase starts 328 with the data collection process using a client-side application 329 running on the user's cell phone. The application is designed 330 carefully to record the time-stamped cellular information from 331 the overhead towers at sparse predefined points called refer-332 ence points in the considered area. 1 These collected measure-333 ments are uploaded to our online running service for further 334 processing. The preprocessor module is used to handle the 335 noise in the input data and prepare the low-level RSS feature 336 vector of each considered provider. 2 As a result, a fixed-size 337 RSS vector across all the recorded samples has been obtained 338 that fits as input to the localization model. The RSS vectors 339 from different providers are then further processed by the 340 deep feature extractor module to learn complex nonlinear 341 transformations and project the original low-level features to 342 a cross-provider feature space. The module is based on a 343 combination of a DNN and a CCA process, ensuring that data 344 from different providers are highly correlated in the common 345 space, as described in Section V-B. Thereafter, the projected 346 RSS features are fed to localization model; hence, we can 347 calculate the social distancing between each pair of users based 348 on an accurate estimate of their locations. The output of this 349 offline phase is two trained models (i.e., the deep correlation 350 model and the localization model), which are saved for later 351 use in the online phase.    number of neurons corresponding to the number of surveyed 497 reference points in the area of interest considered in this 498 phase. The network is trained to operate as a classifier such 499 that each reference point represents a class. Unlike equivalent 500 regression models, classification models usually have a simpler 501 data collection process (i.e., permits collection at low-density 502 reference points). Therefore, a softmax activation function 503 is leveraged at the output layer. This leads to a probability 504 distribution over the different predefined reference points given 505 an input. In particular, the network outputs the probability 506 that the input sample (the latent representation) comes from a 507 specific reference point. More formally, given a total number 508 of training samples m, where z i ∈ R v is the projected 509 latent representation of each cell scan s i ∈ R q , which is 510 fed to the model, the sample z i has a corresponding discrete 511 outputs (i.e., logits), and c i is a i = (a i1 , a i2 , . . . , a in ), which 512 captures the score for each reference points from the possible 513 n reference points to be the estimated point. The logit scores 514 a i j (for sample i to be at reference point j ) are converted into 515 probabilities using the softmax function as For training purposes in the offline phase, we encode the 518 ground-truth label of each sample using one-hot-encoding. The 519 encoding of the output vector has a probability of one for 520 the correct reference point and zeros for others. We used the 521 Adam optimizer [57] and categorical cross-entropy as our loss 522 function.

523
To avoid overtraining (i.e., overfitting), CellTrace employs 524 the early stopping regularization technique, which automati-525 cally selects the optimal number of training epochs. Specif-526 ically, early stopping monitors the model's performance for 527 every epoch on a held-out validation set during the training. 528 It terminates the training as soon as the performance stops 529 improving [58]. The goal of this phase is to pinpoint the users sharing the 532 same environment and, thus, detect the contact occurrence. 533 Initially, each user's device identifies the provider, measures 534 the received cell signals from the hearable towers in the 535 area of interest, and forwards the scan to our running ser-536 vice to process and extract the corresponding feature vector. 537 Specifically, the RSS vector is submitted to a single view 538 of the trained DeepCCA, which corresponds to the user's 539 network provider to extract a cross-provider feature vector, 540 as described in Section V-B. This vector is then fed to the 541 trained localization model (regardless of the provider) to get 542 a location estimate as one of the already defined reference 543 points at the calibration phase. Then, the user's location l * is 544 estimated as the one that has the maximum probability given 545 the input vector (z). That is, we want to find 546 l * = argmax l [P (l|z)] .
(7) 547 One advantage of designing the localization model to oper-548 ate as a classifier rather than a regressor is reducing the amount 549 of the required reference points and, thus, the data collection represents the possibility that the input vector is coming from 563 the i th reference point l i . P i is formulated as follows: Thus, the fine-grained location coordinates are defined as where l ix and l iy are the coordinates of reference point i .    contact tracing efficacy, encouraging some governments to 595 adopt it at least for short periods [48]. Each provider is 596 connected to the shared contact tracing server, which processes 597 anonymized client cellular measurements, and thus, contacts 598 can be identified, as shown in Fig. 4. Then, notification 599 messages are sent from the provider to users upon contact 600 occurrence of infected cases. However, this approach may face 601 some challenges in adoption. For instance, privacy concerns 602 may hinder the adoption in some countries as end-users have 603 not provided their consent to use their data by their providers 604 for contact tracings [48], [59]. Additionally, obtaining consent 605 from different providers to deploy a common contact tracing 606 system is rather difficult, even in an emergency. However, 607 there are tradeoffs in the effectiveness of contact tracing and 608 exposure notification apps with increased privacy [48], [59]. 609 In particular, the effectiveness of the privacy-first apps might 610 be impossible to evaluate due to the lack of recorded data [59]. 611 Nevertheless, CellTrace can be extended for further privacy 612 protection by anonymizing users' data, employing differential 613 privacy [60] or inheriting the decentralization concept from 614 other techniques, e.g., inheriting the concept of DP-3T [40] or 615 exposure notification [41], [42].

VI. EVALUATION 617
In this section, the data collection setup and tools used are 618 described first. Then, we show how the system performs by 619 varying the different system parameters. Finally, we compare 620 the performance of CellTrace to the state-of-the-art techniques. 621 A. Data Collection 622 We deployed our system in two indoor environments with 623 different sizes and characteristics (as described in Table I). 624 across different days. To scan cell towers in the area of interest, 655 we developed a scanning application using the Android SDK. 656 To evaluate the learned model and confirm its generalization 657 ability, we adopted K -fold cross-validation (typically k = 5). 658 The training set is partitioned into k subsets where each subset 659 includes the data collected from two providers. 6 Each time, 660 k − 1 subsets are used to form a two-view training set, and 661 the remaining one is used as the validation set. Hence, every 662 subset appears in the validation set exactly once and appears 663 in a training set k − 1 times. Then, the average error across all 664 k folds is reported and is used to select the model parameters. 665 This significantly reduces the impact of the bias-variance 666 problem due to the interchange of the training and validation 667 sets. In this section, we study the effect of the deep models' 674 different hyperparameters, CellTrace parameters, and the dif-675 ferent techniques used to learn nonlinear transformations for 676 achieving the maximum correlation between the data views on 677 the overall system performance. These parameters include the 678 number of layers, the effect of the feature extraction method, 679 and the size of the feature vector. The default parameters' 680 values used throughout the evaluation section are reported in 681  Table II.  Fig. 7 shows the effect 683 of changing the number of layers on CellTrace accuracy. 684 The figure shows how increasing the number of layers of 685 the location estimation network increases its accuracy until it 686 reaches an optimal value at four layers. This can be justified as 687 increasing the number of layers increases the model computing 688 6 Without loss of generality, we got permission to use provider-side data for two providers only to ensure the system's validity on both sides, i.e., the client side and the provider side. However, CellTrace can work with any number of available providers by creating a view for each provider. 7 https://colab.research.google.com    the raw features and classic CCA, the proposed DeepCCA 706 method gives an improvement of 235% and 185% in the esti-707 mation accuracy of social distance. These results confirm the 708 efficacy of DeepCCA in capturing the correlated signatures of 709 different providers better than other methods, which facilitates 710 locating users sharing the same environment. also shows that a feature vector z of ten dimensions yields 717 the best performance. Beyond that, new dimensions (i.e., 718 features) will be included leading to no further performance 719 enhancement. This can be justified for two opposing reasons: 720 1) the additional features reduce the correlation between 721 different providers' data and 2) on the other hand, the location 722 discriminative power of the localization model is boosted by 723 increasing the number of features in the latent space. In this section, we evaluate the robustness of CellTrace 726 under varying environmental conditions. 727 1) Resilience to Provider Heterogeneity: In this section, 728 we evaluate the performance of the CellTrace system when 729 tested with two different providers individually compared 730 to the heterogeneous providers' scenario. Fig. 10 shows the 731 performance of CellTrace when all users are connected to 732 either only A or only B compared to A and B together 733 (heterogeneous providers). It is worth noting that different 734 operators cover the area of interest with different densities of 735 serving towers of 16 and 9 for operators A and B, respectively. 736 Fig. 11. Effect of testing with unseen locations on the CellTrace system performance.   In this section, we evaluate the end-to-end performance of 779 CellTrace in terms of localization performance and contact 780 tracing accuracy, and compare it to the state-of-the-art cellular 781 location estimation and contact tracing systems.  Figs. 13 and 14 show the CDF of social distance 786 error for the two techniques in test bed 1 and test bed 2, respec-787 tively. Fig. 13 illustrates that CellTrace outperforms the base-788 line, enhancing the median error by 280%. Similarly, for the 789 second test bed, CellTrace achieves an improvement in median 790 localization error of 117% compared to the baseline [10]. This 791 can be explained by noting that, unlike CellTrace, the baseline 792 technique, which relies on the original signal features, does not 793 consider the interoperability between different operators when 794 their connected clients share the same spatial environment. It is 795 worth noting that a slight drop in the accuracy is observed in 796 test bed 2, which can be justified due to the increase in the 797 spatial uncertainty in a larger space compared to test bed 1. 798 This can be easily handled by space partitioning. Nevertheless, 799 the results in the two test beds confirm the superiority of 800 CellTrace due to its ability to capture the correlated features 801 across different providers compared to the baseline. 2) Contact Tracing: In this section, we evaluate the overall 803 accuracy of CellTrace in contact detection. Table III summa-804 rizes the performance metrics in contact tracing. It is worth 805 noting that positive means that contact is detected by the 806 system, which can be either correct (true) or incorrect (false) 807 detection and similarly for the negative detection. However, 808 a false positive case occurs when users are more than 1 m apart 809 (no physical contact). At the same time, the system reports a 810 contact leading to a false alarm and, thus, a bad user expe-811 rience. In addition, the false-negative case occurs when the 812     Although the feasibility of cellular networks as a base 838 technology for reliable contact tracing, using them involves 839 privacy concerns in some countries [48] (as discussed in 840 Section V-E). 841 1) Cellular technology may have some privacy concerns 842 in some countries [48], despite their feasibility, as dis-843 cussed in Section V-E. This can be handled by 844 anonymizing users' data, employing differential pri-845 vacy [60], or inheriting the decentralization concept from 846 other techniques (DP-3T) [40].

847
2) Fingerprinting approach is challenging in 3G and 4G 848 networks due to the reduction of the available cell 849 information, which only includes the associated serving 850 cell and sometimes the strongest neighboring cells In this article, we aimed to realize a flexible solution for 863 contact tracing that can be operated by the provider, client, 864 or even at a third party by handling data from different 865 sources, i.e., different providers. We presented the design, 866 implementation, and evaluation of the CellTrace system as a 867 ubiquitous contact and social distance tracing system using 868 cellular signals. As part of the CellTrace design, we intro-869 duced a novel feature extraction module based on DeepCCA, 870 which yields cross-provider features. These features are fur-871 ther utilized for training a deep localization model tracking 872 users and calculating their social distance regardless of their 873 connected providers. Furthermore, we showed how CellTrace 874 includes provisions in the model to avoid overfitting and 875 boost the model generalization ability. CellTrace achieved a 876 promising localization and contact tracing performance of sub-877 meter median distance error and 97% accuracy, respectively. 878 Nevertheless, CellTrace still has to handle privacy-associated 879 issues to ensure effective contact tracing while maintaining 880 privacy. In addition, we plan to study the system performance 881 at scale, i.e., increasing the number of phones, users, providers, 882 and so on.