Toward a Better Understanding of Mobile Users’ Behavior: A Web Session Repair Scheme

Using mobile devices to browse the Internet has become increasingly popular over the years. However, the risk of being exposed to malicious content, such as online scams or malware installations, has also increased significantly. In this study, we collected smartphone data from volunteer users by monitoring their use of the Web and the applications they install on their devices. However, the collected data is sometimes incomplete due to the technical limitations of mobile devices. Thus, we propose a data repair scheme to restore incomplete data by inferring missing attributes. Here, the restored data represent the browsing history of a mobile user, which can be used to determine if and how the user has been the victim of web or mobile-specific attacks to compromise their sensitive data. The accuracy of the proposed data repair scheme was evaluated using a machine learning algorithm, and the results demonstrate that the proposed scheme properly reconstructed a user’s browsing history data with an accuracy of 95%. The usability of the repaired data is demonstrated by a practical use case. The user’s browsing history was correlated with other types of data, such as received SMSs and the applications installed by the user. The results demonstrate that a user can fall victim to SMS-based phishing (SMShing) attacks, where the attacker sends an SMS message to a user to trick them install a malicious application. We also present a case of a social engineering attack, where the victim was manipulated to provide their Amazon credentials and credit card details.

incomplete malicious information, it may become impossible 89 to identify and understand the corresponding security inci-90 dent. Thus, we propose a data repair scheme that can infer the user has installed malware due to an SMShing attack. We also 99 find traces of a user falling victim to a social engineering 100 attack to reveal their credit card details and Amazon creden-101 tials. We believe that our findings can help other researchers 102 and serve as a basis for the design of detection and prevention 103 systems for mobile users. Possible applications include data 104 augmentation for security analysts, basic forensic capabilities 105 for end users on their own devices, and data mining results for 106 academic researchers. Our primary contributions are summa-107 rized as follows. 108 1) Various data are collected from mobile devices using 109 a custom-made sensor application. These data include 110 web-browsing history information, web links obtained 111 from received SMSs, installed and uninstalled appli-112 cations, and information about applications currently 113 running in the foreground on the device. 114 2) We propose a data repair scheme, and its usability to 115 reconstruct the collected information is demonstrated 116 in a practical use case. In contrast to the work described 117 in [8], the data collection process employed in this 118 study does not include a repair application program-119 ming interface (API); thus, other features are exploited. 120 This scheme differs from the existing methods because 121 mobile devices (in contrast to PC browsers) cannot 122 access typical information such as referrer headers 123 or a source tab ID. Instead, we augment the dataset 124 with additional features to perform the data repair pro-125 cess. Specifically, the collected data are augmented by 126 including similarity and technical features related to the 127 visited websites. 128 3) A machine learning classifier is designed to evaluate 129 the accuracy of the proposed data repair scheme, and 130 experimental results demonstrate that 95% classifica-131 tion accuracy was achieved. Here, the training data 132 were obtained by correcting a subset of the recon-133 structed sessions manually and determining whether 134 each entry fits the current session. In addition, the 135 training data were generated to facilitate more complex 136 future interpretations of the results obtained by the 137 proposed data repair scheme. 138 4) A case study was conducted to demonstrate the usabil-139 ity of the repaired data. Here, the browsing sessions 140 were correlated with other elements collected by the 141 sensor to investigate several security incidents. Traces 142 showing users being manipulated by malicious web-143 pages or by successful SMShing attacks to compromise 144 sensitive data are presented. For example, an SMShing 145 attack prompts a user to click on a malicious link and 146 install a malicious application. We also demonstrate 147 how a user may fall victim to a phishing website by 148 exposing their credit card information. 149 To the best of our knowledge, this study represents the 150 first work on reconstructing web-browsing sessions using 151 collected user data. The goal of this study is to improve 152 security analyses by correlating information about malicious 153 99932 VOLUME 10, 2022 webpage visits with SMS attacks and application installation, 154 which has typically been unavailable to researchers. 155 The remainder of this paper is organized as follows. 156 Related work is reviewed in Section II. The dataset and 157 proposed data repair scheme are introduced in Section III. 158 The machine learning model designed to evaluate the accu-159 racy of the proposed session-repair scheme is presented in 160 Section IV. The findings of our investigation of the dataset 161 are presented in Section V. A discussion of the limitations and 162 hypotheses of this study is presented in Section VI. Finally, 163 Section VII concludes the paper.  [29]. Some of these studies included participants 217 who consented to install a sensor for monitoring purposes. 218 The investigation of the user security awareness and exper-219 tise has demonstrate that high expertise does not necessarily 220 correlate positively with computer security.

221
In addition, surveys can suffer from insufficient sample 222 sizes or low response rates [5], [26]. A seminal work on 223 user behavior prediction was proposed by Canali et al. [9], 224 where the web histories of 100,000 users obtained from data 225 provided by a major antivirus software were analyzed. Then, 226 a set of features that can be used by machine learning models 227 was extracted. This study achieved a reasonable level of 228 accuracy (up to 87%) in determining which users were likely 229 to be the victims of web attacks. 230 Sharif et al. [5] proposed an extensive analysis of HTTP 231 logs from a mobile network carrier. This log analysis process 232 was combined with a mobile user survey, and a prediction 233 model was designed to determine whether a user is likely to 234 be exposed to malicious content over a long period. However, 235 the performance of this model was slightly worse than that 236 reported in [9]. Sharif et al. [5] claimed that, in contrast to 237 the findings reported in [9], they used a more limited set of 238 features, which is also less computationally intensive. Finally, 239 they built a classifier that predicts the probability of a user 240 being exposed to malicious content within a 30-second time 241 frame.

242
In addition, a study based on chain-redirection reconstruc-243 tion has been reported [10], and this study is similar to our 244 approach. In [10], crowdsourcing was used to examine how 245 users go through redirection chains before reaching the final 246 webpage. Then, individual chains were combined into redi-247 rection graphs, which were then used to distinguish legitimate 248 webpages from malicious webpages. Although that study 249 exploited the referrer's header to construct the corresponding 250 graph, the necessary information was inaccessible and could 251 not be used in our current study, which is explained in further 252 detail later in the paper.

253
Another study by Takahashi et al. [8] used a Chrome 254 extension [6] developed to monitor the browsing behavior 255 of PC users. This extension records each tab opened by a 256 user and captures page redirections and bookmark usage. 257 Using these data, the authors concluded that the predominant 258 source of user compromision is actually caused by book-259 marks. Indeed, users regularly visit malicious websites, e.g., 260 illegal streaming services, and add such sites to their book-261 marks. The authors also designed a graph analysis tool that 262 considers the transitions between legitimate and malicious 263 domains, and they implemented a filtering solution to block 264    Kovacs [14] investigated a problem similar to that consid-305 ered in our study. Rather than repairing incomplete browsing 306 session, they predicted the time spent on each website and 307 which tab was in focus during a given browsing session. They 308 also exploited the Google API through a browser extension 309 to collect data from volunteer users. The features used in 310 that study were be divided into three categories, i.e., time-311 based, number of visits, and the referrer id. In addition, 312 they employed a random forest classifier model and achieved 313 approximately 80% classification accuracy. Note that this 314 study exists beyond the cybersecurity domain; however, the 315 methodology used yielded valuable insights relative to brows-316 ing session reconstruction.

317
In opposition to previous works, Szurdi et al.
[13] emulated 318 users to analyze the behaviors of malicious websites. They 319 implemented a platform to generate browsing patterns involv-320 ing various devices, browsers, whether a proxy is employed, 321 etc. This study also represents one of the few studies that 322 targeted mobile users (both emulated and a real human 323 agent). Their platform included multiple evasion mechanisms 324 designed to avoid the countermeasures set up by malicious 325 websites. They collected a total dataset of 2 TB of data, and 326 they discussed an automated labeling scheme based on HTTP 327 features to determine whether a given webpage is malicious 328 or benign. . Their findings demonstrated that users are not 336 actively engaged in researching security incidents, despite 337 the fact that they might be affected directly by such events, 338 e.g., out of the 59 users potentially affected by the Equifax 339 breach, only 15 of these users actually read about the event. 340 Bhagavatula et al. also emphasized the need to better dissem-341 inate information related to security incidents and how to 342 attract the reader's attention.

343
Other than the studies listed in Table 1 Total to help label their data, and then they applied clustering 359 algorithms to the labeled data. By grouping domains with 360 similar communications, they found that they could effec-361 tively determine the type of malware using these domains.   Figure 1. Among the participants, 107 users who 395 were exposed to malicious content provided entries flagged 396 by GSB as malicious. Thus, we limited the scope of our 397 study to those 107 users. Therefore, approximately 5% of 398 the users in the dataset were exposed to malicious content.  The recruitment campaign was run similarly to the one run 402 in [8]. We argue that this is a sufficient number of users for the size in similar works, e.g., 185 users in [14], and 50 users 405 in [12]. This is because our population size is only 10% of that 406 used in [5] and because our recruitment campaign targeted 407 users who were more tech-savvy (similar to [8]) than the 408 general population.

409
The data collected by the sensor included web accesses, 410 application installations and uninstallations, received SMSs, 411 changes in the foreground application, authorization for 412 installation of unknown applications, and other physical 413 device information data (the latter were not used in this 414 study). Some of the collected data contained attributes that 415 can be exploited by GSB to detect potential threats. The infor-416 mation collected for each data type is described as follows.   SMSs was much less than that generated by web browsing.

475
We worked with our internal review board to ensure that the 476 usage of the logs was ethical and respectful of the users' 477 privacy. We also accepted the terms and conditions associated 478 with the use of the mobile sensor [6]. All collected data 479 conformed to these terms and conditions, stipulating that the 480 data collected by the sensor were used only for research pur-481 poses (i.e., to detect and prevent access to malicious URLs). 482 All users that were required to install this browser extension 483 agreed to these terms and conditions.

484
The collected data included privacy-related details; thus, 485 they were used under strict restrictions. Any personally iden-486 tifiable user information was deleted or coded before the 487 records were stored on the servers. The user ID recorded in 488 the log was an internal number unique to each user that could 489 not be used to reveal any personally identifiable information. 490 In addition, raw URLs could not be shared by external parties. 491 Thus, we did not use VirusTotal [33], which requires the 492 submission of raw URLs. Instead, we used GSB to evaluate 493 the maliciousness of URLs because this tool does not require 494 the upload of raw URLs. We also deleted all records of users 495 who requested them to be deleted.

496
The logs used in the analysis were stored on a server 497 in a secure facility. Only registered users from registered 498 machines with adequate security measures were permitted to 499 access the log data. In addition, no permission to copy the raw 500 data outside the machine was given. Thus, all analyses were 501 conducted on secure servers, and only the aggregated results 502 were exported from the secure server for further analysis.

503
C. SESSION-REPAIR SCHEME 504 Here, we discuss the algorithms, assumptions, and metrics 505 employed to repair the browsing session. As described in 506 Section I, some entries in the dataset were incomplete due to 507 several limitations. When reconstructing browsing sessions, 508 the most important field is the tab hash field. In the follow-509 ing, entries with a missing tab hash value are referred to as 510 orphan entries.

511
A simple approach to reconstructing browsing session is 512 to group entries with the same hash value and sort them in 513 chronological order. Algorithm 1 describes a naive imple-514 mentation of a browsing session-repair algorithm. This algo-515 rithm begins by initializing a dictionary containing a list of 516 sessions for each user. Then, for each user, a dictionary with 517 the tab hash as a key and entries that match the key are 518 initialized. Next, a loop for the corresponding user is created 519 through the data. If the tab hash is not already a key of the 520 dictionary, a corresponding key is created; otherwise, the data 521 are inserted in chronological order. Note that the use of this 522 algorithm is limited because it drops all orphan entries. 523 Thus, we designed the proposed approach, which considers 524 parameter values other than hash tab value, i.e., textual and 525 chronological similarity metrics. The following assumptions were considered in the design of 528 our metrics and the corresponding repair algorithm.     In this section, we describe the evaluation score designed for 554 the proposed repair algorithm. The purpose of this score is to 555 evaluate the probability of a specific web access fitting in a 556 given browsing session. Following the assumptions described 557 in Section III-C1, the coherence of an entry to a browsing 558 session is proportional to the URL similarity of its chrono-559 logical neighbors but is inversely proportional to the temporal 560 distance between its neighbors.

561
Here, the temporal distance between entries x 1 and x 2 is 562 denoted dur(x 1 , x 2 ), and the similarity ratio between the 563 URLs of entries x 1 and x 2 is denoted sim(x 1 , x 2 ). In addi-564 tion, prev_entry(x) (respectively next_entry(x)) represents 565 the entry that chronologically precedes (respectively suc-566 ceeds) entry x in the current browsing session. The following 567 formula is proposed to evaluate a specific entry x: 568 score(x) = 1+sim(x,prev_entry(x))+sim(x,next_entry(x)) 1+|dur(x,prev_entry(x))|+|dur(x,next_entry(x))| (1) 569  576 We designed an algorithm to repair each user's browsing 577 sessions. Here, the first step is to extract the existing browsing 578 sessions using the tab hash field (Algorithm 1) and retrieve 579 orphan entries for further processing. Then, for each orphan 580 entry, the corresponding sessions that the entry may fit in 581 are retrieved. The score for each session is computed, and 582 the orphan entry is inserted into the session with the highest 583 score. This process is shown in Algorithm 2, and the compu-584 tation of the corresponding score is described in Algorithm 3. 585 Algorithm 2 begins by collecting the reconstructed ses-586 sions acquired by Algorithm 1 and the orphan sessions. Then, 587 a Boolean session_found that serves as an end condition for 588 VOLUME 10, 2022    session. Thus, we propose an approach to evaluate the perfor-606 mance of the session-repair scheme using ML classifiers and 607 evaluate each orphan entry. Here, we discuss the construction 608 of the training dataset, which features were considered, which 609 models were tested, and the results of using outlier removal 610 and feature pruning.

612
First, a fraction of the results obtained from the dataset gener-613 ated by the proposed repair scheme were evaluated manually. 614 These corrected data represent 0.3% of the total dataset (over 615 2800 data points) and were used as training data for the pro-616 posed classifier. This amount of data seems appropriate, with 617 regard to the works presented in Section II. The details are 618 summarized in Table 2. The number of entries in the dataset 619 corresponds to the total number of entries in the browsing 620 sessions of the users in our study. In addition, the number 621 of participants corresponds to the number of humans who 622 manually labeled the training data. The insertion of each entry 623 can be classified into the following three categories. During the evaluation, we assumed that entries having URLs 631 in the same domain and a small-time difference were inserted 632 coherently. If the domains did not match, we attempted 633 to identify whether the domain belongs to an ad provider 634 because the entry may have been generated by a redirection 635 or a pop up. If it does not, we checked the website using a 636 search engine to determine if the topic is coherent with the 637 neighboring entries. Note that some URLs were no longer be 638 valid; thus, they could not be verified directly. However, they 639 could still be investigated using traffic analysis or malware 640 detection websites. It was also assumed that an adult website 641 is a plausible source for redirecting malicious content. Here, 642 each entry not labeled correct is considered incorrect by the 643 binary classifier. However, in future, we intend to use a 644 ternary classification to reconstruct browsing sessions not 645 recorded by the sensor, as described in Section VI. The data 646 corresponding to the manual evaluation of the proposed repair 647 scheme are summarized in Table 2. Some excerpts generated 648 by the proposed repair scheme were selected due to their 649 relevance to our study (Tables 9 and 10). 650 We extracted several features from the browsing sessions, 651 and these features can be divided into similarity features and 652 technical features. Here, the similarity features represent the 653 chronological and URL similarities between the current entry 654 and its neighbors in the session, and the technical features 655 represent the specificity of an entry. The features used by the 656 classifier are described in Table 3.    The class instances of the dataset were balanced using 686 the synthetic minority oversampling technique (SMOTE).

687
The imbalanced-learn Python library [43] was used. The 688 evaluation of the classifiers was complemented using both 689 feature selection and outlier removal. Here, feature selection 690 was implemented using extremely randomized trees [44], and 691 features below an arbitrary threshold of 0.01 were removed. 692 These outliers were removed from the dataset using the isola-693 tion forest algorithm [45]. The scores of the classifiers were 694 averaged over 30 iterations to consider randomness in the 695 training process.

697
The classifier selected for the evaluation was an optimized 698 XGB algorithm combined with the KNN classifier using the 699 dataset, even though the pruned dataset obtained better scores 700 (Table 4). In fact, the pruned dataset only offered a 0.21% 701 improvement, which can be attributed to the low volume 702 of training data. As shown in Table 5, the proposed repair 703 scheme achieved 95% accuracy, where all metrics performed 704 similarly. We considered this to be an acceptable accuracy 705 level to proceed with the case study (Section V). The dataset 706 was selected using the Jaccard index because of its best 707 performance. The feature pruning process removed the type of threat and 709 the multiple redirect features from the web accesses. These 710 features are not expected to impact the reconstruction process 711 because they are relevant to security incidents. However, 712 the access type was retained, which is reasonable because 713 there should be continuity in access types within a single 714 web-browsing session. For the sake of comparison, Table 4) 715 includes the performance of classifiers without resampling of 716 the imbalanced class using SMOTE. During training, most 717 samples were classified as inserted correctly, which improved 718 the performance of even the worst performing classifiers. 719 This is easily explainable because most training samples 720 belonged to the majority class; thus, the classifiers would 721 overfit. Similarly, an analysis of the outliers demonstrated 722 that most of them belong to the majority class. Note that 723 these outliers did not include the synthetic data generated 724 using SMOTE, and approximately half of the outliers were 725 incomplete entries reconstructed by the proposed scheme. 726 Thus, half of the outliers belonged to complete data that was 727 collected without issue.

728
The Jaccard index was the most accurate similarity metric 729 among the compared metrics; however, its accuracy was less 730 than 1% better than the worst performing metric, i.e., the 731 SequenceMatcher metric. In addition, the SequenceMatcher 732 metric was the slowest to compute (by a factor of five com-733 pared to all other metrics). This was not an issue for the 734 amount of data processed in the current study; however, 735 it could become problematic if the scope of the evaluation was 736 extended to also consider users who did not trigger the GSB 737 sensor. This represents a factor of 20 per run of the algorithm. 738 Finally, we observed that the number of browsing sessions 739 was quite similar among the metrics, which implies a certain 740 degree of stability during the splitting of the browsing session.

741
The results are summarized in Table 5.

743
In the previous section, we discussed how the proposed   mobile applications are monitored by the sensor. Table 6 sum-794 marizes the tab distributions for both legitimate and malicious 795 accesses. As can be seen, Chrome accesses are nearly equally 796 represented among all entries and malicious entries, whereas 797 SMSs represent only a small fraction (11%) of malicious 798 accesses in the entire dataset.

799
GSB identifies three types of threats in the dataset, 800 i.e., Malware, Social_Engineering, and Unwanted_Software. 801 Table 7 summarizes the distribution between threats and 802 legitimate entries for both web accesses and links in the 803 SMS messages. We found that no SMS message triggered 804 the Unwanted_Software flag; however, the data sample was 805 too small to be sure that SMS messages are not commonly 806 used to share such software. In addition, malware represents 807 the majority of malicious content found in SMS messages. 808 In terms of web entries, we found that social engineering con-809 tent makes up the majority of attacks, which is in opposition 810 to the trend observed with SMSs. As will be discussed in 811 Section V-B, if a user is going to install malware received 812 through an SMS, they will typically do that shortly after 813 receiving the message.

815
In this section, we discuss how different dataset elements are 816 correlated to investigate the attacks made on users. In [5], the 817 authors stated that they could not evaluate the actual infection 818 rate due to the limitations of their dataset. Thanks to the 819 user-side collection process used to construct our dataset, it is 820 possible to overcome some of these limitations. Note that we 821 do not intend to present a detection methodology for mobile 822 attacks. The purpose of the following is to complement the 823 discussion about the impacts of attacks on cell phone users. 824 This decision was motivated by the fact that only GSB was 825 used by the sensor to classify the entries as malicious or 826 legitimate.

828
Social engineering attacks cover any attack that targets a 829 user's personal information using deception. Here, we con-830 sidered attacks coming from either an SMS or typical 831 web-browsing activities. Note that the technical and ethical 832 limitations of this study (as described in Section VI-E) did    Table 9 856 shows that a user was redirected to a localhost page using  The main limitation of the proposed repair scheme is that 882 it does not differentiate between incorrectly inserted single 883 entries and the set of entries belonging to a different browsing 884 session that was not captured. The reasons for missing infor-885 mation in this other session may vary. For example, the user 886 may have not authorized the sensor to record this information 887 or the sensor failed to capture the event. Thus, we may con-888 sider orphan entries as a block of entries rather than inserting 889 them individually in multiple sessions. This could produce 890 more meaningful sessions with less noise during the data 891 repair process.

893
In Section III-C1, it was assumed that the similarity between 894 two URLs was sufficient to assume continuity in a browsing 895 session. Similarly, two similar URLs may indicate that the 896 corresponding webpages may include similar topics. How-897 ever, this assumption should be reinforced using third-party 898 tools, e.g., SimilarWeb [46], that provide contextual analyses 899 and related key words to achieve better evaluations of the 900 semantic similarity between webpages.

902
The scope of the current study was limited to defining 903 malicious content based on the output of GSB. However, 904 GSB may not be the optimal choice for detecting malicious 905 content. In fact, it has been pointed out [5] that GSB can 906 take some time to include a reference in their database. 907 Confidentiality issues were also outlined using VirusTotal to 908 analyze the content due to the permanent recording of the 909 submitted data. However, VirusTotal can be used to analyze a 910 URL's domain without including specifics, e.g., subdomains 911 or parameters in the URL. We used VirusTotal manually to 912 verify the web domain in an SMS link and found that a 913 user had installed malware, which was then used to remotely 914 control their smartphone.

916
The dataset constructed in this study also has some limita-917 tions. First, the total number of users was lower compared 918 to that reported in [5]. In addition, a smaller percentage of 919 users were victims of mobile attacks. Note that we obtained 920 a response rate similar to those reported in [8]. Our user 921 base also suffered from the same bias, as the communication 922 campaign for the mobile sensor was similar that run by [8]. 923 In addition, some domains were intentionally not recorded 924 by the sensor (e.g., accesses made to bank-related websites 925 or email accounts). Privacy concerns were also considered 926 carefully throughout the course of this study, and users may 927 arbitrarily decide to exclude specific domains from the scope 928 of the sensor. 929 VOLUME 10, 2022  The sensor itself also suffers from some technical limi-  However, it appears that these markers are insufficient to 942 divide long sessions. Although the median numbers shown 943 in Table 5 are coherent with what can be expected of mobile  to gain insights about how to improve the proposed repair 967 scheme and the security of mobile users. More precisely, 968 we found that some limitations in the data collection process 969 may prevent us from obtaining insights into a user's behavior, 970 which should be investigated further in future work. Finally, 971 we demonstrated the usability of the proposed repair scheme 972 by performing an exploratory data analysis, and we outlined 973 risky behaviors leading to successful attacks, e.g., malware 974 installation via an SMShing attack and the leakage of credit 975 card details via a phishing website. This analysis was made 976 possible by using the proposed repair scheme and recovering 977 missing entries in the web-browsing session information.