Prediction of Application Usage on Smartphones via Deep Learning

Smartphones have proven to be a transformative tool that helps users perform various tasks such as online banking, chatting, sending an email or SMS, and online shopping. However, with the growing number of available applications and people downloading new applications at a high rate, managing the performance of such a large number of applications will increasingly become a concern, which makes managing smartphones' screens and folders complicated. This paper investigates to what extent the usage of those applications can be predicted. The proposed methodology utilizes a deep learning algorithm (long short-term memory) to accurately predict the probability of a given application to be used by the smartphone user after a sequence of applications usage. The experimental result shows that forecasting those applications' usage performance can be correct with an achieved accuracy of approximately 80%.


I. INTRODUCTION
The use of smartphones in our daily lives has grown steadily, especially with hardware improvement and effective programming capabilities. More specifically, smartphones are used to perform activities, such as sending emails, transferring money via mobile Internet banking, making calls, texting, surfing the Internet, viewing documents, storing medical, confidential, and personal information, shopping online, and playing games. For instance, Deloitte Mobile Consumer Survey (2018) demonstrates that 95% of the owner of smartphones use their devices daily in the UK. Furthermore, the UK market penetration by device type, which showed clearly that mobile, is gradually increased compared with a laptop (Deloitte Mobile Consumer Survey, 2018). In addition, with the advent of the increasing features of smartphones, mobile applications have been evolved and become ubiquitous and widely used by smartphone owners. According to Statista (2019), app downloads will surpass 258.2 billion in 2022. Furthermore, the number of available apps in the Google Play app store was about 2 million in 2018, while 1.962.576 in Apple. Interestingly, Statista (2019), in 2020, mobile apps are projected to generate 188.9 billion U.S. dollars in revenues via app stores.
The growth in smartphone usage has led to increased user concerns regarding privacy and security (Lamiche et al., 2018). More specifically, the traditional authentication mechanisms for smartphones such as PIN, Pattern, and Password are suffered from some issues; for instance, secret knowledge-based systems are vulnerable to be sharable, forgotten, easy to guess, and trying to remember and manage a significant number of different accounts as well (Mahfouz et al, 2017). In addition, after the point of entry, using techniques such as a PIN or password, the device user can perform almost all tasks, of different risk levels, without having to re-authenticate periodically to re-validate the user's identity. Furthermore, the current point-of-entry authentication mechanisms consider all the applications on a mobile device to have the same level of importance and so do not apply any further access control rules (Ledermuller & Clarke, 2011). As a result, with the rapid growth of smartphones for use in daily life, securing the sensitive data stored upon them makes authentication paramount.
In this context, behavioral profiling (service utilization) attempts to identify and discriminate users based upon how they interact with applications and services, specifically, which applications they access, time of day, and how long (Clarke, 2011). The advantages of using this mechanism are gathering user data in the background without requiring any dedicated activity by regularly checking user behavior to provide continuous monitoring for smartphone protection.
In summary, the main contributions of this paper are as follows: • Provides a review of the most recent literature in predicting smartphone applications usage. • To introduce our approach and build the prediction model, we propose a predictable model that can predict the probability of a given application used by the smartphone user after a sequence of application usage. • To introduce how we do our evaluation and experimentation, which resulted in the forecasting of those applications usage performance can be correct with an achieved accuracy of approximately 80%.
The remainder of the paper is structured as follows. The following section presents related work and the state of the art of smartphone behavior profiling biometrics. This is followed by an outline of a novel prediction subsequent application usage by smartphone users, including the data collection phase and experimental methodology in section 3. Section 4 presents the experimental results, and section 5 concludes the paper.

II. RELATED WORK
With the rising usage of machine learning and Artificial intelligence (AI), the researchers explore the new ear of predicting the next app using machine learning algorithms and contextual models. This section will review the related work of smartphone app prediction.

A. APP PREDICATION USAGE
In the literature, the researchers have done explicit work on app prediction among users by studying their behavior while using their smartphones. This section will emphasize prediction using different methods of app prediction. Shin

B. PREVIOUS STUDIES DATASET ON NEXT APP PREDICTION
Those researchers summarised the studies of smartphone app prediction using different types of datasets, users, and apps. Android was the most OS used in-app projection because it does not require any jailbreak such as iOS system in iPhone (Cao & Lin, 2017). This research comprehends the methods used, the number of users, predicted applications, datasets used, and accuracy/performance achieved.

A. DATASET
A real applications usage dataset is necessary to provide scientific rigor and a basis for evaluating the application usage pattern. Furthermore, it will identify whether the application usage pattern could maintain a reliable assumption for predicting the smartphone user's next application. Therefore, this study was recruited 76 participants (18 years or older) at the University of Plymouth from February to July 2017. Ethical approval for this research project was obtained from the university's Research Ethics Committee to fulfill University of Plymouth research ethics requirements. All the participants were 18 years or older and were asked to read and sign a consent form and information sheet regarding data collection before starting the experiment. In addition, the research and data were conducted and stored within the Centre for Security, Communications and Network Research at the university premises. Although the study collected applications' logs/metadata, no sensitive material was involved. Participants were asked to use their smartphones generally for at least one month.
After one month, since they were committed to the experiment, participants were asked to provide their devices to perform data extraction. In addition, the investigation was carried out on only those individuals who use Android-based mobile phones.
Only applications access and actions metadata were collected. For the scope of this paper, only applications usage (access) is analyzed. A script code was developed to automate the extraction of log files from a backup file of participants' devices utilizing the Android Debug Bridge (ADB). ADB is a command-line tool that allows communication between a connected Android device and a computer (Android, 2018). For each examined application, the backup file was extracted, and then the developed script retrieved those metadata stored in each application's local database (SQLite), as illustrated in

B. DATA PRE-PROCESSING
After acquiring the raw application usage data, normalization and standardization transformation were examined for transforming the raw data. Transforming the raw data into a rescaled value makes training predictable algorithms faster and reduces the chances of being stuck in local optima (Jason Brownlee, 2019). As data, normalization ensures that each feature is treated equally when applying supervised learners. In machine learning, we can handle various types of data, e.g., audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in machine learning algorithms (e.g., support vector machines, logistic regression, and artificial neural networks). The samples are normalized by scaling the input vectors individually to the unit norm (vector length). The other transformation approach standardizes the features by removing the mean and scaling to the unit variance.

C. NEXT APPLICATION CLICK PREDICTION MODEL
The predictable model aims to learn a machine-learning algorithm to predict the next application the user will click/use with high probability. As we have 12 applications to predict (one is predicted at a time), the investigated problem can be treated as a classification problem, predicting a category (class). Two main variations of LSTM are used, unidirectional and bidirectional. FIGURE 1 illustrates bidirectional RNN in which two independent RNNs are combined. The input sequence is fed in normal time order for one network and reverse time order for another. The outputs of the two networks are usually concatenated at each time step, though there are other options, e.g., summation. This structure allows the networks to have both backward and forward information about the sequence at every time step.

FIGURE 1. Generic Bidirectional Recurrent Neural Network
Each network unit (represented as A and A') is a recurrent unit. A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single "update gate." It also merges the cell and hidden states and makes other changes, as illustrated in FIGURE 2.

FIGURE 2. GRU Recurrent Unit
FIGURE 3 explains the shapes and arrow types that control the data flow in the GRU unit. Each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denotes its content being copied and the copies going to different locations (Olah, 2015).

FIGURE 3. Shapes and Arrow Types of a GRU Unit
As GRU is a recurrent model, the current time prediction depends on all past time inputs. For each layer, the GRU processes at time t by computing the following equations: Where is the sigmoid function, z and r are gates, while h are hidden states. Unidirectional LSTM (GRU) only uses past information. BiLSTM can also take advantage of future information. In each BiLSTM layer, there are a forward pass and a backward pass.
The model architecture used in this study consists of five layers, as illustrated in FIGURE 4. The first layer has five inputs as the sequence length is five, in which by giving five consecutive application usage, the model should predict the sixth one. This layer is followed by a dropout layer with a ratio of 0.20. Dropout removes some of the hidden nodes according to the predefined ratio Srivastava et al. (2011). It is found that it helps make the neural network-based models generalize better. The network output layer activation is processed with a softmax function. Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would Janocha et al. (2017). The application with the highest probability will be determined as the predicted class among other applications.

D. HYPER-PARAMETERS
Batch size is 100. Adam optimization algorithm is used instead of the traditional stochastic gradient descent procedure to update network weights iterative based in training data. An early stop is applied for a total of 100 steps' training. The initial learning rate is 0.01. When it converges, the training will stop. We use the checkpoint of the best validation accuracy to evaluate the test accuracy (Zeng et al., 2019).

A. EVALUATION METRICS
Using only an accuracy metric does not fully reveal overlapping and false-positive rates among the classes or the predicted application names, as it computes the ratio of correct predicted labels to the total examined sample, which becomes insensitive to unbalanced classes. Therefore, an F score is computed, which is interpreted as the weighted mean of precision and recall. An F score of 1.0 is the highest, and the lowest score of 0.0 is the lowest. It worth mentioning that it is common to use F score for binary classification problems; however, adapting the metric for a multiclass problem is achieved using one label versus all other labels. In which, the relative contribution of precision and recall to the F score is equal. Equation

B. EXPERIMENTAL ENVIRONMENT AND TESTBED
The experiment and analysis of this study were mainly conducted using Google Colaboratory (Colab, 2019).
Colaboratory is a research tool for machine learning research projects. It is a Jupyter notebook environment that requires no setup to use. For modelling the deep learning network, Keras is utilised as it is a high-level neural networks API, written in Python and capable of running on top of TensorFlow 2.0 (TensorFlow, 2019). We strongly recommend researchers to leverage Google Colaboratory features as it developed with a focus on enabling fast experimentation and enable collaborations among researchers along with facilitates documenting the developed code boxes/functions on the go.

C. EXPERIMENTAL ANALYSIS AND RESULTS
FIGURE 5 illustrates the density of those applications for all users in the collected dataset. Email is mostly used at late morning-midday as it is tipping the peak usage around 10-13 o'clock. This is not surprising as most people are at work/university at such time in which they are checking/sending emails. In contrast, YouTube application seems mostly used at evening time as it peaks around 18-20 o'clock. Overall, most of these application usages are gradual increases from midday until bedtime (21)(22) where the usage declines. However, not all users have the same pattern as some of which uses a number of the selected application while others use the 12 applications. To show how different users have different usage patterns, FIGURE 6 shows two users usages timelines among the 12 applications. User number 1 (a) uses all the examined applications while the second user does not use application 1, not 2. In addition, they vary in their timeline usage. The proposed model is trained independently for each user. FIGURE 6 illustrates the model accuracy for both the train and validation sets in predicting the application the user will usefor a selected user (e.g. user 52). This specific user is selected randomly to show the train and validation sets accuracy over epochs. After epoch number 50, the model started to overfit. By leaving the learning process of the model to continue longer, this could lead to an overfitting issue-in which the neural network is closely fitted to the training set that it is difficult to generalize and make predictions for new data . Although the dropout technique is used after each layer, preventing overfitting is not always an easy and obvious task. Therefore, the early stopping approach is used in which it stops the training process when the monitored accuracy has stopped improving epoch after another.

FIGURE 7. Train and Validation Accuracy
Likewise, FIGURE 7 illustrates the model loss for both the train and validation sets in predicting the user's application. In this case, the sparse_categorical_crossentropy loss function as presented in the following equation (i.e. objective function) is used as the predicted target of the network is treated as an integer number that corresponds to applications names. where, • w refers to the model parameters, e.g. weights of the neural network • y is the true label • 9 O is the predicted label The overall accuracy of the app's prediction ranges between 60% to 90% for users. From FIGURE 9Error! Reference source not found., email has the highest prediction rate with an accuracy of around 95%. It looks that Phone calls and SMS have almost the same rates.

FIGURE 9. Overall Accuracy per Examined Application
In terms of individual users' prediction performance, FIGURE 10 illustrates the accuracy for all users for all these 12 applications. The averaged accuracy is nearly 80%, similar to the averaged application predictions. The finding provides evidence that some of these users their application usage patterns can be predicted with high confidence in which can result in.

FIGURE 10. Overall Achieved Accuracy per User
Although the experimental results have shown that the proposed approach has achieved an overall accuracy of around 80%, there are many limitations and investigations that need to be examined. For example, the study has focused on only 12 selected applications in which these are the most used applications by the study sample. However, application usage predication should be utilised to include and predict what the user uses typically without flittering the applications. In which predicting the application usage or what the user will open next can be defined as n class problem. Where n is the number of applications that the smartphone's users have installed and frequently uses in the device. Although measuring the frequency of the application usage is not a straightforward task, but it is an aspect that needs to be considered as some of the existing applications (installed) are rarely used such as those comes pre-installed with the device. Further to that, the evaluation of this study was mainly conducted offline (using an experiential environment). It has not been thoroughly tested in a live device (smartphone in this case) to measure other operational metrics, such as computational overheads, memory consumption, and the time required for the whole pipeline to be completed, starting from acquiring usage patterns to pre-processing, and finally inferencing, where the next app is predicted. In addition, the collected dataset was acquired, including Android-based smartphones. Investigating other devices such as iOS-based devices could reveal how similar/different the users' usage patterns between such operating systems are. Future work could also explore other factors, such as identifying the minimum number of seconds and samples required per individual to train a user-dependent predictable model that can successfully match a given pattern usage sequence with the application that the user will use.

V. CONCLUSION
The increasing number of mobile e applications might cause some trouble to find specific applications promptly. For this reason, this research study presents a novel methodology to predict what is the next mobile app that a user is going to open based on supervised machine learning algorithms. This approach might lead to improving the user experience and thereby make the smartphone system more efficient and userfriendly. Based on these findings, this approach can provide a robust approach. The proposed methodology utilizes a deep learning algorithm (long short-term memory) to accurately predict the probability of a given application to be used by the smartphone user after a sequence of applications usage. The experimental result shows that the forecasting of those applications' usage performance can be correct with an achieved accuracy of approximately 80%.

VI. ACKNOWLEDGMENTS
The author would like to thank Deanship of Scientific Research at Majmaah University for supporting this work under Project Number No. 5453.