Using Machine Learning to Optimize Web Interactions on Heterogeneous Mobile Multi-cores

The web has become a ubiquitous application development platform for mobile systems. Yet, energy-efficient mobile web browsing remains an outstanding challenge. Prior work in the field mainly focuses on the initial page loading stage but fails to exploit the opportunities for energy-efficiency optimization while the user is interacting with a loaded page. This paper presents a novel approach for performing energy optimization for interactive mobile web browsing. At the heart of our approach is a set of machine learning models, which estimate \emph{at runtime} the frames per second for a given user interaction input by running the computation-intensive web render engine on a specific processor core under a given clock speed. We use the learned predictive models as a utility function to quickly search for the optimal processor setting to carefully trade responsive time for reduced energy consumption. We integrate our techniques to the open-source Chromium browser and apply it to two representative mobile user events: scrolling and pinching (i.e., zoom in and out). We evaluate the developed system on the landing pages of the top-100 hottest websites and two big.LITTLE heterogeneous mobile platforms. Our extensive experiments show that the proposed approach reduces system-wide energy consumption by over 36\% on average and up to 70\%. This translates to an over 10\% improvement on energy-efficiency over a state-of-the-art event-based web browser scheduler, but with significantly fewer violations on the quality of service.


I. INTRODUCTION
In recent years, portable mobile devices like smartphones and tablets have become the dominant personal computing platform [8]. Concurrent to this mobile computing evolution is the wide adoption of web technology as a development platform for many mobile applications like web browsing, social networking, news reading and online banking. Indeed, the web has become a major information portal for mobile systems and even accounted for two-thirds of the mobile traffics in the US [37].
Energy and performance optimization for mobile web browsing is an open problem. Mobile users want their devices This work has been submitted to IEEE Access for review on 14 June 2019. c 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. to appear fast and responsive but at the same time to have longlasting batteries. Achieving both at once is difficult. Like many other mobile applications, the performance-energy trad-off is a critical issue for interactive mobile web browsing, because users expect a degree of responsiveness for web browsing, but also want low energy consumption when interacting with their battery-powered devices.
Recently, efforts have been made to improve the energy efficiency for mobile web browsing by focusing on the initial page loading phase [34], [35]. These prior approaches exploit the performance-energy elasticity provided by the heterogeneous multi-core hardware design to trade page loading time for lowered power consumption. Although impressive results were shown, such approaches do not consider the impact of user interactions after the initial page loading stage. However, as we will show later in the paper, user interactions can account for a large portion of power consumption for mobile web browsing and thus cannot be ignored for energy optimization.
Some of the more recent studies like PES [15] and eBrowser [52] have attempted to address the energy optimization problem for interactive mobile web browsing. While promising, they have critical drawbacks. Specifically, PES employs an analytical model to choose the operating frequency of the processor to reduce energy consumption. Developing an effective analytical model requires deep knowledge of the underlying hardware and the application domains [46]. As a result, PES offers a poor hardware portability because tuning a model for a new hardware platform could involve significant overhead. While eBrowser avoids the pitfall of using an analytical model, it performs energy optimization by simply dropping some of the user events (which is not ideal as it can miss some important user inputs) and does not explore the rich optimization space offered by the increasingly popular heterogeneous multi-core design. Since heterogeneous multicores like ARM big.LITTLE [23] have become the de facto design choice for mobile systems, eBrowser leaves much room for improvement.
In this work, we aim to close the gap of energy optimization for interactive mobile web browsing. Our goal is to design an adaptive scheme to unlock the potential of heterogeneous multi-cores to better preform energy optimization for interactive mobile web browsing. Our key insight is that a slight delay in responding to a user input might be imperceivable to or acceptable by the user [?], but this provides chances for reducing the energy consumption of a mobile systemby running the computation-intensive rendering process on right processor with an appropriate (but not necessarily the highest) clock speed. In other words, if we can carefully trade the processing time to provide "just good enough" responsiveness, we can then reduce the energy consumption of the entire system without significantly compromising the quality of service (QoS).
While intuitive, translating this high-level idea to build a practical system is not trivial. The key challenge here is how to develop an effective scheme that can be portable across different mobile platforms. Given the diversity of today's mobile devices and the constantly evolving nature of hardware design -where the number of processor cores and capabilities of a mobile device are likely to change from one generation to the other -it is important to make sure whatever strategies we developed today can be easily ported to and deliver good performance on a new hardware architecture tomorrow.
We address the portability issue by employing machine learning to automatically learn how to best configure the underlying heterogeneous multi-core hardware for a given web page to meet the QoS requirement for a specific user and event. Our machine-learning models are automatically built offline using training web pages, and the learned models can be applied to any new, unseen web content. This automatic learningbased scheme removes the need for manually rewriting an analytical model every time the hardware has changed. By reducing the expert involvement, our approach thus reduces the cost of model construction and offers a better generalization ability and performance portability.
To evaluate our approach, we have developed a working prototype 1 based on Chromium [2] -the open-source backbone of two mainstream web browsers: Google Chrome and Microsoft Edge-for-ARM64. We apply the developed system to the landing pages of the top-100 popular websites ranked by www.alexa.com [1] and consider two representative mobile interactions: scrolling and pinching (i.e., zoom in and out). We evaluate our work on two distinct heterogeneous mobile platforms: Odroid Xu3 and Jetson TX2. Experimental results show that our approach consistently outperforms the state-ofthe-art on all evaluation metrics, by delivering over 10% more energy reduction but with fewer QoS violations.
In summary, the key contribution of this paper is a new approach for optimizing interactive mobile web browsing on heterogeneous multi-core mobile platforms. Compared to existing solutions, our approach has the benefits of being low-cost for model construction and portable across hardware architectures. We show that these benefits do not come at the cost of performance penalties. By contrast, our approach delivers consistently better performance over the state-of-theart across web content and evaluation platforms.

II. BACKGROUND A. Problem Scope
Mobile web browsing includes two distinct phases [31] for initial page loading and responding to user inputs. During the page loading phase, web content will be fetched and parsed to construct a Document Object Model (DOM) tree for rendering. Because a web page often cannot fit into a single screen view on a smartphone, only the currently visible area will be rendered by the browser at a given moment. In the user interaction phase, the web browser responds to user events (e.g., scrolling) to render and update the visible viewport accordingly. In this paper, we solely focus on the later interacting phase. We consider two typical mobile interactive events: scrolling and pinching, but our approach can be applied to other user gestures too.

B. Hardware architecture
Our work targets the ARM big.LITTLE heterogeneous multi-core design, the de facto hardware architecture for modern mobile systems. The big.LITTLE integrates an energytuned CPU processor (little) with a faster but more powerhungry processor (big). Such heterogeneous design gives the flexibility for the software to choose a processor core to run a given task depending on the energy and time constraints.
Specifically, our work is evaluated on two representative ARM big.LITTLE implementations: Odroid Xu3 and Jetson TX2, which integrate two different generations of the big.LITTLE architecture. Having two distinct platforms allows us to evaluate the performance portability and generalization ability of our approach. We provide further details of our hardware platforms when describing our experimental setup in Section V.

III. MOTIVATION OF THE WORK
To show the need for interactive mobile web browsing optimizations, we measured the energy spent by the Chromium browser for responding to scrolling events when processing the landing page of the top-100 hottest websites ranked by alexa.com. We automatically generate scrolling events to ensure all content of each page is shown on the screen, but ensure that each scrolling session only leads to a full-screen update. In this experiment, we use an Odroid Xu3 mobile development board that implements the hardware architecture used by the Samsung Galaxy S5 smartphone. Odroid Xu3 integrates a Cortex-A15 (big CPU cluster) and a Cortex-A7 (little CPU cluster) CPUs, and a Mali-T628 GPU. Figure 1 compares the average energy and time spent during the loading and the interacting phases when browsing a web page. The min-max bars of the diagram show the variance across different pages. For each page, we profile the time and energy multiple times until the 95% confidence interval is within a range of 5% variances to ensure the measurement is statistically sound. Note that our measurement excludes time and energy during idle time. Interactions on average are 94% (up to 3x) longer and consume 44% (up to 3.5x) more energy than the initial page loading phase. This huge disparity between energy consumption and processing time suggests that previous work which only focuses on the initial page loading phase would miss a massive opportunity for energy optimization. This work aims to close this gap by performing energy optimization during the interacting phase. Our key insight is that we can reduce the power consumption by responding to user input using a lower CPU clock frequency, and a slight delay in screen updating might be tolerable by the user. To elaborate on our point, consider now Figure 2 which shows the achieved energy reduction and frames per second (FPS) under different CPU clock frequencies for the landing page of cnn.com. In this example, we map the computation-intensive render process to run on a big or a little CPU cluster under different clock frequencies, and the rest browser processes to run on the other cluster. Running the render process on the big CPU cluster with the highest clock speed gives an FPS of 55. Our user study presented in Section V-E shows that the FPS strongly correlates to the minimum acceptable responsiveness of a user, but a typical user would be unable to tell the difference between 25 FPS and a higher screen update rate for web browsing. This finding is in line with the observation presented in a prior study on mobile user experience [?]. This suggests that it would be unnecessary to run the render process at a frequency that is higher than 1 GHz on the big cluster -which already gives a "good enough" FPS of 28 for this example.
The question here is that "what is the optimal 2 processor setting to use?". The right answer depends on which CPU cluster (big or little) we choose to use and at what frequency the CPU cluster will operate on. The choice also depends on the incoming user event rate and the web content, because they determine how long it will take to update a screen view. Unfortunately, choosing the right processor setting is not trivial as an inappropriate setting can lead to either an unacceptable FPS (spoiling the QoS) or unnecessarily higher energy consumption (wasting battery life). In the next section, we will describe how to develop an adaptive scheme based on machine learning to maximize energy reduction without necessarily compromising the QoS.

A. Overview
This work aims to reduce energy consumption during interactive mobile web browsing without compromising the QoS. We achieve this by dynamically choosing the optimal CPU cluster and processor frequency to run the computationintensive render process.
At the core of our approach is a set of regression models, each is tuned for a specific user event. Our models are trained offline using training data. The trained models are then used to make predictions for new, unseen webpages. The model estimates the FPS under a specific processor frequency, by taking into consideration the web content to be rendered, and the incoming user event and event rate. The predictive model is used as a utility function to quickly search for the optimal processor setting. Predictions are made based on a set of numerical values, or feature values, described in Section IV-C.
Some prior works on optimizing the webpage loading follow a constrained based approach to use a classifier to choose from a set of processor frequencies [34], [35]. However, these methods can only apply to the set of FPS and frequencies seen in the training data, due to the nature of classification algorithms. We avoid this drawback by employing a unconstrained based approach -our regression-based model can be used for arbitrary processor frequencies (even those that were not presented during training), because the model takes the frequency as an input.

Learned Model Training webpages
Profiling runs Figure 3: Training predictive models. Our models are training off-line so that the user does not pay the profiling cost during runtime deployment.

B. Predictive Modeling
Our model for determining the optimal processor setting is a collection of artificial neural networks (ANNs). As we target two interactive events in this work, we have two ANNs -one for each event. We choose ANNs because they deliver better and more robust performance than alternative classification techniques like support vector machines and decision trees (see Section VI-F).
We follow the well-known 3-step supervised learning to build and deploy our models: (i) generate training data and problem modeling, (ii) train a model on training data, (iii) and use the model on unseen data. We describe each of the steps in detail in the following subsections.

C. Generate Training Data and Problem Modeling
Training data generation: Figure 3 illustrates the process for training an ANN model, which applies to all machinelearning models we evaluate in this work. Our models are built offline using training webpages. In this work, we apply cross-validation by using a corpus of 80 webpages for training and 20 webpages for testing (see Section V-D). We make sure that the testing webpages are different from the training webpages and come from a different website. We collected the training and testing webpages from the landing page of the top-100 hottest websites ranked by alexa.com (see also Section V-B). To generate training data, we use reran [16], an open-source record and replay framework, to automatically generate the two user events we target. For each training webpage, we generate different training scenarios by varying the duration and speed of a target event. In each training scenario, we exhaustively execute the rendering process under different processor settings and record the achieved FPS. For each webpage, we also extract their web feature values from the DOM tree constructed during the page loading phase.
Model structure: Figure 4 depicts the architecture of our models, which is a fully connected, feed-forward ANN with five hidden layers, where each hidden layer has 80 neurons. The model takes as input the web feature values, a measured event rate, a label indicates where to run the render process (big or little) and the clock speed of a given CPU cluster. It produces the estimated FPS as a real value. The output layer of the ANN is a rectified linear unit (ReLU) activation function for aggregating the results of the previous layer to produce a single output, i.e., the expected FPS under a processor setting. We stress that keeping the network structure simple ... Figure 4: Our FPS predictor is a multi-layer neural network. The input to the network includes a web-feature vector of real values, the measured incoming event rate, a label indicates the big or little processor cluster and real value of the processor speed. The network outputs a real value of the estimated FPS for the given web content with an input event rate under a specific processor setting. is essential for achieving fast prediction and for learning an effective model from a relatively small training dataset. We also evaluate various neural network structures. This is discussed in Section VI-F.
Model Features: One of the key aspects in building a successful predictor is finding the right features to characterize the program space and the input. To that end, we started from 1084 raw features that can be collected at runtime from Google Chromium. Table I groups our raw web features into categories. The features were chosen based on previous work on optimizing mobile web browsing [35], as well as our intuitions.
Feature reduction. To learn a useful model, supervised learning typically requires the number of training samples to be an order of magnitude larger than the number of model inputs (i.e., features). Given that our training dataset size (i.e., 80 webpages) is less than the number of raw features, we need to find ways to reduce the dimensionality of the feature space. We do so by first applying Principal Component Analysis (PCA) [12] to the raw features, and then choosing the top 49 principal components (PCs) which account for around 95% of the variance of the original feature space. We record the PCA transformation matrix and use it to transform the raw features of the new webpage to PCs during runtime deployment. PCA is a standard statistical method for reducing the dimensionality of data. By reducing the feature dimension, we are also improving the generalizability of our models, i.e. reducing the likelihood of over-fitting on our training data.  Contributions of raw features. To obtain some insights for the usefulness of each raw feature, we apply the Varimax rotation [24] to the feature space after applying PCA. This technique quantifies the contribution of each feature to each PC in terms of variances. Figure 5 shows the top 7 dominant features based on their contributions to the PCs. Features like the webpage size and the number of DOM nodes make significant contributions to the PCA space and are hence considered to be important. This is not supervising because the larger the webpage size and the number of DOM nodes are, the more processing time will be. Other features, like # CSS rules, and # Tag.img, also make great contributions to the variance on the PCA space. This because they determine how the webpage should be presented and how do they correlate to the rendering overhead. By employing an automatic feature selection and tuning process, our approach has the advantage of having better portability when targeting a new hardware architecture where the cost of web processing and the importance of web features may change. Later in Section VI-D, we provide a further analysis on the feature importance via a Hinton diagram.

D. Train A Model
The feature values of the target web content, the event speed, and the processor frequency together with the measured FPS are passed to a supervised learning algorithm to learn an ANN for each targeting event. The learning algorithm then tries to update the weights of the ANN to closely map the model input to the measured FPS.
Our models are trained using back-propagation with stochastic gradient descent (SGD). For a set of training examples X 1 . . . X n , the SGD algorithm tries to find a set of network parameters Θ that minimize the output of a loss function: where loss function (x, Θ) computes the mean squared logarithmic error (MSLE) between the model's outputs,x, and expected values, x i : In this work, we train an ANN for each type of events. Since we target two types of events, scrolling and pinching, we build two ANNs. It is to note that an alternative is to have a single model for all event types. However, this strategy offers little flexibility for updating and extension as doing so would require retraining the whole model when targeting a new event. Furthermore, this alternative strategy not only will incur expensive re-training overhead but also is likely to be less effective than a specialized model [25].
Training overhead: The time for model construction is dominated by training data generation. In our case, training data are automatically generated using automated scripts. It took us less than a week to collect all the training data using a single mobile device. In comparison processing the profiling data, and running the learning algorithm can be done in minutes. Since training is only performed once at the factory, it is a one-off cost and the end-user will not experience this during runtime deployment.

E. Use The Trained Models
Our models are implemented in the Python scikit-learn machine learning package. The trained models are encapsulated in a Python library to be invoked by the web browser (via a browser extension in our prototype) for any webpage that is not seen in the training phase. For this work, we have developed a working prototype based on the Chromium. Our implementation requires small changes to the web browserin total, we have modified around 700 lines of code. Figure 6 shows how the trained models can be used during the interactive phase to determine the processor clock frequency. Feature values are extracted from the DOM tree, during the page loading phase after the downloaded web contents are parsed to construct the DOM tree. The extracted feature values are re-used throughout the interactive stage unless the DOM tree has changed significantly due to e.g., content reloading. Specifically, if there is more than 30% difference in the number of nodes between the previous and the current DOM trees, we will update the feature values by performing feature extraction on the current DOM tree.

Runtime scheduler
Interactive phase Figure 6: Using the trained predictive model to find the optimal processor configuration during the interactive phase. The prediction and frequency configuration will be triggered if one of the targeting user input is detected. To make a prediction, we first choose a model for the input event. The chosen model is then used to estimate the achieved FPS under different processor settings to find out the optimal setting. The predicted setting is passed to the runtime scheduler to perform task scheduling and hardware configuration. We note that the runtime scheduler only reconfigures the hardware if the predicted setting is different from the current one.
The pseudocode in Algorithm 1 describes our binary-searchbased processor setting search algorithm. The search engine uses the predictive model (line 8), pred, to quickly find a desired processor setting, c o pt, from a range of available options, C[]. The goal is to find a processor configuration which hopefully will lead to an FPS that is as close as possible to the minium acceptable FPS, F P S min . It is possible that none of the predicted FPS values, F P S pred , exactly matches the minimum acceptable FPS, F P S min . In this case, we return the one that gives the closest FPS value (line 21), but we always choose the next higher frequency setting, C[low + 1], to increase the likelihood for meeting F P S min .
It is also worth mentioning that the overhead of feature extraction, model prediction, and processor frequency searching and configuration is small. It is less than 10 ms which is already included in our experimental results.

V. EVALUATION SETUP
We now describe our experimental setup and evaluation methodology.

A. Hardware and Software Platforms
We evaluate our approach on two distinct mobile platforms: an Odroid Xu3 and a Jetson TX2. Table II gives detailed information about the evaluation platforms. Both platforms implement the widely used ARM big.LITTLE mobile architecture but with different CPU generations and frequency setting knobs. The Odroid Xu3 implements the Exynos 5410 SoC that was released in 2013, and thus represents a low to medium end mobile spec. It is to note a recent study published in 2019 [51] suggests that 75% of today's smartphones still use a CPU design that was released before 2013. Therefore, including Odroid Xu3 in our evaluation ensures that our approach is evaluated on a platform that presents a wide range of mobile devices. In contrast to Odroid Xu3, the Jetson TX2 integrates a more recent SoC (released in 2017), and has larger RAM and more powerful CPUs. Therefore, it represents a higher end, more recent smartphone spec.
We use the onboard energy sensors provided by both systems to measure the power consumption of the entire system. These sensors and power meters have been proven to be accurate in prior work [35].

B. Web Workloads
Throughout this work, we use the landing page of the top 100 hottest websites (as of April 2019) from www.alexa.com. We use the mobile version of a website if available. Figure 7 shows the CDF of the number of DOM nodes and web content sizes. The DOM node and webpage sizes range from small (4 DOM nodes and 40 KB) to large (over 8,000 DOM nodes and 6 MB). The wide distribution of webpages indicates that our test data cover a diverse set of webpages.

C. Baseline and Competitive Approach
Baseline. As a baseline, we use interactive as the default CPU frequency governor. This is a standard power management policy used by the Android system for interactive applications. We use the default setting of the interactive governor, described as follows. The governor samples the CPU load within a window of 80 ms. It raises the frequency if the CPU utilization is above 85%; after that, it waits for at least 20 ms before re-sampling the CPU to decide whether to lower or raise the frequency.
State-of-the-art. We compare our approach against eBrowser [52], the most closely related recent work. eBrowser reduces the energy consumption for a given user event by putting the rendering process into sleep for some time. This essentially reduces the number of events to be processed as some of the user events within an interaction window will be dropped by the browser during sleep. eBrowser uses a linear regression model to model the acceptable event rate on a per-user basis. However, it requires statistical data to be sent to a remote server to learn a model and relies on the operating system for power management. By contrast, our approach does not drop user events (as doing so could miss important inputs) and actively participates in power management by using the knowledge of the web workloads to determine the processor configuration. We port the open source implementation of eBrowser 4 to the latest version of Chromium used in our experiments.

D. Evaluation Methodology
Predictive model evaluation. We use five-fold crossvalidation in our experiments. Specifically, we partition our 100 webpages into 5 sets where each set contains 20 webpages. We keep one set as the validation data for testing our model, and the remaining 4 sets for training data to learn a model. We repeat this process five times (folds) to make sure that each of the 5 sets used exactly once as the validation data. We then report the averaged accuracy achieved across the 10 validation sets. This is a standard evaluation methodology, providing an estimate of the generalization ability of a machine-learning model in predicting unseen data.
Evaluation metrics. In our evaluation, we use two metrics: energy reduction and QoS violation. Energy reduction is normalized to the energy measurement when using the interactive CPU governor. QoS violation is calculate as δ/F P S min , where δ is the number of FPS falls below the minimum acceptable FPS, F P S min . If the resulting FPS is greater than the minimum acceptable FPS, we consider there is no QoS violation (but this may lead to higher energy consumption when reporting energy saving).
Measurements. To measure energy consumption, we developed a lightweight runtime to take readings from the onboard energy sensors at a frequency of 100 samples per second. We then matched the energy readings against the timestamps in an interactive window to calculate the energy consumption. For the FPS, we develop a web extension to record the number of request calls processed by the browser per second, by counting the number of invocations of the Chromium window.requestAnimationFrame() API.
Performance report. Unless state otherwise, we report the geometric mean across experimental settings. We note that geometric mean has been shown to be better at minimizing the impact of performance outliers over arithmetic mean, and is a preferred metric for performance reporting [14]. To collect  run-time and energy consumption, we run each approach on a testing input repeatedly until the variance under a 95% confidence per input is smaller than 2%. This repeat running strategy is essential for obtaining statistically sounded results. Finally, to isolate the impact of network latency, all the testing webpages are downloaded and loaded from the disk. We also disable the cache of the web browser to ensure consistent results across different runs of the same page. We consider this is a reasonable setting as our work focuses on the interactive phase where most of the content would have already been downloaded.

E. Quantifying QoS
To quantify the QoS during web interactions, we conducted a user study. Our user study involved 20 participants (10 females) who were the students at our institution during the time this work was conducted. The participants were at the age group of under 30 and are a frequent user of web-related mobile applications. In our experiment, we automatically replay the user interactions on 100 webpages for each of the targeting gestures (scrolling and pinching) under eight event rates (quantified by the number of pixels per second touched by the finger). In this user study, we display the content under various on-screen update speed (measured by the FPS). We then ask each user to score the experience using a Likert Scale of 5 scores, where a score of 0, 3 and 5 being very dissatisfied, acceptable and very satisfied respectively.
Our user study suggests that the FPS strongly correlates to the QoS. For the same user, the acceptable QoS for a given event-rate for a gesture corresponds to more or less the same FPS (with a standard deviation of less than 4.4). However, the minimum acceptable FPS varies across users and events, indicating adaptive optimization is required. Our observations are in line with the findings reported by prior work [52]. Figure 8 plots the minimum acceptable FPS for scrolling and pinching, averaging across testing webpages and users.
The mini-max bar shows the variation across different users.
In our experiments, we use the results of this user study as the minimum acceptable FPS guidelines. When reporting QoS violations, we measure the performance of each scheme for each testing page under each user-specific acceptable FPS. For reproducibility, in our experiments, we automatically generate eight different event rates for each testing page using a script. We then report the performance of energy saving and QoS violations across 100 webpages, 20 user-specific minimum acceptable FPS settings and eight event rates.

VI. EXPERIMENTAL RESULTS
In this section, we first report the overall results of our experiments, showing that our approach consistently outperforms the state-of-the-art across hardware architectures and evaluation metrics. We then provide details on the working mechanism of predictive modeling, including the prediction accuracy and distribution, feature importance, overhead and alternative modeling techniques.
As a highlight, our key findings are: • Our approach delivers consistently more energy saving but with a lower QoS violation when comparing to the state-of-the-art on both of our evaluation platforms (Section VI-A). • Our approach gives consistent good performance for predicting the resultant FPS under a given processor setting, with a low average prediction error of less than 15% (Section VI-B). • We provide a detailed analysis of the working mechanism of our approach to justify the design choices (Sections VI-D to VI-F).

A. Overall Performance
Figure 9a compares the energy reduction of our approach against eBrowser on Odroid Xu3 and Jetson TX2, where the baseline is the default interactive CPU frequency governor. The min-max bars show the variance of energy reduction. By trading responsiveness for energy, both approaches were able to lower the energy consumption for processing user events. eBrowser gives an average energy reduction of 36.9% and 22.6% on Odroid Xu3 and Jetson TX2 respectively. By exploiting the processor frequency and heterogeneous architecture design, our approach gives a higher energy saving of 47.6% (up to 70%) and 36.4% (up to 60%) on Odroid Xu3 and Jetson TX2 respectively. These translate to an improvement of 10.7% and 13.8% on energy reduction over eBrowser on Odroid Xu3 and Jetson TX2 respectively. Figure 9b shows the distribution of energy reduction across testing webpages for each of our evaluation platforms. The min and max bars represent the highest and the lowest energy reduction found across 100 webpages for meeting the QoS metric of 20 users. Our approach consistently outperforms eBrowser not only with a larger averaged energy reduction but also with a better improvement for 80% of the webpages. On only 20% of the webpages, our approach gives marginally   Figure 10: A Pareto efficiency diagram shows the QoSenergy tradeoff given by different scheduling policies when interacting with the landing page of cnn.com (Section III). Energy consumption is normalized to the interactive CPU governor. Our approach gives the best trade-off. lower energy savings (less than 10%), but our approach does not miss or drop any user event like eBrowser. Figure 9c compares the QoS violation of our approach against eBrowser on both evaluation platforms for scrolling and pinch. We observe a higher QoS violation for pinching over scrolling. This is because scrolling often lasts longer than pinching, which offers more room for scheduling and predictions. While eBrowser can reduce the energy consumption by processing fewer user inputs, it incurs an average QoS violation of 19.5% (up to 52.3%) and 19% (up to 47.5%) on Odroid Xu3 and Jetson TX2 respectively. By contrast, our approach has a lower QoS violation of less than 12.5% and 16% on Odroid Xu3 and Jetson Tx2 respectively. This suggests that our approach can reduce energy consumption while can maintain a higher level of QoS compared to eBrowser.

B e t t e r o f f
Finally, Figure 10 shows the Pareto efficiency of our ap- Jetson TX2 Odroid Xu3 Figure 11: The FPS prediction errors. The thick line shows where 50% of the data lines and the white dot shows the median value. Our approach gives a low prediction error of less than 15% on both platforms. proach, eBrowser, the Interactive and Ondemand CPU governor 5 when processing the landing page of cnn.com (see Section III). From the diagram, we see that our approach gives the best trade-off among all schemes for trading responsive time for energy reduction.

B. FPS Prediction Accuracy
The violin plots in Figure 11 show the error rate for FPS value prediction for scrolling and pinching under the most frequently used processor setting of each platform. The error, e, is calculated as: where F P S measured and F P S pred are the measured and predicted FPS respectively.  Figure 12: How often (as percentages) a processor configuration is considered to be optimal by our model. There is no single configuration that is considered to be optimal for more than 20% of the testing scenarios, suggesting the need for an adaptive scheme.
In the diagram, the thick line shows where 50% of the data lines. The white dot is the position of the median. Our predictive models are highly accurate in predicting the FPS, with a mean error of less than 15% on both evaluation platforms. The prediction accuracy can be further improved by providing to the learning algorithm more training data, which also permits the use of a larger number of features to better capture the application behavior. Nonetheless, our approach can give good results using as few as 80 training webpages.

C. Processor Setting Distribution
The heat maps in Figure 12 depict how frequent a processor setting is chosen for pinching and scrolling on each of our evaluation platforms. In the diagram, we use the notation <Rendering CPU cluster-frequency, frequency of the other CPU cluster> to denote a processor configuration. For example, a configuration of < A7 − 0.2, 0.6 > on Odroid Xu3 means that the render process running on the A7 core (little cluster) at 200Mhz and the remaining processes run on the A15 core (big cluster) at 600MHz; similarly, a configuration of < A57 − 1.1, 0.8 > on Jetson TX2 means that the render process runs on the A57 core (big cluster) at 1.1GHz and the remaining processes run on the Denvor2 core (little cluster) at 800MHz.
As can be seen from the diagram, there is no single processor configuration is considered to be optimal for more than 20% of our testing scenarios, and the frequency for a configuration to be optimal varies across hardware platforms. The results reinforce our claim that a single "one-size-fits-all" model is unlikely to deliver good performance across hardware architectures. Our work avoids this drawback by developing a portable approach using machine learning.

D. Feature Importance
In an attempt to visualize what features are important for predicting the FPS, we plot a Hinton diagram in Figure 13. In the diagram, the larger the box, the more significantly a particular feature contributes to the prediction accuracy on a given platform. The importance is calculated through the information gain ratio. It can be observed that HTML tags and attributes (e.g. webpage size, #DOM nodes, DOM tree depth) and style rules are useful when determining the processor configurations on both platforms, but the importance varies across hardware architectures. We also observe that some features, like HTML tag.IMG and HTML tag.Script, are useful for Odroid XU3 and are less important for Jetson TX2, which because Odroid Xu3 takes longer to process images and JavaScript over Jetson TX due to its less powerful computation capability. This diagram suggests a generic, platform-independent optimization model [15] is unlike to be effective across a diverse set of architectures. Figure 14 gives a breakdown of the runtime overhead of our approach (which was already included in our experimental results). The overhead of our approach including feature extraction, prediction and searching, and task mapping and processor configuration. Feature extraction typically only needs to perform once after the DOM tree has been constructed. Task migration and processor frequency setting account for the majority of the overhead, but is less than 0.3% of the end-to-end turnaround time. Such a small overhead can be easily amortized by improved energy efficiency. We note that the user does not experience the training overhead as training data generation and learning were performed off-line.

F. Alternative Predictive Modeling Techniques
We compare our ANN-based FPS predictor and two widely used regression techniques: linear regression (LR) and support vector regression (SVR). For a fair comparison, we train and evaluate all techniques on the same dataset. Figure 15 shows the mean prediction error given by each modeling technique. Our approach gives the most accurate prediction results with the least mean error across testing web pages, which is 76% and 91% lower than the LR and SVR counterparts respectively. Figure 16 shows how the FPS prediction error changes when different numbers of hidden layers are used for our ANN model. Increasing the number of layers leads to a slightly improved prediction accuracy, but it reaches a plateau after five layers. Using more than five layers would lead to a drop in accuracy, which is mainly attributed to our relatively small training dataset. In this paper, we choose an ANN of five layers as it gives the smallest prediction error and does not require a large training dataset to learn.

VII. DISCUSSIONS AND FUTURE WORK
Our work represents a new attempt for energy-efficient mobile web interactions through the use of machine learning. Like many research works, there is room for further work and improvements. In this section, we discuss a few points.
Multi-tasking mobile workloads. Our work assumes the user is interacting with one webpage at a time. This is a reasonable assumption for mobile applications as unlike desktop PCs, there is typically only one foreground task which the user is dealing with; background programs on mobile devices are typically put into a suspended (sleeping) or closed status. Nonetheless, our approach can be extended to a multi-tasking computing environment that consists of multiple concurrently running workloads. This can be achieved by triggering our scheduler when a web view is presented to the user.
Impact of network latency. Our work focuses on the interactive stage after a page has been loaded and processed to construct the DOM tree. Hence, we do not consider the impact of network latency. It is possible that a user interaction might trigger a new download activity, e.g., loading a new image. This is not explicitly modeled by our approach. However, there is work on energy optimization for page loading, which considers the impact of networks [35]. Such work is orthogonal to our approach. Extending our work to consider the impact of network latency during user interactions is our future work.
Impact of GPU frequency settings. Our work does not model the impact of GPU frequency. Instead, we rely on the default GPU frequency governor to do so. However, our approach can be extended to dynamically adjust the GPU frequency. This would require us to collect empirical data to learn how the GPU frequency setting affect the FPS. Training data collection and learning can be performed automatically in the same way as we did throughout the work, and our methodology for model training and deployment can remain unchanged. We leave this as our future work.
Dynamic web workloads. Our techniques were evaluated on static web content primarily consist of HTML files and images, which remain the dominant content for mobile web applications. To target dynamic content such as JavaScript content or video streaming, we will need new features to capture the workloads and a mechanism for constant monitoring and frequency adjustment. Given the dynamic nature of the problem, it might be interesting to investigate whether a reinforcement learning based approach [38] can be better capture the behavior of the application domain.
Impact of displays. Our work does not leverage the correlation between the web content and on-screen displays to further reduce energy consumption. Nonetheless, our approach can be easily integrated with a display-based energy optimization scheme (which utilizing the color and brightness settings to save energy), as the processor setting, in general, is independent on the colors and brightness of the screen. We investigate such a holistic approach in our future work.

VIII. RELATED WORK
Our work lies at the interaction of the following four areas but qualitatively differs from prior works within each area.

A. Energy optimization
Energy and power optimization for embedded and mobile systems is an intensely studied field. There is a wide range of activities on exploiting compiler-based code optimization [45], [27], runtime task scheduling [39], [40], or a combination of both [46] to optimize different workloads for energy efficiency. Other relevant work in web browsing optimization exploits application knowledge to batch network communications [20], [22], and parallel downloading [36], which primarily target the initial page loading phase. Our work is complementary to prior works by targeting the low-level optimization, and we do so by utilizing the hardware configuration knobs to perform energy optimization during the interacting phase.

B. Optimization for Web Browsing
Our work is closely related to research on optimizing web browsing. Prior works have shown that by carefully choosing the processor frequency, one can reduce the energy consumption for the initial page loading phase [55], [34], [35]. PES [15] and eBrowswer [52] are most closely related to our work. PES employs an analytical model to choose the optimal processor frequency and does not consider the impact of web content to the responsive time. Developing an effective analytical model requires insight knowledge of the underlying hardware [46], which makes it difficult for the approach to be adopted by new hardware architectures. Our work avoids this pitfall by using machine learning to automatically learn a portable approach for how to best optimize for interactive events. Furthermore, we show that web content can have a significant impact on the processor response time, and cannot be ignored. For this reason, our approach explicitly captures and models the impact of web content. Like eBroswer, our approach also trades responsive time for energy efficiency. Our work advances eBroswer by exploiting the heterogeneous mobile architecture and processor frequency settings to reduce energy consumption. By exploring a larger optimization space, we achieve better energy efficiency over eBrowser. A recent work [31] proposed a phase-aware power management scheme to control the processor power state of different web browser phases like loading and touching. This approach considers a fixed response latency threshold for a given phase. Unlike [31], we offer a more flexible, personalized approach by considering the impact of web content on the user perceive latency and the diverse expectations across different users.
Our work builds upon and directly benefits past foundations on web workload characterization [41], [6]. Other studies exploit the interplay between the web server and browser client to improve rendering speed and user experience [5], [28], or reconstruct the web browser architecture [33], [4]. These works are thus orthogonal to our approach.

C. Task Scheduling
As heterogeneous multi-cores are becoming the norm of computing systems, how to effectively schedule application tasks on such architectures have attracted intensive attention. There is considerable work on designing better heuristics or models to schedule application tasks for performance and energy optimization [3], [26], [10], [7]. Our work targets an important domain of mobile web browsing. It builds upon these past results to develop a novel approach to exploit the characteristics of application workloads and hardware to better optimizing interactive mobile web browsing. The main advantage of our machine learning based approach over a hand-crafted model or heuristic is the better portabilitymachine learning enables one to automatically build a model for a new hardware design to adapt to the change of hardware.

D. Machine Learning for System Optimization
Machine learning has quickly emerged as a powerful design methodology for systems optimization. Prior works have demonstrated the success of applying machine learning for a wide range of systems optimization tasks, including modeling personal preference on wearables [21], human activity recognition [32], [53], code optimization [44], [42], [47], [48], [19], [49], [43], [29], [11], [30], [9], [54], [?], [?], task scheduling [17], [13], [18], processor resource allocation [50], and many others. In this work, we employ machine learning techniques to develop an automatic and portable approach to optimize interactive mobile web browsing for energy efficiency. We want to highlight that our work does not seek to advance the machine learning algorithm itself; instead, it exploits and applies a well-established method of statistically reasoning to tackle an important systems optimization problem, in a way that has not been attempted.

IX. CONCLUSIONS
In this paper, we show that the computation spent on responding to user interactions can consume a significant amount of energy on mobile platforms. As a result, we propose a novel approach to perform energy optimization for interactive mobile web browsing. Our work specifically targets heterogeneous mobile multi-core architecture as it has become the de facto hardware architecture for mobile systems.
At the heart of our approach is a set of machine-learningbased regression models for predicting the resultant FPS under a given task to core and processor frequency setting. The predictive models are first trained offline using training web pages and then used at runtime as a cost function to quickly search for the optimal processor setting for new, unseen web content. We demonstrate that by carefully trading the response time, one can significantly reduce the energy consumption during the interaction phase of mobile web browsing. We show that such energy reduction can be achieved without significantly compromising the user-perceived latency or QoS.
We apply our approach to two representative mobile interactive events. We implement our methods in the opensource Chromium web browser, and thoroughly evaluated the developed system on two distinct heterogeneous mobile platforms using the landing pages of top-100 popular websites. Experimental results show that our approach outperforms the state-of-the-art across webpages and evaluation platforms and criteria. On average, our approach reduces the energy consumption by 10% over the state-of-the-art, and it achieves this with fewer QoS violations.