A Survey of Online Data-Driven Proactive 5G Network Optimisation Using Machine Learning

In the fifth-generation (5G) mobile networks, proactive network optimisation plays an important role in meeting the exponential traffic growth, more stringent service requirements, and to reduce capital and operational expenditure. Proactive network optimisation is widely acknowledged as one of the most promising ways to transform the 5G network based on big data analysis and cloud-fog-edge computing, but there are many challenges. Proactive algorithms will require accurate forecasting of highly contextualised traffic demand and quantifying the uncertainty to drive decision making with performance guarantees. Context in Cyber-Physical-Social Systems (CPSS) is often challenging to uncover, unfolds over time, and even more difficult to quantify and integrate into decision making. The first part of the review focuses on mining and inferring CPSS context from heterogeneous data sources, such as online user-generated-content. It will examine the state-of-the-art methods currently employed to infer location, social behaviour, and traffic demand through a cloud-edge computing framework; combining them to form the input to proactive algorithms. The second part of the review focuses on exploiting and integrating the demand knowledge for a range of proactive optimisation techniques, including the key aspects of load balancing, mobile edge caching, and interference management. In both parts, appropriate state-of-the-art machine learning techniques (including probabilistic uncertainty cascades in proactive optimisation), complexity-performance trade-offs, and demonstrative examples are presented to inspire readers. This survey couples the potential of online big data analytics, cloud-edge computing, statistical machine learning, and proactive network optimisation in a common cross-layer wireless framework. The wider impact of this survey includes better cross-fertilising the academic fields of data analytics, mobile edge computing, AI, CPSS, and wireless communications, as well as informing the industry of the promising potentials in this area.


I. INTRODUCTION
The 5G mobile network is the foundation of the future Cyber-Physical-Social Systems (CPSS) by supporting three highly heterogeneous services, enhanced mobile broadband (eMBB), ultra-reliable and low latency communications (uRLLC), and massive machine type communications (mMTC). 5G and beyond 5G services need to support an 600x to 2500x capacity increase [1], sub 1ms round-trip latency [1], and 10,000 or more low-rate devices per cell The associate editor coordinating the review of this manuscript and approving it for publication was Dakai Zhu. site [2]. Such significant improvements translate to a sharp rise in the operational expenditure (OPEX) and optimisation complexity (≈ 60× increase [3]), which is not desirable. As a result, the expenditure of some leading mobile operators may exceed revenues if no effective action taken by the end of this decade [4]. Accordingly, there are widespread concerns that the ambitious quality of the 5G heterogeneous network will be considerably disadvantaged by the OPEX hike.
To alleviate the threat of complexity and OPEX explosion, proactive optimisation has the potential to transform the trade-off between performance and revenue in a fundamental way. One example in 3GPP of successful proactive optimisation is the video context-aware scheduling, based on the user-side attention information, which is a promising implementation in CPSS [5] associated with the cloud-edge computing techniques. Data-driven proactive optimisation aims to closely couple network resource allocation dynamics in cyber and physical spaces with predicted consumer dynamics in social spaces, allowing it to respond before a negative situation arises (e.g. signal outage or network congestion) and leads to the negative consumer experience. For example, the 3GPP Release 15, System Architecture for the 5G System, includes building user device mobility pattern for Access and Mobility Management Function to benefit the priority in radio resource management [6]. The transformation from passive/reactive to proactive optimisation requires several new network elements that are previously neither seen in 1G/2G/3G/4G nor current 5G standards.
These are: • Data mining and analytics to infer consumer demand, context, and experience. They require the fusion of qualitative and quantitative analytics in cyber, physical and social spaces; • Coupling between forecasting and network optimisation functions. This ideally would use the posterior probability of forecasts to inform risks in optimisation; Current research efforts encounter the bottleneck of both collecting the data for retrieving consumer demand and behaviour context, as well as forecasting at a sufficiently high resolution across slices with uncertainty quantification. Whilst an increasing amount of online information (from the Internet) is generated and accessed, the accuracy is sparse, and reliability is subject to varying platform bias. So while online-data analytics is promising to break the bottleneck for proactive optimisation, much evidence-based work is needed to show the envelope of its applicability. This survey represents an attempt to couple the potential of big data analytics of online information and proactive network optimisation with machine learning, and show its potentially useful areas using demonstration examples.

A. FROM REACTIVE TO PROACTIVE NETWORK OPTIMISATION
Radio resource management (RRM) and network deployment are the primary focus areas of the survey, and there are many underlying optimisation functions include scheduling, mobile edge caching, backhaul optimisation, interference management, load balancing, as well as many aspects of coverage and capacity optimisation. From a historical perspective, RRM and network optimisation have moved from engineering expertise based (e.g. human-expertise driven manual configuration in the late 1990s) to reactive numerical optimisation (e.g. expertise-driven numerical functions with parameter inference in post-2010). With increased complexity and the need for real-time analytics that is personal to consumers, it now needs to evolve into big-data-driven proactive self-optimisation. We will first briefly review the historic development of optimisation before diving into the enabling technologies for proactive optimisation.

1) REACTIVE NETWORK OPTIMISATION
In the early days of the 2G network, radio engineers monitored the network statistics and tuned the network to improve key performance indicators (KPI). Engineers used their field knowledge and previous experience (e.g. drive testing) to diagnose the origin of problems [7]. However, it took a long time for engineers to manually detect and diagnose the problem, and the network might need several hours from the occurrence of a problem to network recovery. For 3G optimisation, researchers and operators tried to reduce the human-machine interaction. For example, the 3GPP proposed Minimisation of Drive Testing (MDT) in [10] and designed Markov Decision Tree-based optimisation to maximise traffic offload in Wide-band Code Division Multiple Access (WCDMA) [8]. However, each optimisation algorithm typically still required frequent configuration by engineers and was not personal to individual consumers, but more towards a service area (e.g. city council or a shopping mall) or a service genre (e.g. maximum rate or proportional fair). Besides in the automatic examples, the schedulers still required more than one hour to coverage (76 minutes in [9]).

2) DEVELOPMENTS IN SELF / PROACTIVE OPTIMISATION
In the 4G period, the 3GPP stated the significance of implementing automatic optimisation and introduced Self-Organising Network (SON) in Release 8 [16]. In the past decade, a significant number of SON implementations have been developed to enable cell sites to self-optimise their coverage and capacity [11], energy savings [32], and load balance [12], [15]. The commonly used optimisation methods included reinforcement learning [32], Fuzzy controllers [11], regression tree [15]. One challenge with machine learning approaches is that the integration of data is typically low dimensional (e.g. channel estimation or QoS reporting) and both the contextual information is missing to personalise services as well as the forecasting capability to enable proactive optimisation. As such, typically advanced SON engines reacted over 10 minutes after a severe event [15], [17].
In the 5G, to meet the fast-changing context in dense network deployment, the SON decision process has to converge to a satisfactory solution in a very short time. The optimisation time is influenced by algorithm time complexity, computing ability, the time to trigger the algorithm, as well as the uncertainty of the decision's benefit (regret function). This is important because many optimisation algorithms are time-sensitive that the time-scale highly influences the QoS delivery. For example, in a Power Load Sharing (PLS) research [17], the user dissatisfaction rate would increase nearly 20% if the time to trigger the algorithm was delayed from one minute to one hour. That is because network recovers after the degradation of performance. This outage period increases user dissatisfaction experience [17] and risks increased customer complaint and lower customer loyalty to VOLUME 8, 2020 TABLE 1. Developments of cellular network optimisation from passive to proactive, data-driven, and self-optimisation.
their network [33]. As we can see in the 'Time Scale' column of Table 1, the optimisation time is developed to be closer to real-time and even proactive to save cost and improve loyalty. Approaching this purpose requires heterogeneous data sources to sense the fast-changing network context. On the one hand, current developments in cloud-fog-edge computing have laid the foundation to enable large-scale big-data analysis on cloud and small-scale streaming-data analysis on edge [34]. On the other hand, proactive optimisation in 5G and beyond 5G offers low-latency and reliable communication services to transfer data in CPSS. The widely-used machine learning algorithms aim to configure model-free optimisation for reducing real-time complexity [3], [24], [35], and a current trend for improving SON decision time horizon is triggering the algorithm in advance (e.g. proactive). Table 1 summarised the network optimisation the general view of developments from 2G to 5G. It also provides the gaps of current optimisation.

3) GAPS IN DATA USAGE
Current research for RRM and network optimisation tend to be using mobile network data (subscriber-level and cell-level) as the only data source [14], [23], [24], [36]. The general user behaviour is needed but abstracted to homogeneous mathematical models or statistical assumptions which have limitations to represent the real-world changing complexity and diversity. The vast majority of these approaches self-simulate a single user-side measurement as a proxy for more complex behaviour aspects [3], [14] [24]. Such a fact indicates difficulties in collecting and integrating multi-dimensional data from different sources, especially if it involves both structured and unstructured metadata. The lack of consumer context causes cold start problem that the algorithms require some time to collect information and then start optimisation. Furthermore, in current passive load balancing, inter-cell user distribution is unknown so that all neighbouring cells become active for offloading. It is inefficient when the over-load cell has a skewed user distribution.

B. ENABLERS FOR PROACTIVE OPTIMISATION
The 3GPP Mobile Data Applications Impacts (Release 11) [18] mentioned that network optimisation will be boosted if it can understand user behaviour and spatial-temporal traffic pattern through application data. In that case, the optimisation needs the status information of concrete entities in the social space (e.g., UEs) or virtual entities in the cyber space (e.g., software), such user-centric meta-information is called context in [37], [38]. The context represents all the user information indicating spatial-temporal network traffic characteristics for a user-centric network, including geolocation and user behaviour. The methods to gain the context are named context-aware or context-awareness. And the computing works rely on cloud-fog-edge computing. The optimisation algorithm requires a context-aware module which automatically collects and analyses data from different sources (e.g., online data and personal devices), then supplies context for adequately re-allocating communication resources [17].

1) ONLINE DATA
Many users on online social networks voluntarily generate personal data, and the data is interleaved with geographical, public, and other information [39]. The online data directly and tightly connects to users' intents, and therefore appropriate for transforming network optimisation to be proactive and user-centric. As stated in [40], [41], CPSS aims to offer services to be not only high-quality but also proactive and personalised, which results in an irreplaceable role for the online big-data analysis. Online data can be divided into different types: social networks, video/photo sharing sites, online forums, product reviews/ratings, and wikis. This survey focuses on three of them, social network (e.g., Twitter, Facebook and Instagram), comments and reviews (e.g., Amazon customer review), and multi-media hosting sites (e.g., YouTube). These types of online data inherently contain substantial hidden information about users and hold different data merits and pitfalls.

2) SOCIAL NETWORK DATA
Online social platforms change users from content viewers to content creators and distributors. It not only owns plenty of shared information about public and individuals [17] but also supplies real-time details for forecasting spatial and temporal attributes of future events. The social network data consists of four data formats: geolocation, timeline, text, and photos/videos. Text and photos have plenty but noisy information to grasp users' desire, by contrast, timeline and geolocation have a clear and structural format but limited information. The social network also records the social relationship/tie which benefits the estimation of the weights of D2D links. For example, the forecasting of social ties in [42], [43] enable D2D in caching delivery with finding the most influential user [42] and sharing with friends [43].

C. THE FRAMEWORK OF DATA-DRIVEN PROACTIVE OPTIMISATION
To make the enablers drive proactive optimisation, we propose a framework in Fig. 1 that involves data acquisition, integration, and using forecasting to drive RRM and network optimisation. The complex computation relies on cloud and edge computing. Example frameworks of combining the cloud-edge computing and the CPSS big data analysis is put forward in [34], [40], [44]. In these works, a cloud plane is responsible for global big data with large-scale and long-term (cloud computing) while local data is processed by an edge plane (edge computing) because of its small scale and short term. Our work follows a similar idea, the remote cloud and the edge cloud work as the cloud and edge planes respectively. In Fig. 1, the remote cloud (data-driven proactive optimisation module) is placed between the Gateway and the core network. An example of this implementation is shown in [45] that placed the caching module close to the small BS and between the core network and edge network. That will benefit the data storage, analysis and learning, but the precondition is that the latency should be acceptably low. The following parts illustrate the functions of the framework.

1) DATA
This step includes data collection, cleaning, and storage. For a more detailed process, the work [34] further divides it into organisation, representation, cleaning, reduction and integration. These aspects have been successfully implemented thanks to the support of the edge computing and well-accepted in CPSSs. The data collection is a process of gathering information on variables of interests through multiple online and offline available sources, such as the VOLUME 8, 2020 Internet, vehicular, satellite, base station, user devices, government and business databases, and sensors. The combined data may vary in sparsity and resolution across urban and industrial areas [46]. For the Internet data, the providers provide Application Programming Interfaces (API) for the third parties to access the open data. More details of data collection can be found in Section III.A.1. Then, the raw data can be noisy and redundant, so it needs a cleaning process (e.g., sort, filtering) to be established in a systematic fashion and stored in the edge data centre or cloud data centre for further analytics. Tensor-based method is effective to analyse big data by focusing on typical features [47]. The extensible order tensors can represent unstructured and structured data. Typical applications of this method are in [48], [49]. These works illustrate how the tensors work in cloud-edge computing. Big streaming data is also a challenge for real-time big data processing. High-order singular value decomposition is proved efficient to avoid redundancy (see examples in [50]).

2) PREDICT CONSUMER BEHAVIOURS
The network traffic fluctuates according to consumer behaviour. This step is building models to predict demand changes. The input is the online data from the Internet, and the output is the probability of different demand levels across various behaviour contexts and slices. Understanding the posterior distribution of predictions will generate a spatial-temporal consumer demand distribution that helps predict the network KPI.

3) CORRELATE TO NETWORK KPI
The correlation models a path of mapping predictive behaviours to network KPIs, such as the network traffic. The polynomial regression (in the figure) and statistics analysis (e.g., Pearson correlation) are two commonly used approaches. Therefore, the model input is the behaviour probability, and the output is the probability of KPI, such as the probability of high-load occurrence.

4) PROACTIVE NETWORK OPTIMISATION
The predictive network KPIs are injected into this function. The optimisation algorithm needs to configure the parameters in advance for the upcoming condition changes to achieve targets. An example is proposed in [51] about optimising the network in a proactive and energy-efficient way. They presented a framework with implementing a big-data-aware intelligent platform between the core network and Baseband Unit (BBU) pool for analysing user behaviour and network patterns to output control strategies. Note that, these analytics and leanings are available to be carried out by both remote cloud plane and edge plane. The cloud computing can generate a general trend context of public behaviours, such as traffic of a city in rush hours, which suggests a macroscopic optimisation. At the same time, the edge computing processes personalised context with the help of edge data centre, edge tensors, edge data management and analysis. Finally, the network configuration in the physical space is decided according to both trend and individual context.

D. SURVEY OBJECTIVES
The core concept of this survey is to identify the user-oriented contexts that proactive network optimisation requires, then categorise the available online data and methods to acquire these contexts, finally to characterise the relationship between different proactive network optimisation and each online data. In this case, this survey probes further into these questions: • Which user and demand contexts can help personalise proactive network and RRM optimisation?
• How to retrieve or infer the CPSS contexts from online data?
• What are the inputs and outputs of the Cloud-Fog-Edge computing?
• What is the supply-demand relationship between online data and proactive network optimisation?
• What are the open research areas in proactive network optimisation for 5G? The main academic contributions of this survey are as follows: • To help readers understand the categories and characteristics of the available online data.
• To review and summarise the useful CPSS contexts provided by online data and the spatial-temporal traffic pattern.
• To clarify the role of Cloud-Fog-Edge computing in proactive network optimisation.
• To analyse each network optimisation to be proactive regarding the required contexts.
• To build a supply-demand business relationship between online data and network optimisation.
• To provide applications of proactive optimisation by the cloud and edge computing in CPSS. The remainder of this survey is organised as follows: Section II discusses related survey papers about online data, contexts, and network optimisation. Then, Section III summarises the contexts, data sources, and analytic methods. In Section IV, the authors analyse each proactive network optimisation and build a link between contexts and optimisation. After that, an online data-proactive network optimisation supply-demand relationship map is put forward.
Section V proposes open challenges and future research directions. Finally, Section VI concludes this survey.

II. RELATED WORK
This section discusses relevant surveys in three aspects (data-driven optimisation, consumer context, and online data), summarises and compares them with our survey (in Table 3).
There are several data-driven network optimisation works with focus areas of exploiting big-data [55], [57], constructing SON engine architecture [56], and developing in machine learning [31], [97]. The authors of [55] exploited big data-driven 5G network optimisation. They first presented a general framework to integrate operator big data, then introduced optimisation cases (e.g., resource allocation, interference coordination, and cache deployment). Furthermore, self-optimisation is emerging in recent years. The survey [56] was one of the pioneering-literature about reviewing cellular self-optimisation for a tutorial purpose. Readers were provided with projects, features, standards, taxonomy, solutions and design guidelines in detail to start their research in this area. Then, a hot topic of applying machine learning in SON has been reviewed in [31]. Moreover, the work [97] provided a comprehensive tutorial on applying artificial neural networks (ANN) in wireless communication scenarios (e.g., proactive caching). Several key types of the neural network were presented, such as feed-forward, recurrent, spiking, and deep neural networks. Besides, the authors in [57] proposed a review and two case studies on data-driven small cell RRM and deployment. Although the above works provide detailed reviews in data-driven optimisations, to our best knowledge, there is still no particular work to reveal the paths of taking the advantages of online data to proactive network optimisation. Compared with other works (as summarised in Table 3), ours mainly contributes to using online data to make user behaviour predictive to drive proactive optimisation.
The methods for forecasting consumer behaviours were reviewed [53], [54], including users geolocation, link quality, network traffic, and social information. The work [53] aimed at constructing a predictive frame to obtain intelligence for detecting user behaviour and network environmental changes. The authors presented a detailed context classification of prediction techniques but did not introduce data sources. Isolated examples of network optimisation were briefly discussed, such as mobility prediction for network offloading. The reviews of data sources from the online platform can be found in [39], [52]. Moreover, the researchers in [54] investigated existing big-data mobility works for geolocation prediction. The authors summarised the basic principles and common methods for mining users' distribution and mobility patterns. Furthermore, GPS, Global System for Mobile Communications (GSM), and WiFi data records could all be used to track popular regions, so the authors compared them in the areas of data analysing and model building.  The above surveys presented the feasibility of applying online data for forecasting consumer behaviours (demand), whereas it is still not clear how to couple these techniques with proactive network optimisation. This survey also contributes to this aspect through integrating the demand knowledge for a range of proactive optimisation techniques.

III. ONLINE DATA FOR BEHAVIOUR PREDICTION
The users' behaviour can be fully understood and even predicted based on meta-information such as: time, history, geolocation, identity, social status, schedule, and events. In this section, we will lead the way to forecast user behaviour and inferring data demand. Table 4 presents an overview of the common data-analytics tools in the following reviewed literature. We divide these methods into two main categories: data collection and data analytics.

1) DATA COLLECTION
The online data consists of three major categories, social network (e.g., Twitter and Facebook), media hosting site (e.g., YouTube and Instagram), comments and reviews (e.g., e-commerce reviews and topic talk). The geo-tagged social network data is the source for extracting the geo-location context. Forecasting social behaviours requires different kinds of online data for modelling user activities and corresponding meta information. The traditional method for collection is using an Application Programming Interface (API).
The API-collection method is widely used (see application examples in Table 4, 'Search API' and 'Stream API' columns). It allows automatic data collection from service providers in an economical way. The search API and the stream API are all in this category. However, it still has some challenges, such as poor efficiency and difficulty in gathering historical data. Also, service providers could limit the collection.
Building own datasets, such as collecting from volunteers or purchasing from service providers, can avoid the above limitations. This method provides a nearly complete dataset and achieves a flexible setting of the environmental variables. A popular method is sending friend requests to other users with a statement that researchers are collecting the data for researches with privacy protection. The volunteers could feel free to accept or reject friend requests. For example, the researchers in [98] invited 19,484 users agreed to join the experiment. Sometimes, incentives can improve the performance of collection that contributors who would be rewarded with payment according to their contribution [99], [100]. This method can be costly and requires ethical approvals, or one can use open datasets to reduce this cost.
The open datasets can reduce the cost in data-collection. Many organisations make available their datasets for transparency or research purposes, such as Kaggle datasets [101] and European Union Open Data. One typical example is the Italia Telecom operator dataset (see data from [102]). It is used to forecast cellular traffic pattern [103]. Some public projects will also open their data to other researchers. For example, an EC H2020 RISE Project DAWN4IoE has opened the datasets, such as cellular traffic data [104] and Twitter density [95], [105]. Researchers are required to choose their methods for collection and pre-process the raw data for further data analytics.

2) MOST COMMON ONLINE-DATA ANALYTICS TECHNIQUES
Scalable machine learning emerged in recent years as it gives the computer system an ability to learn user behaviour from online data. We compare the most commonly used machine learning methods regarding proactive optimisation requirements in Table 5 and list the sections where these methods have been used. The complexity refers to the number of computation operations that should be performed to achieve the desired result. The training data and time indicate the required data amount and training efficiency. Then, the accuracy suggests the supposed performance that the algorithms generate. Finally, the evaluated levels (low/fair/high) is based on the previous literature cited after each name.
From Table 5, we can find some common characteristics of the machine learning usage. For example, the geo-location modelling usually requires unsupervised clustering methods, such as K-means and Density-based Spatial Clustering of Applications with Noise (DBSCAN). This has the advantage of not requiring labels in a sophisticated problem setting and rely on topological features as a compressed representation of high dimensional attributes. However, the ill-defined nature of clustering means initial parameterisation is highly related to researcher bias or intuition (see details in Section III. B). In contrast, social behaviours usually have limited categories (e.g., positive and negative in user sentiment), so supervised classification is commonly selected, such as Support Vector Machine (SVM) and K Nearest Neighbour (KNN) (Section III. C). For time series forecasting, regression and Markov methods are good at predicting network traffic (Section IV.A). In network optimisation, more sophisticated methods are chosen, such as reinforcement learning (Section IV. B) and the neural network (Section IV. C). However, many of the non-Bayesian methods face challenges of catastrophic forgetting and dealing with high-dimensional inputs. That requires more-advanced learning methods, such as deep Gaussian process, meta learning, VOLUME 8, 2020 deep reinforcement learning, and neuro-evolution deep learning. These approaches either provide quantitative uncertainty estimates, high-dimensional feature capture, and/or improved adaptation to the environment.

3) PREDICTION ERRORS AND OPTIMISATION OVERHEAD
The machine learning approaches thus far produce prediction errors because of the scarcity of training data or the mismatch of prediction functions. The probability of error can be described by uncertainty in the predictions. In further optimisation, the overhead of the system will be incurred because of such uncertainty. We draw lessons from other prediction and optimisation systems in other areas of science and engineering.
In prediction systems, such as climate science and structural mechanics, big data helps to inform the likelihood of outcomes of predictions that arise from dynamical systems. Probabilistic numerics translate input uncertainty into output uncertainty (e.g., probabilistic finite element). The uncertainty caused by data/estimation errors is required to monitor and control the computational overhead. In that way, the paper [121] provided an illustration of using the probabilistic numerics to describe the uncertainty with diagnosing error sources in computations. However, the Gaussian Process needs to be coupled with deep learning to face more complex tasks, so the deep Gaussian process emerges.
The deep Gaussian process acts as a deep neural network but with Gaussian Process governing the mappings between layers. It will give an empirical confidence interval to quantify the uncertainty. The higher uncertainty could mean a higher potential to cause overhead. In network optimisation, the forecasting associated with high potential of overhead could be discarded in the decision making. For example, the work [111] successfully learnt natural human motion by the Deep Gaussian Process even with scarce data. Besides the uncertainty (overhead) quantification, a parallel system offers a useful structure to improve the robustness.
The parallel system owns a reliability-wise structure. It allows the system to function with any mechanism working. For example, if the network unexpectedly operates in a bias condition, it still has time to alter to a reactive optimisation. Such a method works as the parallel system as introduced in [122] to improve reliability. Based on the above methods, this survey proposes a framework to create a regret function for the poor performance in the proactive optimisation (see IV.A.6. The Quantification of Uncertainty in Proactive Optimisation).

4) PRIVACY AND DATA UTILITY
The privacy problem is critical in data analysis. The challenge is to gain high utility in data while ensuring confidentiality, integrity, and availability [123]. Besides, the network operators and the data providers should achieve not only encryption of all data but also a strict access control to avoid the unpleasant data collection, storage, and usage. In that case, the trade-off between data utility and privacy needs to be solved in three aspects.
Firstly, current users are usually unaware of the collection of personal data, which causes the anxiety about potential defraudation and hurt feelings. Appropriate notices and asking authorisation can relieve the anxiety during personal data collection. For example, in the Internet of Things (IoT), the users are notified about IoT privacy properties [124]. This 'right to know' alleviates anxiety and provides users with choice. An example is the current usage of Internet cookies (see cookies consent under the EU General Data Protection Regulation [125]), but it requires that consumers trust data storage and usage.
Secondly, the stored data must be provided with both privacy and authenticity. For this purpose, the encryption schemes will transform the data into a ciphertext with a symmetric-key mechanism to satisfy the two requirements (see Authenticated Encryption [126]).
Moreover, data protection not only needs to encrypt information but also protect them from attacks, which is a classification problem. For example, the classification of legal/illegal user can be achieved by utilising the radio channel information [127]. The work [128] used Recurrent Neural Networks (RNN) to detect various attack variations, and the work [129] provides a panoramic survey of security in cyber-physical systems.
Finally, during the data usage, researchers need to protect sensitive latent information while reserving utility. Such a trade-off is studied in [130] by measuring data utility loss and latent-data privacy matrices. Another method to secure outsourced data analytics is by applying the homomorphic encryption [131].

5) SUMMARY OF FINDINGS AND LESSONS LEARNED
In summary, the main findings and lessons learned from this sub-section include: • Collecting online data from APIs is the most economical way, but the service providers could limit the process.
In contrast, building their own datasets can suit the requirements well, but it will cost time or money to find or reward the contributors. The other way is to use the public datasets which are increasingly available due to transparency and reproducibility drives.
• Geo-location modelling is often an ill-defined unsupervised clustering challenge. In contrast, behaviour modelling is usually a supervised classification problem. Next, the traffic prediction can be addressed by the regression methods, including the polynomial regression, Gaussian Process, and RNN. In the network optimisation, the parameters and principles become dynamic and numerous, so high-complexity methods (e.g., reinforcement learning and the neural networks) are becoming increasingly suitable.
• The prediction errors cause undesirable overhead in the proactive optimisation modules. It is necessary to quantify such overhead and also take into account other utility functions such as privacy and security.

B. MODELLING GEOLOCATION
Geolocation represents the real-world measured location of users, which offers spatial traffic distribution. It contains three components, observation time, moving objects, and geolocation records [54], which can be mined from data of GPS, Base Station (BS), and landmarks. In that case, the online data is suitable for providing the context of the popular region [94] or forecasting the nodes of the personal trajectory [132].

1) POPULAR REGION
A popular region is a specific place with the potential to generate high communication traffic where a group of location records gather around a centre at a particular time. This region can be attractive all the time, such as commercial and tourists areas. In Fig. 2 from [95], we present the spatial correlations between 3G traffic and population density. This verifies the hypothesis that popular regions (with high population) have high probabilities of generating high demand. The network optimisation schemes should allocate resources in these areas to satisfy the imbalance traffic distribution, especially during events.
In the network optimisation, the prediction of the popular region represents the upcoming hotspots whose popularity is correlated with social network traffic (e.g., Twitter). It will benefit resource deployment [95], load balancing [94], and caching [133] by finding the place with high demand. The time requirement for hotspot prediction generally needs to be two hours in advance [94], [95] due to the achievable high accuracy (correlation> 0.85 in [95]). In contrast, the geographic resolution requirements for hotspot prediction depend on the requirements of different network optimisations. For example, in [133], the predicted resolution of hotspot decided the flying height (332 m) of the flying BSs for proactive caching. Moreover, the work [94] achieved a load-resource matching with a 120-meter resolution. The popular regions (hotspots) are usually modelled by clustering, such as k-means in [133]. It is to maximise similarity in the same group and guarantee that the assigned objects in different clusters are as different as possible.

a: K-MEANS BASED SPATIAL MODEL
In using this model, researchers manually choose the number of clusters (k), then the algorithm groups GPS coordinates according to k centroids cooperated with map information. For example, the users can be clustered into different groups to guide the location and height of flying BS to cover them (see the research in [133]). However, the k has to be manually determined, and the cluster range is out of control. To determine the number of clusters and a cluster radius, Ashbrook and Starner [61] tried to use a variant of K-means which simulated radius regarding cluster numbers and picked the k at the convergence starting point.
In fact, popular regions' ranges can vary a lot in both size and shape, which requires automatic range optimisation and methods to reduce computation cost. Besides, this model cannot avoid the influence of noise data. In that case, the method named DBSCAN emerged.

b: DBSCAN BASED SPATIAL KERNEL
The DBSCAN is a density-based algorithm that groups the points with many nearby neighbours and ignores the points lying along in low-density area as outliers (noise). This model requires no prior knowledge of clusters and no radius and results in fitting cluster shapes. Researchers choose only a minimum range and the minimum number of points in this range. Then a cluster with a minimum density is generated with arbitrary shape. For instance, the work [135] used this method to search popular regions considering the diversity of users and adaptive density. We applied the DBSCAN-based method on a Twitter dataset in London. The result is proposed in Fig. 3. The density parameter is set as the average density, so the popular regions (high-density cluster) own higher density and smaller size. For convenience in visualisation, a Voronoi diagram is used, so only the centres VOLUME 8, 2020 and borders are displayed. Popular regions, like the city of London, own denser small clusters indicating the high traffic demand. Even the spatial clustering algorithms are widely applied to find popular areas, only clustering in one dimension was not enough, such as recognising sub-areas for the evolution of events. Therefore, researchers considered using a spatial-temporal model.

c: SPATIAL-TEMPORAL CLUSTERING MODEL
Geolocation clustering has two main sub-categories, spatial clustering and temporal clustering, which make the objects gather regarding both dimensions (location and time). We need to consider the temporal dimension to find changing popular areas along with time. For example, the temporal dimension can be added to an extension of DBSCAN to take time changes into account to separate regions in both space and time. K. Tamura and T. Ichimura proposed an example work in [58] by analysing Twitter data.

d: EVENT-DETECTION BASED MODEL
A place with an attractive event becomes popular in a particular period. Detecting events means to retrieve necessary information of a planned public occasion, such as schedule, topics, and attendance. Thanks to the online information, the occurrence of events can be automatically detected [59], [62].
Statistic method is chosen to forecast the regularity and the events. The city region can be partitioned into sub-areas by clustering. Then, in each sub-area, a geographical regularity estimation was executed, it was the usual condition of crowds moving pattern. Finally, the statistic method, such as boxplot, was chosen to find out the outliers. For example, in [62], R. Lee and K. Sumiya developed such an event detection algorithm to identify festival occurrence through analysing the Twitter data. Fig. 4 provides an example that we use this method to detect the event in London. It displays the density of geo-tagged Tweets before and during the event.
The popular region has changed from regular hotspot to the event hotspot. The network optimisation schemes require to be changed for fitting the new-emerging event hotspot.
However, the extracted features of events can be incomplete while using only one data source. To improve in this aspect, H. Becker et al. proposed an approach [59] for identifying scheduled events from not only the social networks (e.g., Twitter, Flickr) but also media hosting site (e.g., YouTube). Another challenge is that the majority of online data is not geo-tagged, which limits the upper boundary of the detection precision. K. Watanabe et al. proposed a real-time local event detection system in [63] using both geo-tagged and non-tagged Tweets.

2) PERSONAL TRAJECTORY
The personal trajectory describes the individual moving path that refers to an ordered time sequence of stops where a user pass or stay [54]. With this context, one can maximise network performance based on the adequately allocated resources. Social network data has the GPS records to find user's staying locations, such as home and workplaces.
In the wireless network, the personal trajectory directs the efficient resource distribution for continuous optimisations, such as cooperative caching. The mobile users may handover to other BSs before finishing the content transmission. In that case, it is better to predict user movement and distribute the caching segments along the trajectory. The required accuracy for a beneficial caching is > 75% or higher [116]. In proactive load balancing, the handover margin's re-configuration has to be finished before the high traffic (crowds) comes. For example, when a group of users moved to a destination cell in 5 minutes, the network can recover from over-load in less than 48 minutes with the prediction of the trajectory (more than one hour without trajectory) [17].
Moreover, the forecasting of high-mobility users would reduce the handover frequency in the HetNet [136].

a: CLUSTERING BASED MODEL
Clustering based model is to group personal location records into different stops or nodes and allocate the context tag to this location, such as home or work. The stops indicate the locations where users would require some services (e.g., using the network) while they were staying longer than a minimum stop duration. It firstly requires to collect the locations in the testing scenario. Then, the clustering is applied to model the sub-region with stops. The research [137] used affinity propagation clustering following the above procedures. Another work [138] also regarded the trajectory as stops and moves. However, for slow-moving events, such as city tour, previous methods miss some interesting places as objects are moving. Palma et al. [70] considered speed by a spatial-temporal clustering to achieve unknown stops discovery.
Another challenge is to avoid the influence of half-way data because the half-way location records distributed sparsely and cost less time for users to stay. According to this, we can filter the half-way coordinates out of raw data through time-based clustering. A time interval can be set to pick out the places with a longer time duration. J. H. Kang et al. studied this method in [68] according to the time characteristic using real WiFi-based location system. The WiFi-based trajectory modelling is also a kind of indoor fingerprinting localisation and tracking system whose accuracy can be improved by the Kalman filter [139].

b: MARKOV MODEL
In the Markov model, the future states only depend on their current state and have no relation with all previous states. In that way, it is an available choice to model personal trajectory with probability. Each node is a location, and a transition between two nodes means the probability of the user mobility between those two locations. D. Ashbrook and T. Starner proposed the paper [61] for forecasting multiple users' movement by the Markov model composed of nodes and transitions. However, the GPS data faces a problem that indoor positioning is not as accurate as outdoor. We still need other data source to compensate the model for loss. A work about using BS handover data was published recently by H. Farooq and A. Imran in [69] to predict the students' mobility using a Semi-Markov model.

c: SENTIMENT MODEL
The stops of trajectory have both physical and semantic meanings. As reviewed, the online data can be used to detect both of them [64], [65], such as eating at home or offering presentations at workplaces. Previously reviewed literature concentrated on physical trajectory but gave fewer efforts to know the semantic meanings of interesting places (e.g. home, work). The researchers in [66] utilised Bayesian networks to investigate GPS temporal patterns to find semantic meanings of frequently visited places. Such personal trajectory model tracked visited places to predict future regular visits. However, we require to know not only the regular movements but also a judgement of the outliers of their daily trajectory. For example, cognitively-impaired elders or blind people will encounter problems if they have such irregular movements. Q. Lin et al. studied this problem in [67] based on mining historical GPS data. Moreover, the trajectory during events can also be different from the daily movements. The authors of [53] regarded modelling the user' movements as an event-based trajectory that records new geolocations when we detected new events.

3) SUMMARY OF FINDINGS AND LESSONS LEARNED
In summary, the main findings and lessons learned from the modelling of geolocation include: • The popular regions indicate the hotspots distribution in a spatial traffic pattern. The proactive optimisation needs this context to decide the most profitable region for resource allocation and infrastructure deployment.
• The clustering methods, K-means and DBSCAN, are widely chosen in the popular region modelling. The K-means is easy to implement with low complexity but requires a manual selection of the cluster number k, leading to a degree of arbitrary parameterisation based on user bias/intuition. Some variants of K-means can mitigate this problem by re-simulating a series of k values, but still meet the negative influence caused by noisy samples. In that case, the DBSCAN based spatial kernel is selected to reduce the noise and highlight the high-density areas. The setting of minimum density determines the popular regions that can be found.
• The clustering method is good at modelling the stops in the trajectory, but it is difficult to track the slow-moving objects which have few stops. In that way, the speed of objects is taken into consideration. Besides, the half-way locations with less staying time require to be ignored. Therefore, it is better to also analyse the data in the temporal dimension through setting the minimum stay threshold. Another problem is that trajectory prediction also requires the transfer probability between the stops. The Markov model can provide such a transition probability. As this model assumes that the next stop is only correlated to the current one, it will ignore the potential influence of the previous stops.

C. MODELLING SOCIAL BEHAVIOUR
Humans have diverse behaviours and attitudes towards a specific object [140]. Understanding user behaviour can help choose the suitable communication service type (e.g., video, audio). Moreover, the attitude becomes the key to guide operators to improve service quality. Social behaviours include content popularity, preference, and relationship.

1) CONTENT POPULARITY
Frequent-accessed content generates the majority load of the network. Content popularity tells which content is liked, accessed and shared by a high number of users. There are two targets of modelling content popularity, the first one is to judge which content will be popular, and the other one is predicting the level of popularity [141]. The content popularity decides the content placement, transmission and storage in proactive caching. Traditional statistic popularity models (e.g., Zipf distribution) have no parameter optimisation, such as the gradient descent in machine learning. One of the promising methods for improvement is to use machine learning based on online data for modelling the popularity model instead of using Zipf distributions, such as applying regression methods to model published YouTube videos' popularity for new video's proactive caching [142]. The error tolerance is low in caching (error rate < 0.5% [116]) due to the high backhaul cost while meeting errors. Moreover, the models are required to be updated continuously to maintain a lower overhead (cost) [116].

a: CLASSIFICATION BASED MODEL
Classification is a supervised-learning method that identifies which category a new observation belongs, its training process is based on the data set whose categories are already known. The work [142] extracted video features as vectors and used SVM to classify videos and build the popularity model. The videos in the same classified category should have a similar popularity trend. Besides, we can also mark contents with tags of popular or not popular, then describe the content with popularity in the first hours and popularity in the later days, finally use classification to model the content popularity. The authors in [73] proposed an example of using the Naive Bayes classifier to recognise stably popular and highly popular YouTube video based on popularity patterns and content-requesting times.

b: REGRESSION BASED MODEL
The content popularity varies along with the changes of some independent variables, such as comments and visit counts.
The regression-based model can make the network understand this variation. For example, after the content publication, the first-hour popularity can reflect its level after a month. Such correlation can be described by a linear regression [72]. The authors in [72] worked on an online newspaper article popularity prediction using readers' comments on Digg. However, online content in text-form occupies much lower cellular resources than online videos. The popularity of online videos was studied in [76].

2) USER SENTIMENT
User sentiment is a sense of a user's perception to certain contents or services (positive or negative), and it may affect its subsequent actions (e.g. stop usage or complaint). Understanding user sentiment can help best fit their tastes to increase QoS and even cache the preferred content in advance. In the network optimisation, user sentiment usually indicates the areas with poor network performance and the content a consumer likes or not. Users could complain about network experience through the social network, and such behaviour offers an opportunity to classify the complaints into different service categories. In a work about network coverage blackspots detection [33], the detection accuracy was achieved over 0.6 F1 score with 80 training size. In that work, we used Twitter data as the source to detect spatial user sentiment and provided a case study of London Bridge as shown in Fig. 5. Consumer requests reflect their sentiment on the contents. The works [116], [133] cached each user's interested content in the BS along the trajectory.

a: LEXICON DICTIONARY
Sentiment analysis is mining affective states subjective information. It is a combination of using natural language processing, text analysis and unsupervised learning. An example is given in [33] building an NLP based model using Twitter data to help telecom operators to find QoE blackspots where network optimisation or better deployment is required. They filtered the words related to complaints and located the blackspots with the geo-tags in the Tweets. Such blackspots can be translated to coverage holes caused by uneven BS deployment, power-exhaustion BSs, or emergence BSs under the attack by extremes weathers [143]. Sensors' coverage self-healing provides a path to solve the coverage problem, such as coverage hole repairing of mobile edge nodes [144], [145].

b: SUPERVISED LEARNING BASED MODEL
Supervised learning produces an inferred model which maps the training data and its labels. Then, this model can allocate new observations with the existing labels according to the estimated probability. The authors in [116], [133] predicted the Youku (like YouTube) video's content-requesting probability by using a neural network. The input of this model was a vector of user context, and the output was the content request distribution. Other researchers estimated the user sentiment from feedback and ratings in [81] using such as Naive Bayes Neural Network, and Nearest Neighbour algorithm. Similar works using different data based on this model, such as the investigation on searching result preference [82], [83]. However, individual preference is not stable over time because of the influence of environments, experiences, and education. Generally, for machine intelligence at the individual level, calibration is needed on a daily basis [146].

3) USER RELATIONSHIP
The online platforms record the relationship between users like Facebook friendship or Twitter subscription. One can use friendship, interaction, latent, and following graphs for representation [147]. These relationships not only indicate the role of a user in society but also provide us with a model to reveal information spreading rules to benefit the dissemination of content [148]. According to [149], [150], the diffusion of information in Social-Physical Networks can be modelled by the strength of social ties. In the network optimisation, sharing the cached contents in friends circles dramatically alleviates the pressure of the core network. This can be done by finding the most influential user in social ties [42]. An example of using Facebook friends in a cooperate caching is [43].

a: INTERACTION GRAPH-BASED MODEL
Online relationship is less valuable than offline ones [89] because some of them might have little interactions after building connections. This skewed distribution of online friendship challenges in estimating close relationships. Therefore, the interaction graph takes visible interactions into account to build the user relationship model. The proactive caching work [43] considered both connections and interaction on Facebook to find users with similar requests. In [85], the interactions, such as wall posts and photo comments, were used to improve traditional friendship graphs on Facebook. However, not all the interactions are as visible as comments. However, 'close relation' can be reflected from not only interactions but also the geolocation that effective close relationship should have overlapped trajectory in their daily life. Otherwise, its value will be discounted in network optimisation. The work [88] studied friendship and location by using statistics to estimate the probability of friendship that is roughly proportional to the inverse of the distance. Similar work is in [90] using Gaussian distribution and Expectation-Maximisation (EM) to fit the periodic and social mobility model which forecasts the relationship between friendship and mobility.

c: FOLLOWING GRAPH-BASED MODEL
In Twitter, users subscribe to others and see all of their public posts. It indicates a weaker offline relationship than Facebook but reflects more power in information spreading on the news media level. H. Kwak et al. modelled this relationship as the following graph in [91] and found that Twitter had a nature of broadcasting, which verified its role as the emerging news media. The information-spreading ability is influenced by numbers of the follower and influential user rankings. Understanding the spatial properties of the broadcasting graph can enable proactive content caching and D2D and P2P joint optimisation [151].

4) SUMMARY OF FINDINGS AND LESSONS LEARNED
In summary, the main findings and lessons learned from modelling the social behaviours include: • In proactive caching, the popularity prediction is critical to guide the deployment, storage, and transmission of the segments for maximising the caching hit ratio. The forecasting of content popularity becomes a classification problem which matches the known popularity of published contents to the unknown popularity of the new similar contents. The Naive Bayes and SVM classifiers are all popular supervised choices. It should be noted that the selection of dimensions in the classification requires it to maximise the difference between the diverse kinds of content for avoiding the misclassification. It would be useful if the data sets are visualised with the dimensions. Besides, the popularity correlates to the number of visits and comments.
• The users' sentiments represent their tastes to different content or the satisfaction of their experience. In the proactive optimisation, it will enable the sliced virtual networks with functionality specific to the service or customer. Such a sentiment analysis model consists of natural language processing, text analysis and supervised learning, which has already been successfully used in detecting cellular blackspots. The problem is that the results highly rely on pre-defined text corpus, which can have ambiguity and understanding errors. As for modelling user preference, the supervised learning based models can use the user context as an input and produce a distribution of future user's requests of contents. VOLUME 8, 2020 However, the individual sentiment is unstable, so it requires a frequent calibration via user prompts [100] which is subject to inconsistent user participation.
• The modelling of user relationship can help to reveal the information spreading rules which are required in the dissemination of contents in the proactive caching, especially on the D2D level. It has three kinds of representations, interaction, latent, and following graphs. The intensity of interactions reflects their relationship and the potential of overlapped trajectories as well as similar requests. In that case, it represents a ratio to adjust the strength of users' relationship. However, not all the interactions are visible, such as the posts glancing or social mobility. The closer locations and similar behaviours increase the probability of a friendship. The Gaussian distribution can be used to fit the social mobility that the probability of friendship is proportional to the inverse of the geo-distance. The above models mainly focus on friendship on Facebook, but there is also another subscription relationship on Twitter.
The following graph-based model can determine the most-influential user which may benefit the broadcasting of contents in proactive caching.

D. PREDICTIVE USER BEHAVIOUR 1) SEASONALITY IN USER BEHAVIOUR
The previous models represent the connection between online data and user context. Major of them aim to model the seasonality (regularities) in user behaviour, such as places with daily visits and preferred content. The regular behaviour has a low possibility to change, which brings convenience for prediction. To track the habits in network usage, it is better to forecast the regular spatial-temporal pattern according to historical data [94], [132]. In general, these results indicate that regular behaviours are predictive because they repeat over time.

2) ANOMALY DETECTION IN USER BEHAVIOUR
However, the anomaly (irregularity) also exists in user behaviour, where traffic-burst randomly occurs on the timeline. It is difficult to model such behaviour along with time, such as parades. For example, Fig. 6  In the Gaussian Process model, the prediction of Tweets per hour y * is based on the observations y before 18:00, 22/02/2016. Therefore, the probability follows a Gaussian distribution y * |y ∼ N (ȳ * , var(y * )) [152], in whichȳ * is the mean indicating the best estimate of y * and var(y * ) is the variance representing the uncertainty. The 95% confidence interval isȳ * ± 1.96 √ var(y * ). This model helps understand seasonal characteristic in historical data. However, the burst on the event day does not obey the seasonal trend. The key to solve is using online contents to detect the future popular event. The event is the dominating disturbance that enters early in the process of Tweeting. Therefore, the ARMAX (ARMA with exogenous terms) model can be used to track the irregular burst caused by the event. The events are regarded as unusual outliers in the regular traffic pattern. For example, as shown in Fig. 4, the hottest traffic spots will change when the event is approaching, and the term frequency will also alter to event-related words as shown in Table 6. Therefore, we need to find irregular conditions in regularity. We can use machine learning, such as SVM, to classify the Tweets according to temporal-spatial-textual dimensions. Then the algorithm detects upcoming popular events as well as predicts the irregular behaviour of UEs. The paper [17] proposed a context-aware load balancing based on predicting an event in simulation, and [153] also studied that unexpected real-time road traffic prediction and control based the Tweets by waiting drivers.

3) SUMMARY OF FINDINGS AND LESSONS LEARNED
In summary, the main findings and lessons learned from the predictive user behaviour include: • Majority of the prediction models focus on the seasonality (regularity) in the behaviours because the demand components have periodic variations. Machine learning approaches attempt to balance between best fitting of parameters and avoid overfitting via Bayesian methods.
The application examples include: the small cells can be turned on or off according to the periodic variations of network traffic. However, the challenge appears when the network traffic does not follow the seasonality. In that condition, random components break the rules of the seasonality and cause prediction errors.
• The non-periodic random components in the social behaviours cannot be predicted based on the training data in the regularity. We applied the Gaussian Process and the ARMA model to examine the degree to which uncertainty exists. The conclusion is that extra information is required to highlight the time with an anomaly. For example, in event detection, the event information has been posted several weeks or months ahead. We can use this information to determine the anomaly period and apply the detection methods to track the irregularity. In that case, the proactive load balancing strategy has to balance the newly emerging event's hotspots which can be very different from the daily hotspots (see Fig. 4). This proactive action will increase the convergence ability of optimisation, but it also needs to deal with the prediction errors and the associated overhead.
• We summarise the reviewed papers of the whole Section III in Table 7. This table classified the papers with the fields of used models, data types and amount.
In this section, we also introduce how each context is required in the proactive optimisation, which is the link between Section III and Section IV. For example, the popular region prediction needs to satisfy the requirements of proactive load balancing with a minimum spatial granularity of 120 m and 2 hours ahead. Such a prediction can be executed by using the clustering methods on the geo-tagged social network data. The proactive caching also needs the user trajectory prediction with accuracy > 75%. Such requirements can be achieved by analysing the GPS data using the neural networks.

IV. DATA-DRIVEN PROACTIVE NETWORK OPTIMISATION USING CELLULAR AND ONLINE INFORMATION
The context from online data-analysis is the enabler to infer predictive user behaviours, which also enable cellular traffic predictions and further shift current reactive network optimisation to proactive. This section will analyse how to implement them to achieve proactive optimisation.

A. CELLULAR TRAFFIC PREDICTION
Network traffic prediction becomes significant in the proactive optimisation, especially the proactive load balancing. Traditional algorithms construct regression models for the one-step prediction based on records, such as the core network [154] or cell-level prediction [155]. However, these researches face bottlenecks to step further as the resolution is limited to cell-level. One of the solutions is analysing the high-resolution GPS data from heterogeneous datasets, then correlating it to cellular traffic. The work [95] has verified that the network traffic and the size of online data are both positively correlated to the number of involved network users.
In that way, online social networks could not only predict flash crowds' needs but also offer operators suggestions about traffic forecasting for resource allocation [87]. This part reviews the development and provides the findings about online-data driven traffic prediction.

1) NETWORK-LEVEL TRAFFIC PREDICTION
The network-level traffic indicates the amount of exchanging information through the backbone network. Such data record the past traffic as a vector in the temporal dimension, which is the training data for neural networks. Then, the trained network forecasts the quantification of traffic at the next time stamp. The researches in [154], [156] followed this way by using a feedforward deep neural network or a Long Short Term Memory (LSTM) recurrent neural network for this purpose. The results are satisfied in predictions, but they only provide the one-step prediction, which means that the network needs to be re-trained for multiple-step predictions. It may negatively impact the time for further optimisation. If the traffic seasonality and random spike can be decomposed in the training process, the multi-step traffic prediction can be transformed into a combination of seasonal prediction and adding external random information. The Non-linear Auto-Regressive with exogenous model (NARX) makes predictions in this way and solve the one-step problem [157].

2) CELL-LEVEL TRAFFIC PREDICTION
This traffic includes both spatial (BS location) and temporal dimensions. The granularity of prediction is usually in hour-cell level to have a stable seasonality. In other words, based on the hourly data collected from the BSs, the cell-traffic will be modelled by statistic models or machine learning methods. The temporal traffic consists of the trend, seasonality, and random components. In detail, the trend indicates the overall direction in which the traffic is developing or changing. The seasonality is that the traffic experiences regular and predictable changes which recur every calendar day or other periods. In that case, the cell-level traffic becomes predictable if the trend and seasonality are modelled, which can be easily implemented by ARMA model or exponential smoothing [158], [159]. However, these models only consider a constant time range, which is described by a 'window', so the long-term memory of all the training data is neglected. The LSTM is designed for solving this problem by feedback connections in RNN to process the entire sequence of data with selectively remembering patterns. It has a forget gate to disable the meaningless information in recurrent states, such as the random fluctuations in the traffic pattern. One successful example is in [155], but this technique also has some limitations. One of them is that the knowledge learnt from one cell can not be shared with other cells, which is not intelligent with repeating training effort. It is a promising way to use meta learning to use the conclusions of other learning's results. The records of other learning methods will be stored and help current training in different cells for learning both temporal and spatial traffic.
The spatial traffic can be described by a probabilistic distribution whose parameters are adjusted for fitting the training data with minimum errors. The traditional method is to formulate using mathematical statistics, such as Zipf distribution [160]. This method finds relations between traffic and locations, and the significance of this relationship. For example, the work [161] designed an α-stable traffic model with parameter tuning for a city-wide scale. The common shortage of this method is that it approximates the parameters without the optimisation like gradient descent so some important details will be ignored. With the development of machine learning, this shortage has been overcome by the neural networks (e.g., LSTM) which owns intelligent weights' fine-tuning methods like back-propagation [155]. Although the neural networks performed better in prediction accuracy, it lacks the ability for quantifying uncertainty as the mappings between layers governed by weights but not a stochastic process. The usage of Gaussian Process addresses this problem [155]. This non-parametric method trains its hyper-parameters to produce a posterior distribution of prediction with uncertainty quantified. Although the Gaussian Process may not surpass the performance of neural networks, it can quantify the risks via the posterior distribution. In that case, it becomes promising to use deep Gaussian process to couple the advantages of both deep learning and Gaussian Process [111].

3) TRAFFIC PREDICTION USING ONLINE DATA
The problem of previous traffic prediction is the lack of predictive user behaviours, so it becomes difficult to explain and predict the random traffic spike caused by changed services. In that case, heterogeneous data become more important because they contain not only cellular information but also the individual-level geolocation and behaviours (see models in Section III). In the general methods, the traffic is decomposed into the trend, seasonality, and random components. The random components may be the holiday traffic or the traffic spike during popular events. If such traffic is estimated, one will combine the quantification of the three components for the final prediction [162], [163]. This gap is filled by estimating cellular traffic based on geo-tagged social network using machine learning, such as using linear regression [95], [96]. Before training, it is required to select the interested dimensions first, such as cellular traffic and amount of Tweets. Then, the data will be fitted by regression models with minimising residuals like using least squares [96]. The work [96] found a strong correlation between Tweets and mobile traffic in a stadium even though the Twitter traffic takes only a small partition in the whole traffic. However, the correlation is not fixed when the temporal or spatial scale changes. The strength of correlation increases along with the decrease of spatial resolution. It is a trade-off between better spatial resolution or higher correlation. The current method for this problem is re-calculating the correlation with different resolutions to pick an acceptable one [94].
Based on the positive correlation, network traffic becomes predictive using heterogeneous data. The regression methods shall provide the optimised parameters for the fitting correlations. One can formulate the model according to the parameters. For example, the work [95] predicted spatial-temporal traffic based on the estimation of correlation (Fig. 7) between 3G network load and Tweets using log-linear regression.
This estimated Down-Link (DL) traffic loadr DL in cluster c in time interval t can be described aŝ where [a DL = 0.88 kb/Tweet b DL = 2.37 kbps] and τ is ratio between time interval and second (e.g, in this work τ = 3600 s/hour). This formulation couples Tweets and cellular traffic but considers only general conditions. Sometimes, an anomalous traffic emerges without a holiday-like obvious signal. Therefore, an anomaly detection in traffic prediction emerges as another research direction.

4) ANOMALY DETECTION
The anomaly traffic does not follow the model of the trend or the seasonality because user behaviours become different during irregular conditions. The general method is to model the regular traffic first, then detect the outliers based on the modelled regularities. Finally, the outliers will be treated as a particular group with another model to fit the traffic. In that case, we need two machine learning models for both regular and anomalous conditions, which are usually combined with a clustering and a regression. The clustering methods automatically distinguish the regular and anomalous conditions in the selected dimension. For example, using K-means on grouping the BSs with similar traffic will present the BSs with extremely high or low load [164], [165]. The extreme values are useful for proactive optimisations, such as load balancing for extremely high-load cells and BSs turn-off for extremely low-load cells. Another model for modelling anomalous traffic is usually undertaken by regression methods, such as Gaussian Process, neural networks, or NARX model [157]. These methods performed well but faced a VOLUME 8, 2020 challenge in optimising the weights that they can not avoid the local optimum with using gradient descent. This is because the start of the gradient is randomly allocated in the global space. It is not controlled to walk iteratively to a close local optimum then converge. One of the solutions is applying evolution algorithm in parameters' optimisation. The scheme is inspired by biological evolution in which the generations will finish the procedures of reproduction, mutation, recombination, and selection. The mutation provides chances to jump out of the local optimum. In that case, it is promising to use neuro-evolution deep learning to better tune network weights.

5) THE QUANTIFICATION OF UNCERTAINTY IN PROACTIVE OPTIMISATION
To estimate the overhead caused by data scarcity and malicious attack, the certainty of the prediction needs to be quantified to regret the poor performance. We describe the above process in Fig. 8. The detail steps are shown as follow: (i) Gaussian Process and Bayesian learning can generate the posterior distribution based on the observations [111], [166]. Such distribution describes the certainty (confidence region) of demand prediction. (ii) Then, in the proactive optimisation, the input is the demand (e.g., traffic) samples generated by the posterior distribution, and the output is the corresponding network quality which can be statistically counted by the histograms or the Kernel Density Estimation after several simulations. In general, the simulator outputs the QoS metric according to the traffic posterior distribution and provides a cascade QoS distribution.
(iii) Such a cascade distribution will quantify the confidence region of the proactive optimisation to compare with the reactive optimisation. In other words, its confidence area describes how the network can operate in future and its probability.
If the QoS of proactive optimisation is estimated to have poor performance worse than reactive optimisation, the regret of poor performance occurs. In contrast, the better performance area becomes benefit or profit. Based on the framework, the final decision is made according to the difference between profit and cost. In general, the uncertainty offers a probabilistic numerical estimation of the profit of proactive-optimisation decisions while facing data scarcity or malicious attacks.

6) SUMMARY OF FINDINGS AND LESSONS LEARNED
In summary, the main findings and lessons learned from the traffic prediction are highlighted as the following items. We also provide a summary Table 8 to compare the methods and suggest the solutions of current pitfalls.
• The network-level traffic prediction can be addressed by the regression methods. These methods can provide high-accuracy results in the one-step forecast but accumulating errors for multi-step prediction. The problem is that the network optimisation requires multi-step prediction to reduce redundant re-training efforts. The solution is using the NARX model regarding random spike as exogenous inputs and combine the multi-step seasonality prediction with the exogenous information.
In that way, it alleviates the influence of errors caused by random components. • For the temporal dimension, the traffic is decomposed into the trend, the seasonality, and the random components. The first two items are predictable through using the ARMA or the exponential smoothing, but only part of the training data is used to deduce the prediction. Such a requirement about flexible long-term memory makes the LSTM a feasible choice. Its forget gate is trained to remember the meaningful items. Moreover, one of the future researches is to avoid the knowledge catastrophic forgetting between BSs by meta learning.
• In the spatial traffic prediction, traditional methods modelled it by mathematical statistics (e.g., Zipf distribution and α-stable). Compared with machine learning (e.g., neural networks), the traditional ones have no parameter optimisation (e.g., gradient descent). Instead, the parameters are determined using general approaches, such as maximum likelihood estimation which approximates the parameters without finding a path to the minimum gradient. The problem of current machine learning is that the predictions are generated without a quantification of the uncertainty. It causes difficulties for future decision makings to quantify the cost and profit considering potential errors. Such a problem is estimated to be solved by Gaussian Process or deep Gaussian process to produce predictions as posterior distributions (uncertainty).
• The random components in traffic are difficult to explain and predict due to the lack of user behaviours metainformation. This difficulty drives data analytics from cellular only to heterogeneous data (e.g., Twitter data). Current methods concentrate on using linear regression to formulate the positive correlation between cellular traffic and social network. Some statistic methods can quantify the strength of correlation but not formulate the model. Current gap of predicting the random traffic spike is that many anomalous conditions are unknown in advance (e.g., non-periodic events such as protests).
It needs the anomaly detection to distinguish regular and anomalous conditions automatically. The general method is combined with a clustering method (for distinguishing) and a regression model (for traffic modelling), such as a combination of K-means and neural networks. However, the weights selection in the neural network may meet the local optimum by the gradient descent.
In that case, the neuro-evolution deep learning can jump out of the local optimum, which is a promising method to model traffic with fine-tuned weights in the long-term.

B. PROACTIVE LOAD BALANCING
Load balancing is required to cope with the imbalance distribution of users' demand [31]. Specifically, the goal is to handover the UEs at the edge of overlapping or adjacent cells from congested cell to idle cell through optimising handover offset values [169], thresholds [12], and the number of handovers [170]. Afterwards, the SON algorithm was developed to take advantages of Fuzzy logic controllers for auto-tuning handover margins [13], [171]. However, these methods face some challenges: • The controller-based methods are reactive to random traffic spike, which delays the convergence and results in the limited ability in adapting the fast-changing load [17].
• The controller-based methods have a potential occurrence of oscillations, which may cause re-overload occurrence for target cells [171].
Several proactive load-balancing works use machine learning to deal with the above challenges by deriving, predicting, and adjusting key parameters (e.g., call blocking ratio (CBR)).

1) BALANCE LOAD WITH PREDICTIVE TRAFFIC
The first requirement to be proactive is forecasting the cell load (similar to the number of UEs and the BS traffic). The machine learning methods are selected to forecast the cell VOLUME 8, 2020 load and decide the offsets according to their correlation. The offset was adjusted automatically based on the cell load with being subject to the minimisation of packet loss ratio. The knowledge learnt from the training process highlighted profitable choices of parameter adjustment. For example, in [19], they used the polynomial regression to formulate the relationship and adjust the small cell offset value. However, the relationship between parameters can be complicated and related to a lot of parameters. Reinforcement learning performs better than polynomial regression in complex scenarios. Q-learning is a reinforcement learning to solve the problem by learning the state-action table from training data. The state is indicated by the cell load, and the action represents the optimisation decisions (e.g., offset values or antenna down-tilt).
With the employment of this model, the Reference Signal Received Power (RSRP) margin can be continuously adjusted according to the state-action table to maximise the user QoS. The works [21], [172], [173] followed this way to self-tune the cell margin or the antenna down-tilt.

2) BALANCE LOAD WITH CALL BLOCK RATIO OR UE LOCATIONS
The CBR is an indicator of cell load because such a ratio will increase along with the rising cell load. Another indicator is the average distance between the BS and the neighbouring UEs. If the UEs are far from the BS, the cell load is estimated to be low. According to the indicators, similar state-action tables can be trained. The works [20] used such table to update the CBR-offset fuzzy rules. The work [174] implemented the distance-based target BS-selection algorithm in the UE to handover to the BS with maximum QoS. The problem of these methods is that the state-action table is generated in regular traffic conditions. The networks also have event-like random conditions with very different traffic pattern. One of the solutions is considering extra user-context information, such as the event traffic pattern.

3) MACRO CELL OFF-LOADING
Another way to balance load is to offload traffic from macrocells to small-cells, WiFi and D2D connections to maximise energy efficiency. The offloading process was coupled with proactive content caching and finally resulted that most energy savings were from prediction and delay tolerance. The first step is to forecast context information, such as network traffic, user mobility and preference. Then, the offloading target function is formulated with maximising energy efficiency or the QoS. For example, the work [175] proposed this study in the 3G scenario with collecting data from YouTube and Apple iTunes to forecast consumer mobility and preference for forming the hot zones. Moreover, the papers [176], [177] focused on offloading to small cells and D2D connections toward high energy efficiency.
The Q-learning method builds a table of states (indicated by the cell load) by actions (the offloading decisions) in the training process and suggests the best action next time [178]. However, the social relationships and content popularity are not considered, which can forecast the repetitive content downloads. Rerouting this mobile traffic to other access networks is a good choice to offload the macro cell and the core network. For example, in [179], they denoted the offload ratio of content j of user i as w ij , the size of content j as L j , the requested times of content j from user i as m i j , and the backhaul capacity as C BN . In that case, the backhaul utilisation U BN is The backhaul utilisation can be forecast if user preference m i j is predicted (such prediction models can be found in Section III), and the target is to minimise the backhaul utilisation in the future actions.

4) SMALL CELL SWITCH-OFF
The switch-off algorithm can disable idle devices for a sleeping interval [32], [180], which reserves the energy consumption in low-load scenarios. The resource allocation for this purpose is similar to the marketing models with biding for maximum profit. The objects toward high profit are mapped to the trade-off among conflicting financial interests, such as cost (e.g., power) and profit (e.g., capacity). In detail, the network needs to 'bid' for the resource of third parties to carry their low loads. Therefore, we need to forecast the offloading traffic to 'buy' the requested capacity. The work [181] utilised this model considering time-varying traffic to switch off HetNet nodes and offload the remain UEs to third parties' cells.
Traffic prediction decides the involved cells and the time to wake up. In the temporal traffic prediction, the low-load intervals provide the basis for a system to determine which cells need to be turned off for how long. The work [182] did a similar work with the target of maximising energy efficiency. Beside load prediction, the sleeping intervals should be set in advance according to overlapped areas, battery condition, and previous settings. The traffic prediction is not the only requirement. The work [183] studied this problem considering extra context information.
The appropriate number of nodes to be switched off depends on the traffic pattern. The traffic changes to a low level at night, so its forecasting will suggest the maximum number of nodes to be switched off [184]. In a cell, the average distance between BS and all the UEs can indicate the power consumption because the BS has to increase its power to maintain stable links for further users. Therefore, researchers choose to switch-off the cells with the highest average distance with avoiding the QoS degradation [185].

5) DATA-DRIVEN USER ASSOCIATION
The users in the overlapping area become the load to be balanced. While the cell's capacity is limited, these users will still occupy the rare resource in this crowded cell even through they can be transferred to the adjacent idle cell without much increase of pathloss. Such traditional user-cell FIGURE 9. A comparison of convergence speeds adapted from our work [57]. The data-driven user association scheme owns the best performance to balance load in the cell overlapping areas.

FIGURE 10.
A comparison of total network utility with different loads adapted from our work [57]. The proposed method achieved the highest utility, especially in the condition of heavy load. association schemes lose the opportunities to connect users on the far side. To solve this problem, better user-cell association will benefit the total utility of resources by making rooms for more potential candidates.
To maximise the total utility of users, the users in overlapping areas should be carefully associated even the pathloss is a little bit worse, but the overall maximum target is achieved. In detail, each pair of overlapping cells cooperatively re-associate users based on analysing the QoE data. Our work [57] proposed an iterative optimization method to implement this idea. The analysed QoE information was shared through X2 interface or cloud message exchange, then the involved small cells were triggered to optimise current association profile. Fig. 9 presents the convergence speed and compares with other three methods, two reinforcement learning and a social best response. It shows a fast convergence and high sum utility (above 30). These advantages make the data-driven methods compatible with proactive optimization. Under the circumstances with different loads, as shown in Fig. 10, data-driven method also results in the best-profiting association.

6) PROACTIVE LOAD BALANCING BASED ON ONLINE INFORMATION
The above proactive algorithms can relieve the pressure of oscillations and re-overload. However, the learning process is only cellular data-driven, which has some problems: • There exists unnecessary cell expansion for the cells far from a hotspot.
• It lacks user behaviour, such as events. These problems limit the roof of existing proactive algorithms. To address them, we need a proactive load balancing based on the analysis of online information.
The context-aware module in this algorithm should provide predictive user distribution of potential high-loaded areas [8]. The research [17] proposed a heterogeneous data-driven distribution-aware proactive load balancing study aiming to solve the unnecessary cell-expansion problem. In the design, the context-aware module collected and consolidated data from Twitter and GPS, then output user distribution which was also the forecasting of load difference between the loaded cell i and its nearest idle cell j, LR diff (i, j) (it is the input of the integration module). In the integration module, the forecasting of LR diff (i, j) suggested the loaded cell i to increase its speed to share the load to its nearest idle cell j. For example, the loaded cell i reduced its transmission power by k × P TX (i). Such power reduction P TX (i) was adjusted according to the controller's output. Besides, the integration module could adjust the strength k to enhance or weaken the effect of load balancing algorithm according to the prediction of environmental changes.
Compared with the performance of Fuzzy controller-based optimisation, the context-aware optimisation could reduce at least 1.3% more user dissatisfaction rate, and nearly 50% convergence time. This research indicates a path to couple proactive load balancing with the online information.

7) SUMMARY OF FINDINGS AND LESSONS LEARNED
In summary, the main findings and lessons learned from the proactive load balancing include: • The cell offset needs to be determined according to the learnt correlation between the offset and the cell traffic subjecting to achieving the minimum packet loss. The polynomial regression and the Q-learning are the commonly chosen tools, where the cell load is an input, and it outputs the adjustment of the offset. One problem of this method is that the regular conditions and the random components are mixed in the modelling, which causes a slow convergence in an anomalous condition (because the anomaly is not fully learnt). In that case, the anomaly detection needs to be implemented here to learn the anomalous traffic and alert the system to fit it.
• The behaviour information will enable the network to fit customers' demands and forecast the upcoming change of traffic. The localisation systems provide a high-resolution intra-cell traffic pattern, which will VOLUME 8, 2020 suggest to enhance or weaken the effect of the controllerbased load balancing according to the forecasting load difference of cells. Currently, the context-aware module is not well-designed yet due to the uncertain predictions and the difficulties to quantify the cost of taking risks to optimise network following the predictions. One method to avoid the potentially heavy cost is to use a parallel model to operate proactive optimisation while reserving the chance to be shifted back to traditional optimisation for minimising the risks. This method is reliable but does not quantify risk in a probabilistic framework. Another way is quantifying the cost based on the posterior distribution of predictions. It is a promising direction to address the concerns of overhead by the Gaussian Process or deep Gaussian process.
• In the Table 9, we summarise the proactive optimisation researches' topic, learning methods, optimised parameters, and the findings to use online information to fill the gaps. For example, current proactive methods need to active all neighbour BSs to participate in load balancing. This problem can be solved with the high-resolution hotspots detection and enabling only the nearest cell for this work. Such hotspots modelling methods can be found in Section III. B. 1. Popular Region and Section IV. A. 5. Social Network-based Traffic Prediction.

C. PROACTIVE CACHING
Mobile Edge Caching algorithm stores relevant data to the nearby BSs based on predicting user demands. This technique increases network performance regarding throughput and latency. To maximise the hit-ratio of the cached content, the issues about content selection, content placement, content delivery, and storage utilisation should be addressed in a proactive way [31].

1) CONTENT PLACEMENT
To maximise the profit, the systems prefer selecting to cache the most popular content to the closest routers (BSs) to user-side and distributing content packets along with the path. This is popularity-based content dissemination which aims to improve the system performance regarding the server-hit rate and expected round-trip time. The content popularity represents the probability that the BS-associated UEs will request the content in future. The Zipf distribution is a traditional model to forecast popularity as the accumulated request probability in simple conditions (e.g., slow changing popularity). The contents are cached according to their estimated popularity [188]. To improve the performance in complex scenarios, reinforcement learning is another way to execute actions (cache contents) according to a trained state table (service delay). The reward is given with decreasing the delay (see example in [189]). Generally, the machine learning methods are used to model user-files correlations to guide the cache deployment [190]. However, it is inefficient to make the BSs with similar popularity repeat the accumulating works. Instead, it is better to update the estimation according to the knowledge of other BSs. The BSs can be grouped by clustering methods according to traffic, content preference, and storage. Then, the modelled popularity is shared through the control plan interface (e.g., X2 interface) and updated according to own preference [191]. This scheme also benefits the cooperative caching for sharing content between adjacent BSs and distributed caching for fetching contents from multiple BSs (see example in [45]). The general idea is following the trajectory prediction of users and deploy the segments. Nevertheless, an optimisation is required to maximise profit according to the above learnt knowledge. It is a matching game by considering both BS preference (most popular contents) and server preference (low transmission time). The best result is to cache the most-requested content with the lowest transmission time to minimise the backhaul usage. An example is in [192], they regarded this problem as a many-to-many matching game.

2) CONTENT DELIVERY
The content delivery represents an efficient strategy to prepare the contents before being requested. This process is usually scheduled during an off-peak time [45]. The traditional way is broadcasting the most popular contents at the off-peak time [192]. It is efficient at the beginning but becomes costly while facing fast changing popular contents. The developing direction is user-oriented. If the cache placement is in UE-level, the D2D link becomes more efficient for content delivery. The UEs with close social ties are potentially interested in the same contents, so the content delivery based on social ties becomes a new research direction. It is useful to highlight the influential users by using a centrality metric to estimate the social influence of users (higher centrality means more influential). Moreover, the content dissemination models can be modelled by a stochastic Dirichlet process, such as the Chinese restaurant process [42].

3) STORAGE UTILISATION
The issue of storage utilisation rises as the storage of caching is limited. Some contents have to be removed for refreshing up-to-date contents. A traditionally used method is the Least Recently Used (LRU). The LRU algorithm lists a ranking from most recently used content to least recently used content and deletes the bottom item for the extra storage [27]. However, continuously tracking the accessing information of content is expensive. The best way to solve is forecasting the size of cache and deleting the least-popular items. This scheme preserves local memory for the upcoming popular contents and combines the prediction of storage with traffic for flexible transmission. The reinforcement learning can build a storage-cache table considering maximising the storage usage [193].

4) DYNAMIC CACHING WITH MOBILITY PREDICTION
The probability of content requesting differs when time and area change. Testing the caching strategy with dynamic popularity and behaviours becomes more important these days even though current simulations are executed with many ideal assumptions [194]. User mobility is one of the influential factors. The predictions of cell transition and cell sojourn time suggest the best segment to be stored and the time to delete it [195].
Generally, the mobility prediction continuously provides a ranking of content request percentage. Then, the contents are cached into the edge BSs (e.g., Remote Radio Heads (RRH)) according to this ranking. Additionally, with mobility prediction, some UEs who are difficult to be supported by BS caching can be served by the caching on Unmanned Aerial Vehicles (UAV). RNN is a useful tool to model temporal dynamic behaviour, such as mobility. The works [116], [133] applied this algorithm with the Echo States Networks (ESN) due to its easier training process and less computational effort. Future improvements could combine the ESN with reinforcement learning to build a deep reinforcement learning network for better cache actions considering the complex scenarios.

5) COLD START PROBLEM
Cold start problem means that the system starts with no prior information and needs some time to collect sufficient information and converge. The traditional methods chose to cache random content or even all files at the start to collect the popularity of these content [27], [42], [190]. This cold-start popularity generation required time for enough accuracy and had difficulties in supporting frequent user movement.
To overcome the cold start problem, transfer learning is one of the methods. The transferred knowledge can be learnt from other communications (e.g., D2D interactions) to support caching at initial stages with content popularity and social ties. An example of using transfer learning is in [196]. However, this method is hard to quantify the principle of the required knowledge to transfer in all cases. For example, the knowledge of rural areas can be very different from urban areas, but the definition of rural/urban can not be understood if they are not manually defined. Another method can avoid this problem by directly gaining prior knowledge from online data, such as video popularity prediction. During an off-peak time, the content selection and placement were already done, and there would be no cold start problem during peak time [116]. Here, we present an example to show the performance of using data-driven method to avoid the cold start problem. A spatial popularity map is analysed and provided to the caching resource deployment. Traditionally, meta-heuristic algorithms, like the Simulated Annealing algorithm, are chosen for deployment. They will waste efforts in searching low-demand spaces. The spatial popularity map is effective to reduce the searching space and used iterations. The computation iterations are compared in Fig. 11. A Cumulative Distribution Function (CDF) is given, an average 53.85% reduction of computation iterations is achieved.

6) PROACTIVE CACHING BASED ON ONLINE INFORMATION
Online information provides the popularity of Internet videos, which can be represented as p i (t) of the video i at time t with the total size s i and the cached portion α i (t). In that way, the load of backhaul can be expressed as i (1 − α i (t))s i p i (t), 0 ≤ α i (t) ≤ 1. The caching algorithms aim to increase the α i (t) to minimise the backhaul load according to content popularity p i (t).
The general method of popularity forecasting of a new video is to be classified into the categories of the published videos. Firstly, the unpublished-video features are extracted by a CNN. Then, the features clustering is applied by treating multiple published videos with similar features as a single video category with a representation vector. Next, high dimensional features extracted from unpublished videos are transformed to a representation vector to be classified. Finally, the popularity is predicted by a regression model. The researchers in [142] worked on this and got a good result which presented the offloading ratio as a function of time sequence. The popularity of published video is updated along with time in the condition of 'with prediction', so its performance is approaching the genie case. Current method uses the perception ability of CNN to learn the video features, but it is still short at deciding the optimal cache placement scheme. In contrast, reinforcement learning is good at decision making. In that case, deep reinforcement learning is a promising way to improve the cache decision ability due to the coupling of deep learning and reinforcement learning.

7) SUMMARY OF FINDINGS AND LESSONS LEARNED
The Table 10 summaries the details of the reviewed literature in this subsection. The main findings and lessons learned from the proactive caching include: • The content placement is a delay-minimising problem solved by the reinforcement learning or a matching game to best link contents to BSs. However, the traditional methods trained the model without knowledge sharing between BSs. This problem is estimated to be addressed by cooperative caching based on popularity and mobility. Then, the content delivery needs the modelling of traffic seasonality to forecast the upcoming off-peak time. If the content should be delivered to the UE-level, the social ties and high influential users will enable the cache delivery through D2D links or the broadcast of the most popular content. Finally, the storage management has a similar target to save the transferred content and empty the spaces for another high-popularity cache. We find that the proactive caching not only needs to predict content popularity but also the mobility and relationship. Fortunately, all of them are available in online data.
• The mobility enables the dynamic caching updated with the user trajectory predicted by RNN. General RNN methods require high computational efforts and more training time, so the ESN is designed for improvement in this aspect. The gap of using only ESN is that its ability of decision making is not as good as reinforcement learning. In that case, deep reinforcement learning becomes a promising way to combine the advantages of both learning methods. Besides, the cold start problem exists in current proactive caching schemes. One can solve it by transferring knowledge from other networks, but the 'appropriate' principles need to be pre-defined to avoid mis-learning of useless knowledge. Online data-analytics can avoid this problem by directly learning from the requested services.
• The online information is the basis for forecasting new content's popularity. Based on the popularity, the cache size will be optimised by the caching schemes to minimise the backhaul load. However, there still exists some challenges in further developments. For example, the prediction errors or fast-changing popularity can generate more burden in the fronthaul. The decision should not only be made according to the probability of occurrence but also the certainty of this prediction. Gaussian Process is estimated to be a feasible solution for this problem.

V. FUTURE RESEARCH DIRECTION: ONLINE DATA ANALYTICS-BASED 5G PROACTIVE NETWORK OPTIMISATION
In a 5G network, the explosive communication demands urge current network optimisation to transfer from passive to proactive. In detail, the proactive algorithm dynamically follows the changes of circumstances and executes optimising strategies to maximise the efficiency of resource usage. In that case, how to monitor rapidly changing environments (contexts) is the first problem to be solved. Online data is no doubt a valuable source with all the users as environment-monitoring sensors to provide the essential context for future network optimisation. In our survey, an optimisation-context-data map is proposed to clarify which kind of proactive optimisation requires what contexts through analysing which online data. However, despite the work about data analysing (extracting contexts from online data) and simulated proactive optimisation (assuming that the contexts are already known), there are still open issues and challenges to complete the circle of making online data as a reliable data source. In this section, we discuss these challenges and issues along with future research directions.

A. REALISATION IN 5G
In the past years, lots of researchers have discussed the 5G network about the key technologies, such as millimetre wave, network densification, and massive Multiple-Input Multiple-Output (MIMO). Among all of them, densification is regarded to bring significant changes to the current cellular network because of the increasing inter-cell and intra-cell handover [29]. Therefore, the proactive trend should fit into the change of paradigm in 5G with a platform with security, collection, analysis, measurements, and efficiency. This section presents some research directions about achieving 5G proactive network optimisation.

1) CONTEXT GRANULARITY
Granularity represents the measure of the distinguishable scale of detail in the context. Specifically, in the popular area prediction, various levels of granularity of locations are provided according to different data sizes, geo-tags accuracy, and data-mining methods. For example, in [62] the festivals' areas were estimated in the scale of ward or town, while in [33] the Quality of Experience (QoE) blackspots and high traffic zones were detected in a much smaller granularity (in the London Bridge Station). There is no best granularity for every optimisation, but the real-world data sets indeed have granularity limitations. Therefore, the first challenge is to judge if the online data source has an achievable granularity in expectation. Furthermore, the granularity of context should be consistent with the requirements of proactive network optimisation. For load balancing examples in [23] and [17], in [23] the granularity of cell load (traffic or area popularity) was at a small-cell level for the cell-offloading research, while the traffic distribution granularity in [17] was finer as at the intrasmall-cell level for more efficient offloading. In other words, each network optimisation has required granularity, so it is necessary to determine it before data analysing.
For the 5G, the cell size varies, and there exist dense deployments of small cells. In that case, the context granularity becomes the premise factor, and the context extracted from data sets should achieve at least a small-cell level. Besides, the authors expect that real-time network data to be collected and used, so new challenges emerge about the prediction errors.

2) PREDICTION ERROR IMPACT
Prediction is the process of future context deduction according to experience, so the prediction error refers to the context information that is irrelevant for the future. Such unwanted variation always exists during sampling, training and testing.
Moreover, both noisy data quality and rapid changing circumstance can cause it. For example, in geolocation prediction based on Twitter data, there are a lot of GPS coordinates shifts caused by weak signals, especially for indoor Tweets. In that case, these indoor Tweets provide a location range (approximate a rectangular area) if Twitter cannot estimate an accurate coordinate. Such indoor data takes a large number of all geo-tagged Tweets. On the one hand, the geolocation prediction is inaccurate if we consider all location records. On the other hand, the prediction cannot represent the indoor user' context after filtering.
Accordingly, reducing the prediction errors and limiting the impact of unavoidable ones deserve researching efforts. Online data is unavoidably noisy and related to uncertain predictions. Therefore, future online-data based proactive optimisation algorithms have to take the impact of prediction errors into consideration.

B. AGGREGATION OF ONLINE DATA AND CELLULAR KPIS
Cellular network KPIs refer to radio network performance monitoring, performance degradation detection, and network resources optimisation. Such data is usually regarded as the first choice of cellular big-data mining due to its accurate measurements, significant amount and tidy format. However, network KPIs have two shortages. Firstly, the data source is not open to the public because of privacy and security. Secondly, the KPIs cannot reflect user's intents so they cannot detect demand burst in a long-term period. Therefore, the following two research directions are proposed to alleviate the negative impact.

1) COLLABORATION WITH DATA PROVIDERS
If the network is expected to learn from past user behaviour, the choice of data collection becomes critical. Most researchers collect non-real-time and limited data by crawling. For example, researchers mainly collected Twitter data by two APIs, Twitter search API (search Tweets with keywords in the previous seven days) and Twitter streaming API (up to 1% of all Tweets) [197]. This method is neither sufficient nor efficient. Besides, the crawled data is not representative because of the characteristics of different regions, cultures, and languages diverse a lot [198]. In that case, we need to build collaboration with service providers.
Another essential data source is the network KPI owned by mobile network operators. It directly reflects network status in a short period and becomes the only indicator for network-optimising specialists. In other words, these parameters are the foundation for current passive network optimisation and so as the future proactive optimisation. However, network operators forbid external access to such data, so current researchers rely on either historical cellular data [33], [95] [94] or simulation. Therefore, exploring collaboration with network operators is also necessary.
Traditional mobile network operators only provide communication services, so they face difficulties collecting meta-information needed to forecast context-aware demand for proactive optimisation accurately. However, as Internet traffic becomes the majority of daily usage, there appears more cooperation and combination of mobile operators and Internet service providers. For example, in China, the Mobile Virtual Network Operators (e.g., Tencent, Weibo, and Taobao) are also Internet service companies in the realm of the social network, video, and e-commerce [199]. Such a combination of the service provider and the mobile provider brings convenience for the heterogeneous data analysis. Besides, the mobile network operators, such as China Unicom, also cooperate with service providers to serve users with particular application goody bags, such as unlimited mobile data for Youku videos. In that case, the operators can have traffic data for social behaviour analysis.
Future proactive optimisation should take advantages of both network KPIs and online data. Accordingly, finding a reliable and sufficient data collection method becomes another challenge.

2) OTHER OPTIMISATION SCHEMES
There are still other aspects in network optimisation, such as interference management, coverage and capacity, and resource optimisation. Most of the current attempts are real-time self-optimisation, but we can find some chances to benefit them using online data analysis. For example, H. Claussen et al. proposed a femtocell self-optimisation on coverage and capacity in [22]. They simulated user mobility in an indoor scenario as the context for coverage adaption, which can be modelled by personal trajectory with context (including way-points and spent time). In interference management, the researchers use polynomial regression [14], [200] or a neural network-based cognitive engine [201] to model the relationship between traffic distribution and transition power. With traffic distribution prediction, the system can select the potential high-interference areas and trigger interference management. For reference, we summarise the context and potential applications as a map (Fig. 12) with possible connections for all proactive optimisation.

VI. CONCLUSION
In summary, this survey demonstrated the essential state-ofthe-art technologies in online-data analytics that can offer promising drivers to shift network optimisation from passive to proactive in 5G.
Increasingly amounts of online data contain rich meta information of individual demand context (e.g., personal trajectory, user preference, and user relationship) and the wider social context (e.g., popular region, content popularity, and network traffic). The information when appropriately processed through machine learning mechanisms and environmental data can be used to forecast traffic patterns across multiple population and spatial scales. Then, in turn, it provides the capability for proactive optimisation that could allocate resources to suit diverse service requirements and the complex dynamics in advance by using the forecast information.
In this survey, to reveal the potential data to optimisation mapping, the authors virtualised the context as paths for connection to help readers find the most valuable context and its available data sources. Different models are proposed to retrieve predictive user behaviour and further correlate to network KPI.
The authors strongly believe that in future 5G networks, the optimisation will be proactive, service-oriented as well as user-oriented. In that case, online data becomes an indispensable source to increase QoE and reduce OPEX. However, open challenges still exist, such as context granularity in the 5G scenario, prediction errors, real-time data analytics, and taking full advantage of both online and cellular data.