Cognitive Caching at the Edges for Mobile Social Community Networks: A Multi-Agent Deep Reinforcement Learning Approach

Content caching in the current commercial content delivery networks (CDNs) allows reduction of duplicate traffic and improvement of QoS and QoE but it still suffers from surges of content traffic, network congestion, high mobility of users and dynamic users’ content request patterns which may result in high content access latency. With the increasing interest of large companies in providing next-generation mobile edge applications and services that the users can use despite potentially sparse, non-uniform connectivity, it is becoming increasingly important to provide efficient, smart content caching services at the edges to help with scalable storage and processing of local data as well as sharing data both at the edges and in the cloud. We propose a novel multi-agent deep reinforcement learning approach, CognitiveCache, in which edges adaptively learn their best caching policies while collaborating with other neighbouring edges to better understand if they can be usable for cache content placement optimisation problem in dynamic environments. We show that CognitiveCache can respond and adapt to the spatial-temporal locality of dynamically changing content workloads and resources, improve the reliability and scalability of content sharing, enhance QoE for users and decrease operational costs in mobile social community networks. We perform extensive multi-criteria evaluation of our proposal against four benchmark and competitive protocols over two different real-world scenarios in New York and London in the face of different mobility and users’ interest patterns to show that CognitiveCache achieves higher cache hit ratios, lower delays while reducing resource consumption.


I. INTRODUCTION
Current large-scale networking systems have been evolving to adapt to the increasing complexity and dynamics of both underlying networking infrastructures and applications. Content caching in content delivery networks (CDNs) [63], [64], such as AWS Cloudfront [25] and Azure CDN [61], allows improvement of users QoS and QoE but it still suffers from sparse network coverage, disconnections, network congestion and highly dynamic users' mobility and query patterns [1], [2], [6]. Many applications, such as remote health care and mobile social networks, need to be supported by next-generation mobile edge predictive content services which allow localized content storing and processing close to the users interested in it [1], [6], [7], [32]- [34].
The associate editor coordinating the review of this manuscript and approving it for publication was Miguel López-Benítez .
State-of-the-art edge services hosted in the mobile edge devices bring local data management, computation and inference capabilities to the edges to reduce the delay and improve the performance of data transport for end-users [67]. However, they still have limited support for surges of traffic, especially video streaming and dynamically changing networks due to users' mobility and content request patterns [1], [2]. To enable fully local network, interest and privacy awareness self-organised multi-layer cognitive edge clouds have been proposed in [32], [34] to host various services. Edge and fog computing [3], [41], [50] integration with various technologies including device-to-device (D2D) communication [30], [43], content-centric architecture [1], [11], small cells [48], caching [1]- [3], [8] are proposed to support complex networking data services. Intelligent caching services at the edges are envisaged to be important solution providing more localised and more responsive content VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ services to mobile users compared to the traditional CDN approaches. We formulise the objective function of cache content placement optimization problem in dynamic environment and investigate if machine learning (ML) -reinforcement learning (RL) approaches can be helpful for caching services that are able to compute cache content placement among the edges such that the aggregate benefit is maximized. RL algorithms are traditionally used in statistics, neuroscience and have been successfully applied in a variety of applications ranging from robotics [49], energy and resource management [22], [50], recommendation systems [19], [51], transportation [52] and software-defined networks [21]. Reinforcement learning [15], [16] is suitable for our unreliable network environments with no global knowledge and a certain degree of dynamics where we can quantify different variables and states of the network environments. We argue that application of RL's core concept where a mobile edge learns behaviour through ''trial-and-error'' interactions in a dynamic environment is feasible in the considered scenario (even though it is not for other scenarios such as e.g. high speed police chase which we do not consider in this paper).
In this paper, we propose a multi-agent deep reinforcement learning (DRL) model for caching framework Cognitive-Cache at the edges that adaptively and collaboratively capture and predict the dynamic changing networks and complex users' content request patterns to improve the accuracy of caching decision to place the most suitable contents in the most suitable edges. We model mobile edge-cloud network environment with a novel state space and set of actions that edges could take to collaborate with other nodes, interact with the environment in real time to maximise a cumulative reward. Traditional single-agent DRL-based caching approaches [54], [55] have been proposed where a single edge (e.g. single base-station or an access-point) learns to make most suitable caching decisions based on the states of the environment and the rewards. In our context with multiple edges (e.g. femtocell, access points, mobile users, vehicles, etc.), single-agent DRL-based caching approaches are not applicable due to i) every individual edge should learn its own caching policy relied on its observed user content request patterns and the current state of its caching storage ii) single central agent could have limited scalability as the increasing number of distributed edges will result at massive action space [56]. Therefore, we investigate a DRL caching approach for mobile edge-cloud scenarios where multiple edges can collaborate with each other. We utilise and extend the actor-critic based approach [5], [57] for multi-agent reinforcement learning [18], [24] in which the actor network controls the caching decisions and the critic network evaluates and gives feedback on the chosen ones. In CognitiveCache, each edge considers its caching strategy together with the caching policy of its neighbouring edges as part of the environment [38]. This implies that all edges have some effect on the environment and similar action of a single edge can have different outcomes depending on what the other edges are doing. We consider mobile social community networks which are characterized by multi-network layer [1], [3], [29], [45] including physical geo-temporal network topologies, resources, social geo-temporal communities and content geo-temporal interests. Note that in this paper, our edges are assumed to be privacy-aware by design as proposed in [32] and [34]. Malicious behaviour between the edges is out of the scope of our paper.
We show that CognitiveCache outperforms benchmark and state of the art caching protocols across a range of metrics to achieve higher cache hit ratio, lower latency and lower transmission cost compared in the face of realistic dynamic users' mobility and data demand patterns. We drive our experiments by real-world mobility traces and users' interest traces of more than 10 3 mobile edges and 10 5 content interests from New York Foursquare & London Twitter datasets. The remaining of the paper is organized as follows. Section 2 provides a systematic overview of the related work. Section 3 describes key features of our multi-agent deep reinforcement learning caching framework and provides pseudo-code. Section 4 discusses multi-criteria evaluation of CognitiveCache against other benchmark and competitive protocols across a range of metrics for realistic dynamic users' mobility and data demand patterns. Section 5 gives conclusions and discusses future work.

II. RELATED WORK
Machine learning [15], [16], a subset of artificial intelligence, has attracted increasing attention of scientific communities in recent years as it allows to perform a specific task without explicit instructions. Machine learning approaches can be grouped into three fundamental paradigms [15], [16]: (1) supervised learning that infers a learning function from labelled data; (2) unsupervised learning that involves mining information and learning from unlabelled or raw data; (3) reinforcement learning (RL) that explores how agents take suitable actions to maximise rewards in an environment. The environment is typically stated in the form of a Markov decision process [15], [46]. In reinforcement learning [15], [16], given a goal, an agent learns how to achieve the goal by trial-and-error interactive process with its environments. Both supervised and unsupervised learning are more suitable for off-line learning scenarios since all the inputs are received at once while reinforcement learning can be classified as online learning algorithms as it relies on being able to continuously monitor the response of the actions taken, and measure against a definition of a reward [16]. One of the well-known reinforcement learning algorithms is Q-learning [17], in which an agent in a given state estimates the expected reward if it chooses a specific action and enters another state. The core of the algorithm is a Bellman value iteration update [17] which is used to find optional action-selection strategies. Authors in [67] argue that to reduce the delay in data processing and minimize the privacy risks of revealing raw data to service providers, future AI and machine learning-based services could be deployed on users' devices at the edges rather than placing everything on the cloud although moving ML-based data analytics from cloud to the edge devices brings a series of challenges.
In recent years, different deep reinforcement learning based caching approaches at the network edges have been proposed leveraging its powerful learning capacity from the historical events. Authors in [54] propose a DRL based framework for content caching at the base station as a single agent. The authors utilize actor-critic method and train the policy using Deep Deterministic Policy Gradient (DDPG) [58] to improve the caching decision, maximise the long-term cache hit rate without knowledge of content popularity distribution. However, [54] is mostly confined to centralized learning within one base station as a single agent which is not scalable in mobile edge-cloud scenarios [56]. Authors in [55] propose a deep reinforcement learning-based caching in hierarchical content delivery networks. The proposed framework DQNCache relies on Deep Q Networks to learn optimal caching policy in an online manner. DQNCache tries to find optimal value function which is a mapping between an action and a value to find better action with higher value. DQNCache belongs to value-based algorithm which is different from policy-based algorithm such as REINFORCE [59]. While value-based approaches are more sample, efficient and steady, policy-based approaches are better for continuous, stochastic environments and have a faster convergence. In this paper, we utilize actor-critic method [5], [57] which takes advantage of both value-based and policy-based algorithms while eliminating their drawbacks. Similar to [54], [55] is built on single-agent DRL approach where a parent node makes caching decisions by observing aggregated requests from all leaf nodes. Different from existing single-agent DRL approaches [54], [55], our CognitiveCache leverages multiagent DRL to predict and adapt in real time the dynamically changing spatial-temporal locality of networks and users requests [3], [4] and achieve collaborative intelligence between edge caching in mobile edge-cloud networks. Each CognitiveCache edge has its own learning model and exchanges knowledge with its neighbours to provide better scalability compared to centralised DRL caching approaches. Authors in [56] propose a multiagent DRL based caching for video contents which leverages independent Q-learning, advanced actor-critic method integrated with long short-term memory network to adaptively learn most optimal local caching policy in conjunction with other edges. In this paper, we propose to use soft actor-critic method [5], [57] which promises to be more sample efficient and more robust to brittleness in convergence compared to [56]. CognitiveCache has a fine-grained state design that allows CognitiveCache to adaptively capture and forecast the popularity of content and surge of content demand.
Reinforcement learning has also been applied in different areas of mobile edge cloud networks. Authors in [37] propose a delay-tolerant congestion-control framework that automatically modifies its operations based on the dynamic changings of the underlying network using reinforcement Q-learning. [37] shows that it continuously adapts to the dynamics of the target environment in a variety of DTN applications and scenarios with adequate performance. However, [37] only takes into account resource availability as states of nodes without considering other dimensionalities such as connectivity and content traffic, thus it does not sufficiently represent the multilayer multidimensional characteristics of mobile network environments. Authors in [68] propose a distributed and energy-efficient control framework based on convolutional neural network (CNN) that utilizes distributed multi-agent DRL approach to solve a challenge of mobile crowdsensing. Reference [68] takes energy availability, spatial coordinates and remaining data as states for each node in the network. The solution explores the spatio-temporal nature of the considered scenario for better cooperation and competition between nodes to maximize the data collection ratio, geographic fairness and energy efficiency. Authors in [19] leverage the deep learning techniques for sequential modelling and correlation identification of user interests influenced by their social circles and centrality. Reference [19] shows that they significantly improve prediction accuracy to predict user interests based on their sociality compared to widely used baseline methods. However, [19] assumes centralized knowledge and does not support fully distributed decision making as we do in our paper. Authors in [20], [21] utilize deep reinforcement Q-learning to propose an integrated framework that can enable dynamic orchestration of not only networking but also caching and computing resources in order to improve the performance of next-generation vehicular networks. However, the framework's complexity is very high when the network states, caching states and computational resource states are jointly considered. Authors in [22] propose an intelligent deep reinforcement learning-based offloading mechanism for vehicular edge where the states of communication, mobility and computation are modelled by finite Markov chains. Similar to [20], [21], task scheduling and resource allocation strategy is formulated as a joint optimization problem to maximize users' QoE.
Many existing predictive analytics, collaborative heuristics and utility-based mobile edge caching models have been proposed in recent years. Research in [1] models mobile edge-cloud networks as a bargaining game theory to propose a collaborative adaptive caching framework CafRep-Cache. The authors formulate content discovery, caching and retrieval as the optimization problem and prove it is a typical Integer Programming program which is NP-Complete [1], [6]. CafRepCache serves subscribers with its local cache or by redirecting a request to a nearby collaborative cache, rather than forwarding to the original publisher. CafRepCache is built on multilayer predictive analytics and heuristics to capture and predict content interests coming from dynamic changing clusters [1]- [3], [14], [31], [39] of subscribers in both random and scale-free network, and thus reduce content retrieval delay, improve cache efficiency and reduce resource consumption while enabling responsiveness to heterogeneous dynamically changing network topology, congestion avoidance and varying patterns of content publishers/subscribers. Authors in [30] present Influ-entialCache, a proactive caching approach in small cellular device-to-device communication networks. InfluentialCache models D2D cellular network as a social graph in order to utilize its spatial structure. The influential users and their clusters are identified using eigenvector centrality and CC-GA clustering algorithms [42]. The authors assume that the content requested, generated, or accessed by the influential users in a community will become popular within this community, thus proactively cache this content to serve the others. Authors in [8] propose SocialCache, a caching algorithm based on social relationship of nodes in the network to choose caching carriers. Content popularity is calculated based on the frequency and freshness of content requests. Authors in [12] propose LocationCache that uses function of distances between subscribers and caching points and other features such as time stamp or number of content requests to classify contents and replace them when cache memory is full. Least frequently used (LFU) [10] measures the frequency of content requests at a caching node to make caching decisions. The content receiving more frequent requests will need to be cached because it will have higher chance of being requested again in the future. Least recently used (LRU) [9] records the time stamp of a content request locally in order to make caching decisions based on how recent the content requests are. ProbCache [60] keeps a copy of content in a cache along a path with a probability which is calculated based on path lengths and multiplexes content flows. In this paper, we compare CognitiveCache against state-of-the-art and benchmarking caching algorithm including the least recently used (LRU) [9], least frequently used (LFU) [10], ProbCache [60] and DQNCache [55].
In previous works, we have proposed and deployed fully-distributed real-time multi-layer mobile edge cloud architectures for enabling multiple services for smart vehicles, drones, cities and agriculture applications spanning MODiToNeS [32], mobile personal edge-clouds [66] and Raspberry PI based personal clouds RasPiPCloud [32]- [34] which support multiple on-demand virtual containers (e.g. LXC, Docker) to host different services and applications that collect, store, analyse, predict and share data with other edges while retaining completed control and ownership of their data.

III. COGNITIVECACHE -A MULTI-AGENT DEEP REINFORCEMENT LEARNING BASED CACHING FRAMEWORK A. COGNITIVECACHE FRAMEWORK AND SYSTEM MODEL
We envisage a distributed edge caching scenario for heterogeneous content services such as video streaming, file downloading, used by dynamic groups of mobile users who request contents in real time. In this context, multiple CognitiveCache edges (e.g. femtocell, access points, mobile users, vehicles, etc.) are located in different areas providing processing and caching capacity. For example, a group of students in a city area acting as subscribers send their interests of information about nearby parking space, local coffee shops or interests of bus schedule to CognitiveCache edges. CognitiveCache edges can communicate with other neighbouring edges and can collaboratively retrieve the requested contents from each other ( Figure 1). While the edges have limited caching storage and thus can only cache a certain amount of contents, CDN server is assumed to have sufficient caching capacity with all the requested contents already being cached and is able to provide requested contents to the users. However, we assume that retrieving contents from CDN server has a significantly higher delay compared to retrieving contents from the edges (e.g. BSs). More specifically, we propose that serving a request has three phases: local edge cache hit, neighbour edge cache hit and CDN cache hit. Local edge cache hit: when a request arrives at the local edge, it sends the cached content to users if the requested content is found in its local caching storage. Neighbour edge cache hit: when a local edge does not cache the requested content, it then attempts to retrieve the content from its neighbour edges which cause extra latency but is still quicker and cost-beneficial compared to retrieving content from the CDN server. CDN cache hit: when the requested contents cannot be found from either local or neighbour edges' cache storage, local edge fetches the requested contents from CDN server as a lowest-priority back-up solution. We assume the CDN's latency is the same for all users.
We propose a multi-agent deep reinforcement learning (DRL) model for caching framework CognitiveCache at the edges with novel design of states, actions and rewards. We utilise independent Q-learning [38] approach to solve the multi-agent reinforcement learning problem where each CognitiveCache edge considers its caching strategy together with the caching behaviours of its neighbours edges as part of the environment. CognitiveCache relies on actor-critic based RL approach [5], [57] and utilises Long-Short Term Memory (LSTM) [40]) as the (deep) recurrent neural network architecture for the actor-critic networks to learn a model of the environment. Figure 2 shows an overview of CognitiveCache framework. Note that we will use the term edge, agent and node interchangeably. At every epoch t, CognitiveCache edge adaptively and collaborative captures and predicts the dynamic changing network environment such as cache availability, complex users' content request patterns using predictive analytics and heuristics proposed in [1]- [3]. As a result, CognitiveCache edge can form a state of the network environment observed by not only itself but also its neighbour edges. Cognitive-Cache relies on actor-critic based RL approach [5], [57]: the actor network controls how the edge behaves by learning the optimal policy, taking the state as input and outputs the best caching decisions such that whether to cache or evict/drop a list of contents; the critic network evaluates and gives feedback to the selected caching decisions to keep improving in real time the caching decision policy. After taking the caching decisions, each CognitiveCache edge receives a reward based on the popularity of cached content, content transmission latency and costs at the next epoch t+1. The reward and the next state observed by CognitiveCache at t+1 help to keep improving the caching decisions such that to maximise the cumulative reward in order to eventually improve the average cache hit ratio, reduce latency and transmission cost. One disadvantage of standard actor-critic based RL approach is that seeking for the best caching actions is undirected and slow to converge [65]. Thus, our CognitiveCache utilises (deep) recurrent neural network (specifically, LSTM [40]) as the architecture for the actor-critic networks to learn a model of the environment that helps to capture long-term temporal dependencies, predict next observations/states and rewards based on current observations/states and actions in the partially observable environment. At the result, CognitiveCache converges quickly to achieve good caching decision policy. Figure 3 shows an example of states and caching action transitions of CognitiveCache. CognitiveCache edge maintains and exchange with other neighbour edges the state of its caching storage (i.e. which contents have been cached) and a list of content popularity it observes. The CognitiveCache edge in high cache state means it has not cached much and still has plenty of space in its caching storage. Thus the edge may freely decide to cache high, medium or low popular content. This makes the transitions from high cache to medium and low cache. When the edge is in medium or low cache state, it will be more selective about which contents to cache so that it only caches high (or medium) popular content as it will need to carefully drop its old cached contents (i.e. ones which are expired or have lowest content popularity) to make more caching space. When a CognitiveCache edge does not receive many content request (i.e. low interest), it may decide to not cache the new content and at the same time, the edge will be able to drop its old (or expired) cached contents that increases the cache availability. We model CognitiveCache system as a network G that consists of a set N of edges n i (n i ∈ N ) and a local CDN server denoted as C. We define a set of neighbours of each edge n i as NE i . We assume that each CognitiveCache edge n i ∈ N in the network has a cache of size CS i . We denote with O a set of content files that can be requested by the network. Each content o k ∈ O has the size δ k . Content o k consists of an array of chunks o k,l . For simplicity, we assume all chunks o k,l of a single content o k will have the same chunk size δ k,l without losing generality. We also denote r t k as the interest about content o k at time t. We denote p t i,k as the popularity of content o k observed and predicted by the edge n i during the interval time t. Content popularity implies the probability of how likely content will be requested in a period of time. Contents are globally popular if they have been requested by a high number of subscribers coming from different areas in the networks while localised highly popular contents are those which have been requested by subscribers from the same location. Each edge n i ∈ N in the network receives certain requests of content o k at time t, denoted as local content request rate q t i,k . In addition to this, VOLUME 8, 2020 z t i,k is denoted as the aggregated request rate of the content o k observed from all the collaborative neighbours of n i at time t.
We denote x t i,k ∈ {0, 1} as whether edge n i has a cache of content o k at time t or not. We denote y t i,j,k as whether a content o k requested within edge n i area is cached by n j . n j ∈ {NE i , C} is either a neighbour edge n i or CDN server at time t. Table 1 summarises the main notations used in this paper. We measure the caching performance as the product of cache hit ratio, reduction in latency and transmission cost. The objective of our optimization is to compute cache content placement among the edges such that the aggregate benefit is maximized. We formulate the optimal cache content placement problem as follows in equation 1.

max :
Subject to : Equation 2 ensures that each edge does not exceed its caching capacity. Equation 3 restricts the optimization caching decision as a binary value indicating whether to cache content or not. u t i,j,k is the beneficial utility value when caching content o k at the edge n j ∈ {NE i , C} which is requested from the area of the edge n i . In the short term, u t i,j,k is the beneficial utility for caching content o k at the edge n j for the edge n i . Note that i = j implies local edge cache hit. u t i,j,k is defined as the inverse proportion of the additive value of latency and transmission cost as in equation 4 below: where l t j,i and c t j,i are the latency and transmission cost between n j ∈ {NE i , C} and the edge n i , α and β weights the importance of latency and transmission cost. As discussed above regarding the relation between local hit, neighbour hit and CDN hit, we summarise the relation of l t j,i and c t j,i as in equations 5 and 6 below: Our cache content placement optimisation problem is a typical Integer Programming program which is NP-Complete [1], [6]. Achieving a solution working in real time with global optimality is non-trivial [6], [18], [56] regarding the partial network knowledge, the dynamic of mobile users and their content requests. In this paper, we explore how to develop multi-agent DRL caching approach for the mobile edge-cloud scenario where not only individual edge can capture and predict the spatial-temporal locality of content traffic patterns [3], [4] to make data-driven caching solution but also multiple edges can collaborate with each other to solve this optimisation problem.

B. COGNITIVECACHE -A MULTI-AGENT DEEP REINFORCEMENT LEARNING-BASED CONTENT CACHING AT THE EDGES
We propose our novel state, action, reward design followed by the algorithm and architecture of our multi-agent deep reinforcement learning caching framework CognitiveCache.

1) COGNITIVECACHE STATE, ACTION AND REWARD DESIGN
Each CognitiveCache edge in the network maintains a historical record of state-action-reward tuples s 0 , a 0 , r 0 , s 1 , a 1 , r 1 of its own and its neighbour edges. Our caching's multiagent environment is not globally observable but rather partially observable where each edge is able to capture the environment state and communicate with its neighbours. We describe our novel design of state and action spaces, and the reward function of the CognitiveCache agent as follows: State space: CognitiveCache maintains the state s t i of edge n i at time t as s t . , x t i,k } is the binary value indicating whether edge n i has a cache of a list of contents o k ∈ O at time t and p t i = {p t i,0 , p t i,2 , . . . , p t i,k } is the popularity of a list of content o k ∈ O observed and predicted by the edge n i at time t. To capture and infer the content request demand, simply logging the number of content requests is not sufficient. We utilise the content predictive analytics proposed in [3] to capture the spatial-temporal locality of content requests more accurately and responsively. Specifically, p t i,k is resolved by the combination of temporal (request frequency, recency, betweenness [3]) and spatial content heuristics [3] in order to allow CognitiveCache to capture and predict the locality trend of content request patterns over time in different locations and avoid losing valuable contents by reducing the caches for onetimers contents [3]. The input state of an edge includes its own observed state and its neighbours' states, denoted as: Action space: At every epoch, after observing the input state of the environment (i.e. cache storage state and content request popularity), CognitiveCache makes the action based on its policy. The action a t i of edge n i at time t is defined as: in which for every iteration, edge n i has to decide a list of contents to be cached a t i,k = 1 and a list of content to not be cache a t i,k = 0 (or be removed if the cache storage is full). Each edge decides the best actions based on the input state. CognitiveCache seeks for high entropy in our policy to explicitly encourage exploration that assigns equal probabilities to actions that have the relatively same Q-values. This also avoids CognitiveCache repeatedly selecting a particular caching action that could exploit some inconsistency in the approximated Q function.
Reward space: After taking the caching actions, each CognitiveCache edge receives a reward r t i . We define the reward r t i of edge n i at time t after taking a list of actions as: in which p t i,k is the popularity of contents in the next epoch, u t i,j,k is beneficial utility value when caching content o k at the edge n j for the edge n i . As shown in Equation 4, u t i,j,k is the inverse proportion of additive value of latency and transmission cost. u t i,i,k means local cache hit. In Equation 9, we consider the rewards of both local edge and its neighbours when improving the caching policy. This is because the local edge can serve requests from its neighbours and acquire some reward value. In our model, we assume α > β, i.e the local edge reward has higher weight than the neighbour reward, thus the policy updating is more driven to local cache hit which has lower latency and transmission cost for delivering content to subscribers.

2) COGNITIVECACHE ARCHITECTURAL OVERVIEW AND PSEUDO-CODE
In our context with multiple edges (e.g. BSs, APs, mobile devices, vehicles, etc.), a single centralised learning agent is not applicable as every individual edge should have its own caching policy driven by its observed user content request patterns and the single-central agent could have limited scalability due to the explosive action space of massively distributed edges [18], [56]. Fully-cooperative approaches also suffer from scalability and stability performance [18]. Therefore, we propose CognitiveCache framework based on multi-agent independent DRL [38] where each CognitiveCache edge adaptively considers its own caching strategy while collaborating with its neighbours such that the input states of an edge will involve its own state together with the states of its neighbours. Figure. 4 shows the CognitiveCache multi-agent deep reinforcement learning-based caching framework.
We propose to utilise soft actor-critic (SAC) method [5], [57] for the multi-agent RL [18], [24] which optimizes a stochastic policy in an off-policy manner. The actor network controls the caching decisions/actions and the critic network evaluates and gives feedback on the chosen caching decisions to update the caching policy. Soft actor-critic [5], [57] is more sample efficient and more robust to brittleness in convergence compared to other approaches such as [56]. We base our work on [5], [57], which was originally designed for continuous actions splace, to provide an alternative version of the soft actor-critic (SAC) algorithm that is applicable to discrete action settings. In addition to searching for maximum rewards, SAC algorithm maximises the entropy of the policy.
Regarding the architecture of deep neural network for pretraining actor and critic networks, we utilise recurrent neural network (RNN) or more specifically long short term memory (LSTM) [40] as shown in Fig. 5, which is a state-of-the-art learning model that is typically used for time series prediction. This allows CognitiveCache to capture and explore the hidden users' temporal content request patterns as well as address the problem of large input space compared to other traditional deep neural network [56]. We describe the SAC objective function consisting of both reward and entropy function [5], [57] in Equation 10 as VOLUME 8, 2020 follow: where π is a policy, γ ∈ [0, 1] is the discount rate, τ π is the trajectories distribution by policy π , α is the parameter indicating the importance of the entropy term versus the reward [5], [57]. H π · |s t i ) is the entropy of the policy π at state s t i . H π · |s t i ) = −log π · |s t i . In actor-critic based caching approach, the actor network controls how the edge behaves by learning the optimal policy, taking the state as input and outputs the best caching decisions. The critic network evaluates the action by computing the value function. CognitiveCache utilises SAC which makes use of three functions: a state value function V, a soft Q-function Q, and a policy function π . We train the three function approximators in line with [5]. The soft state value function for discrete action space [5] is defined as: We train the soft Q-function parameterized by θ by minimising the error function [5]: (12) where D is the experience replay buffer [5], [57]. The policy is then updated in a direction that maximises the potential rewards. Finally, we train the policy network π parameterized by φ by minimising the error function [5]: ]] (13) We provide CognitiveCache pseudo code in Table 2. CognitiveCache updates all the network functions of every individual edge during each epoch in an experience-replay manner. After the actor-critic based training for our Cogni-tiveCache, the actor network can be utilised to make caching decisions for every single edge. More specifically, Cognitive-Cache framework consists of two phases: 1) Offline-training: the actor and critic networks are constructed and pre-trained with a sufficient number of historic transition samples in order to achieve good initial parameters for phase 2. 2) Online control: start with a set of parameters bootstrapped in phase 1, in each epoch t, if the requested content is already cached (local cache hit), the edge immediately sends the requested content to the subscribers. If the requested content is not cached, the edge observes state s t i of itself and its neighbours resolved based on [3] and obtains the Q-value from the actorcritic networks. Then, a list of action a t i are selected based on π -policy, whether to cache the content or evict/drop it. Cog-nitiveCache edge is encouraged to explore different possible actions that assigns equal probabilities to actions that have the same or close Q-values. After the action a t i is executed, the edge observes the reward r t i and next state s t+1 i on which the action policy keeps updating for the next epoch time t+1. The transition (s t i , a t i , r t i , s t+1 i ) is stored in CognitiveCache memory at the end of each time period.

IV. EVALUATION
This section provides rich multi criteria evaluation of CognitiveCharge, first describing realistic experimental datasets with a case study, then introducing a set of benchmark and state-of-the-art caching policies as competitive caching algorithms.
We use Foursquare [26] and Twitter [28] datasets as real traces to drive user content requests in two very different network scenarios: New York [26] and London [28]. Foursquare New York dataset [26] is collected through location-based service Foursquare API (https://developer.foursquare.com/) describing the spatial-temporal locality of content requests in terms of user interests at public venues. We assume the contents represent 14,550 requests of 5101 users in different locations of New York City during the period of one week. Each record is associated with its timestamp, its GPS coordinates and its semantic meaning.
Similarly, the data of Twitter London [28] was collected in one week, containing 15,602 geotagged tweets, posted by 5869 users. Each tweet consists of a timestamp and GPS coordinates (latitude and longitude).
We focus on a particular edge-cloud community of users: students in big universities campuses in New York and London. Students have different heterogeneous mobility patterns, interests and content request usage. This gives us statistically sufficient diversity to evaluate CognitiveCache and competitive caching protocols performance in different contexts. Without loss of generality, we split the chosen area into multiple 1 km × 1 km small grids and assume a Cogni-tiveCache edge positioned at the centre of each grid will serve the content requests coming from its area. Students' requests may range from texts (e.g. bus schedule information) to pictures and videos. Based on our traces analysis, the requested contents are highly skewed so that a lower number of contents are requested more frequently by the users/subscribers. This is because the students may share some common interests such as bus schedule, travel, nightlife, restaurants, shopping, cinema, etc. Table 5 and 6 show the content topic distribution in New York and London scenarios in which the three highest popular contents in New York belongs to topic of college, coffee shop and subway while that in London are transportation, college and coffee shop. As shown in Figure 5 -7, the content request has shown a certain level of locality in each area. Moreover, our traces analysis shows certain similarities among neighbouring edges will leverage potential collaborations between the edges.
We design the CognitiveCache learning model using Python and Tensorflow [23], running on a machine with GTX 1050 Ti GPU card, Intel I7 3.6 GHz CPU cards and 16GB memory. As shown in Table 3, we set discount factor as 0.99 and the learning rate for both the actor-critic networks are 1e-4. The hidden layers' size is 256. The number of iterations is 20000. We use 70% data for training and 30% of data for evaluation. We perform the evaluation across a range of criteria: local edge (cache) hit ratio, neighbour (cache) hit ratio, latency and transmission cost in the face of vastly different mobility, workloads and content traffic patterns against multiple stateof-the-art and benchmark protocols: LRU [9], LFU [10], ProbCache [60] and DQNCache [55]. Least frequently used (LFU) [10] measures the frequency of content requests at a caching node to make caching decisions. The content receiving more frequent requests will be cached because it will have higher chance of being requested again in the future. Least recently used (LRU) [9] records the time stamp of a content request locally in order to make caching decisions based on how recent the content requests are. ProbCache [60] keeps a copy of content in a cache along a path with a probability which is calculated based on path lengths and multiplexes content flows. DQNCache [55] is a deep reinforcement learning-based caching in hierarchical content delivery networks. The proposed framework DQNCache relies on Deep Q Networks to learn optimal caching policy in an online manner. DQNCache belongs to value-based algorithm which tries to find optimal value function which is a mapping between an action and a value to find better action with higher value.
Local edge (cache) hit ratio is the proportion of requests being served directly from the local edge's cache. Neighbour (cache) hit ratio is the proportion of requests being served indirectly from the neighbour edges; local edge hit ratio and neighbour hit ratio together show the proportion of content requests being satisfied by the edges instead of the CDN server. Latency means end-to-end average latency of all content requests in a time period. Note that we assume the transmission latency between any edge to the CDN is 3 times of the latency between any two neighbouring edges. Transmission cost is the total traffic cost when forwarding requests/receiving contents to/from neighbour edges or CDN server. We describe our experiments with increasing cache capacity which is the storage capacity constraint implying the maximum number of contents that an edge can cache. Smaller cache capacity size offers more selective cached contents, thus requires more accurate and more robust caching algorithms. All experiments are repeated ten times and averaged. The detailed simulation parameters are shown in Table 4.  Figure 6 shows the overall temporal trends or patterns of users' content traffic during weekdays and weekend: if content is requested at a certain interval of time, it is highly likely it will be requested again in near future. Contents are VOLUME 8, 2020   not requested randomly and independently over time but at a certain time interval, before its popularity gradually fades out. Foursquare New York experiences surge of traffic at peak hours during weekdays and low average number of requests during weekend. On the opposite, Twitter London has a more uniform distribution of content traffics, although it still experiences surge of traffic twice a day, during a week. Figure 7a and Figure 7b show the user content request distribution in New York Foursquare [26] and London Twitter [28]. We show that the locations of mobile subscribers imply different degrees of similarity in content request. Subscribers within a community or from two relatively close clusters with each other are more likely to have similar content request patterns compared to those from long-distance subscribers or regions far apart. This captures the interplay between geographical diversity of the users and their content request patterns. New York Foursquare has high degree of power-law distribution [13] such that there is a small important number of nodes which are highly connected and there's a trailing tail of nodes with a very few connections [13]. In London Twitter, although the degrees of its nodes still follows power-law model, its topologies have more uniform connectivity distribution compared to New York Foursquare such that popular nodes may become extremely less popular, and emerging new nodes may become extremely high popular in a very short time. London Twitter has shorter average paths and lower clustering compared to New York Foursquare. It also has lower publisher-subscriber density compared to New York Foursquare (35.1 users/km 2 and 64.8 users/km 2 respectively).
We investigate the influence of edge caching storage capacity on a range of metrics. We vary the edge caching storage capacity from 200 contents which is equivalent to 1% of the total content population to 1200 contents (which is equivalent to 6% of the total content population). Figure 8 and Figure 9 show the cache hit ratio of high popular contents in local edge and in neighbour edges for both New York and London scenarios. We show that CognitiveCache outperforms all other competitive approaches, improving more than 52% cache hit ratio for New York Foursquare and 88% for London Twitter. CognitiveCache has good performance in both New York Foursquare with high-degree scale-free network community where a small number of contents are requested a lot and in the London Twitter scenarios where the structure of community and user requests changes significantly over time. Higher cache space leads to the bigger gap between CognitiveCache and others. CognitiveCache allows 91% of high popular content to be cached and served to subscribers within the local edge. When the local cache space is relatively small, high popular contents which are not be cached locally will be redirected and served via the neighbour edges rather than sending to the CDN server. This is due to CognitiveCache is able to collaboratively learn the time-series and spatial content request patterns from the historical observations while taking into account the states of itself and its neighbours' edges. The neighbour hit ratio of CognitiveCache decreases when the capacity increases. This is because most highly popular requests will be served locally when cache space is increased. At the result, the percentage of neighbour cache hit is expected to reduce. Cogni-tiveCache outperforms single-agent DQNCache [55] in both New York and London scenarios as DQNCache [55] only aims to maximise the individual performance for each individual edge without considering the state of its neighbours. Probability-based caching algorithm ProbCache [60] has a  better performance compared to benchmarking LRU and LFU caching algorithms. Figure 10 and Figure 11 show the cache hit ratio in local edge and in neighbour edges for low popular contents in New York and London scenarios. We show that while CognitiveCache allows a majority of high popular contents to be served by local edge, it also enables the cache hit ratio of low popular contents to be 80% in both New York Foursquare and London Twitter while that of DQNCache, ProbCache, LRU and LFU are 51%, 33%, 27% and 21% respectively. CognitiveCache preserves most of its cache space for predicted high popular contents while still being able to serve 39% of low popular requests locally and additional 41% via its collaborative neighbour edges' cache. When the cache space is getting larger, each Cog-nitiveCache considers not only its local hit but also offers a proportion of its cache space to serve its neighbours. DQNCache, ProbCache, LRU and LFU has poor caching performance for low popular content, especially in London scenario with very high dynamic mobile subscribers and users requests.    Figure 13 show the average latency in local edge and in neighbour edges for low popular contents in New York and London scenarios. CognitiveCache can reduce 33%, 47%, 66%, 71% latency compared with DQNCache, ProbCache, LRU and LFU respectively. In Figure 8-11, we show that CognitiveCache can successfully serve 91% of high popular contents within 9ms and 80% of low popular contents within 19ms for New York scenario while that of London is 15.7ms and 19.4ms respectively. CognitiveCache has better performance in delay compared to competitive caching protocols since it benefits from its well-identified VOLUME 8, 2020  states consisting of its predictive request demand and the caching state that allow it to learn and adapt faster and more accurate to the dynamically changing of mobile subscribers and their requests. Figure 14 shows the average transmission cost of sending requests and receiving contents either from local edge, neighbour edges or from the CDN server in New York and London scenarios. Note that we assume if a content request achieves local hit, the transmission cost is relatively lower compared to neighbour hit. In turn, the transmission cost of neighbour hit is lower compared to that of the CDN server. We show that CognitiveCache reduces 23%, 75%, 83%, 87% transmission cost compared with DQNCache, Prob-Cache, LRU and LFU respectively. This is due to Cognitive-Cache is able to serve a majority of high popular content locally while allowing low popular content to be cached in neighbour edges, thus minimizing the number of requests being sending to CDN server. Since obtaining contents, especially video from a local edge or even neighbouring edge is quicker and more cost-beneficial compared to that from the CDN server, CDN content retrieval should be the lowest priority. Figure 15 shows how our CognitiveCache caching captures, predicts and responds to the user requests in real time in two very different scenarios: New York Foursquare and London Twitter. The historical content request patterns offer valuable resources for our data-driven caching solution as we show that CognitiveCache can capture and predict the temporal-spatial locality of users' content requests that are leveraged for highly accurate, responsive and cost-effective CognitiveCache caching decisions. CognitiveCache enables more responsiveness to the rising trend of newly high popular contents and fading out of older contents over time as well as avoid one-timer contents [47] and mitigate flash crowd effect [3].

V. CONCLUSION
In this paper, we proposed CognitiveCache caching framework based on multi-agent deep reinforcement learning which can tackle the complex challenges of bringing contents as close as possible to the mobile users and improve the quality of service in mobile edge-cloud networks. We design a novel space of states, actions to leverage the temporalspatial locality of content requests that enables more accurate and responsive caching decision making. We evaluate our CognitiveCache proposal against benchmark and competitive caching models: DQNCache [55], ProbCache [60], LRU [9] and LFU [10] over two very different real-world network topologies: New York [26] and London [28]. We show that our caching framework consistently outperforms the benchmarking and state-of-the-art algorithms, increases the cache hit ratio while minimising the latency and transmission costs.
In future work, we plan to explore and propose a novel incentive mechanism that incentivise all edges in the network to collaborate and share their caching space with fairness [6] and security concern, avoid the selfish and malicious behaviours of users in real world. In addition to this, we will investigate new privacy-aware and energy-aware Cognitive-Cache by building on and extending works in [32], [44] for edge privacy awareness and [36], [50] for edge energy efficiency. In addition, while complex ML/RL techniques and algorithms help to analyse a huge amount of historical data to gain deeper insight of network environments, predictive analytic and heuristic-based approaches [1]- [3], [29] allow predictive adaptive response to changing local conditions in real time and at low cost. This opens up opportunities for future work to innovate and redefine the ML/RL based caching algorithm assisted with real-time predictive analytics and heuristics to improve the accuracy, scalability and efficiency for content services in mobile heterogeneous networks.