Introduction
The Internet of Things (IoT) is an emerging domain dedicated to connecting ubiquitous objects to the Internet, and the number of connected devices will reach 28 billion by 2021 [1]. With the improvement of IoT requirements, the grid pattern is also changing, and the distributed concept is forcing the traditional grids to adapt to the new situation [2]. The smart grids have replaced traditional networks by using distributed power control and communication technologies (such as 5G) to improve operational efficiency [3]. The distributed smart grids integrate many IoT devices and upload information to the Internet in time to avoid problems such as failures and capacity limitations. Many service concepts have been introduced into the smart grids, including smart meters (SMs), advanced metering infrastructure (AMI), distributed generators (DG), and so on [4]. The service communication network can be represented by a hierarchical multi-layer architecture, and the architecture can be composed of user local area network (IAN), neighborhood network (NAN), and wide area network (WAN) based on data rate and coverage [5].
In smart grids, synchronous power grid monitoring tends to operate with high precision to realize real-time fault monitoring. However, the stringent delay requirements raise significant challenge to the communication infrastructure [6]. Some mission-critical applications have tight delay constraints, such as the distribution automation deployed in substation requires information transmission within 4ms. However, some smart meters send data at long intervals, such as 15 minutes [7]. In order to reduce service delay in response to service-oriented resource allocation in smart grids, edge computing (EC) and deep reinforcement learning (DRL) are introduced.
A. Smart Grids and Edge Computing
EC provides a distributed paradigm for computing and caching in smart grids by deploying servers at the edge. In order to reduce the burden from more and more power devices, edge nodes (ENs) have been given the ability to perform computing and caching at the edge by allowing services to be handled at edge servers distributed close to users [8]. A multi-layer radio access network in the cloud is designed and a cooperative resource allocation algorithm is proposed to reduce service delays in edge networks and optimize throughput [9]. The heuristic algorithm [10], greedy algorithm [11] and game theory [12] are applied to optimize the allocation resource in smart grids.
However, there are still many problems to be tackled about EC, such as 1) which EN handles tasks; 2) whether EN offloads tasks to the cloud or receives tasks from the cloud; 3) how to allocate resources for these tasks in smart grids. In the resource allocation problem of EN, it is not only necessary to optimize communication, computing, and caching resources, but also to select the appropriate EN, so many researchers describe this problem as an NP problem. Q. He et al. believed that the edge user allocation (EUA) is NP-hard, so they proposed a game-theoretic approach to solve this problem in [13]. In order to meet the delay requirements of users, T. Ouyang et al. used Lyapunov optimization to decompose the complex problem into real-time optimization problems which were NP-hard and proposed a Markov approximation algorithm to solve the problem [14]. The state and action space increases exponentially with the number of the user’s requests and the devices. In addition, due to the dynamics of the network environment, e.g., the contents of the request and locations of the devices, the resource allocation in smart grids can no more be tackled via the traditional one-shot optimization algorithms. Therefore, in the next subsection, DRL is applied to tackle this problem.
B. Smart Grids and Deep Reinforcement Learning
With the expansion of smart grids scale and the rapid growth of the number of users, it has led smart grids to operate in more uncertain, complex environments. Due to the influence of uncertain factors, traditional methods cannot adapt to the development of smart grids and the requirements of the customers. On the other hand, the extensive deployment of AMI [15], WAMS [16] and power system nodes produces massive data, which can not only provide data base for DRL training, but also reduce the impact of uncertain factors. Due to the robust learning ability and interaction with the environment, DRL can collect information from a large amount of data and make adaptive decisions [17]. The information from users and devices is often not available in advance, but DRL can complete the decision to make to uninstall without prior experience [18]. Agent strategy can be divided into single agent strategy and multi-agent strategy. Single agent algorithm relies on an experienced replay buffer, such as Q-learning and DRL. In the multi-agent system, the agents update their policies in parallel and enable the use of replay buffers to train independent learners [19], and a single globally shared network is trained to output different policies for homogeneous agents [20]. Both cooperative multi-agent DRL [21] and non-cooperative methods [22] can be used to optimize resource allocation strategies.
The uncertainty and complex operation environment of smart grids bring challenges to the application of DRL. For example, different control methods and constraints of power devices make the model more complex, and there are multiple entities with different objectives in smart grids, which endows the reward functions with more difficulties [23]. In view of the above issues, multiple designs and modifications are required to adapt to different scenarios, which are as follows: (1) the reward function determines the efficiency of the algorithm, so it needs to be designed according to the actual problem; (2) information sharing between the state space and the behavior space determines the efficiency of the decision-making; (3) the scheduling and updating strategies of the agent need to adapt to the service characteristics of the smart grids.
This paper focuses on the joint optimization of computing and caching resources allocation problems which makes the ENs have communication, computing, and caching capability to process services independently. A resource allocation model of smart grids based on EC is proposed. In this model, the service delay includes network transmission delay and computing delay. In order to reduce the service delay, an algorithm based on DRL is proposed to explore the optimal resource allocation strategy. The main technical contributions are summarized as follows:
We first design an EC system framework with three layers in smart grids that includes the service layer, edge layer, and cloud layer, where the objective of the proposed framework is to minimize the total service delay and obtain the optimal resource allocation strategy under the varying service requirements.
An agent polling update deep reinforcement learning (APUDRL) algorithm is proposed to optimize the communication, computing, and caching strategy. It combines the neural network with the DRL-based polling method and explores strategy with the designed environment. In the face of a large number of services with different delay requirements, such as millisecond, second, and minute, the constraints in the optimization function are mapped to the penalty factors in the reward function. The SumTree sampling method is used to improve the efficiency of data sampling, and the separation of value function and advantage function is used to improve the learning efficiency.
The numerous services in smart grids bring huge cache pressure, so a cache update strategy considering popularity and cache time, namely popularity and cache time (PACT) is proposed to improve the cache hit rate. The content popularity represents the frequency of users’ requests for the content, and the cache time represents the length of time content has been cached.
The extensive simulations are performed under varying scenarios in the proposed EC system in order to verify the effectiveness of the proposed system model and the algorithm APUDRL. The numerical simulation results show that the performance of algorithms outperforms that of three baseline algorithms by at least 72.85%, 61.65% and 58.84%. The cache hit rate of the PACT is also surpasses that of the two baseline schemes.
The remainder of the paper is as follows. The system model is introduced in Section II. Then, in Section III, the APUDRL algorithm is proposed. In Section IV, simulation results prove that the proposed algorithm has excellent efficiency and adaptability. Finally, the conclusion is drawn in Section V.
System Model and Problem Formulation
The system structure is shown in Fig. 1, and an EC system framework with three layers in smart grids is considered. The service layer is composed of users and power devices, which is used to send and receive services. Based on a variety of smart grids use cases and selected standards, three services with delay tolerance of milliseconds, seconds and minutes are considered, which are expressed as
In this paper, delay requirement refers to the acceptable window of delay from the transmitter to the receiver. It is assumed that each user can send two types of services with different delay tolerance in the service layer. The edge layer represents the ENs with specific communication, computing and caching capabilities, which are regarded as agents. The cloud layer represents the network structure of the cloud, which including the service request list (SRL), the deployment of EN, and cloud control (CC). They have the capacity of communication, computing and storage, and can cooperate to complete cloud services.
The deployment location of the ENs determines the service they carry out. For example, the nodes deployed in the substation perform services such as transmission line monitoring and power dispatching automation, while the nodes deployed on the device side can assume the role of smart meters and perform services such as data transmission and service analysis. When the amount of data or services increase, the edge node may be limited by its own capacity and cannot meet the service delay, which violates the original intention of introducing edge computing. Therefore, it is necessary to optimize the resources of edge nodes to reduce the service delay in the case of limited resources.
By deploying ENs close to users, services will be handled by ENs first. There are \begin{equation*} \mathcal {C}_{k}=[C^{\mathrm {cac}}_{k},C^{\mathrm {cop}}_{k},C^{\mathrm {com}}_{k},C^{\mathrm {pos}}_{k}].\tag{1}\end{equation*}
A. Communication Model
There are \begin{equation*} \mathrm {SINR}_{u,k}=\frac {g_{u,k}(t)p_{u,k}(t)}{\sigma ^{2}_{u}(t) +\sum \limits _{i=1,i\neq {k}}g_{u,i}(t)p_{i}(t)},\tag{2}\end{equation*}
The total bandwidth of \begin{align*}&\hspace {-.5pc}r_{u,k}={\sum _{m=1}^{M}x_{u,k}^{\mathrm {sub}}[m]b_{u,k,m}} \cdot [\log _{2}(1+\mathrm {SINR}_{u,k}) \\&-\,\sqrt {{V_{u,k}}/{L_{u,k}}}f_{Q}^{-1}(\varepsilon)],\tag{3}\end{align*}
B. Computation Model
The computing capability of
If \begin{equation*} d_{u,k}^{\mathrm {cop}}={d_{u,k}^{T}}+d_{u,k}^{\mathrm {P}}x_{u,k}^{\mathrm {cop}} +D_{u}^{\mathrm {cop}}(1-x_{u,k}^{\mathrm {cop}}).\tag{4}\end{equation*}
C. Cache Model
Caching strategy can improve the communication efficiency among nodes. For example, monitoring, alarm, and control systems need to communicate with each other among multiple nodes. Since it needs a large number of signals (information) to run the monitoring and alarm system, the signal transmission has a high signaling cost. If the content based on popularity is cached in the appropriate node, it can not only reduce the signaling cost, but also improve the efficiency of information transmission.
ENs are connected to the cloud through the backhaul link, and the cache states are shared among the devices. The caching capacity of
\begin{align*} E(n_{k,p}^{T})=&\lambda \cdot {f_{k,p}(Zipf)} \\=&\lambda \cdot \left ({\mathfrak {p}_{\omega }}\right)^{-1}\cdot {\sum _{i=1}^{p}{i[{\alpha _{k}}]}},\tag{5}\end{align*}
SMs and AMI have different importance according to their deployment location, so the centrality of device \begin{equation*} c(k) = \frac {{f({\mathcal{ D}}(k))}}{K}.\tag{6}\end{equation*}
The parameter \begin{equation*} \tau (p)\,\,= \frac {f(T(p))}{P}.\tag{7}\end{equation*}
The cache decision is determined by the above the centrality of devices
It is assumed that the cloud stores all the content and doesn’t need to be updated. When the cache content of ENs reaches the maximum capacity, it needs to update the buffer pool. Considering the requirements of some services on the time of content cache, such as the intelligent monitoring of general electric equipment based on video [29], [30], a method combining popularity and cache time (PACT) is proposed to update the content. As for PACT, the content popularity represents the frequency of content being requested, and the cache time represents the length of time content that has been cached, which can be expressed as \begin{equation*} \Pi ^{\mathrm {cac}}=T_{p}^{\mathrm {cac}}{\sum _{i=1}^{p}{\omega ^{\mathrm {pop}}_{k,p} [{\alpha _{k}}]}}/{\omega ^{\mathrm {pop}}_{k,i}[{\alpha _{k}}]}\sum _{i=1}^{p}{T_{i}^{\mathrm {cac}}},\tag{8}\end{equation*}
When the UE \begin{equation*} d^{\mathrm {cac}}_{u,k}={d_{u,k}^{T}}+d^{Q}_{u,k}.\tag{9}\end{equation*}
D. Optimization Model
The ENs make decisions for UE scheduling, computing offloading and content caching to reduce the delay of task processing. They receive service requests from users, including the type of service and SINR. ENs has computing and caching capabilities, and determines resource allocation, whether to cache content, whether to update cache, and where to perform tasks.
The location of UE \begin{equation*} \|l_{u}(x_{i},y_{i})-l_{k}(x_{j},y_{j})\| < R_{u\rightarrow {k}}.\tag{10}\end{equation*}
For any service, it can only be allocated to one EN to process as follows \begin{equation*} \sum \limits _{k = 1}^{K}\sum \limits _{m = 1}^{M} x_{u,k}^{\mathrm {sub}}[m]\in {\{0,1\}},\quad \forall {u}.\tag{11}\end{equation*}
The delay of service \begin{equation*} (d_{u,k}^{\mathrm {cop}}(j)+d^{\mathrm {cac}}_{u,k}(j)) < \mathrm {D}_{i},i\in \mathcal {S}^{\mathrm {ser}}.\tag{12}\end{equation*}
The allocation of bandwidth needs to be under the condition that the existing resources are sufficient, yielding the following constraint \begin{equation*} \sum \limits _{u = 1}^{U}{\sum \limits _{m = 1}^{M} x_{u,k}^{\mathrm {sub}}[m]b_{u,k,m}\leq B_{k}},\quad \forall {k}.\tag{13}\end{equation*}
The total computing resources allocated to users by the EN \begin{equation*} \sum \limits _{u = 1}^{U} c_{u,k}\leq {C_{k}},\quad \forall {k}.\tag{14}\end{equation*}
The optimization goal of this paper is to minimize the expected delay of \begin{align*}&\mathop {\arg \min }\limits _{\left \{{ d }\right \}} \mathbb {E}\left[{\sum _{j=1}^{N}d_{u,k}^{\mathrm {cop}}(j)+d^{\mathrm {cac}}_{u,k}(j)}\right], \\&{\mathrm {s}}{\mathrm {.t}}{\mathrm {.:~~}} (9) -(13).\tag{15}\end{align*}
Proposed Algorithm
In this part, we model the resource allocation problem among multiple devices based on DRL. The goal of the algorithm is to optimize the resource scheduling strategy and reduce the service delay.
A. Reinforcement Learning Algorithm
In the traditional RL algorithm, the Q-value function is solved by iterative Behrman equation, and the Q-value function is expressed as \begin{equation*} Q_{i+1}(s,a)=E_{s}[r+\gamma {\max _{a'}Q_{i}(s',a')|s,a}].\tag{16}\end{equation*}
In this paper, the channel state, computing resources and cache state are all dynamic. When the system state changes, the size of action and state space cannot be estimated and the cost of solving Q-value function with Behrman equation is too high. With the rapid development of DQL, complex high-dimensional data can be used as input, and then deep Q-network (DQN) makes decisions according to the input data. For DRL, the space of reward is
To avoid the correlation between samples, an experience replay mechanism (ERM) is introduced. The motivation of ERM is to break up the correlation within each sample, which is denoted by
B. Agent Polling Update Deep Reinforcement Learning
The requirements of services in smart grids for ENs are often different. The states and actions of all agents are centrally stored by the traditional multi-agent algorithms that are conducive to global optimization, but in the face of a large number of unknown services, some ENs need special operations such as computing offloading, so the traditional method has high limitations. We use an Agent Polling Update Deep Reinforcement Learning (APUDRL) to store the agent’s action separately and store the agent’s state centrally. The APUDRL can increase the flexibility of the network on the premise of sharing resources among multiple ENs. For example, in a physical environment with
When the service needs to be processed by a single device, each EN can be regarded as an independent learner, only need to apply the APUDRL algorithm to explore the best strategy, update operation and reward. If the service needs to be processed by multiple devices, the APUDRL algorithm is used to observe its states and the states of other agents, and the operation is selected according to the joint states. When the APUDRL algorithm learns the best strategy or reaches the maximum number of times, the network can get the expected cumulative return. Markov Decision Process (MDP) is a general framework to solve DRL. By mapping optimization problems to MDP, DRL is used to solve MDP problem. The multi-service resource optimization problem in this paper is described as an MDP, which contains several key elements, including state, behavior and reward, and the details are as follows.
1) State
The state is the observation of the current environment, and the agent can make action decisions (see below) based on the environment. The observation includes the information of devices and the cache storage. The information of devices includes the location of \begin{align*} C_{k}^{\mathrm {cac}}=[[\varrho _{k,1}^{\mathrm {con}},\varrho _{k,2}^{\mathrm {con}},\ldots,\varrho _{k,P}^{\mathrm {con}}],~[\omega ^{\mathrm {cac}}_{k,1},\omega ^{\mathrm {pop}}_{k,2},\ldots,\omega ^{\mathrm {pop}}_{k,P}]]. \\\tag{17}\end{align*}
For some delay-sensitive services, the high requirements on the agent make a single agent unable to meet the service demand, so multiple agents are required to handle services cooperatively. To improve the observation information of the system, it is sensible for an agent to observe some information about the results of other involved agents. If all states information can be regarded as a vector, the states can be expressed as \begin{align*} \mathbf {s}=&[[l_{1},C_{1}^{\mathrm {cop}},{x^{\mathrm {sub}}_{u,1}[m]},C_{1}^{\mathrm {poo}}], \\&~[l_{2},C_{2}^{\mathrm {cop}},{x^{\mathrm {sub}}_{u,2}[m]},C_{2}^{\mathrm {poo}}],\ldots, \\&~[l_{K},C_{K}^{\mathrm {cop}},{x^{\mathrm {sub}}_{u,K}[m]},C_{K}^{\mathrm {poo}}]].\tag{18}\end{align*}
2) Action
The action means selecting the appropriate agent to process the service. If the action of offloading to ENs or cloud is selected, the resources will be allocated to these services. If the resources of the ENs are insufficient, the ENs or the cloud can provide help. In the proposed model, the agent will select the appropriate ENs for the arriving service. According to the caching information and remaining computing resources, each agent chooses an action from the action space under the current observed state \begin{align*} \mathbf {a}=&[a_{u,1}^{\mathrm {com}},a_{u,2}^{\mathrm {com}},\ldots,a_{u,k}^{\mathrm {com}},\ldots,a_{u,K}^{\mathrm {com}}, \\&~a_{u,1}^{\mathrm {cop}},a_{u,2}^{\mathrm {cop}},\ldots,a_{u,k}^{\mathrm {cop}},\ldots,a_{u,K}^{\mathrm {cop}}, \\&~a_{u,1}^{\mathrm {cac}},a_{u,2}^{\mathrm {cac}},\ldots,a_{u,k}^{\mathrm {cac}},\ldots,a_{u,K}^{\mathrm {cac}}],\tag{19}\end{align*}
3) Reward
The reward function is the reward brought by the selected action in the current state, then the agent adjusts the exploration policy according to the reward. In the APUDRL algorithm, the agent interacts with the environment to obtain the state \begin{align*} r_{t}\!=\!\alpha _{1}\alpha _{2}{d_{u,k}^{T}} \!+\!\alpha _{3}[d_{u,k}^{\mathrm {P}}x_{u,k}^{\mathrm {cop}}\!+\!D_{u}^{\mathrm {cop}}(1\!-\!x_{u,k}^{\mathrm {cop}})]+\alpha _{4}d^{Q}_{u,k}-\kappa, \\\tag{20}\end{align*}
The long-term accumulative reward is defined as the sum of all rewards ever obtained by the agent, but the discounted is in each step of reward. In order to achieve the goal of minimizing task delay, the long-term cumulative reward is defined as
4) Network
The structure of APUDRL is shown in Fig. 2. The algorithm collects different services characteristics from the environment and transmits the observation values to each agent. Each agent achieves its interaction with the environment and selects actions from the action space trying to minimize the cumulative reward. An experience replay buffer \begin{equation*} \delta =r+\gamma {Q^{\mathrm {tar}}}-\gamma {Q^{\mathrm {val}}}.\tag{21}\end{equation*}
A high TD error indicates that the samples have a high update efficiency for the network. The principle of the experience replay mechanism is as follows. First, we use the absolute time difference TD error of experience to evaluate the learning value of experience, which has been calculated in reinforcement learning algorithms such as SARSA [32]. Then, by ranking the experiences in the replay buffer through the absolute value of their TD errors, we more frequently replay those with high magnitude of TD errors. The change of accumulated weight is defined as \begin{equation*} \Delta {w}\leftarrow \Delta {w}+w^{\mathrm {cur}}\delta \cdot {\nabla {Q^{\mathrm {val}}}},\tag{22}\end{equation*}
\begin{equation*} f^{\mathrm {loss}}=\frac {1}{K}\sum _{i=1}^{K}{w_{i}}\cdot \delta ^{2},\tag{23}\end{equation*}
where
The state-action value function \begin{equation*} Q(s,a)=V(s)+A'(s,a),\tag{24}\end{equation*}
Therefore, the APUDRL algorithm can be expressed as Algorithm 1.
Algorithm 1 Agent Polling Update Deep Reinforcement Learning Algorithm
Initialize the scenario and services.
Initialize the number of ENs
Initialize action-value function
for
Initialize a series of random actions and receive a series of initial state
for
for
Select the action and keep exploration.
Acquire the action
Store
Get mini-batch date from the SumTree based on
Activate part of the agents, and get
\begin{align*} y_{j}= \begin{cases} r_{j}, & \text {if episode terminates at}~\mathrm {j+1}\\ r_{j}+\gamma {\max F^{\mathrm {tar}}}, & \text {otherwise} \end{cases} \end{align*}
Use the loss function
Use the
if the number of agents is not enough then
else
break.
end if
end for
end for
end for
Simulation Results and Analysis
In this section, we conduct extensive simulations to verify the performance of the proposed scheme. In particular, we provided simulation settings in Part A, including model parameters and network parameters. Part B demonstrates the impact of network structure and sampling method on performance. Part C analyzes the performance of the resource allocation strategy from two aspects of computing offload and cache hit rate.
A. Simulation Settings
The important parameters in performance evaluation are listed in Table 1, and the specific parameters are analyzed as follows.
The system environment of a single macro cell with a radius of 5km is considered. It is assumed that the maximum communication distance of ENs is 100m [5]. The ENs are randomly located in the system environment according to a uniform distribution. The wireless transmission capacity (WiFi) and wired transmission capacity (power line communication) are set to 2.5Mbps and 15Mbps respectively. The maximum amount of cache capacity is
The initial value of the importance of the sample is 0.4, and its growth rate is 0.001. The agent is composed of two neural networks, namely the target network and the evaluation network. The target network keeps random exploration and trains the network. The evaluation network evaluates the results of the target network. The parameters of both networks are replaced with the rate
B. Training Performance Evaluation
In this part, we focus on the network performance of the algorithm and compare the performance of the proposed algorithm with that of three baseline algorithms, namely Double DQN [34], Prioritized Replay DQN [35] and Dueling DQN [36]. The algorithm converges approximately after several iterations, and the performance is evaluated by reward function and loss function. For these three baseline algorithms, their network structures and sampling methods are different, but they are usually used to deal with the resource allocation problem of discrete actions.
1) Double DQN
Double DQN has the same structure of two Q-networks. Through decoupled the choice of target Q-value action and the calculation of target Q-value, it can resolve the over-estimation of DQN.
2) Prioritized Replay DQN
Prioritized replay DQN with SumTree uses priority sampling to improve prediction accuracy.
3) Dueling DQN
The Dueling DQN tries to optimize the algorithm by optimizing the structure of the neural network. By dividing the Q-network into two parts, one is the state value function
Fig. 4 shows the total reward per episode of the proposed algorithm and three other algorithms in the same environment. The total rewards per episode of all algorithms first decrease with the increase of the episode, and then due to the influence of random learning strategy, the rewards have some small fluctuations, but finally converges. It can be seen that the gain of the APUDRL algorithm in terms of reward value is approximately 72.85%, 61.65% and 58.84% compared with Double DQN, Prioritized Replay DQN and Dueling DQN. The convergence value under the Double DQN algorithm reaches 266.35 which has the worst performance, and the best performance is 72.24 from the proposed algorithm. The reasons are as follows. First, the two algorithms rely on different neural network structure. Specifically, the neural network of Double DQN uses a uniform sampling method which has poor adaptability to newly taken actions, resulting in extremely unsatisfactory returns from the algorithm. In contrast, the neural network of the proposed algorithm introduces SumTree to optimize data storage, and employs an advantage function to evaluate currently taken actions. Furthermore, the sampling method of the algorithm is efficient, and the advantage function improves the accuracy of decision-making. For the newly added actions, the network can learn quickly and estimate the reward in advance.
In Fig. 5, the loss function of the algorithms decreases with the increasing episode. For Double DQN, the fluctuate is unsatisfactory and it does not reach the convergence value in the end, because the sampling method and network structure lead to inaccurate strategy. As can be seen in the figure, the proposed algorithm can obtain the total losses per episode fluctuates around 5 and it gives a lower bound to the other algorithms. As expected, the fluctuation range and convergence value of the loss function of the proposed algorithm significantly outperform the other three algorithms since the agents in proposed algorithm have efficient sampling methods and accurate decision-making.
C. Performance of Computing and Caching
In this part, we evaluate the performance of computing and caching based on the proposed algorithm. Considering the different requirements of service delay in smart grids, we analyze the service delay of three kinds of services in different proportions. The ratio of milliseconds and seconds to minutes is 1:1:1, 1:1:3, 1:3:1 and 3:1:1 respectively. As shown in Fig. 6, the number of ENs is set from 0 to 20. When the network resources are sufficient, the total processing delay approximately presents a linear downward trend with the increase of the number of ENs. The delay rapidly decreases with the increasing number of milliseconds services and delay slowly decreases with the increasing number of minutes services and seconds services. The results show that the proposed algorithm can adapt to the service with different delays, and change the strategy adaptively according to the service delay requirements to ensure the convergence of the results.
Fig. 7 shows the total delay of the proposed algorithm with different number and proportion of services, the three different computing offloading modes including [37]
ENs Computing: All services are delivered to ENs for processing. This is suitable for the situation that the service load is light and the ENs have abundant resources to finish the task in time.
Cloud Computing(CC): All services are delivered to CC for processing. Compared with ENs computing, CC has more resources and is suitable for services with lower latency requirements because this transmission costs more resources and time.
ENs and CC: All services can choose to process in ENs or deliver to the cloud for processing, which is suitable for a variety of services with different delay requirements.
System delay versus the number of services with different computing offloading and service ratios.
As shown in Fig. 7, the delay of the single service is always lower than that of mixed service the number of services. The single service refers to that the service requested by the user is composed of one type of service in
Fig. 8 shows the comparison of cache hit rates among OT, OP, and PACT, where OT only considers cache time and the OP only considers content popularity. As can be seen from Fig. 8, the OT has the lowest cache hit rate, which is always lower than the other two strategies. It is illustrated that the OT can only adapt to the caching requirements of a few services, such as power monitoring video. The cache hit rate of PACT is always higher than that of OT and OP. The obtained shows that considering the content popularity and content cache time, the cache hit rate has a significant improvement under different cache capacity. Compared with OT and OP, the cache hit rate of the PACT is improved by 34.42% and 18.71%.
Conclusion
We design an EC system framework with three-layer in smart grids combining EC and CC to allocate computing and caching resources, which is suitable for a large number of services in smart grids with different delay requirements. A DRL algorithm based on a polling method that adapts to the smart grids is proposed, which allows the agent to perform the polling mechanism according to the requirements of the service to optimize the neural network and the constraints in optimization problems are mapped into penalty factors of reward functions. Numerical experiments show that, compared with the baseline algorithm, the proposed algorithm can achieve superior long-term utility performance, which is reflected in the smaller convergence value of the reward function and loss function. In the face of the growing number of services with different delay requirements, the algorithm is also surpassed that of the baseline schemes in delay and cache hit rate.