Robust Network Slicing: Multi-Agent Policies, Adversarial Attacks, and Defensive Strategies

In this paper, we present a multi-agent deep reinforcement learning (deep RL) framework for network slicing in a dynamic environment with multiple base stations and multiple users. In particular, we propose a novel deep RL framework with multiple actors and centralized critic (MACC) in which actors are implemented as pointer networks to fit the varying dimension of input. We evaluate the performance of the proposed deep RL algorithm via simulations to demonstrate its effectiveness. Subsequently, we develop a deep RL based jammer with limited prior information and limited power budget. The goal of the jammer is to minimize the transmission rates achieved with network slicing and thus degrade the network slicing agents' performance. We design a jammer with both listening and jamming phases and address jamming location optimization as well as jamming channel optimization via deep RL. We evaluate the jammer at the optimized location, generating interference attacks in the optimized set of channels by switching between the jamming phase and listening phase. We show that the proposed jammer can significantly reduce the victims' performance without direct feedback or prior knowledge on the network slicing policies. Finally, we devise a Nash-equilibrium-supervised policy ensemble mixed strategy profile for network slicing (as a defensive measure) and jamming. We evaluate the performance of the proposed policy ensemble algorithm by applying on the network slicing agents and the jammer agent in simulations to show its effectiveness.


I. INTRODUCTION
Network slicing in 5G radio access networks (RANs) allows enhancements in service flexibility for novel applications with heterogeneous requirements [1]- [4].In network slicing, the physical cellular network resources are divided into multiple virtual network slices to serve the end user, and thus network slicing is a vital technology to meet the strict requirements of each user by allocating the desired subset of network slices [5].Instead of model-based optimization of resource allocation that assumes the knowledge of traffic statistics [6]- [8], deep reinforcement learning (deep RL) [9], [10] as a model-free decision making strategy can be deployed to optimize slice selection and better cope with the challenges such as dynamic environment, resource interplay, and user mobility [11]- [14].Most existing studies in the literature assume the slice state to be identical for all different slices over time, and hence the The authors are with the Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13244 (e-mail: fwang26@syr.edu,mcgursoy@syr.edu,svelipas@syr.edu) The material in this paper was presented in part at the 2022 IEEE International Conference on Communications (ICC).selection problem reduces to assigning the number of slices to each request.
In this paper, we consider a more practical scenario of a cellular coverage area with multiple base stations and a dynamic interference environment, analyze resource allocation in the presence of multiple users with random mobility patterns, and develop a multi-agent actor-critic deep RL agent to learn the slice conditions and achieve cooperation among base stations.We propose a learning and decision-making framework with multiple actors and a centralized critic (MACC) that aims at maximizing the overall performance instead of local performance.In the machine learning literature, MACC framework has been proposed for general RL tasks [15]- [17].In our setting, this multi-agent system requires the actors at each base station to communicate with a centralized critic at the data center/server to share the experience and update parameters during the training process.It subsequently switches to decentralized decision making following the training.
Typically, when 5G RAN achieves faster transmission rates with higher frequency bands, it also has smaller coverage, necessitating a relatively dense cellular network architecture.In such a setting, the dynamic environment with user mobility might have significant impact on the network slicing performance.Due to this, we introduce the pointer network [18] to implement the actor policy to handle the varying observations of the deep RL agents.
It is important to note that due to being highly data driven, deep neural networks are vulnerable to minor but carefully crafted perturbations, and it is known that adversarial samples with such perturbations can cause significant loss in accuracy, e.g. in inference and classification problems in computer vision [19]- [21].Given the broadcast nature of wireless communications, deep learning based adversarial attacks have also recently been attracting increasing attention in the context of wireless security [22], [23].In particular, deep RL agents are vulnerable to attacks that lead to adversarial perturbations in their observations (akin to adversarial samples in classification and inference problems).In wireless settings, jamming is an adversarial attack that alters the state and/or observations of decision-making RL agents.Motivated by these considerations, we also design a deep RL based jammer agent with jamming and listening phases.Different from most existing works, we analyze how the jamming location and channel selection are optimized without direct feedback on the victims' performance.We further analyze the performance degradation in network slicing agents in the presence of jamming attacks and identify their sensitivity in an adversarial environment.
Finally, we note that there is also a growing interest in developing defense strategies against adversarial attacks [24]- [26], and specifically jamming attacks [27], [28].One of the most intriguing defense strategies is to ensemble several different policies, explore alternative strategies, and provide stable performance [17], [29].In our context, we consider the mobile virtual network operator (MVNO) and the jammer as two players in a zero-sum incomplete information game.Several existing studies within this framework focus on applications, e.g., involving video games or poker games, in which random exploration does not lead to physical loss or penalty.In contrast, wireless users may experience disconnectivity in such cases.Therefore, an efficient and prudent exploration plan with fast convergence is desired.Most existing works also restrict the scope to the performance of the considered player or its performance against a certain type of adversary, failing to consider the nature of the zero-sum game against an unknown opponent that is potentially adaptive.In this paper, we propose an approach we refer to as Nash-equilibrium-supervised policy ensemble (NesPE) that utilizes the optimized mixed strategy profile to supervise the training process of the policies in the ensemble to fully explore the environment, and leaves no improvement space for the opponent to pursue the global equilibrium over all possible policy ensembles.We evaluate the performance of NesPE by applying it on both the aforementioned network slicing victim agent and jammer agent in a competitive context, and compare its significantly improved performance with two other policy ensemble algorithms.
The remainder of the paper is organized as follows.In Section II, we introduce the wireless network virtualization framework and dynamic channel model, and formulate the network slicing problem.In Section III, we propose the MACC deep RL algorithm with its pointer network architecture for actor implementation, and describe the network slicing agents.In this section, we also evaluate the performance of the proposed algorithm.Subsequently, we devise the deep RL based jammer agent in Section IV, introduce the two operation phases, jamming location optimization, jamming channel optimization and the actor-critic implementation, and evaluate the performance Finally, we design the NesPE algorithm in Section V, describe the steps of the algorithm, analyze its performance in both non-competitive and competitive environments, and compare it with other policy ensemble algorithms by applying on both the victim agent and jammer agent.Finally, we conclude the paper in Section VI.

II. SYSTEM MODEL AND PROBLEM FORMULATION
In this section, we introduce the wireless network virtualization (WNV) and the interference channel model, and formulate the network slicing as an optimization problem.

A. Wireless Network Virtualization
WNV is well-known for enhancing the spectrum allocation efficiency.As shown in Fig. 1, infrastructure providers (InPs) own the physical infrastructure (e.g., base stations, backhaul, and core network), and operate the physical wireless network.WNV virtualizes the physical resources of InPs  and separates them into slices to efficiently allocate these virtualized resources to the mobile virtual network operators (MVNOs).Therefore, MVNOs deliver differentiated services via slices to user equipments (UEs) with varying transmission requirements.
In this paper, we consider a service area that contains N B base stations of InPs with inter-base-station interference, and MVNOs that require services from the InPs.These MVNOs provide services to N u UEs that are within the coverage area of at least one base station, and may move from the coverage of one base station to that of another.Users in the coverage area of multiple base stations that serve the MVNO can be assigned to any of these base stations.Without loss of generality, we discuss the station-user pair for a single MVNO in the remainder of this paper.Next, we introduce the dynamic channel environment.

B. Interference Channel Model
We consider a dynamic environment where N channels are available for each MVNO.The fading coefficient of the link between base station b and a UE u in a certain channel c at time t is denoted by h b,u c (t) ∼ CN (0, 1), an independent and identically distributed (i.i.d.) circularly symmetric complex Gaussian (CSCG) random variable with zero mean and unit variance.Each fading coefficient varies every T time slots as a first-order complex Gauss-Markov process, according to the Jakes fading model [30].Therefore, at time (n + 1)T , the fading coefficient can be expressed as ) where e c denotes the channel innovation process, which is also an i.i.d.CSCG random variable.Furthermore, we have ρ = J 0 (2πf d T ), where J 0 is the zeroth order Bessel function of the first kind and f d is the maximum Doppler frequency.

C. Network Slicing Problem Formulation
In the aforementioned environment, the resource allocation is performed in a first-come first-served fashion.Each base station may serve at most N r requests from different users simultaneously, and the request queue can stack at most N q requests.For each request, multiple channels can be allocated, and the transmission power is P B in each channel.Different requests served by the same base station do not share the same channels, while requests to different base stations may share the same channel at the cost of inflicting interference.For all requests to each base station, transmission is allowed in no more than N c channels, and consequently the total power is limited to N c P B .In this setting, if request k is allocated the subset C k of channels at base station b, the sum rate r k for this request is with where h B is the height of each base station, α is the path loss exponent, and {x b , y b } and {x u , y u } are the 2-D coordinates of base station b and user u, respectively.Therefore, for each base station, the resource assignment including the selected subsets C k , the number of selected channels n c , the number of served requests n r , and the number of requests in queue n q must follow the following constraints: Above, (6) indicates that no channel is shared among different requests at the same base station, and (7) defines the number of channels being used.The constraints in (8)- (10) are the upper bounds on the number of channels, the number of requests being served simultaneously, and the number of requests in the queue, respectively.If n r = N r and n q = N q at a base station, any incoming request to that base station will be denied service.
The features of each request k consist of minimum transmission rate m k , lifetime l k , and initial payload p k before the transmission starts.At each time slot t during transmission, the constraints are given as follows: where l k (t) denotes the remaining lifetime at time t.Note that with this definition, we have l k (0) = l k .If any request being processed fails to meet the constraints (11) or (12), it will be terminated and be marked as failed.Any request in the queue that fails to meet constraint (12) will also be removed and marked as failure.Otherwise, the lifetime will be updated for each request being processed and each request in the queue as follows: Each request being processed will transmit r k (t) bits of the remaining payload: Note that the initial payload is p k (0) = p k .If the payload is completed within the lifetime (i.e., the remaining payload is p k (t + 1) = 0 for t + 1 ≤ l k ), the request k is completed and marked as success.For all aforementioned cases, if the request is completed successfully in time, the network slicing agent at the base station b receives a positive reward R k equal to the initial payload p k .Otherwise, it receives a negative reward of R k = −p k .Each user can only send one request at a time.
Afterwards, base station b records the latest transmission rate history of r b,u c into a 2 dimensional matrix In each time slot t, for request k from user u that is allocated subset C k of channels at base station b, the base station updates the entries corresponding to the channels that were selected for transmission: Note that the location {x u , y u } of user u, the fading coefficient h b,u c , and the interference corresponding to 1 b,b ′ c vary over time, and therefore H b is only a first-order approximation of the potential transmission rate r b,u c (t + 1).The goal of the network slicing agent at each base station is to find a policy π that selects n c channels and assigns them to n r requests, so that the sum reward of all requests at all base stations is maximized over time: where K ′ b is the set of completed or terminated requests for base station b at time t, and γ ∈ (0, 1) is the discount factor.

III. MULTI-AGENT DEEP RL WITH MULTIPLE ACTORS AND CENTRALIZED CRITIC (MACC)
To solve the problem in (16), we propose a multi-agent deep RL algorithm with multiple actors and centralized critic (MACC) where the actors (one at each base station) utilize the pointer network to achieve the goal of attaining the maximal reward over all base stations by choosing the optimal subset of channels in processing each request.In the remainder of this paper, we denote the full observation at base station b as O b , the observation over all base stations as O = ∪ N B b=1 O b , and the channel selection at base station b as a matrix of actions In each time slot when n c (t) channels are selected, In this section, we first introduce the MACC framework, then describe the pointer network as the actor structure, and finally analyze the implementation on the considered network slicing problem.

A. Deep RL with Multiple Actors and Centralized Critic
In this section, we briefly discuss the actor-critic algorithm [31].This deep RL algorithm utilizes two neural networks, the actor and critic.The two networks have separate neurons and utilize separate backpropagation, and they may have separate hyperparameters.
The multi-agent extension of actor-critic algorithm where each agent has separate actor and critic that aim at maximizing each individual reward is referred to as independent actor-critic (IAC) [15], [16].When the action of each agent interferes with the others and the goal is to maximize the sum reward over all agents, the deep RL with MACC may be preferred [16].In this framework, the decentralized actors with parameter θ and policy f θ (O b ) are the same as in IAC, while there is only one centralized critic with parameter ϕ and policy g ϕ (O) aiming at the sum reward by updating each decentralized actor.Therefore, MACC agents are more likely to learn coordinated strategies through interaction between agents and mutual environment, despite the scarcity of information sharing at the training phase.In the framework of MACC, the critic maps O to a single temporal difference (TD) error where K b is the set of requests being processed and requests in the queue at time t, and γ ∈ (0, 1) is the discount factor.Then, the critic parameters ϕ and actor parameters θ are optimized by considering the least square temporal difference and policy gradient, respectively, as follows: For a wireless resource allocation problem, fast convergence in the training of neural networks is highly important.Therefore in our implementation, we not only recommend offline pre-training via simulations, but also speed up learning by sharing the neural network parameters among the agents, i.e., we use all the information to update one actor and one critic (as in IAC) during training and share the parameters among all agents.

B. Pointer Network
In this section, we introduce the pointer network to implement the actor policy f θ for IAC and MACC.
Traditional sequence-processing recurrent neural networks (RNN), such as sequence-to-sequence learning [32] and neural Turing machines [33], have a limit on the output dimensionality, i.e., the output dimensionality is fixed by the dimensionality of the problem so it is the same during training and inference.However in a network slicing problem, the actor's input dimensionality {N, n c , n r } may vary over time, and thus the expected output dimension of A b also varies according to the input.As opposed to the aforementioned sequence-processing algorithms, pointer network learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in the input sequence [18], and therefore the dimension of action A b in each time slot depends on the length of the input.
Another benefit of pointer networks is that they reduce the cardinality of A b .General deep RL algorithms typically list out each combination as an element in action A ′ b , and each element indicates picking n c channels out of N channels and assigning each to one of n r requests.Therefore A ′ b has the dimension of N nc n nc r .When the dimensions of N , n c and n r are high, this will require a prohibitively large network to give such an output, and require exponentially longer time to train.Comparatively, A b generated from our proposed pointer network actor has only n r N output elements.
Therefore, we here introduce the pointer network as the novel architecture of the actor of the proposed agents as shown in Fig.
Furthermore, pointer network computes the conditional probability array P c ∈ [0, 1] nr where using an attention mechanism as follows: where softmax normalizes the vector of u c k to be an output distribution over the dictionary of inputs, and vector v and matrices W 1 and W 2 are trainable parameters of the attention model.
We note that the n r × N matrix of the pointer network output P = [P 1 , P 2 , . . ., P N ] typically consists of non-integer values, and we obtain the decision of maximum elements in each column: In our implementation, we use two Long Short Term Memory (LSTM) [34] RNNs of the same size (i.e., h e = h d ) to model the encoder and decoder.During the backpropagation phase of neural network training, the partial derivative of error functions with respect to weights in layers (encoder and decoder in our case) that are far from the output layer may be vanishingly small, preventing the network from further training, and this is called vanishing gradient problem [35].However, the encoder and decoder are cardinal for the function of the pointer network while we demand fast convergence, so we set W 1 and W 2 in ( 22) as trainable vectors and set v as a single constant to reduce the depth of the pointer network and speed the training on both LSTMs.

C. Network Slicing Agent
In this subsection, we introduce the action and state spaces for MACC deep RL with pointer networks employed for the aforementioned network slicing problem.
1) Centralized Critic: The centralized critic agent g ϕ (O) is implemented as a feed-forward neural network (FNN) that uses ReLU activation function for each hidden layer.This FNN takes the observation is the action matrix as described in (23) in the last time slot, H b (t) is the transmission rate history described in (15), U b (t) is the set of users corresponding to the n r requests served by base station b, and C is the set of all channels {1, 2, . . ., N }.I b is the information of requests, including the remaining payload p k , minimum rate m k , remaining lifetime l k and the absolute value of reward |R| = p k of the requests being processed and the requests in the queue: The output of g ϕ (O) is a single value that aims at minimizing the error δ(t) in (18).Note that in our implementation, the sum reward is not the instantaneous reward at time t, but the rewards of K b , the set of all requests being processed and requests in the queue at time t.Therefore, we do not get the reward value or use the sample for training until all requests and those in the queue are completed.Compared to the instantaneous reward, we use this reward assignment to strengthen the correlation between the action A b and the outcome, and thus speed the convergence.}, and the output is a vector of two states with length 2h e .
The input that the encoder LSTM receives at each step is The input that the decoder LSTM receives at each step is The two components are the transmission rate history corresponding to all users U b in channel c, and the information of all requests being processed.
Therefore as in (22), the output e k and d c of the encoder and decoder are fed into the attention mechanism to produce the conditional probability P, and thus we obtain the output action O b via (23).
Apart from the pointer network actor f θ (O b ), we consider two other decision modes to explore different actions and train MACC policies, including the random mode and the maxrate mode.The random mode randomly assigns n c out of N channels to n r requests without considering the observation, and the max-rate mode generates O b via transmission rate history matrix H b [U b , C] instead of P: The agent follows an ϵ-ϵ m -greedy policy as shown in Algorithm 1 below to update the neural network parameters.Specifically, it initially starts with an exploration probability ϵ = ϵ 0 = 1 and chooses the max-rate mode with probability

D. Numerical Results
As shown in Fig. 3, we in the experiments consider a service area with N B = 5 base stations and N u = 30 users, and each cell radius is 2.5km.There are N = 16 channels available, and each base station picks at most N c = 8 channels for transmission.The ratio between transmission power and the noise in each channel is P B /σ 2 k = 6.3.Each base station allows the processing of N r = 4 requests, and keeps at most N q = 2 requests in the queue.When a request from one user is terminated or completed, the user will not be able to send another request within T r = 2 time slots.
The channel varies with maximum Doppler frequency f d = 1 and dynamic slot duration T = 0.02s, and the location of users follows a random walk pattern.Each user may transfer between the coverage of different base stations, while staying within the coverage area of at least one of the base stations.All other parameters are listed in Table I.
In Fig. 4, we compare the moving average of the sum reward achieved by the network slicing agents utilizing the proposed MACC with pointer networks against other algorithms.Similarly as in [36], we introduce three statistical algorithms including max-rate which always executes the max-rate mode, FIFO that selects channels with high channel rate history but always assigns enough channel resources to the requests that has arrived earlier, and hard slicing that assigns channels with high channel rate history and the number of channels assigned to each request is proportional to the corresponding minimum transmission rate m k .We also compare four deep RL agents, each of which deploys IAC or MACC frameworks using either FNN and pointer networks as actor.While the proposed MACC with pointer network agent starts with a lower performance at the beginning of the training phase due to random exploration, it outperforms all other algorithms during the test phase.In the initial training phase of t < 50000, the proposed agent starts with random parameters and employs the ϵ-ϵ m -greedy policy, which starts with a large probability ϵ(1 − ϵ m ) to explore with random actions, and consequently the performance starts low but gradually improves.We observe that in the test phase from t = 50000 to t = 200000, the sum reward eventually converges to around 20 bits/symbol, while slightly oscillating due to the challenging dynamic environment with random channel states, varying user locations, and randomly arriving requests.We further note that the proposed network slicing agents complete over 95% of the requests, and hence attain a very high completion ratio as well.

IV. DEEP RL BASED JAMMER
We have introduced the deep RL based network slicing agents in the previous section.Since deep RL is vulnerable to adversarial attacks that perturb the observations, the proposed network slicing agents can be open to attack by intelligent jammers, and it is critical to determine their sensitivity and robustness under such jamming attacks.In this section, we introduce an actor-critic deep RL agent that performs jamming attacks on the aforementioned victim network slicing agents/users (introduced in Section III) and aims at minimizing the victims' transmission rate.We assume the jammer has the geometric map of all BSs, but it does not have any information on the channel states, users' locations, requests, victim reward or the victim policy.This deep RL jammer may jam multiple channels to reduce the transmission rate and potentially may lead to request failures, or observe the environment and record the interference power in each channel to speculate the victims' actions.We demonstrate a jammer that can significantly degrade the aforementioned victim users' performance even though it lacks critical information on the victims.

A. System Model
In the considered channel model, the fading coefficient of the link from the jammer to the user equipment (UE) u in a certain channel c is denoted by h J,u c , and the fading coefficient of the link from the base station b to the jammer in a certain channel c is denoted by h b,J c .We consider h b,u c , h J,u c and h b,J c to be independent and identically distributed (i.i.d.) and vary over time according to the Jakes fading model [30].Once the jammer is initialized at horizontal location {x J , y J } with height h J , it can choose in any given time slot one of the two operational phases: jamming phase and listening phase.1) Jamming phase: In this phase, the jammer jams n J ≤ N J channels simultaneously with jamming power P J in each channel without receiving any feedback.With the additional interference from the jammer, we can express the transmission rate r b,u c,J from base station b to UE u in channel c as r b,u c,J = log 2 1 + where P J is the jamming power, N b,J c is the jamming interference in channel c, 1 b,J c is the indicator function for both base station b and jammer choosing channel c, and L J,u is the path loss: By degrading the transmission rate r b,u c,J with the jamming interference in channel c, the jamming attack may lead to a number of request failures.The jammer may further amplify its impact by intelligently choosing a preferable subset of channels to jam.The listening phase is introduced to learn such information from the environment.
2) Listening phase: In this phase, the jammer does not jam any channel, but only listens the (interference) power in each channel c among N channels: where 1 b c is the indicator function which has a value of 1 if there is a transmission at base station b to any UE in channel c, and L b,J is the path loss: Due to the jammer's lack of prior information, we consider an approximation and assume that the listened power N listen c from all base stations transmitting in channel c is a rough estimate of the sum of jamming interferences N b,J c,est if jammer were in the jamming phase and chose channel c to inject interference, i.e., Therefore, with this assumption, the jammer anticipates that the higher N listen c being observed/listened, the more likely that jamming in channel c degrades the victim users' performance.Given this, we introduce how we optimize the subset of channels to attack during the jamming phase in Section IV-C.
Another benefit of the listening phase is that no jamming power is consumed in this phase, and consequently average power consumption is reduced.In the remainder of this paper, we assume that the jammer only switches from the listening phase to the jamming phase by the end of each period with T J ∈ R + time slots, and thus it has an average power consumption of n J P J /T J .

B. Jamming Location Optimization
The jammer aims at minimizing the performance of victim users, but it does not have any information on channel fading, UE locations or rewards provided to different requests.Therefore, the jamming location is optimized by minimizing the expected sum transmission rate for given UE u integrated over the service area when the channels for transmission coincide with the channels being jammed.More specifically, we have the following optimization: where the expectation with respect to the set of fading coefficients {h} considers ∀b, u, c : h b,u c , h J,u c , h b,J c i.i.d.
∼ CN (0, 1), D b h is the subset of coverage area with maximal transmission rate r b,u c,J from base station b given {h b,u c , h J,u c , h b,J c }, and ∀b, b ′ : Additionally, we note that P B , h B , |α|, and σ with arbitrary positive values will not affect the optimized jamming location.

C. Jamming Channel Optimization
After the jammer is initialized in the true environment and has observed/listened the interference power N listen c (t − 1) within the listening phase at time t − 1, it will decide the subset of channels C J (t) ⊂ C to jam during jamming phase at time t, where |C J (t)| = n J (t) and C is the set of all channels {1, 2, . . ., N }.According to (30), channels with higher N listen c are more likely to be better choices, but this information is not available in the jamming phase at time t (since jamming is performed rather than listening).Therefore, C J (t) can only be evaluated via N listen c (t − 1) and N listen c (t + 1).Note that N listen c (t + 1) is not available at time t and this challenge will be addressed via deep RL in the next subsection (i.e., by essentially introducing a reward to train the neural network at time t + 1 and having that reward depend on N listen c (t − 1) and N listen c (t + 1).)Hence, with the deep RL approach, the action will depend only on observation before time t.
In the absence of information on the requests, we assume a model in which each request arrives and is completed independently.Thus, the state of current time slot t can be estimated as a linear interpolation (or a weighted average) of N listen c (t−1) and N listen c (t+1).Therefore, the optimized subset of channels to jam can be determined from where ) and β(t) describes the impact of the jammer on the victims.Typically, a request takes multiple time slots to get completed.When it is jammed, there are two possibilities.On the one hand, it may fail to meet the minimum transmission rate limit and get terminated immediately.In such a case, the next request in the queue is processed, and the network slicing agent rearranges and distributes channels into a new set of slices to be allocated to different requests from different users, and thus the listened interference N listen c (t+1) may change dramatically.In this case, we are likely to have β(t) > 1.On the other hand, if the request under attack has a lower transmission rate but the minimum transmission rate limit and the lifetime constraint are satisfied, the transmission will last longer, and the interference N listen c (t + 1) is less likely to vary from that at time t.In this case, we are likely to have β(t) < 1.Therefore, the value of β(t) should be determined via experience: where and T ′′ is a set of time points in the jamming phase where each t ′′ ∈ T ′′ is close to time t, and T ′ is a set of time points where each t ′ ∈ T ′ are in successive listening phases without jamming attack.If T J > 3, d listen c (T ′ ) can be the set of successive listening phases in every period during training.Otherwise if T J ≤ 3, d listen c (T ′ ) has to be collected before jamming starts.
Again, it is important to note that when the jammer agent makes decisions at time t, N listen c (t + 1) is not available.To address this, we propose a deep RL agent that uses the actorcritic algorithm to learn the policy.

D. Actor-Critic Jammer Agent
Our proposed jammer agent utilizes an actor-critic deep RL algorithm to learn the policy that optimizes the output C J (t) to minimize the victims' expected sum rate.The jammer works with a period T J , and only switches from the listening phase to the jamming phase at the end of each period and uses the policy to make the decision C J (t).We next introduce the observation, action, reward, and the actor-critic update of this agent.
1) Observation: At each time slot, the jammer records its instantaneous observation as a vector O J ∈ R N .In a listening phase, where 1 is the indicator function.In the beginning at time slot t in the jamming phase, the full observation } is fed as the input state to the actor-critic agent.
2) Action: At the beginning of time slot t in a jamming phase, given the input state O J (t), the actor neural network outputs a vector of probabilities P J (t) ∈ [0, 1] N .From the probability vector, the decision C J (t) is derived which is the subset of channels to jam, and it is described as the action A J (t) ∈ {0, 1} N : (37) 3) Reward: Following the jamming phase at time t, the reward is received to train the critic after the next listening phase at time t + 1.This reward aims at encouraging the policy to produce an action that imitates N β c (t), the linear interpolation of listened interference as in (33).Therefore, we set the reward as the negative of the mean squared error: 4) Actor-Critic Update: At the beginning of a jamming phase at time t, the actor with parameter θ J and policy f θ J (O J ) maps the input observation O J to the output probability P J , which is similar to a Q-value generator.The critic with parameter ϕ J and policy g ϕ J (O J ) maps O J to a single temporal difference (TD) error: where γ J ∈ (0, 1) is the discount factor.For each training sample, the critic is updated towards achieving the optimized parameter ϕ * J to minimize the least square TD: The actor is updated towards the optimized parameter θ * J to minimize the policy gradient: Both networks are updated alternately to attain the optimal actor-critic policy.Note that during each parameter update with the bootstrap method, the mini-batch of training samples (i.e., action-reward pairs) should be randomly drawn from a longer history record for faster convergence.

E. Numerical Results
As shown in Fig. 5, we in the experiments consider a service area with N B = 5 base stations, N u = 30 users, and a jammer.The theoretically optimized location of the jammer {x * J , y * J } is determined according to (31) and it lies at the center {0, 0}, while the actual location is slightly moved away from the base station tower.There are N = 16 channels available, and the jammer picks at most N J = 8 channels for jamming.The jamming power in each channel equals the transmission power in each channel: P J = P B .The jammer has phase switching period of T J = 2 time slots.
In the experiments, the network slicing agents are initialized as the well-trained MACC agents with pointer network actors as detailed in Section III, and time is initiated at t = 0.The jammer's actor and critic policies are implemented as two feedforward neural networks (FNNs), and both have one hidden layer with 16 hidden nodes.The jammer is initialized at t = 100 and begins jamming and performs online updating.During the training phase 100 ≤ t < 10000, the jammer follows an ϵgreedy policy to update the neural network parameters θ J and ϕ J with learning rate 10 −5 .It starts by fully exploring random actions, and the probability to choose random actions linearly decreases to 0.01, thus eventually leading the agent to mostly follow the actor policy f θ J (O J ).This probability is fixed to 0.01 in the testing phase from t = 10000 to t = 20000.
In Fig. 6, we compare the performance of the proposed actor-critic jammer that approximates N β c (t) with four other scenarios in terms of victim sum reward in the testing phase.
The figure shows the testing phase after the victim agent adapts to the jamming attack, and as a result its action policy becomes stable.The slight fluctuations in the curves are due to randomly arriving requests and varying channel states.We also note that the vertical axis is the victim's reward, and hence the jammer with the better performance will lead to lower victim reward.The first case is the setting with no jammer and hence the performance is that of the original network slicing agent in terms of the sum reward in the absence of jamming attacks.The second scenario is with a last-interference jammer agent that is positioned at the same location (i.e., the origin {0, 0}) with the same power budget.However, this jammer agent does not utilize any machine learning algorithms, and chooses the subset of channels with the highest observed/listened interference power levels in the last listening phase: Consequentially, the last-interference jammer which focuses on the last time slot is equivalent to the proposed jammer when β → ∞.The third scenario is with the next-interference jammer agent, which is also an actor-critic jammer agent but whose reward has β = 0, so it concentrates on the next time slot.Furthermore, we provide performance comparisons with a max-rate jammer that is assumed to know the channel environment between the base stations and users perfectly (which implies a rather strong jammer), and thus it obtains every potential channel rate regardless of the interference from other users, and picks the maximum via Given N maximum potential rates in N different channels, the jammer randomly picks channel c to jam with probability  .To better evaluate the performance among different jammers, we assume that all jammers have the same power budget.We observe that all other jammers are less efficient in suppressing the victim sum reward compared to our proposed actor-critic jammer, which aims at estimating N β c (t) and therefore has better performance.Specifically, we observe in Fig. 6 that the proposed jammer results in the smallest sum reward values for the network slicing agents and has the most significant adversarial impact.In order to give a comprehensive comparison, we in Fig. 7 also illustrate the performance of the same set of jammers where the jamming power is 60dB higher than the transmission power, i.e.P J = 10 6 P B .We notice that since the victim user receives negative reward when the request fails, some curves have negative values at times in this harsh jamming interference environment.
While it is hard to distinguish some curves with similar performance, we provide the numerical results of average reward and request completion ratio in both experiments in Table II.We observe that the victim under the actor-critic jamming attack completes much less requests and the smallest average reward values in both scenarios, and thus the actorcritic jammer outperforms all other jammers including the max-rate jammer that knows the channel status.Additionally in the first scenario where P J = P B , we notice that the base station at coordinates (0, 0) in Fig. 5, which is the closest one to the jammer's location, only completes 67.25% of the requests under the proposed jamming attack.In comparison, this base station under other types of attack completes about 80% of the requests.This base station also completes less requests under our proposed attack when P J = 10 6 P .These observations indicate that the proposed jammer agent is more likely to learn from the received interference.V. NASH-EQUILIBRIUM-SUPERVISED POLICY ENSEMBLE FOR JAMMING AND DEFENSE As demonstrated in the previous section, an actor-critic based jammer with limited information and power budget can lead to significant degradation in the performance of network slicing.In this section, instead of jamming detection or jamming pattern-based defensive strategies, we propose the Nash-equilibrium-supervised policy ensemble (NesPE) as a strategy that automatically explores the underlying available actions to achieve the optimal performance in both competitive and non-competitive contexts.We show that NesPE can be both utilized as the victims' defensive strategy and also applied to jamming attack design.

A. Nash-Equilibrium-Supervised Policy Ensemble
We first describe the details of the NesPE as a scheme that enhances the performance of a player in a multi-player zero-sum stochastic game [37] in which the player does not have direct observation of the opponents' action choices.The actions of the optimal mixed strategy depend on the prior observation, and this relationship is to be determined via deep RL or other machine-learning-based algorithms.To train the policy ensemble with the deep RL algorithm of the player and to determine the optimal mixed strategy profile, we consider the tuple G = O, E, (f θe ) e∈E , ê, L, (O l ) l∈L , u , where • O is the prior observation before decision-making.
• f θe is the actor function of policy e with parameter θ e , and is regarded as the player's strategy.It maps the prior observation to the player's action A e = f θe (O) ∈ {0, 1} N where the chosen elements have value 1 and others are set to 0. • ê is the index of the chosen policy to execute the action.
• L is the finite index set of subsequent observations obtained after decision-making, i.e.L = {1, . . ., L}. • O l is the subsequent observation, and l indirectly represents the opponents' action choice.Thus it is regarded as the opponents' strategy.
function of the player when it chooses policy e and the interaction with the opponents results in the observation O l .The mapping function is unknown to both players, and its outcome is available only to the considered player after decision-making.
According to [38], every finite strategic-form game has a mixed strategy equilibrium.In this stochastic game without prior knowledge of the utility matrix, the considered player can alternatively obtain the utility matrix by recording the experienced utility and taking the average.We a matrix H of queues with finite length (specifically, each element of this matrix is a queue with a few recorded utilities), and the matrix has E rows and L columns.In each time slot, the experienced utility u ê,l (t) is appended to the queue H(ê, l).At the beginning of the next time slot, the player uses the average of each queue to form a new matrix with E × L elements as the utility matrix for calculating the optimal profile at Nash equilibrium σ = {σ 1 , . . ., σ E }, which is the set of probabilities to choose and execute each strategy at a time.
However, although the probability to choose each given policy is optimized, the parameters of the policies (θ e ) e∈E may not be optimized, and need further training with the received utility u ê,l (t) after each time slot t.Most existing works of policy ensemble use similar rewards to train different policies, thus there is no guarantee that the policy ensemble explores different strategies.In NesPE, we consider a dual problem that takes both utility reward and the correlation of policy ensemble [23] into account.The expected correlation between policy e 1 and policy e 2 is defined as (44) Therefore, the correlation between the two policies indicates the similarity of decision-making patterns between them.To fully explore all possible strategies and increase the policy variety to feed the mixed strategy, lower correlations between the policies are desired.
Thus, in order to maximize the expectation of utility reward u ê,l and limit the correlation below a certain threshold D, we at each time slot t consider a non-linear programming problem for policy e: where ρ(e, e ′ , t) We analyze the performance of NesPE in two contexts.On the one hand, we consider a set of opponents and an environment in which a dominating action exists, where an action A is called dominating if Due to random initialization, there will be one policy e in NesPE that has the highest σ e , and thus we have |R(e, t) − u e,l (t)| < |R(e 1 , t) − u e1,l (t)|, ∀e 1 ̸ = e.Therefore the reward of policy e focuses on maximizing E (u e,l ) while the other policies are encouraged to choose different actions other than that chosen by policy e.Such iterative training will lead to lim t→∞ σ e (t) = 1.Therefore, we note that NesPE in this context will converge to a pure strategy with a single policy e that fully optimizes E (u e,l ).
On the other hand, if there is no dominating action and a mixed strategy is preferred, σ e will be upper bounded.Each policy e 1 is encouraged to find a balance between optimizing u e1,l (t) and diverge from others, while the better policies will be less affected by the correlation minimization.Thus, the player may obtain the optimized mixed strategy via the converged Nash equilibrium profile.

B. Numerical Results
In this section, we apply NesPE to both the network slicing victim agent and the jammer agent, and compare the performances in terms of victims' average reward and completion ratio during the testing phase.All parameters other than the ensemble are the same as in Section IV-E.
For the policy ensemble of the victim user agent, we consider E = 5 policies in the ensemble, and L = 5 different classes of subsequent observations.In this case, at each base station b, each element H b [u, c] of the transmission rate history is a queue with the maximal length of Q instead of a single number, so we have . The subsequent observation is the set of experienced transmission rate r b,u c (t) at time t, and is classified into type l via where U b is the set of the served users at base station b, and C u is the set of channels in the slice assigned to user u.
For the policy ensemble of the jammer agent, we consider E J = 2 policies in the ensemble, and L J = 2 different classes of subsequent observations.In this case, a queue with a finite length of listened difference d listen c is kept, and we use its average value d listen c to classify the subsequent observation For comparison, we also show the results obtained with two other types of policy ensemble methods that aim at robustness, including agents with policy ensembles (APE) [17] and cooperative evolutionary reinforcement learning (CERL) [29].The key idea of APE is to randomly initialize different policies, select one policy at a time, maintain a separate replay buffer, and maximize the expected reward.On the other hand, CERL sets different policies with different hyper-parameters, uses a shared replay buffer, and applies a neuro-evolutionary algorithm.
In Table III, we compare the victim agents' average rewards and completion ratios in the presence of a jammer agent when the proposed NesPE and other algorithms are employed at both the network slicing agents and the jammer.In rows from top to bottom, we have the performance with no jammer, original jammer with a single policy, and NesPE, APE, and CERL based jammers, respectively.For each jammer, we present the performance with both the lower power budget of 0dB where P J = P B , and the higher power budget of 60dB where P J = 10 6 P B .Across different columns from left to right, we have the performance when the victim utilizes a single policy, and NesPE, APE, and CERL based policies, respectively.In the first row, we notice that each type of victim agent has a similar performance (in terms of both average reward and completion ratio/percentage), which indicates that different victim agents are able to identify and utilize the dominating strategy.In the following rows, we see that NesPE based victim network slicing agent achieves a better performance than the other victim agents against all different types of the jammer agent.In the third row in which the performance of the victim network slicing agents in the presence of a NesPE based jammer is provided, we observe that all victims under this NesPE based jamming attack have lower rewards and completion ratios compared to those under the other types of jamming attacks.With the higher jamming power budget of 60dB, we observe similar trends when comparing different algorithms.Hence, NesPE based jamming agent performs better in suppressing the performance of network slicing agents.This lets us conclude that NesPE based strategies outperform the other algorithms in a competitive environment where the adversary exists, and both the victim and jammer perform more favorably when they employ NesPE.

VI. CONCLUSION
In this paper, we designed network slicing agents using MACC multi-agent deep RL with pointer network based actors.We considered an area covered by multiple base stations and addressed a dynamic environment with timevarying channel fading, mobile users, and randomly arriving service requests from the users.We described the system model, formulated the network slicing problem, and designed the MACC deep RL algorithm and pointer network based actor structure.We demonstrated the proposed agents' performance via simulations and have shown that they can achieve high average rewards and complete around 95% of the requests.
Subsequently, we developed a deep RL based jammer that aims at minimizing the network slicing agents' (i.e., victims') transmission rate and thus degrades their performance without prior knowledge of the victim policies and without receiving any direct feedback.We introduced the jamming and listening phases of the proposed jammer and addressed the jamming location optimization.We also studied jamming channel optimization by designing an actor-critic agent that decides on which channels to jam.We have demonstrated the effectiveness of the proposed jammer via simulation results, and quantified the degradation in the performance of the network slicing agents compared to the performance achieved in the absence of any jamming attacks.We also provided comparisons among actor-critic based jammers with different assumptions on how to decide on which channels to jam (e.g., based on the last or estimated next interference or linear interpolation of the two).
Finally, we designed the NesPE algorithm for a competitive zero-sum game between the victim agents and jammer agent.By applying NesPE on both the victim agents and the jammer agent and comparing with two other policy ensemble algorithms, we have shown that NesPE not only adapts to a highly dynamic environment with time-varying fading and mobile users, but also adapts to a competitive environment against adversary's policy ensemble agent with optimal mixed strategy.Thus, both the victim network slicing agents and jammer should apply NesPE to attain improved performance levels, and the interaction between them converges to the Nash equilibrium over all possible policy ensembles.

Fig. 1 .
Fig. 1.WNV system model for allocating virtualized physical resources via MVNO.N cb denotes the number of slices available at InP b.

2 .
Similar to sequence-to-sequence learning, pointer network utilizes two RNNs called encoder and decoder whose output dimensions are h e and h d .The former encodes the sequence {O e1 b , O e2 b , . . ., O enr b } with O ek b ⊂ O b (where the index k ∈ {1, 2, . . ., n r } corresponds to requests), and produces the sequence of vectors {e 1 , e 2 , . . ., e nr } with e k ∈ R he at the output gate in each step.The decoder inherits the encoder's hidden state and is fed by another sequence {O d1 b , O d2 b , . . ., O dN b }, O dc b ⊂ O b (where the index c ∈ {1, 2, . . ., N } corresponds to channels), and produces the sequence of vectors {d 1

Fig. 2 .
Fig. 2. Pointer network structure.At each step, the encoder RNN (red) converts each element of input O ek b into output e k , while the hidden state and cell state are fed to the decoder RNN (blue).The decoder converts each element of input O dc b into output dc at each step.The final output of the pointer network is generated by attention mechanism as the conditional probability P c , and each element are generated via e k and dc.

2 )
Decentralized Actor with Pointer Network: The decentralized actor agent f θ (O b ) at each base station b is implemented as a pointer network.The initial hidden state and cell state of the encoder LSTM are generated by a separate FNN without hidden layers, whose input is the information of the requests in the queue {I nr+1 b , I nr+2 b , . . ., I nr+nq b whose components are the last action corresponding to request k, transmission rate history corresponding to the user U b [k] of request k, and the information of request k.

Fig. 3 .
Fig. 3. Coverage map of service area with 5 base stations and 30 users.
Fig. 4. Performance comparison in terms of sum reward for the proposed MACC agent with pointer network based actors against different algorithms.

Fig. 5 .
Fig. 5. Coverage map of service area with 5 base stations, 30 users, and the jammer.

Fig. 6 .
Fig.6.Comparison of victims' sum reward in the testing phase achieved in the absence of jamming attack, and also achieved under attacks by the lastinterference jammer, next-interference jammer, and the proposed actor-critic jammer.The jamming power equals the transmission power.

Fig. 7 .
Fig.7.Comparison of victims' sum reward in the testing phase achieved in the absence of jamming attack, and also achieved under attacks by the lastinterference jammer, next-interference jammer, and the proposed actor-critic jammer.The jamming power is 60dB higher than the transmission power.
46)To solve the problem, we can convert it to maximizing a Lagrangian dual function, and set the reward of the deep RL algorithm asR(e, t) = u ê,l (t) − ζ e ′ ̸ =e σ e ′ (t)ρ(e, e ′ , t), (47)where ζ is the dual variable, and R(e, t) is appended to the replay buffer for the training process to update the parameters of policy e.We note that the instantaneous correlation ρ(e, e ′ , t) is weighted by the strategy profile σ e ′ of the other policy e ′ .Therefore, the more desired policies with higher σ e are less affected by the correlation with other policies, while the less desired policies are encouraged to diverge from the former.The proposed NesPE strategy is obtained by iteratively repeating the aforementioned process over time as shown in Algorithm 2. Append the payoff u ê,l (t) to the queue H(ê, l)Append the information at time t to the replay buffer B Randomly sample a mini-batch B ′ from B for G ′ , A ê′ , u ê′ ,l (t ′ ) in B ′ do for e ′ in E ′ do update θ e ′ with reward R(e ′ , t ′ ) in (47) e of each policy Randomly initialize the utility history H for t in range do Acquire prior observation O(t) Obtain mixed strategy profile σ from the average of H Randomly choose one policy ê according to σ Execute action A ê = f θ ê (O(t)) Classify the latter observation O l (t) into category l

TABLE III AVERAGE
REWARD AND COMPLETION RATIO COMPARISON OF DIFFERENT POLICY ENSEMBLE ALGORITHMS AND DIFFERENT JAMMING POWER BUDGET DURING TEST PHASE