Antenna Beamwidth Optimization in Directional Device-to-Device Communication Using Multi-Agent Deep Reinforcement Learning

Exploiting the millimeter wave (mmWave) band is an attractive solution to accommodate the bandwidth-intensive applications in device-to-device (D2D) communications. The directional nature of communications at mmWave frequencies and mobility of devices require beam alignment at both transmitter and receiver ends. The beam alignment signaling overhead leads to a loss in the network’s throughput. There exists a trade-off between antenna beamwidth and the achievable throughput. Although a narrower antenna beam increases the directivity gain, it leads to a higher signaling overhead and less stable D2D links which reduce the network’s throughput. Therefore, optimizing the antenna beamwidth is crucial to maintain the D2D users’ quality-of-experience (QoE). In this paper, we propose a novel distributed antenna beamwidth optimization algorithm based on multi-agent deep reinforcement learning. We model D2D links as agents that interact with the communication environment concurrently and learn to refine their antenna beamwidth policies. Agents aim to maximize the network sum-throughput and maintain reliable communication links while taking into account the application-specific quality-of-service (QoS) requirements and the cost associated with beam alignment. Online deployment of the proposed algorithm is distributed and does not require any coordination among users. The performance of the proposed antenna beamwidth optimization algorithm is compared with other widely used baseline algorithms. Numerical results show that our proposed algorithm improves the network performance significantly and outperforms existing approaches.


I. INTRODUCTION
Device-to-device (D2D) communication allows user equipments (UEs) to communicate over direct links rather than traversing the cellular infrastructure. D2D communication is envisioned to improve the network's performance by offloading the cellular network and providing ubiquitous coverage for commercial, public safety and critical communication applications [1]. However, implementation of D2D communication is limited mainly due to spectrum scarcity in the sub-6 GHz band. Exploiting the abundant unlicensed spectrum in the millimeter-wave (mmWave) band for D2D communications is seen as an attractive solution to addressing the spectrum scarcity bottleneck [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wang .
Radio propagation at mmWave band encounters several obstacles such as severe path-loss and sensitivity to blockages [3]. The small wavelength of mmWave signals, however, facilitates the implementation of large directional and high-gain antenna arrays on D2D devices, which helps to compensate for additional path-loss [3]. This, in turn, introduces a new challenge to D2D communications. Achieving the maximum directivity gain in a highly directional mmWave band system requires the transmitters and receivers to be aligned. Beam alignment incurs significant signaling overhead to the system, which reduces the network's throughput significantly. There exists a trade-off between antenna beamwidth and achievable throughput [4]. Selecting a narrower antenna beam, although leads to higher antenna gain, incurs longer beam alignment overhead and reduces the link stability time. Therefore, one needs to optimize the VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ antenna beamwidth prior to data transmission to maximize the network's sum-throughput and maintain users' qualityof-experience (QoE). Modeling the antenna beamwidth optimization problem and finding a systematic algorithm to reach the optimal solution is even more challenging in the D2D environment as the mobility of the devices and the diverse quality of service (QoS) requirements, make network topology highly dynamic. In this paper, we focus on the antenna beamwidth optimization problem in a mmWave D2D network where D2D UEs optimize their antenna beamwidth based on their context information to maximize the network sum-throughput and to maintain reliable D2D links. Despite the recent advances in antenna beamwidth tuning technologies [5] and the significant impact of antenna beamwidth optimization on network performance [6], it is a fairly unexplored research area. There exist few studies in the literature that discuss antenna beamwidth optimization [6]- [10]. Nevertheless, most of the existing works suffer from several limitations that make them inappropriate for D2D communications. For example, works in [6]- [11] are centralized and computationally expensive, [6]- [8], [10] increase the communication overhead significantly and [9], [10] are not robust against changes in the network topology, thus, cannot be applied to the D2D communication framework.
Compared with those methods, deep reinforcement learning (DRL) based algorithms present the most promising mechanisms to tackle complex optimization problems in communication networks. DRL enables an agent to make complex on-line decisions in dynamic and uncertain environments, given only sequences of observations and rewards without increasing the overall system overhead. However, existing DRL-based techniques such as [12], [13] implement an independent Q-learning (IQL) approach [14] through which each agent learns a policy based on its actions and observations and treats other agents as a part of the environment. Nevertheless, multi-agent environments are non-stationary since agents are learning and updating their policies concurrently [15]. Therefore, implementing IQL is not suitable for addressing multi-agent domains as it causes an agent's locally optimal action to become a globally non-optimal joint action [16]. In addition, non-stationarity introduced by IQL also inhibits exploiting experience replay, which is crucial in speeding up and stabilizing the DRL training process. Therefore, addressing the antenna beamwidth optimization problem using a distributed and low-overhead DRL-based algorithm while considering the non-stationarity of the multi-agent environment is a paramount problem that is lacking in the literature.
In this paper, our goal is to maximize the network sum-throughput through optimizing D2D UEs' antenna beamwidth. The interaction among D2D UEs along with the mobility of UEs and various D2D applications' QoS requirements, make antenna beamwidth optimization a challenging problem. Since ignoring user interactions and non-stationarity of the environment leads to non-optimal solutions, we model the beamwidth optimization problem as a multi-agent problem and exploit the recent progress of multi-agent DRL to develop a distributed antenna beamwidth optimization algorithm. The proposed algorithm enables D2D UEs to maximize the network's sum-throughput while maintaining reliable communication links to support various D2D commercial and public safety applications. The proposed algorithm considers mmWave propagation characteristics, directional communication, D2D users' mobility, payload size, QoS requirements, and achievable throughput versus antenna beamwidth trade-off, simultaneously. The main contributions are summarized as follows: • We proposed a multi-agent DRL-based antenna beamwidth optimization algorithm to maximize the network sum-throughput. In addition, the D2D UEs joint antenna beamwidth policy maintains the reliability of the D2D links by assuring that D2D links transmit their payload successfully in the required time budget.
The proposed algorithm has two phases: training and decentralized deployment. The training phase is performed offline, under different network topologies using a shared reward function. The training algorithm enables distributed agents to optimize their antenna beamwidth during the online implementation without requiring any inter-agent communications.
• We modeled the antenna beamwidth optimization problem as a cooperative multi-agent DRL; since implementing IQL does not guarantee convergence to an efficient joint solution. Fingerprint-based learning [17] is implemented to enable agents to track their fellow agents' policies and reach the optimal joint action solution.
Using fingerprint learning facilitates experience replay which expedites and stabilizes the training phase of the multi-agent DRL-based antenna beamwidth optimization algorithm.
• The performance of the proposed algorithm is compared with the existing methods such as IQL [12], [13], random antenna beamwidth selection [18], and constant antenna beamwidth selection [19]. The simulation results show that our proposed algorithm outperforms other existing methods and improves the network sum-throughput and D2D links' reliability significantly. The remainder of this paper is organized as follows. Section II reviews the relevant related work. The system model and assumptions along with network sum-throughput maximization problem formulation are described in Section III. A novel distributed antenna beamwidth optimization algorithm based on multi-agent DRL for solving the network sum-throughput maximization problem is proposed in Section IV. Simulation results are presented in Section V and finally, conclusions and the future work directions are discussed in Section VI.

II. RELATED WORK
Directional transmissions are used in the mmWave band to compensate for the high path-loss [3]. Therefore, beam alignment must be implemented at the transmitter and receiver ends in order to establish high-throughput physical links. Beam alignment between transceivers requires sending and receiving multiple pilot signals [4], which reduces the D2D links' throughput as D2D transceivers cannot transmit data during the beam alignment phase. Although reducing antenna beamwidth increases the directivity gain, it requires longer beam alignment overhead and is more prone to misalignment. Therefore, it is necessary to optimize antenna beamwidth according to D2D UEs context information. Despite its importance, antenna beamwidth optimization has yet to be explored properly. Existing antenna beamwidth optimization techniques can be categorized into the following groups: particle swarm optimization [6]- [8], dynamic programming [9], non-linear programming [10], deep learning [11], and DRL-based methods [12], [13].
The particle swarm optimization (PSO) algorithm is used in [6], [7] for improving system throughput of a vehicular communication network and a relaying small-cell network. In addition, the PSO algorithm is used in [8] for interference management in the D2D network by proposing a device association and beamwidth selection. Beam management is performed in [9] with the goal of maximizing network throughput using backward dynamic programming. A framework is proposed in [10] to simultaneously control the transmission power and the beam-level beamwidths of indoor mmWave transceivers to maximize the energy efficiency of the network using non-linear programming. In [11], a deep learning-based beam management and interference coordination (BM-IC) is proposed to maximize the sum-rate of a dense mmWave network. These techniques cannot be applied to mmWave band D2D networks since they suffer from several limitations. First, existing techniques such as [6]- [11] are centralized and computationally expensive as they require an online central controller to optimize the antenna beamwidth, thus, cannot be applied to the D2D communication framework. Moreover, most of the existing techniques such as [6]- [8], [10] require coordination and information exchange among network entities, which increases the communication overhead significantly and makes these approaches not scalable. Furthermore, existing techniques such as [9], [10] are not robust against changes in the network topology, where the dynamicity of the network entities can negatively impact the system performance.
Recently, reinforcement learning (RL) is shown to be a useful tool to tackle several complex optimization problems in communication networks [20] such as dynamic spectrum access [21] and resource allocation [22]. However, the learning process in RL is time-consuming. DRL takes advantage of multi-layer neural networks to expedite the learning process, thereby improving the learning speed and the performance of RL algorithms. A DRL-based approach is proposed in [12] that simultaneously optimizes beamwidth and transmit power of transceivers in the network. A self-tuning sectorization algorithm is proposed in [13] that optimizes base station MIMO broadcast beams for each cell. The authors in [23] addressed the problem of optimizing relay selection and antenna power allocation using a centralized hierarchical DRL algorithm. However, these works implement an IQL approach through which each agent learns a policy based on its actions and observations and treats other agents as a part of the environment. Using IQL causes an agent's locally optimal actions to become a globally non-optimal joint action [16]. Non-stationarity introduce by IQL also inhibit exploiting experience replay, which is crucial in speeding up the DRL training process.
Therefore, addressing the antenna beamwidth optimization problem using a decentralized and low-overhead DRL algorithm that considers user interactions and environment non-stationarity is lacking in the literature. To address these gaps, we propose a novel distributed antenna beamwidth optimization algorithm based on multi-agent DRL. Unlike [6]- [11], the proposed algorithm is decentralized, lowoverhead, and robust to changes in the network topology. In addition, our proposed algorithm implements fingerprint learning to consider the interaction among users and the non-stationarity of the environment. Therefore, the proposed algorithm reaches an efficient joint solution unlike [12], [13]. Moreover, experience replay is facilitated through fingerprint learning to expedite the learning process significantly. Our goal is to maximize the D2D network's sum-throughput and maintain reliable communication links while taking into account the application-specific QoS and the cost associated with beam alignment. Table 1 compares the proposed antenna beamwidth optimization algorithm with other relevant schemes.

III. SYSTEM MODEL AND PROBLEM FORMULATION
This section describes the system model for mmWave D2D network and introduces the main elements that impact the antenna beamwidth policies. In addition, we formulate the network sum-throughput optimization and the D2D link reliability problem.

A. NETWORK TOPOLOGY
We consider a network of mobile UEs that communicate through D2D links established at the mmWave frequency band operating under time division duplexing (TDD). A co-channel deployment with bandwidth W , uniform transmit power, and half-duplex mode are assumed. Let L = {1, . . . , L} denotes the set of D2D links in the network where each D2D link comprises a D2D transmitter and a D2D receiver. In this scenario, D2D links are already established using peer association mechanisms such as [24], [25]. Also, all D2D transmitters have a payload in their buffer B l that must be transmitted in a limited time budget T l according to their application's QoS requirement. D2D users move at variable speeds and directions.

B. D2D CHANNEL MODELING
To model the mmWave channel, the distance-dependent path-loss model for peer-to-peer communication proposed in [26] is adopted. Under this model the path-loss is defined as PL(d l ) = Cd −α l , where C denotes the path-loss intercept, α is the path-loss exponent, and d l represents D2D link length of a given D2D link l ∈ L. Each communication link experiences i.i.d small-scale Nakagami fading with parameter N h . Hence, the received signal power can be modeled as gamma random variable with parameter, h l ∼ (N h , 1/N h ).

C. ANTENNA PATTERN
We assume that each D2D UE is equipped with a directional antenna and is enabled to rotate its antenna bore-sight toward the desired direction with a simple rotation around its location. Each D2D transceiver can pick a beamwidth from the set of its available beamwidths, l . Without loss of generality, we assume that D2D transceivers on a given link l adopt the same antenna beamwidth. This case can be extended to the case that D2D users implement different strategies.
The directional antenna pattern is modeled using the Gaussian antenna model as where ρ = 2.028 ln(10) and 2ϕ is the antenna half-power beamwidth. θ denotes the antenna angle relative to the antenna's bore-sight direction. G m = π10 2.028 42.64ϕ+π and G s = 10 −2.028 G m are the maximum main-lobe gain and the side-lobe gain, respectively [27].

D. BEAM ALIGNMENT OVERHEAD
Achieving the maximum antenna gain in a highly directional mmWave band system requires the transceivers to be precisely aligned by finding the best transmit and receive antenna directions. Beam alignment between transceivers requires sending and receiving multiple pilot signals. In this work, the hierarchical beam alignment method is considered, where first the best wide-beam pair is found through an exhaustive search, and then the search is refined using a narrower beam level within the subspace of the best wide-beam pair [28]. Assuming the antenna wide-beam pairs are already aligned, the narrow-beam alignment time [4] can be written as in which ψ l and ϕ l denote the wide-and narrow-level beamwidth of D2D transceivers on link l. T P represents the pilot signal transmission time.

E. LINK STABILITY TIME
A D2D link is stable and appropriate for data transmission as long as its D2D transmitter and receiver antennas stay aligned. Misalignment in directional communication, due to the users' mobility, occurs when the received power cause drops less than a certain ratio, denoted by α ∈ [0, 1]. Consider D2D link l that its receiver and transmitter are located at point A and B, respectively, as shown in Figure 1. Assume that the transceivers antenna beams are aligned and the antenna main-lobe direction is fixed. Also, the receiver is moving with relative velocity V l in the direction of the relative angle of µ l (with respect to its antenna bore-sight direction). Since the bore-sight angle of D2D transceivers is fixed, the movement will cause beam misalignment. The pointing error of the D2D receiver toward its transmitter, t seconds later, denoted by µ l , can be obtained using the law of sines in triangle ABA as where d l denotes the D2D links distance. Note that although receiver movement changes the distance d l , the impact of distance difference is neglected and only the impact of movement on the angular difference is considered. Also, we assume that V l t d l . For small µ l , we estimate sin( µ l ) µ l , therefore, Based on the definition, the link is stable if the relative antenna gain at the receiver is above a certain ratio, α ∈ [0, 1].
Using (3) and (4) the link stability time, denoted by T S m,n , can be written as (10) .
It can be seen that higher antenna beamwidth and lower gain threshold increase the link stability time. Moreover, lower relative speed guarantees D2D links to be stable for longer.

F. PROBLEM FORMULATION
The beam misalignment of D2D transceivers caused by the relative movements of UEs, or availability of payload with different QoS requirements entail D2D UEs to perform beam management including beam alignment and antenna beamwidth optimization to maintain or improve their QoE. We consider the time-slotted communication framework with a slot duration of τ , as shown in Figure 2. D2D UEs are allowed to perform beam management at the beginning of each time slot. Beam management is triggered upon antenna misalignment, availability of new payload in the D2D transmitter's queue, or change in the QoS requirements. Beam alignment leads to a loss in D2D links' throughput due to the time consumed to align transceivers' antenna beam, as explained in Section III-D. In other words, there exists a trade-off between antenna beamwidth and the D2D links' throughput. Selecting a narrower antenna beam leads to higher antenna gain based on (1), but it incurs higher beam alignment overhead as per (2). Consequently, the narrower antenna beam reduces the data transmission time and D2D links' throughput. Also, narrow antenna beams are less stable and more prone to misalignment according to (5). Therefore, to maintain the QoE (D2D link reliability) and increase the network's sum-throughput, D2D transceivers are required to optimize their antenna beamwidth according to the network conditions and context information.
The throughput on a given D2D link l with bandwidth W during time slot k can be defined as where λ l (k) is the beam alignment parameter, λ l (k) = 1 indicates that beam alignment is performed at time slot k and λ l (k) = 0, otherwise. 1 η = T A l τ captures the impact of beam alignment overhead, where τ represent the effective time slot duration for payload transmission ( Figure 2). Since D2D transceivers are not allowed to transmit data on unstable D2D links, the effective time slot duration for payload transmission is defined as the minimum of link stability time and maximum allowed time slot duration, i.e., τ = min(T S l , τ ). The achieved signal-to-noise-plus-interference-ratio (SINR) on D2D link l can be written as where P represents the transmit power, G r l (θ r l ), G t l (θ l ) and are the D2D receiver and transmitter antenna gain on link l, respectively. The leftmost term in the denominator represents the aggregated received interference at the receiver of link l from all other D2D transmitters, and σ 2 denotes the noise power. We assumed that duration of a time slot is shorter than channel coherence time. Therefore, the channel is considered non-varying during a time slot.
Let q l (k + 1) be the remaining payload of D2D transmitter on link l at the beginning of time slot k + 1 and is defined as where δ l (k) = t l (k) * τ denotes the amount of payload that is transmitted during time slot k. QoS constraint of the D2D link l is modeled as a limited time budget T l through which the D2D payload B l must be transmitted. The reliability of the D2D link l, as the measure of QoE of users, is defined as the ratio of the transmitted payload during time budget T l = N l τ with N l time slots and can be written as The problem we are addressing in this work can be formulated as designing an antenna beamwidth selection policy such that it maximizes the network sum-throughput as where = {ϕ 1 , . . . , ϕ l } is the joint antenna beamwidth selection policy of D2D links. Constraint (9b) represents the lower band of feasible antenna beamwidth, and it holds since beam alignment time must be less than the effective time slot duration, i.e., T A l ≤ τ . Constraint (9c) shows the antenna beamwidth upper bound.
The optimization problem (9) is difficult to solve analytically and is computationally hard due to the interaction among D2D links, especially in the D2D environment, which requires low-overhead distributed solutions. In this paper, we propose a solution based on multi-agent deep reinforcement learning to tackle this problem. The proposed framework considers the non-stationarity of the multi-agent environment and the interaction among users. Our goal is to enable D2D UEs to learn a joint antenna beamwidth optimization policy that maximizes the network sum-throughput in various network dynamics, only based on its local observation without online coordination or exchanging messages. Moreover, the reliability of the antenna beamwidth optimization policy is required to be assessed to assure that under such policy all D2D links' payloads are successfully transmitted, l ≥ 1, ∀l ∈ L.

IV. PROPOSED SOLUTION USING MULTI-AGENT DEEP REINFORCEMENT LEARNING
In this section, we describe the proposed solution to solve the network sum-throughput maximization problem through optimizing D2D UEs' antenna beamwidth in a mmWave D2D dynamic environment using cooperative multi-agent DRL. First, we explain the multi-agent DRL framework, where multiple agents interact in a common environment, take an action and try to learn a policy to maximize their shared reward. Then, we explain the details of the proposed antenna beamwidth optimization algorithm. The proposed algorithm is based on the multi-agent DRL framework and is used to solve the optimization problem (9). We define the states, actions, and rewards in the mmWave D2D multi-agent environment.

A. BACKGROUND ON MULTI-AGENT DEEP REINFORCEMENT LEARNING AND Q-LEARNING
A cooperative multi-agent DRL framework is a setting where agents concurrently interact with a shared environment and learn to coordinate together to achieve a common objective [29]. Agents interact with the environment according to partially observable (PO) Markov decision processes (MDP) (POMDP). In POMDP the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. An POMDP is defined as tuple (L, S, U, T , R, Z, O), in which L is the set of agents, S denotes the state space, U = × l U l is the joint action space, Z = × l Z l is joint observation space. Each agent executes an action u l ∈ U l based on its policy π l , forming a joint action u that make the current state s ∈ S transit toŝ with the probability of T (s, u,ŝ). Agents in partially-observable environment receive observations of the latent state, denoted as z l with the joint observation probability of O(z,ŝ, u), where z = (z 1 , . . . , z L ). Consequently, at each time-step k, agents receive a shared reward r(k) = R(s(k), u(k)). Agents aim to maximize the expected discounted return R(k) = H t=0 γ t r(t + k + 1) with horizon H , where γ ∈ [0, 1) is the discount factor, by finding the optimal policy π * l . Q-Learning [30] is used to find the best policy by estimating the Q-values of policies Q π l (z l , u l ) = E π [R(k)|z(k) = z l , u(k) = u l ]. In multi-agent DRL, agents interact with the environment without explicit knowledge of POMDP model. Due to partial observability and local non-stationarity of POMDP, learning the underlying POMDP model is complicated. Therefore, agents instead of learning functions T , R and O, directly learn Q-values or policies.
Q-learning iteratively estimates the optimal Q-value function using backups. The optimal policy π * maximizes the Q-value function, Q π * l (z l , u l ) = max π Q l (z l , u l ). Deep Q-learning uses a deep neural network, known as deep Q network (DQN) parameterized by θ l , Q(z l , u l ; θ l ), to estimate Q-values. DQN relies on experience replay to accelerate and stabilize the training process. During training, actions are chosen according to -greedy policy that selects the currently estimated best action with probability 1 − , and takes a random exploratory action with probability . Each agents' experience including current observation, action, reward and next observation as tuple z l (k), u l (k), r l (k), z l (k + 1) is stored in its replay memory. Replay memory is a first-infirst-out queue containing the set of latest experience tuples. The parameters θ l are iteratively updated using stochastic gradient descent (SGD) by sampling batches of b experiences from the replay memory to minimize the squared temporal-difference (TD) error: l are the parameters of a target network periodically copied from θ l and kept constant for a number of iterations. The replay memory stabilizes learning, prevents the network from overfitting to recent experiences, and improves sample efficiency.
The widely used approach to solve multi-agent DRL is Independent Q-learning (IQL) [14], where each agent learns its DQN parameters only based on its observations and actions while treating other agents as a part of the environment. However, since all agents are learning and affecting the environment simultaneously, using IQL makes the environment non-stationary from the perspective of any individual agent. Non-stationarity and local observability of multi-agent environments cause locally optimal action to become globally non-optimal joint action [16]. In addition, the non-stationary nature of the environment makes the experience replay memory samples obsolete and negatively impacts the training performance [17].
The non-stationarity can be resolved if agents' observation state is augmented with an estimate of other agents' policies. One possible solution is augmenting each agent's observation space with its fellow agents' DQN parameters. However, this method is intractable in practice since a large number of DQN parameters complicates the learning process. Also, sharing and updating DQN parameters among agents increases the signaling overhead significantly that consequently reduces the D2D links' throughput. To overcome this problem, low-dimensional estimates (i.e., fingerprints) of other agents' policies can be added to agents' experience [17]. It is shown that augmenting the agents' experience tuple with fingerprints, including the training iteration number e and exploration rate disambiguates the age of training samples and stabilizes the replay memory significantly. Recently, this method has been used to address the non-stationarity of the environment in multi-agent wireless networks, such as spectrum sharing [31] and dynamic power allocation [32]. In this work, we use fingerprint-based learning to address the non-stationarity of the mmWave D2D environment.

B. PROPOSED FRAMEWORK
We model the D2D framework in Section III as POMDP and propose an beamwidth optimization algorithm based on the multi-agent DRL framework to enable D2D UEs to solve the optimization problem in (9a)-(9c). In this framework, the set of D2D links L are modeled as agents that are assigned with a common objective of maximizing the network sum-throughput. D2D links interact with the communication environment to gain experience which enables them to learn the optimal joint antenna beamwidth policy. The proposed framework has two phases, centralized training and distributed deployment. Each D2D link has a DQN that must be trained. During the training phase, each D2D link takes an action (selecting an antenna beamwidth) based on its observation and receives a shared reward from the environment which directs it toward learning the optimal policy through training its DQN. During the deployment phase, D2D links select an antenna beamwidth based on their observation using their trained DQN, without online coordination or message exchange. The schematic of the proposed framework is shown in Figure 3.

1) TRAINING PHASE
Since the optimal antenna beamwidth selection policy is unknown to the D2D links at the beginning of the training process, we consider the training process to be an episodic DRL where learning is a continuing task over a time horizon of T = N max τ . D2D links' DQN are trained through running multiple episodes. At the beginning of each training episode, the environment parameters are randomly initialized including D2D UEs velocity, the direction of movement and antenna main lobe direction and beamwidth. Also, D2D transmitters are loaded with a payload that lasts until the end of the episode. Algorithm 1 presents the proposed offline antenna beamwidth selection training algorithm.
At the beginning of each time slot, if payload exists for transmission, q(k) > 0, D2D transmitter and receiver on each established D2D link examines the antenna alignment. If beam alignment is required (i.e., the transceivers' antennas are misaligned), D2D UEs on each link must select an antenna beamwidth and perform beam alignment, λ(k) = 1. Otherwise, D2D UEs stick to their previous policy λ(k) = 0, and the D2D transmitter continues to transmit its payload (lines 5-13).

a: ACTIONS AND oBSERVATIONS
D2D transmitter and receiver on each established D2D link select an action, i.e., antenna beamwidth u l ∈ l based on their local observations and context information using an -greedy policy (lines 6 -13). The action space of each user is the set of the antenna beamwidths that can be generated by the user's antenna array, however, to satisfy (9c) antenna beamwidth must be less than the wide-level antenna beamwidth, i.e., u l < ψ l . We define the observation state of a D2D link l as z l (k) = I l , q l (k), T l (k), T S l (k), λ l (k), e, , where I l denotes the measured interference power at the D2D receiver on link l. The interference can be accurately estimated by the receiver of each D2D link at the beginning of each time slot. We assume it is also available instantaneously at the transmitter through a delay-free feedback Packet δ l (k) and q l (k) and forward it to the central node.

end 17
The central node computes the shared reward r s (k) based on (11). Receive observation z l (k + 1).

24
foreach D2D link l ∈ L do 25 Store tuple z l (k), u l (k), r l (k), z l (k + 1) in the replay memory.

26
Sample a batch of b samples from replay memory.

27
Calculate y DQN i .

28
Perform SGD to minimize L l (θ l ). Since a selfish action selection that solely maximizes each D2D link's throughput cannot guarantee to obtain global optimization. D2D links are trained by a shared reward to turn the environment into a fully cooperative 2 one. At the end of each time slot, the D2D transmitters evaluate the amount of the transmitted payload δ l (k) and forward it to a central node (line 15). Then, the central node calculates the network average data throughput and broadcasts it to all agents (line [17][18][19][20][21]. The shared reward function can be written as where |.| denotes the set cardinality. The central node can be the base station or a UE that is picked by D2D UEs. Note that, as soon as the agent delivered all of its payload, its reward becomes constant, C. The constant value should be big enough to ensure that the algorithm encourages reliable D2D links, i.e., l > 1. The observation tuple and the reward function parameters are selected to ensure that the optimization problem in (9) is satisfied. First, reward function (11) is in line with objective function (9) to maximize the network sum throughput. Parameters T l (k) and T S l (k) in the observation tuple (10) ensure that the selected beamwidth provides sufficient time for successful data transfer in the required time budget as (9b). Also, monitoring q l (k) in the observation tuple and the constant reward C are designed to encourage the D2D link reliability such that l ≥ 1, ∀l ∈ L.
At the end of each time slot, experiences are stored in replay memory and the D2D links' DQN are trained using samples from their experience replay memory. In order to stabilize the learning, the parameter set of the target DQN, θ − are duplicated from the training DQN parameter set θ every N u episodes and is kept fixed in between [17] (lines 25-31).

2) DEPLOYMENT PHASE
During the deployment phase, at each time step, each D2D link assesses its context information including payload availability, beam alignment and received interference. Using the acquired information, the observation tuple in (10) Figure 3. Then, D2D UEs on each link select an antenna beamwidth with the maximum Q-value. Finally, all D2D links transmit their payload using their selected antenna beamwidth.
The centralized training procedure, which requires high computational capacity is performed offline under various network topologies and different initial conditions, which allows the D2D links to perform well during the decentralized execution time even with strong non-stationary conditions. 2 By cooperative we mean that agents are aimed at maximizing a shared objective

V. NUMERICAL RESULTS
To demonstrate the effectiveness of the proposed algorithm, we compare its performance with two baseline models, i.e., random antenna beamwidth selection [18] and constant antenna beamwidth [19]. Also, to demonstrate the impact of non-stationarity of the multi-agent environment, the performance of the proposed antenna beamwidth optimization algorithm is compared with the classical IQL [12], [13], in which D2D links do not learn to cooperate and treat their fellow agents as a part of the environment.
We custom built our simulator consisting of the D2D interaction environment and the D2D links' DQN that is used to learn the antenna beamwidth selection policy. The D2D interaction environment is an area of the size 1 km × 1 km, which is -given the transmit power of D2D users-large enough to avoid the boundary effect. In the simulation environment, D2D UEs are located uniformly in the simulation area. For each D2D transmitter, we assumed there exists a corresponding D2D receiver at distance d l . Also, D2D UEs move according to the random walk model. D2D users' trajectories (speed and direction of movement) are drawn based on i.i.d. uniform random variables. D2D transceivers are equipped with a directional antenna for data transmission in the mmWave band. Also, we assume that all the D2D transmitters transmit at the same power. Simulation parameters shown in Table 3 are used, unless otherwise specified.
The DQN of each D2D link comprises three fully connected hidden layers, of 500, 250, and 120 neurons, respectively. The rectified linear unit (ReLU), f (x) = max(0, x), is used as the activation function and Adam optimizer is used to update network parameters with a learning rate of 0.001. The agents' DQN is trained offline using -greedy policy for a total of 3000 episodes. We want the D2D transceivers on each link to find the best policy fast, however, committing to a policy without sufficient exploration could also trap the D2D links in a locally optimal policy. To address the trade-off between exploration and exploitation, the exploration rate is linearly annealed from 1 to 0.02 over the first 2400 episodes and remains constant afterward. The hyper-parameters values of DQN in Table 4 are tuned through informal search. It is worth noting that in the training phase, the payload size of all D2D links is considered the same. However, the payload size and QoS requirements of D2D links vary during the deployment phase. Each episode of the training contains N max time-slots with a duration of τ . D2D transmitters are loaded with a payload with size B at the beginning of each episode which must be transmitted by the end of the episode. At the beginning of each training episode e, D2D UEs' velocity, the direction of movement and channel condition are set randomly. L D2D links are established between D2D transceivers in the network environment, and D2D UEs align their antenna towards their corresponding peer. At the beginning of each time slot k, D2D links' channel conditions are updated according to a Gamma random variable. Also, the location of D2D UEs is updated according to the random walk model. Based on the new location of users, beam alignment is performed if antennas are misaligned. At the beginning of each time slot, D2D UEs on each D2D link gathers their context information, including the remaining payload in the D2D transmitter's queue, the amount of interference, remaining link stability time and time budget to transmit the payload. Using the gathered information and fingerprints including current training episode number and exploration rate , D2D UEs form the observation tuple in (10). D2D UEs take an action according to the -greedy policy and receive a reward and a new observation. This information is stored in the replay memory which is used for training the DQN network. The size of the experience memory is limited. During the training, older samples will be replaced by new samples gradually.
To verify that the D2D UEs' joint policy maximizes the network sum-throughput while maintaining reliable D2D links, we evaluate the performance of the proposed antenna beamwidth optimization algorithm and the related baseline models in terms of D2D link reliability and throughput. Using (8), the network's link reliability, denoted by l , is defined as the ratio of D2D links that transmit their payload successfully in the limited time-budget (specified by the QoS requirement). The network's link reliability can be written as Figure 4 compares the performance of the proposed algorithm in terms of network link reliability with the existing approaches. The results are averaged over 200 runs of Monte-Carlo simulations to thwart the effect of noisy results. It can be seen that increasing the payload size decreases the D2D links' reliability as fewer D2D links can successfully transmit their payload. However, our proposed antenna beamwidth optimization algorithm maintains the D2D link reliability at an acceptable level, i.e., more than 90% of D2D links transmit their payload successfully when the payload size is less than 35 MB. Also, the performance of random beamwidth selection and constant beamwidth selection deteriorates significantly as the payload size increases, since these approaches do not optimize antenna beamwidth according to the context information. It should also be noted that the IQL fails to guarantee D2D links reliability, which justifies the importance of considering the non-stationary of a multi-agent environment and enabling D2D links to track their fellow agents' policies to reach the best joint beam management policy. Figure 5 compares the performance of the proposed antenna optimization training algorithm with the IQL algorithm in terms of D2D link throughput and reliability. The results are shown for four D2D links during an episode of the distributed deployment. Figures 5a and 5b compares the D2D transmitters' queue status during 100 time slots of a deployment scenario. It can be seen that the proposed algorithm enables all D2D UEs to maintain reliable links by transmitting their whole payload in the required time budget successfully. While using the IQL does not guarantee D2D links' reliability, since none of the D2D UEs could transmit their payload in the required time budget. Figures 5c and 5d show the changes in the D2D links' throughput while transmitting their payload during the same deployment scenario. These figures illustrate that by implementing the proposed algorithm, D2D links learned to cooperate rather than acting selfishly as in IQL. It can be seen in Figure 5c that D2D links take turns to send their payloads according to their observation tuple. In this example, using the proposed algorithm, D2D link 2, at the beginning achieves a high throughput to transmit its payload while other D2D links keep their transmissions low to avoid causing interference and deteriorating D2D link 2's throughput. It can be seen that throughput of D2D link 2 goes to zero upon finishing transmitting its payload at time slot 45 FIGURE 5. Reliability of the D2D links using the proposed multi-agent DRL-based algorithm is compared with the single-agent algorithm, where the non-stationarity of the environment is neglected. Four D2D links status are shown during 100 time slots within an implementation episode: (a) remaining D2D links' payload using the proposed algorithm, (b) remaining D2D links' payload using IQL, c) D2D links' throughput using the proposed algorithm, and (d) D2D links' throughput using the single-agent algorithm.
(also shown in Figure 5a). Afterward, in the same manner, D2D links 1, 4 and 3 take turns to transmit their payload, while others keep their activities at a minimum. Meanwhile, Figure 5d shows that using the IQL algorithm and ignoring the non-stationarity of the environment results in competition among the D2D links to increase their individual throughput during each time slot. The interference among D2D links leads to throughput reduction and failure in payload transmission. Compared to IQL, our proposed algorithm manages to keep the network sum-throughput very high. Also, the proposed algorithm provides very high throughput (about 10 Mbps) to each D2D link. In contrast, the throughput of D2D links using IQL is relatively low (about 3 Mbps). These results confirm that IQL is not a suitable approach for non-stationary multi-agent environments. Since using IQL D2D UEs merely take actions based on their own observations while other users are treated as a part of the environment. Figure 6 compares the normalized reward received by D2D links using the proposed algorithm and IQL. The graph shows the reward of a given D2D link. The results are shown for 3000 training episodes. Note that, in IQL agents are not trained using a shared reward function. The growing trend of reward function using our proposed algorithm indicates its efficiency in enabling agents to cooperate and increase the reward function. While using IQL, the D2D link fails to refine its policy to improve its reward function. Also, the relatively stable and tight convergence of the proposed algorithm's reward function highlights its ability to find an effective joint policy. At the same time, IQL's massive fluctuation shows that the D2D link cannot converge to a good policy. Moreover, this graph is a reliable measure to verify the sufficiency of the number of training episodes. The fast and tight convergence of the proposed algorithm indicates that 3000 episodes are sufficient to train the D2D links.

VI. CONCLUSION AND FUTURE WORKS
In this paper, we proposed a novel multi-agent DRL-based algorithm to optimize D2D UEs' antenna beamwidth in a directional D2D network in the mmWave band. The proposed algorithm considers D2D UEs' mobility, payload size, QoS requirements, beam alignment cost and non-stationarity of the multi-agent environment. The proposed algorithm enables D2D links to learn an optimized antenna beamwidth selection policy to increase the network sum-throughput while maintaining the D2D link reliability. D2D links are trained offline using a shared reward function while the deployment of the proposed algorithm is distributed and does not require any online coordination. The training algorithm is based on the multi-agent DRL, and the non-stationarity of the environment is addressed by augmenting users' observation with a low dimensional fingerprint. Finally, the performance of the proposed antenna beamwidth optimization algorithm is evaluated through extensive simulations. Also, a performance comparison is performed with existing approaches, such as IQL and random beamwidth selection. Results show that our proposed algorithm improves network performance significantly and outperforms other approaches.
In the future, we plan to investigate adapting the proposed algorithm in indoor applications. In such environments, D2D users are more prone to perform beam alignment due to shorter beam stability time. The proposed beamwidth selection algorithm in this work will manage to compensate for the stability time by selecting the proper antenna beamwidth and render higher performance gains. Engineering, North Carolina A&T State University (NCA&TSU). He is also the Director of the Autonomous Control and Information Technology Institute, and the Testing, Evaluation, and Control of Heterogeneous Large-Scale Systems of Autonomous Vehicles (TECHLAV), NCA&TSU. Through his research, he has received funding in excess of 30 million from various U.S. funding agencies. He has written more than 350 technical publications, including book chapters and journal articles and conference papers. His current research interests include machine learning, unmanned aerial vehicles (UAVs), testing and evaluation of autonomous vehicles, optimization, and signal processing. He is a member of the IEEE Control Society, Sigma Xi, Tau Beta Pi, and Eta Kapa Nu. He also serves as an Associate Editor for the Intelligent Automation and Soft Computing journal. He serves as a Reviewer for the IEEE TRANSACTIONS ON FUZZY SYSTEMS, MAN, AND CYBERNETICS, and Neural Networks. VOLUME 9, 2021