A Deep Learning-Based Transmission Scheme Using Reduced Feedback for D2D Networks

In this study, we investigate frequency division duplex (FDD)-based overlay device-to-device (D2D) communication networks. In overlay D2D networks, D2D communication uses a dedicated radio resource to eliminate the cross-interference with cellular communication and multiple D2D devices share the dedicated radio resource to resolve the scarcity of radio spectrum, thereby causing co-channel interference, one of the challenging problems in D2D communication networks. Various radio resource management problems for D2D communication networks can’t be solved by conventional optimization methods because they are modelled by non-convex optimization. Recently, various studies have relied on deep reinforcement learning (DRL) as an alternative method to maximize the performance of D2D communication networks overcoming co-channel interference. These studies showed that DRL-based radio resource management schemes can achieve almost optimal performance, and even outperform the state-of-art schemes based on non-convex optimization. Most of DRL-based transmission schemes inevitably require feedback information from D2D receivers to build input states, especially in FDD networks where the channel reciprocity between uplink and downlink is not valid. However, the effect of feedback overhead has not been well investigated in previous studies using DRL, and none of the studies reported on reducing the feedback overhead of DRL-based transmission schemes for FDD-based D2D networks. In this study, we propose a DRL-based transmission scheme for FDD-based D2D networks where input states are built by using reduced feedback information, thereby reducing feedback overhead. The proposed DRL-based transmission scheme using reduced feedback information achieves the same average sum-rates as that using full feedback, while reducing the feedback overhead significantly.

D2D networks can't be solved by conventional optimization 72 methods because most of them are non-convex [14], and 73 many studies thus relied to DRL to solve non-convex prob-74 lems in D2D networks [15], [16], [17]. A previous study 75 investigated a joint problem of resource block allocation 76 and power control and proposed a DRL-based scheme to 77 solve the joint problem [15]. A joint problem of channel  Even though DRL-based novel approaches can achieve 91 almost optimal performance or outperform the state-of-art 92 conventional schemes without using non-convex optimiza-93 tion methods, as shown in [15], [16], and [17], they require 94 agents to collect channel information to build input layers. 95 Specifically, the input states in [15] consist of instantaneous 96 signal channel gain and interference channel gains between 97 D2D transmitter and receiver, and interference channel gains 98 between cellular user and D2D receiver. Local channel state 99 information and outdated non-local channel information are 100 only used for input states in [16]. The input layer of the 101 DRL used in [17] is also formed by the channel gain matrix. 102 In time division duplex (TDD)-based D2D networks, each 103 D2D transmitter can easily obtain channel information thanks 104 to the channel reciprocity of uplink and downlink, i.e., each 105 D2D transmitter can estimate downlink channel information 106 based on uplink channel information obtained by measur-107 ing uplink sounding symbols transmitted by D2D receivers. 108 In frequency division duplex (FDD)-based D2D networks, 109 each D2D receiver is required to send downlink channel 110 information to D2D transmitters to enable them to build input 111 states because the channel reciprocity of uplink and downlink 112 is not valid, which inevitably results in tremendous feedback 113 overhead.

114
Despite various studies for D2D networks, to the best of 115 our knowledge, there has been no investigation on feedback 116 reduction for D2D networks because meaningful researches 117 to reduce the feedback overhead have been focused on 118 multi-antenna networks [18], [19]. A method that signifi-119 cantly reduces the required feedback load by utilizing a small 120 number of receive antennas at each mobile was proposed 121 in [18]. On the other way, a transmit antenna selection scheme 122 for downlink transmission in massive antenna systems was 123 proposed to reduce the feedback required by cellular base 124 stations [19]. These schemes can not be applied to D2D net-125 works because of the structural differences between D2D and 126 cellular networks, despite their superiority. Thus, we investi-127 gate feedback reduction schemes for FDD-based D2D com-128 munication networks. Each agent builds its input states by 129 only exploiting local channel information instead of global 130 channel information to reduce feedback overhead. Further-131 more, we also propose two feedback schemes, namely partial 132 feedback scheme and binary feedback scheme, to reduce the 133 feedback overhead further. In the partial feedback scheme, 134 each D2D receiver feeds back its signal channel gain and 135 interference channel gains that are greater than its signal 136 channel gain. In the binary feedback scheme, each D2D 137 receiver feeds back indicators of interference channel gains 138 that are greater than its signal channel gain, instead of the 139 real values of channel gains. Our numerical results show that 140 the partial and binary feedback schemes can both achieve 141 approximately optimal sum-rates in power or interference 142 limited environments.

143
The main contributions of this paper are summarized as 144 follows. The problem of feedback overhead for D2D net-145 works is formulated and two feedback reduction schemes are 146 proposed to reduce the feedback overhead for D2D networks 147 based on the formulation. In addition, the proposed feedback 148 reduction schemes are incorporated with a new transmis-149 sion scheme using DRL to prevent performance degradation 150 caused by the reduced feedback. The proposed feedback 151 reduction schemes can significantly reduce the feedback 152 overhead while achieving the same sum-rates, compared to 153 the full feedback scheme. They also enable each D2D to 154

187
where N 0 denotes the thermal noise power and γ = P N 0 . γ 188 is referred to as the signal-to-noise ratio (SNR) for notational 189 simplicity, hereafter. Our goal is to enable each D2D transmit-190 ter i to autonomously determine its action a t i for maximizing 191 the sum-rate K i=1 c t i while reducing the feedback overhead. 192

194
A. DRL-BASED TRANSMISSION SCHEME 195 We investigate a DRL-based transmission scheme where each 196 D2D transmitter can determine autonomously whether to 197 transmit data based on the feedback from its associated D2D 198 receiver. Fig. 2 illustrates the architecture of the DRL-based 199 transmission scheme. s t i denotes the input state of D2D trans-200 mitter i at time t, which consists of signal channel gain and 201 interference channel gains that the receiver i feeds back. a t i 202 is the output that is the binary action of D2D transmitter i 203 at time t denoting whether to transmit data. In various envi-204 ronments, input states and actions are bilaterally co-related. 205 Thus, actions are chosen based on input states and the chosen 206 actions affect the next input states. However, actions and 207 input states are unilaterally co-related in our case because the 208 chosen actions don't change the next input state consisting 209 of channel gains, even though actions are chosen based on 210 input states. We, thus, use a dueling deep Q-network (DQN) 211 as a learning model because it is particularly useful when 212 actions and input states are unilaterally co-related [20], [21]. 213 As shown in Fig. 2, the dueling DQN comprises of three 214 main layers; feature layer, state-value layer, and advantage 215 layer, denoted by θ , α, and β, respectively. Each main layer 216 has three fully connected sub-layers and each fully connected 217 sub-layer is followed by the rectified linear unit (ReLU) 218 activator. The top stream, which consists of the feature layer 219 and the state-value layer, yields a scalar output denoted by 220 V (s t i ; θ, α) that can estimate the value of input state as a scalar 221 value. The bottom stream, which consists of the feature layer 222 and the advantage layer, yields an array output denoted by 223 A(s t i , a; θ, β) that can estimate the advantage for each action 224 a. For this study, a ∈ {0, 1} and the output size of the bottom 225 layer is 2. The top stream and bottom stream are aggregated, 226 thereby yielding the final state-action values Q, calculated as 227 is the average value of A(s t i , a; θ, β) 230 over a and is included to resolve the lack of identifiability of 231 Q value [20]. Without E a [A(s t i , a; θ, β)], we can't recover V 232 and A values uniquely for a given Q value, which leads to poor 233 performance in practical cases. Then, the D2D transmitter 234 i chooses its action at time t, denoted by a t i , which has the 235 greatest state-action value as If a t i = 1, the transmitter i transmits data to the receiver i, 238 which broadcasts the data-rate c t i defined in (1). Otherwise, 239 between the action-state value for the selected action a t i ,

253
where η denotes a discount rate for future rewards. Contrary 254 to typical environments, the environment in this study is uni-255 lateral and the actions don't affect the subsequent input states 256 and η is thus set to 0 as in [22] and [21]. The optimizer grad-257 ually trains α, β, and θ by using the gradient descent method 258 to minimize the mean square where ν denotes a learning rate.
where j denotes the identifier of D2D transmitter j. To reduce 277 the amount of feedback information given in (7), we propose 278 two feedback schemes: partial feedback scheme and binary 279 feedback scheme. In the partial feedback scheme, each D2D 280 receiver i feeds back the channel gains that are equal to or 281 greater than its signal channel gain |h ii | 2 . Thus, the feedback 282 information of receiver i can be described by In the binary feedback scheme, each D2D receiver i only 286 feeds back identifiers of D2D transmitters whose channel 287 gains are equal to or greater than the signal channel gain 288 |h ii | 2 as follows: The feedback information in the binary feedback scheme con-291 sists of the same number of elements as the partial feedback 292 scheme, however, the binary feedback scheme can further 293 reduce the feedback overhead compared to the partial feed-294 back scheme because the identifiers of channel gains are only 295 fed back without real-valued channel gains. After transmit-296 ting data, D2D transmitter i receives the feedback information 297 F t i transmitted by the receiver i, and builds the next input 298 state s t+1 VOLUME 10, 2022 If F t i = F t partial,i , the D2D receiver i feeds back only |h t ij | 2 ∀j 303 that are greater than or equal to |h t ii | 2 . Thus, the input state is 304 described as (14) 341 The overhead ratios of the partial and binary feedback 342 schemes compared to the full feedback scheme can be cal-343 culated as By taking the derivative of (17) over x, we can obtain Then, (14) can be simplified as by using (18) and (21).

367
If D2D nodes are uniformly distributed, they have different 368 path losses but the same distribution [25]. For two channel 369 gains with the same distributions, p = 1 2 by using Lemma 1.

392
We analyze the performance of the DRL-based distributed  Table 1.  signal channel gain to transmit data. The No Control scheme 417 is optimal in power-limited environments while the Oppor-418 tunistic scheme is optimal in interference-limited environ-419 ments [27]. As shown in Fig. 3-(a), with K = 10, the 420 No Control scheme outperforms the Opportunistic scheme 421 for γ ≤ −3dB because the interference level is low. The 422 DRL-based transmission schemes achieve the same average 423 sum-rates as the No Control scheme for γ ≤ −14 dB and 424 outperforms the No Control scheme for −14dB≤ γ ≤ −3dB, 425 regardless of the feedback schemes. For γ ≥ −3dB, the 426 Opportunistic scheme outperforms the No Control scheme 427 because the co-channel interference among D2D transmitters 428 becomes significant. For −3dB≤ γ ≤ 5dB, the DRL-based 429 transmission schemes outperform the Opportunistic scheme. 430 For γ ≥ 5dB, the Opportunistic scheme outperforms the 431 DRL-based transmission schemes. However, if we consider 432 that the Opportunistic scheme requires an extra centralized 433 node to gather all signal channels and to select one D2D trans-434 mitter to transmit data, the difference in the average sum-rates 435 is marginal. In addition, the partial feedback scheme and 436 the binary feedback scheme can achieve the same average 437 sum-rates as the full feedback scheme for all γ values, despite 438 the reduced amount of feedback overhead. Figs. 3-(a) and (b) 439 show that as K increases from 10 to 50, the range of γ 440 where the No Control scheme outperforms the Opportunistic 441 scheme is reduced and the difference of two schemes' aver-442 age sum-rates increases because the co-channel interference 443 among D2D transmitters becomes more significant as K 444 increases. In addition, the DRL-based transmission schemes 445 are superior to the No Control and Opportunistic schemes 446 in a wider range of γ values. Specifically, the DRL-based 447 transmission schemes can achieve average sum-rates equal 448 to or greater than what the No Control scheme or the Oppor-449 tunistic scheme can achieve when γ ≤ 5dB for K = 10, 450 while they outperform the No Control and Opportunistic 451 schemes when γ ≤ 4dB for K = 50. In addition, the 452 partial and binary feedback schemes can achieve the same 453 average sum-rates as the full feedback scheme even though K 454 increases.  floating-point format in IEEE 754 [28]. As K increases, 473 ρ partial decreases and converges to 0.5 as in Theorem 1. 474 ρ binary slightly increases but is still lower than ρ partial . More 475 specifically, ρ partial ≈ 0.5 and ρ binary ≈ 0.