Optimal Power Allocation With Multiple Joint Associations in Multi-User MIMO Full-Duplex Systems

Optimum power allocation is an effective way to mitigate residual self-interference and inter-user interference in multiple input multiple output full-duplex (FD) systems. However, current research mainly considers parts of influencing factors and sets service models fixed. Given this, we comprehensively focus on three perspectives in a novel power allocation method, which involve the muting management (MM) and the assignment of both base station antennas and subcarriers in the FD system. Then, we formulate an optimization problem to maximize the total spectrum efficiency. According to the categories of variables in the nonconvex objective function, we first propose a hierarchical algorithm, which is decomposed into the first-order Taylor approximation (FOTA) method and the greedy algorithm. The continuous and discrete variables related subproblems are solved through FOTA and greedy algorithm, respectively. Among them, the greedy algorithm is an alternative to a traditional method of exhaustive search. Considering the high complexity of the greedy algorithm, we further introduce deep reinforcement learning (DRL) instead to solve the corresponding subproblem. Thus, two Double Deep Q-learning Networks are constructed to train the samples in each sub-slot. Simulation results validate that the hybrid DRL-convex method outperforms the hybrid greedy-convex method. Meanwhile, the MM introduced scheme’s performance gain is more evident than that of the method without MM in many scenarios.

because of inaccurate SI channel estimation and hardware impairment [5]. Moreover, residual SI is further aggravated caused by the extensively used multiple input multiple output (MIMO) technology, which would make residual SI compound and challenging to eliminate [6]. When multiple users are located in the cellular network, interference from user to base station (BS) and that from user to user also exist. All the aforementioned interference together deteriorates the FD performance.
As is known, lifting power could increase capacity but decrease it in turn due to intensifying interference. The revenue depends on the weights of each desired signal and interference, which is a typical allocation problem. Therefore, appropriate power allocation can effectively address the tradeoff between performance gain and loss, and is mainly fulfilled through an objective function (e.g., maximized SE, energy efficiency (EE)). In view of this, many scholars focus on power allocation methods to alleviate the FD performance reduction caused by multiform interference [7]. More importantly, the diversity gain can be improved with the application of smart antennas rather than deploying the traditional fixed antennas [8]. With the aid of smart antennas, the FD technique is further enhanced. Considering that spectrum resource scarcity is an international problem [1], for this reason, implementing power allocation along with FD technique in a limited spectrum resource is very meaningful in the current situation.

A. MOTIVATION
Despite of aforementioned benefits provided by the power allocation method in the FD systems, current research seldomly considers both smart antennas and spectrum resource scarcity in a power allocation method [9], [10], [11], [12], [13], [14], [15]. Meanwhile, other scholars set the precondition of users' service models fixed to optimize the objective function, which may not be an overall optimum [16]. Inspired by this, we propose a power allocation method with multiple joint associations (i.e., smart antennas, scheduled users, and subcarriers) to improve FD performance from each aspect. To be specific, the layer of smart antennas works on lifting diversity gain. The muting management (MM) for scheduled users is to restrain interference caused by parts of users, which generate more disturbances than others. The rational subcarrier assignment aims to reduce competition in spectrum resources. All three elements are bound up with SE performance. In our work, the MM is realized through a newly designed frame structure, which considers both muting and compensation for a small group of users. This field is different from the previous work. Additionally, we integrate the assignment of antennas and subcarriers in power allocation and work the joint optimization problem out through a hierarchical algorithm, which is another new scheme to the existing works.

B. MAIN CONTRIBUTIONS
In this paper, we review the current investigations about power allocation in FD systems. Accordingly, we propose a novel power allocation method. The main contributions of our work are summarized as follows: 1. Considering the different types of interference, we divide the scheduled users by service type. To realize service enabled from the user level, we define the user identifier and devise MM by adding a trigger region and a muting indication in sub-slot 1. 2. To integrate the affecting factors that involve the MM and assignment of antennas and subcarriers, we formulate an objective function of the power allocation method, which considers the above three elements to optimize the overall SE fully. 3. The proposed optimization problem is decomposed into two subproblems in terms of continuous and discrete variables. With the first-order Taylor approximation (FOTA) method, the continuous part is converted into convexity. Then we employ the greedy algorithm based on the traditional method (exhaustive search) to tackle the discrete part as a benchmark. 4. Because deep reinforcement learning (DRL) is more appropriate for solving nonconvex problems of discrete variables, we design another hybrid method based on two Double Deep Q-learning Networks (DDQNs) instead of the greedy algorithm. Simulations demonstrate that the hybrid DRL-convex method outperforms the hybrid greedy-convex method. Also, our proposal with MM achieves performance enhancement in comparison to that without MM. The remainder of this work is organized as follows. The related works are presented in Section II. Our system model, followed by MM and uplink (UL)/downlink (DL) interference model in each sub-slot, is described in Section III. According to the system model, we formulate the optimization problem for maximizing SE in Section IV. Section V presents two proposed hierarchical algorithms to tackle the nonconvex problem, and detailed complexity analysis is presented. In Section VI, numerical results demonstrate our proposal. Section VII remarks on the conclusion of our work.

II. RELATED WORKS
Based on existing SIC technology, optimizing the formulated objective function in a power allocation method is the mainstream for improving the FD system's performance recently.
Some scholars design optimized power allocation methods in MIMO FD relaying. In [17], to satisfy the requirements of each user's signal to interference plus noise ratio (SINR), along with saving the allocated power at the FD relay, the authors formulate an EE-optimization problem in a block fading channel and work the issue out through the geometric programming method. Based on [17], the authors in [18] mainly focus on the relationship between antenna number and SE/EE. The optimal number of antennas to maximize the objective function is derived. They demonstrate that a performance bottleneck confines the FD antenna scale due to the distortion noise.
Moreover, other scholars apply the power control method in FD cognitive radio networks to further promote the degrees of freedom in the MIMO system. The authors in [19] put forward four cognitive radio modes, where the secondary users adopt different power strategies. Additionally, the authors make a performance contrast between FD and HD in four schemes, respectively.
The above authors in [17], [18], and [19] set the FD antennas' operational modes fixed, and other scholars consequently research adaptive FD antennas in power allocation. Unlike [18], the authors in [9] give an optimum ratio between emitting and receiving antennas instead of equal numbers to maximize the sum rate in the power allocation method. Some scholars study power allocation with flexible antennas under secure transmission in FD systems. In [10], the antenna selection coefficient has been put forward to regulate the number of emitting/receiving antennas in the FD system. Then a power allocation method based on quantum calculation is applied to maximize the security capability and EE.
The authors in [9] and [10] regard the emitting/receiving antennas as a group, while others in [11] and [12] treat each FD antenna as an individual. In [11], the authors propose a scheme that can dynamically select each emitting/receiving antenna according to various channel conditions, thus raising the FD diversity gain for SE enhancement in power allocation. To further research the diversity gain of FD, the authors in [12] introduce a binary matrix to define the operating modes of each antenna. Using the assignment matrix, they construct a two-stage SE objective function, which is solved through successive convex approximation.
All the above studies have not considered the FD network with spectrum resource intensive. Some scholars take bandwidth or subcarrier as a power allocation factor. The authors in [13] adopt a three-stage Stackelberg game in power allocation, which takes bandwidth and EE as pricing and utility, respectively. They attempt to acquire the optimum utility value through the game. In [14], the authors present auxiliary variables and penalty factors to handle the discrete subcarrier assignment variables. With the problem reconstruction, the optimization solution of EE in power allocation has been acquired through the Lagrange method. In [15], the authors propose a power allocation method based on successive convex approximation in the FD distributed antenna system. The system includes several user-centric virtual cells that share limited subcarriers.
The sequential decision problem is known to be solved by reinforcement learning (RL) [20]. Since RL can find an appropriate compromise between performance and complexity in the case of massive samples [21], it has attracted tremendous attention from academia. Therefore, many scholars have attempted to solve the power allocation problem with RL [22], [23], [24], which has also been used in FD systems recently [25], [26], [27], [28], [29], [30]. To name a few, in [25], based on the underlay mode referring to [19], the authors employ DRL for power control, which increases the secondary user's SINR by improving its perception accuracy. Example of a multi-user cellular network with four scheduled users (i.e., two UL and two DL users) and one unscheduled user (i.e., non-service user) at the moment.
The times of satisfaction for the capacity requirement at both primary and secondary users are defined as rewards, which can be maximized through a training process. The authors in [26] focus on a pair of terminals with FD capability. By setting applicable states, actions, and rewards, they propose the hybrid RL scheme to maximize the sum of SE and energy transmission efficiency. Meanwhile, the influence of different antenna numbers and power budgets on performance is also studied. In [27], the authors adopt RL in an unmanned aerial vehicle FD relay scenario to maximize secrecy capacity. At the same time, different RL techniques are compared in terms of secrecy rate and convergence.
In view of the above research, the researchers in [9], [10], [11], and [12] aim to improve the spatial diversity gain in power allocation but do not consider the case that both emitting and receiving antennas are shared. Also, subcarrier assignment in the objective function is not involved concurrently. Although authors in [13], [14], and [15] consider subcarrier assignment, the FD antennas are invariable. To the best of our knowledge, the joint optimization problem of FD antenna and subcarrier assignment in power allocation has not been investigated integrally. Also, investigations in [9], [10], [11], [14], and [15] set users' service models fixed, ignoring the service of muting. In response to this, we put forward our proposal. Meanwhile, considering the advantage of RL, we adopt DRL to enhance the algorithm in the FD system. Fig. 1 depicts a BS working in FD mode, equipped with N smart antennas in the cell. Let N = {1, 2, . . . , N } denote the set of BS antennas. Each antenna is connected with an analog circulator device to isolate radio UL and DL. As a result, the operational mode of single transmitting, single receiving, or co-transmitting co-receiving in the same band can be selected [31], [32].

III. SYSTEM MODEL
We suppose that Z users are uniformly distributed in the cellular network. The set of users represented by Z = {1, 2, . . . , Z } is classified into the subsets of UL users, DL users, and non-service users at the moment. Each user is equipped with one antenna and can transmit or receive data at a different time due to working in HD mode. The network spectrum resources are divided into M mutually orthogonal subcarriers, the set of which is denoted as M = {1, 2, . . . , M }. We assume that scheduled users in different subcarriers do not interfere with each other. As the number of scheduled users (i.e., service users) is larger than the number of subcarriers, scheduled users reuse part of the subcarriers, which incurs interference. In view of this, we will describe MM in the following subsection.

A. MUTING MANAGEMENT
The authors in [16] propose a concept of Interference Aware Muting that forces the mobile terminal to turn off due to causing severe interference to BS. We call such users the jamming users (JUs). As shown in Fig. 1, if one UL and one DL user are deemed as JUs, the UL JU would aggravate UL to DL and UL to UL interference, while DL JU exacerbates SI and DL to DL interference. We can observe that muting the JUs is a tractable and explicit strategy. However, the muting process causes a service interruption to the JU that suffers a performance loss. In order to minimize the side effect of muting, the muted JUs will return to regular service at the next time slot. That is, the muting orders will be invalid at the subsequent slot until the new arrival of orders. Under the above operations, albeit with performance partially reduced from outages, we still attempt to reach a state where the advantage outweighs its drawback compared with unmuted before. Motivated by this, we introduce MM into the system model.
Because the service type of users depends on the service scheduled from BS, we assume that each time slot contains a control region and a data region for simplicity [33]. The schedule information, which determines the service type, is monitored in the control region by a user. When a user has detected a downlink control information that is relevant to UL or DL schedule information, the data will be transferred during the related data region. The affiliated data region is subject to the specific frame pattern that BS has configured [34]. To better evaluate the proposed MM, we combine two consecutive time slots (called sub-slots 1 and 2) into one schedule unit, in which the schedule information for the two time slots keeps the same to ensure users' service continuity for a while. Sub-slot 1 is added with a trigger region and a muting indication based on the primitive frame structure, as shown in Fig. 2, where sub-slot 2 is the default. The intention of this configuration is that we expect to keep the minimal possible change in order to ensure compatibility. BS judges the service muting decisions through power policy adjustment in the trigger region, and the muting indication bears the muting order that delivers to related scheduled users. During sub-slot 2, the muting order will not work so that the silenced users can restore the service. Meanwhile, the appropriate compensation should be considered at sub-slot FIGURE 2. Frame structure of sub-slots 1 and 2 for each user. The dark-blue and red-brown frames stand for DL and UL subframes, respectively. The ellipsis indicates the specific frame pattern, which is not our focal point in this study.
2 in terms of fairness. The offset process is unrelated to muting order as long as sub-slot 2 has acquired the muted users information. It is evident that the operation in sub-slot 2 is aligned with the design framework.
To manage muting from the user level, we define service identifiers for each user in the cell.
For instance, } is denoted as service identifiers for Z users at κ th schedule unit, where τ κ ∈ {t κ , t κ + t}. t is the length of a sub-slot, and τ κ = t κ or t κ + t means the sub-slot 1 or 2 in the κ th unit. α u z (τ κ ) and α d z (τ κ ) signifies the UL and DL service identifier for user z at the related sub-slot, respectively. For ease of writing, sub-slot 1 or 2 at the κ th unit is recorded as t κ,1 or t κ,2 . In this paper, we mainly analyze one unit, so we abbreviate t κ,1 and t κ,2 to t 1 and t 2 , respectively.
In conclusion, identifiers for user z at sub-slots 1 and 2 can be expressed as in which χ ∈ {u, d} expresses the service type of UL or DL. Mu and Sch are short for Muting and Schedule, respectively. Since users work in HD mode, the muting indicator is not attentive to the specific service type. Mu z (t 1 ) = 1 or 0 denotes that the muting order has delivered to user z or not at subslot 1. Similarly, Sch χ z (t 1or2 ) = 1 or 0 indicates that user z concerning service type χ is scheduled or not at the whole unit. Notably, users can not be scheduled for two types of service simultaneously due to operation under HD mode.
To sum up, α χ z (t 1 ) = 1 or 2 means that the scheduled users have been silenced or not at sub-slot 1. α χ z (t 2 ) = 2 guarantees the continuity of the same schedule information in a unit. In addition, α χ z (t 1or 2 ) = 0 indicates that the user is not scheduled at the unit.
As the above discussions, the scheduled users can be mathematically categorized into two types. One is a collective of UL users, and the other is a set of DL users, denoted by respectively.
Let G denote the set of scheduled users, which satisfies G = J ∪ K = {1, 2, . . . , G}. Note that we only consider users in G of the interference model below.

B. INTERFERENCE MODEL
In this paper, we apply a composite fading channel and can acquire complete channel state information (CSI) [35], [36]. Considering the channel's frequency characteristic, we suppose that the CSI between two nodes in one schedule unit will remain unchanged [10]. Accordingly, the difference in transmission between two sub-slots pivots on MM.

1) TRANSMISSION AT SUB-SLOT 2
We first construct mathematical modeling at sub-slot 2 for ease of analysis because no MM is applied.
For the modeling's sake, we initially assume that all users in G share the same subcarrier, and BS fixes N t emitting and N r receiving antennas, satisfying N t = N r = N .
The signal received at BS from user j at sub-slot 2 can be written as represents the channel vector from user j to BS. d j ∈ R 1×N denotes the distance vector from user j to each BS antenna. a a a j (t 2 ) ∈ R 1×N and w w w j (t 2 ) ∈ C 1×N indicate the lognormal shadow fading and small-scale fading vector, respectively. Both a a a j and w w w j obey independently identical distribution as a a a j , w w w j ∼ CN (0, 1 1×N ) [15], where 1 1×N stands for 1 × N dimensional vector with elements all 1.
The first term of (4) implies the desired signal. The second term signifies the interference caused by other UL users except user j (namely, UL to UL interference), and the third term indicates the residual SI, which has been mitigated by DL precoding (i.e., DL power allocation). p j (t 2 ) in the first term represents the transmitted power of user j, which satisfies where P(t 2 ) is a set of transmitted powers for all UL users.
Meanwhile, x u j (t 2 ) stands for the transmission symbol from user j, which follows in the third term is the residual SI matrix and follows where a is the Rician factor and σ 2 SI is the SI power ratio of pre-SIC to post-SIC [35]. Additionally, x d k (t 2 ) denotes the received symbol of user k from BS, which also satisfies W(t 2 ) means the set of DL precoding vectors of all DL users.
The last term n u indicates additive white gaussian noise (AWGN) vector related to user j at BS.
The signal received at user k from BS at sub-slot 2 is similarly expressed as where h d k (t 2 ) ∈ C N ×1 denotes the channel vector from BS to user k.
Similar to (4), the first term in (9) represents the expected signal. The second term indicates the interference caused by receiving other DL users' signals, which is the aforementioned DL to DL interference. The third term means user k is interfered with by UL users, namely, the UL to DL interference. In the third term, g k,j (t 2 ) represents channel gain from UL user j to DL user k. The final term n d k (t 2 ) ∼ CN (0, σ 2 d,k ) stands for AWGN at user k.
According to (4) and (9), the target UL or DL user signal mingles with different categories of interference, as shown in Fig. 1, thus inducing undesirable channel conditions. It results in an optimization bottleneck of FD performance for the case of all BS radiating/receiving antennas inflexible [32]. Hence, we refine each BS antenna's working mode, which covers reception/transmission independence mode and coexistence mode. VOLUME 11, 2023 The smart antennas are modeled with an assignment vector Q = [q u , q d ], in which q u and q d are subvectors of receiving and emitting antennas, respectively. The subvectors at subslot 2 is written as where q χ l (t 2 ) is the state of antenna l (∀l ∈ N ), defined as q χ l (t 2 ) = 1, antenna l is used for service χ, 0, antenna l is not used for service χ.
Accordingly, the vector Q acts on the channel vector as the following (12) and (13) to realize the adaptive antennas.
Moreover, the interference model only applies to the users that share the same subcarrier. We reconstruct the UL/DL interference model by assigning BS antennas and subcarriers. So (4) and (9) can be rewritten as and is the assignment state of subcarrier m to user z for service χ, represented as It is evident that a scheduled user corresponds to the user assigned a subcarrier and vice versa. In conclusion, the assignment states for each user constitute the subcarrier allocation matrix, which is defined as ] is a submatrix for user z.
Note that for an arbitrary subcarrier m, one and only one y u j,m (t 2 ) has practical significance. It results from the fact that each scheduled user is only assigned one subcarrier. Hence, we could substitute the expression y u j (t 2 ) for y u j,m (t 2 ) in the paper below for simplification. Similarly, y d k,m (t 2 ) is simplified to y d k (t 2 ).

2) TRANSMISSION AT SUB-SLOT 1
Since the transmission at sub-slot 1 involves an additional factor related to MM, we reformulate the interference UL/DL models at sub-slot 1 according to (12) and (13) as and where e z (t 1 ) is the service muting state written as It is obvious that value 0 means muting. Similar to the subcarrier assignment, e z (t 1 ) is a part of the vector e(t 1 ) = [e 1 (t 1 ), e 2 (t 1 ), . . . , e Z (t 1 )].
In view of the correlation, we introduce a new parameter b z (t 1 ) called a service-enabled state, which is defined as and satisfies 1, user z is not muted for the assigned subcarrier at sub-slot 1, 0, user z is muted for the assigned subcarrier at sub-slot 1. (20) The service-enabled states of each user also form the vector We rewrite (16) and (17) using parameterB(t 1 ) and can readily get the new expressions at sub-slot 1 that resemble (12) and (13) at sub-slot 2. The difference between UL/DL interference models at each sub-slot lies in the varied parameters B andB below where f j and g k are the functions of the received UL and DL signals at the matched sub-slot, respectively.
To strive for simplification of (21a) and (21b), we regulate new variateB to substitute for B andB beloŵ As a result, we create a single standard formula instead of the two expressions at each sub-slot for the sake of problem formulation.

IV. PROBLEM FORMULATION
The UL SINR of user j at sub-slot 1 or 2 is written as UU (t i ) and SI (t i ) are the covariances matrices of UL to UL interference and SI, respectively. They are given by in whichH Similarly, the DL SINR of user k is expressed as φ DD (t i ) and φ UD (t i ) are the variances of DL to DL and UL to DL interference, respectively. They are written as Finally, we substitute (24) and (29) into the Shannon formula to acquire the SE of UL user j and DL user k as respectively, where det(·) is the determinant operator. Thus, the total SE of all scheduled users in the cell at the schedule unit is defined as Given the (35), it is noteworthy that the total SE is related to multiple influencing factors. The parameters P and W apparently work for UL and DL power allocation to mitigate interference. In contrast, the parameters Q andB are indirectly concerned with that, which is explained below.
In (24), the powers of a specific user j or k can impact the UU or SI term of user j, which brings side effects to the SE of user j. Similarly, in (29), the user k or j will decrease the SE of user k via the increase of φ DD or φ UD . For ease of presentation, we suppose two users as an entirety, one user at numerator referring to (24) or (29) with a higher ratio of throughput to power (also called EE) is interfered with by another. If the performance loss is higher than the SE obtained by the lower-EE user, the total performance will degrade.
Hence, it is easily acquired that the appropriate parameterB to mute the lower-EE user can effectively mitigate the residual SI (equivalent to SI ) or multi-user interference (same as UU , φ DD , or φ UD ). Also, parameter Q is correlated with the composite channel gains, such ash u j , h u j , andh d k , which directly affect the powers. It shows a better performance than fixed channel gains h u j , h u j , and h d k when they are in poor condition. In fact, the essence of parameters Q andB comes down to a power allocation issue.
In order to implement a comprehensive power allocation method, we take (35) (36b) and (36c) restrict two sub-slots in one schedule unit and two types of service, respectively. (36d) means each user can be only assigned one subcarrier at most. (36e) and (36f) imply that only the scheduled users could be silenced subject to identifiers in (36e). (36g) and (36h) determine the work modes of each BS antenna. p max and P max are the maximum powers for the user and BS, respectively. Thus, (36i) and (36j) are each the maximum power constraint for DL and UL users. (36k) and (36m) are the quality of service (QoS) constraints for the unmuted UL and DL users at sub-slot 1 or 2, while (36l) and (36n) are the QoS constraints for the resumed UL and DL users at sub-slot 2, which were once muted at subslot 1. The compensation coefficient β is used to remedy the performance loss for the muted users.
We fulfill the integration of the abovementioned three elements through (36a). By solving the optimization problem, we can acquire a maximum SE with the optimal UL/DL power allocation, which also considers MM and the assignment of BS antennas and subcarriers.

V. ALGORITHM DESCRIPTION
Apart from the binary constraints, the object function (36a) and the constraints (36k)-(36n) are all nonconvex. Hence, this is a non-deterministic polynomial hard (NP-hard) optimization problem [37]. Furthermore, binary variables Q andB with coupled UL and DL power allocation make the traditional solution even more impractical. Considering Q andB are discrete variables while W and P are continuous, based on different variable types, we mainly present two hierarchical methods to solve the problem of (36) in this section. The hierarchical method intends to split the problem into two subproblems. We can go through each subproblem by looping to solve the initial problem ultimately.

A. HYBRID GREEDY-CONVEX METHOD
A practicable method for the continuous variables related subproblem is to construct an approximate function that is easier to solve than the original NP-hard problem. Several approximation algorithms, such as successive convex approximation and majorization-minimization, are used to address this issue [38], [39]. Considering the nature of SE equations, we apply another approximate method called the FOTA in this paper.
First, we reformulate the problem of (36) based on the fixed Q andB to realize problem decomposition as R d k (W, P, t 2 ) ≥ βR d req , ∀k ∈ D, β > 1, (37f) where the expansion of (37a) through the logarithmic property is written as where On the right hand side of (39) and (40), several newly defined expressions are represented as follows where C u j (t i ) and D d k (t i ) in (41) and (42) are defined as respectively. From (39) and (40), we can see that both R 1 (W, P) and R 2 (W, P) are concave logarithmic functions. As the formation of (40) is more straightforward than (39), we only need to analyze (40) mathematically.
The FOTA of (40) with multiple iterations will converge to R 2 (W, P) due to the function concavity [40]. Accordingly, we can acquire the approximate value of R 2 (W, P) by taking derivatives. To facilitate partial differentiation, we convert J u j (t i ) and K d k (t i ) in (40) to a formalization with only two direct variables as where As a result, we calculate n iterations to obtain the FOTA of (40) as R 2 (W, P) ≈ R 2 (W (n) , P (n) ) + R 2 (W (n) )(W − W (n) ) where R 2 (P (n) )(P − P (n) ) Obviously, (51) and (52) are affine functions with respect to W and P, respectively. Hence, we transform (50) into an affine function approximately. Substituting (50) into (38), we acquire the concave object function of problem (37) accordingly.
Similarly, the nonconvex constraints (37c)-(37f) can be each decomposed with two logarithmic functions subtracted, so we approximately achieve the concave constraints with the assistance of the FOTA method.
Consequently, we convert the problem of (37) into an approximate convex optimization problem as With Matlab convex tool [41], we can work out the convex optimization problem and obtain the optimal solution for W, P, and the corresponding total SE.
Since the optimum solution of discrete variables Q and B can not be acquired through differentiating regularly, a direct approach to choosing an appropriate configuration is an exhaustive search referring to [14]. Nevertheless, when problem parameters extend, global search, such as the exhaustive search [10], is incompetent due to the curse of dimensionality with two discrete variables. Accordingly, the greedy algorithm only searches several local optimums (namely, candidates) instead of the global optimum. Subsequently, it selects the best candidate from the candidate list to approximate the global optimum [42]. Since the greedy algorithm adopts a top-down structure, in which the backtracking is unnecessary, the efficiency is promoted to some extent compared with the exhaustive search. Thereby we apply the greedy algorithm to find a suboptimal configuration. In order to decrease the ergodic samples, we evenly pick up i Q and iB samples through a sample rate from universal sets of Q andB, respectively, The greedy selection rule should follow two steps: 1) Initiate the configuration of all emitting/receiving antennas shared and all scheduled users unmuted; 2) Gradually decrease the share level of antennas and increase the muted users in a random process. Besides, we set a tolerance threshold ω th to accelerate seeking the sub-optimal candidate.
Through the outer loop (namely, greedy method) and inner loop (namely, convex method) updates, a relatively optimum solution can be acquired. Accordingly, the hybrid greedyconvex method is summarized in Algorithm 1.
Nevertheless, the greedy method could be easily trapped in the local optimum for nonconvex problems even though it explores the last candidate. This is due to the fact that there are limited candidates in the list. Since the RL technique has a significant advantage in tackling a vast amount of data, we will adopt the DRL technique based on the previous proposal.

B. HYBRID DRL-CONVEX METHOD
It is known that Markov Decision Process (MDP) is a tuple that includes four elements as sets of current states s t , next states s t+1 , actions a t , and rewards r t+1 , where the t means the time step. In our devised system model, the work mode for BS antennas, the assignment of subcarriers, and the user MM are all handled by FD BS. We take BS as an agent for this reason. Because the set of actions is finite, we use discrete variables Q andB as action a = (a q , ab), in which a q represents the BS antenna assignment, ab denotes the joint of subcarrier assignment and MM. Considering that BS-agent adopts two sets of actions for each sub-slot in a schedule unit, we mainly focus on the agent behavior at sub-slot 1. This is because the action at sub-slot 2 is pared-down owing to no muting orders imported compared with sub-slot 1. Therefore analyzing realization at sub-slot 1 can reasonably cover the following implementation at sub-slot 2.
The action space can be written as where A 1 and A 2 are the total number of combinations of a q and ab, respectively. The determined action at time step t from A will act on the constrained multi-user interference model and the FOTA algorithm, namely, the environment. Subsequently, the environment outputs the SINR of each scheduled user and total SE at time step t + 1, which are treated as the next state s t+1 and reward r t+1 , respectively. The state and the state space are recorded as and while the output reward is denoted as It is evident that once the BS-agent chooses a specific action at step t, each scheduled user will transit from the current state s t to the next state s t+1 that is calculated based on the determined action, and thus the BS-agent is rewarded in the meantime. Correspondingly, the state transition probability is P(s t+1 |s t , a t ). With the new state and benefit, BS-agent will adapt its policy via trial-and-error and repeatedly make a new round of decisions. Note that the learning process of BS-agent is directed by a reward that follows constraints of (36) in the environment.
The interaction between BS-agent and environment is visualized in Fig. 3 and deemed an MDP, which is a discrete decision problem on the time sequence.
In an MDP, the state value is determined by Behrman optimal equation [43]. Therefore an optimal reward table (i.e., Q table) will be acquired from the state and action values. Each element of the Q table is a necessary return from MDP and is written as where γ denotes the discount factor to the future step reward. If γ is set to 1 (0), the agent concentrates on the long-term (short-term) step reward [44]. The above flow is called Q learning, which is suitable for solving nonconvex problems with discrete variables. However, for traditional Q learning, BS will maintain a A 1 A 2 × A 1 A 2 size Q table, which will cause excessive memory occupation. On account of Q learning, the deep Q-learning network (DQN) exploits a deep neural network to estimate the Q value instead of the lookup table [45], thereby avoiding the case that the dimension of the Q table is too large to be looked up. DDQN is an improvement of DQN, which contains two Q-networks: an online Q-network for action selection and a target Q-network for action evaluation. It evades overfitting when selection and estimation are processed in the same DQN [46]. Given the above superiority in DDQN, we propose another hierarchical solution of the hybrid DRL-convex method, where DDQN is an alternative to the greedy algorithm. Fig. 3 presents a macroscopic perspective of the interaction process, while we will introduce DDQN in a microscopic view to show the training process of the BS-agent.
In DDQN, the online Q-network and the target Q-network are represented as Q(s t , a t ; θ t ) and Q(s t , a t ; θ − t ), respectively. θ − t and θ t are the weighting factors on the to-do lists of training.
BS-agent initializes the online Q values for each action and state. The action a t (namely, Q andB at time step t) with the (61) Therefore, the related reward r t+1 (the total SE) and new state s t+1 (i.e., the UL/DL SINR of each user) will be obtained through interaction with the environment. s t , a t , r t+1 , and s t+1 together constitute a tuple that is stored in a replay buffer R.
For the learning process, the tuples are randomly picked out in batch from the replay buffer R. BS-agent determines the next action through the online Q-network with the max operation based on s t+1 . The determined action is put into the target Q-network to acquire the true value as Then, BS-agent calculates the loss function through the mean squared error between the true values and the prediction values for the tuples as Later, weighting factors θ t are updated in each step through backpropagation based on the gradient descent method as where v means the learning rate, which decides how much degree of deviation to learn. The gradient descent is defined as The Q value is also updated in each time step: Comparatively, weighting factors θ − t are only copied once in an episode through Polyak averaging method and represented as where ρ is a hyperparameter, decides the soft update ability. Fig. 4 shows the training process in one step or episode. All the above episodes (called one epoch) training acts on sub-slot 1, where BS-agent only considers partial constraints (36k) and (36m) in the environment. For sub-slot 2, BS-agent should switch to the reduced action space (i.e., the muting orders are excluded.) and retrain the network with constraints (36k)-(36n). Since the environ- ment in each learning process varies, the outputs of the prediction from the neural network in each sub-slot differ due to different environment interactions. As a consequence, two DDQNs should be trained separately, one DDQN for sub-slot 1 and the other for sub-slot 2, to maintain two groups of weighting factors. Considering that the training process at sub-slot 2 is similar to that at sub-slot 1, the procedure is not described in detail. In conclusion, in the training process, one iteration of training is the equivalent of a single pass for a time step. It is the same as one interaction with the environment, such as the involved sub-slot 1 or 2, through the given combined action at the corresponding time step. Fig. 5 shows the information transmission between the two DDQNs, where the muted users information exchanges after DDQN 1 ends the training for the purpose of compensation to the muted users. At the end of a session, DDQNs will acquire convergent weighting factors for each neural net. With the trained neural nets, BS-agent has finally grasped a skill from the environment to select a suboptimal action. Meanwhile, the real-time performance is guaranteed as the samples can be trained off policy.
The hybrid DRL-convex method is summarized in Algorithm 2, where θ t,D (θ − t,D ) represents the weighting factors in the online (target) Q-network of DDQN D (D ∈ {1, 2}), and the italic R D indicates SE at sub-slot D. A D stands for the action space at sub-slot D.

C. COMPLEXITY ANALYSIS
For expanded parameters, we select 2(2N ) r Q and (2ZM ) r B + (2Z 2 M ) r B samples from the universal sets in the greedy algorithm. Additionally, the time complexity of the FOTA method is O(nMJ (J + K )). To sum up, the time complexity of the hybrid greedy-convex method (HGC) is O(nI max MJ (2(2N ) r Q + (2ZM ) r B + (2Z 2 M ) r B )(J + K )). It is obvious that r Q and r B mainly determine the exponential computational complexity [47]. To evaluate the effectiveness of HGC, we take the hybrid exhaustive-convex method (HEC) as a baseline. Since exhaustive search is incompetent to traverse all combinations in the expanded parameters, to realize the method's feasibility, we keep the same iterations for exhaustive search and greedy algorithm for the sake of fairness. In this regard, the time complexity of HGC equals to that of HEC.
As the training process for the DRL method relates to many factors (e.g., the kernel size, size of the feature map, and number of channels for input and output), it is tough to provide an accuracy complexity. But from the point of each iteration view, the time complexity depends on the number of episodes (namely E) and steps (namely T ). Thus, the time complexity of the hybrid DRL-convex method (HDC) is O(2nETT iter MJ (J +K )), where T iter indicates the complexity in one iteration.
Compared with the exponential computational complexity of HGC/HEC, the complexity of HDC is much less in the scenario of high dimensions. Moreover, when DDQN 1 is trained well, it is unnecessary to retrain the DDQN 1 unless there are significant changes in the interference model. Accordingly, the time complexity drops to O(nETT iter MJ (J + K )). Since parameter dimensionality is highly related to the number of neurons and hidden layers in the deep neural network, we tend to apply a sophisticated network to cover the complex parameters and improve performance.
In the following section, we additionally introduce HDC/HGC/HEC without MM for comparison. The related complexity is O(nETT iter MJ (J +K )) and O(nI max MJ ((2N ) r Q +(2ZM ) r B )(J +K )) for HDC and HGC (HEC) each. Although the complexity of HGC (HEC) without MM has decreased more than that of HDC without MM, that of HGC (HEC) without MM is still higher.

VI. PERFORMANCE EVALUATION A. SIMULATION PARAMETERS
In this section, we illustrate multiple numerical results to evaluate the performance of our proposal. We assume Z users uniformly spread in a square with a side of 50 m, where the FD BS is located in the center. The FD BS is equipped with N smart antennas, while each user only provides one HD antenna. To simplify the experiment, we suppose all users are scheduled by BS (i.e., α χ z (t 1 ) = 1, 2, α χ z (t 2 ) = 2, ∀z ∈ Z), and half of the stochastic scheduled users receive Sch d z (t 1 ) = Sch d z (t 2 ) = 1, at the same time the other half get Sch u z (t 1 ) = Sch u z (t 2 ) = 1. M mutual orthogonal subcarriers are reused in the network. The detailed parameters information of the interference model refers to Table 1.
For the DRL method, we train the proposed DDQNs by using Python 3.6, TensorFlow-gpu 1.14, and Keras 2.1.6 for 5000 episodes with 2500 steps each. Each DDQN has three fully connected layers by applied dropout. There are 1024, 512, and 256 neurons, followed by the Relu activation functions in each hidden layer. The hyperparameters and other parameters also refer to Table 1.

1) ANALYSIS OF CONVERGENCE SPEED
Because there is no training process in HGC/HEC and both DDQNs are nearly the same training process, we only present HDC at sub-slot 1 in Fig. 6.
From Fig. 6, we can see that the reward per user for each configuration gradually increases and eventually converges to a relatively steady maximum value. When the reward stops growing, it indicates that the neural network has been trained well. Moreover, the more complicated parameter will incur a lower training speed and a more fluctuating final reward. It is clear that the algorithm convergence is bound up with the dimensionality of parameters. For instance, a neural network with G = 8, N = 2, M = 6 undergoes 300 episodes to train stabilized, while 4000 episodes are required to train a neural network with G = 28, N = 12, M = 6.
In the following subsections, we will further emphasize the superiority of the HDC algorithm with MM for various parameters in detail.
Note that the mentioned HDC, HGC, or HEC algorithm in the previous sections embodies MM by default. In the following subsections, the case of MM not included is regarded as the reference, so we will stress the condition of whether MM is introduced or not elaborately.

2) ANALYSIS OF THE NUMBER OF SCHEDULED USERS
From Fig. 7, since no JU appears at G = 8, the total SE in a schedule unit under this case is served as a baseline for other cases of different numbers of scheduled users to compare with, thus highlighting the total SE gain. It can be seen that the increases of both the number of scheduled users and total SE gain are asymmetrical. For example, the total SE gain of HDC without MM (namely, HDC-WOM) at G = 28 is only 131.5%, while the number of scheduled users rises to 350% compared with that at G = 8. On the other hand, from G = 8 to 16, more scheduled users located in the current cell will make total SE growth. However, this upward tendency stops at G = 20. The above two observations suggest that the total SE is seriously restricted by the tight resource situation, where the spectrum resources are insufficient to maintain the scheduled users. So mutual interference becomes the villain of the performance exacerbation.
Compared with HDC-WOM, the HDC with MM (HDC-WM) shows a relatively robust performance advantage. The reason is that muting the JUs helps alleviate the spectrum resource competition, such as five JUs having been muted at G = 28. This action will bring HDC-WM more incremental gains. Meanwhile, it illuminates that the appeared probability of JUs increases as the number of scheduled users mounts. Consequently, the proposed MM successfully seeks the tradeoff between total SE and total scheduled users by muting several JUs.
For the convenience of performance comparison, we adopt the average total SE per scheduled user instead of the total SE as a performance metric below. Compared with Fig. 7, Fig. 8 presents the relationship between the performance and the number of scheduled users from another point of view. Because of the scarce spectrum resources, the performance of each algorithm is decreasing monotonously, correlated with raising the number of scheduled users. As expected, the HDC-WM outperforms the HDC-WOM. To be specific, the performance distinctness becomes more evident (from 0 to 25.8% gain) as the number of scheduled users increases. The reason is that the number of JUs determines the MM's marginal increment level (see Fig. 7).
Generally, both the HDC algorithms get better performance than the HGC/HEC algorithms since HDC has a more robust convergence owing to lower complexity. The gap between the HGC with MM (namely, HGC-WM) and the HGC without MM (namely, HGC-WOM) is inconspicuous, which is 3 bps/Hz at most, so does the discrepancy between the HEC with MM (i.e., HEC-WM) and the HEC without MM (i.e., HEC-WOM).
For the greedy algorithm in HGC, limited candidates lead the trap in local optimum, thus decreasing the advantage of MM. Noticeably, the HGC-WM is outperformed by the HGC-WOM at G = 20 and even can not meet the minimum QoS for all users at G = 28. It is caused by the higher complexity of HGC-WM than that of HGC-WOM as the dimensionality expands. To be specific, the extra complexity of HGC-WM compared with HGC-WOM is O(nI max MJ ((2N ) r Q + (2Z 2 M ) r B )(J + K )). In our experiment settings, the excess part can be rewritten as O(nI max MJG((2N ) r Q + (2G 2 M ) r B )), which has an influential role rather than profits brought by MM in the case of large parameters. Since HEC has an identical complexity to HGC, the handling ability of HEC is similar to HGC. It also proves that MM in HGC or HEC is no longer privileged in terms of complicated parameters. It is noteworthy that, for a learning method, the complexity between HDC-WM and HDC-WOM is nearly the same. It only depends on T iter , which can be ignored in a learning process. Thus the advantage of MM can be well displayed in HDC.
For HEC, The performance of HEC is worse than that of HGC. Although the complexity is equal between them, the exhaustive search in HEC directly seeks the global optimum with randomness and blindness, which is a lack of efficiency compared with adopting a top-down design in HGC. Therefore, the HEC can not cope with G = 24. Fig. 9 shows that the performance is positively correlated with the number of antennas at BS. That is to say, configuring large antennas is a straightforward means of promoting spatial diversity gain. The HDC-WM achieves 6 bps/Hz marginal increment compared with the HDC-WOM, and the performance gap between them becomes virtually static along with the number of antennas. Combining Fig. 8, an alternate view points out that the MM performance gain mainly depends on the number of scheduled users.  The margin between HGC-WM and HGC-WOM is obscure at N = 2/4/6/8. Similarly to Fig. 8, too much calculation causes the HGC performance degradation in the case of large antennas, so HGC represents a moderative performance growth. Additionally, at N = 12, the HGC-WM is surpassed by HGC-WOM over 11 bps/Hz, stemming from the superiority in MM nullified by extra complexity. It is worth noting that the disparity between HEC-WM and HEC-WOM is more conspicuous than that between HGC-WM and HGC-WOM at N = 12. The result illustrates that when tackling large parameters, the HEC-WM is inferior to HGC-WM.

3) ANALYSIS OF THE NUMBER OF ANTENNAS
The performance distinctness between HDC and HGC/HEC increases with the number of antennas. The minimum gap between them is less than 10 bps/Hz at N = 2, while the peak discrepancy is over 60 bps/Hz at N = 12. The result demonstrates that HDC has superiority in tackling multiple antennas.  Fig. 10 depicts that the relative abundant subcarriers in the current network will provoke HDC into a better SE. It is obvious that more spectrum resources will decrease all kinds of interference. Nevertheless, the discrepancy between HDC-WM and HDC-WOM is narrowed with the number of subcarriers increasing. The reason is that the influence of JUs to the entirety has declined, owing to the relatively adequate spectrum resources. To be specific, the gap is 9 bps/Hz at M = 1, while it has a 3 bps/Hz drop at M = 6. Notice that our proposed MM is more appropriate in scenarios of scarce spectrum resources.

4) ANALYSIS OF THE NUMBER OF SUBCARRIERS
Although HGC for the two algorithms also displays a growing performance trend at M = 1 to 4, the ascending trend is suppressed or reversed at M = 5, 6. This is due to the fact that, for HGC, the profit derived from spectrum resources is less than the loss of calculation at the related configurations. Significantly, the phenomena that HDC-WOM begins to outperform HGC-WM at M = 5, emphasizes the conclusion in Fig. 8.
For HEC, the exasperate performance trend is more acute when more subcarriers are assigned. It proves that the traditional exhaustive search can be not fully applied to the scenario of two discrete variables.
From Fig. 7-10, we can see that the algorithm performance is highly correlated with the size of parameters, such as the number of scheduled users, antennas, and subcarriers. To fully evaluate the performance comparison between the proposed and traditional algorithms, we set the upper bound of parameters as G = 28, N = 12, and M = 6 in our simulations. The numerical results are presented in Table 2. It validates that only the proposed HDC with MM can gain better performance than that without MM in the case of extended parameters. Although the proposed HGC is worse than HDC, it shows relative robustness due to adopting a topdown structure rather than HEC. VOLUME 11, 2023

5) ANALYSIS OF RESIDUAL SELF-INTERFERENCE
The above analysis mainly focuses on the three factors, which are variable-related parameters in the objective function. In the following simulations, we will further analyze other influencing factors, such as the parameter σ 2 SI of the interference model and parameters p max , β in the constraints. Fig. 11 displays the average SE per user for different residual SI power ratios. It can be observed that the impact of residual SI on performance is almost weak when the ratio is less than −110 dB. Performance deteriorates drastically when the ratio exceeds −90 dB. It proves the essential of SIC technology and that the stronger residual SI can worsen the FD communication system. The performance gap between HDC-WM and HDC-WOM is 5 bps/Hz at first, which gradually increases as the ratio increases. Nevertheless, the growing trend is not obvious at the outset till the ratio is beyond −105 dB. On the other side, the performance enhancement between the two algorithms is 13 bps/Hz at −80 dB. The finding means that the strong residual SI would strengthen the JUs' adverse effects. As a result, applying MM will gain a prominent performance in the situation of stronger residual SI.
The performance tendency of HGC/HEC is similar to that of HDC. The main distinction is that the performance gain brought by MM in HGC/HEC is lower than that in HDC. It also results from the higher complexity of HGC-WM/HEC-WM than that of HGC-WOM/HEC-WOM, as we present in the complexity analysis. Fig. 12 represents the effect of different p max settings on SE. As is known that the degree of interference depends on transmit power directly. The lower transmit power makes the interference more negligible. Irrespective of interference to some degree, the non-learning methods can easily find the best member without too much traversal. Meanwhile, the influence of JUs is minimal. Accordingly, there is little difference among algorithms at p max = 6 dBm. Thereafter, the performance profit brought by increasing power still outweighs the cost from interference. Especially at p max = 10 dBm, the performance boost is remarkable. It means that the state of a low power level accompanied by a low interference floor condition will have a greater potential to improve performance. This circumstance terminates at p max = 22 dBm, where SE nearly reaches the peak for each algorithm. If the p max continues increasing, interference will become the dominating factor to impact performance. For this reason, the transmit power of each user and BS should stop to elevate, so as to avoid performance degradation.

6) ANALYSIS OF MAXIMUM TRANSMIT POWER
Note that the performance gap between HDC-WM and HDC-WOM gradually increases till convergence as increasing power budget. For instance, the gap between two HDCs is 0 and 7 bps/Hz at p max = 6 and 22 dBm, respectively. The reason is that the side effect of JUs at an upper power level is much greater than that at a comparatively low level. Correspondingly, there is a clear superiority with MM applied in the case of strong JUs. On the other hand, when interference becomes stronger, the exhaustive search and greedy method are tough to tackle MM, especially in the exhaustive search, so the advantage of MM is inconspicuous for them.
For the different types of algorithms (i.e., HEC, HGC, and HDC), their performance is mainly restricted to the parameter sizes of scheduled users, antennas, and subcarriers, while parameters of residual self-interference and maximum transmit power are irrelevant to complexity. As a result, the variation tendencies among different algorithms in Fig. 11 and Fig. 12 are more consistent than in Fig. 8, Fig. 9, and Fig. 10.

7) ANALYSIS OF COMPENSATION COEFFICIENT
In Fig. 13, we show the effect of each compensation coefficient on performance. Because compensation factors do not work on algorithms without MM, the algorithms maintain the same SE regardless of the coefficients. Note that each performance of algorithms with MM decreases monotonically with an increase in compensation coefficients. It demonstrates that the remedy of muting users against interruption would intensify interference at sub-slot 2. That is, the means of compensation merely attends to the JUs at the expense of the entirety.
Since more scheduled users in the current cell would arouse more JUs, the decay of SE is more significant with increasing the compensation coefficient at a larger G. If we assume G = 28, the performance of HDC-WM is even worse than that of HDC-WOM at β = 2. In view of the principle of fairness, the value can not be higher than an upper limit. We set β = 1.4 to attempt to cover the interests of the majorities and the individuals.

VII. CONCLUSION
In this paper, a novel power allocation method in the FD multi-user MIMO system has been studied. Specifically, in this scheme, three major factors (such as smart antennas, scheduled users, and subcarriers) related to power allocation have been considered. To further improve performance from user level, we introduce MM by modifying the frame structure to alleviate interference related to JUs. After formulating the problem of the optimal power allocation method, we propose a hierarchical algorithm method concerning different types of variables. That is, the subproblem of continuous variables is solved through FOTA. Meanwhile, the other of discrete variables is addressed through the greedy algorithm based on the traditional exhaustive search. Considering the high computational complexity of the greedy algorithm when applied with extended parameters, we devise the jointed DRL method to obtain a better performance. The DRL method contains two DDQNs. One DDQN is used to train samples in sub-slot 1, and the other is applied at sub-slot 2. Simulation results reveal that HDC outperforms HGC/HEC in three main aspects. Meanwhile, compared with non-introduced MM, the case with introduced MM has achieved a performance enhancement due to degrading the side effect of JUs. In conclusion, our proposal offers a new way to improve SE in multi-user MIMO FD Systems. One possible extension of this work is to develop an improved DRL scheme to further optimize the performance in the scenario with massive users and antennas.