Distributed Learning-Based Resource Allocation for Self-Organizing C-V2X Communication in Cellular Networks

In this paper, we investigate a resource allocation problem for a Cellular Vehicle to Everything (C-V2X) network to improve energy efficiency of the system. To address this problem, self-organizing mechanisms are proposed for joint and disjoint subcarrier and power allocation procedures which are performed in a fully distributed manner. A multi-agent Q-learning algorithm is proposed for the joint power and subcarrier allocation. In addition, for the sake of simplicity, it is decoupled into two sub-problems: a subcarrier allocation sub-problem and a power allocation sub-problem. First, to allocate the subcarrier among users, a distributed Q-learning method is proposed. Then, given the optimal subcarriers, a dynamic power allocation mechanism is proposed where the problem is modeled as a non-cooperative game. To solve the problem, a no-regret learning algorithm is utilized. To evaluate the performance of the proposed approaches, other learning mechanisms are used which are presented in Fig. 8. Simulation results show the multi-agent joint Q-learning algorithm yields significant performance gains of up to about 11% and 18% in terms of energy efficiency compared to proposed disjoint mechanism and the third disjoint Q-learning mechanism for allocating the power and subcarrier to each user; however, the multi-agent joint Q-learning algorithm uses more memory than disjoint methods.

In this paper, we aim to maximize the energy efficiency of an uplink Power domain non orthogonal multiple access (PD-NOMA) system. To reduce the delay during a vehicular conversation, D2D communication is introduced in the V2X environment. In the proposed system, device-to-device (D2D) pairs share the same uplink resources with other vehicles, and interference produced in the network which impacts on the system performance. Thus, we focus on intra-cell interference and use the successive interference cancellation (SIC) technique to manage the interference among the users in a cellular frequency band [5]. An optimization problem is formulated as a nonlinear integer programming problem. Since users autonomously select their subcarriers based on the environmental information about subcarriers, using machine learning methods seems desirable to reduce both signaling overhead and equipment costs in the system. Q-learning is a recent form of Reinforcement Learning algorithm that does not need a model of its environment and it is able to compare the expected utility of the available actions without requiring a model of the environment. Q-learning has emerged as a valuable machine learning technique for distributed SONs due to having low complexity and converging to an optimal point. In addition, it is shown that through our distributed Q-learning, D2D users not only are able to learn their resources in a self-organized way, but also achieve better system performance than that using traditional method. Furthermore, SONs can allow systems to configure themselves automatically without manual intervention [6]. Q-learning method is selected for solving the resource allocation problem, which in turn leads to find an optimal policy in the sense of maximizing the expected value of the total reward function for the considered system model [7].
In this paper, we propose two machine learning approaches. In the first, a multi-agent Q-learning algorithm is applied for the joint power and subcarrier allocation. In the second approach, the problem is decoupled into two sub-problems: a power allocation sub-problem and a subcarrier allocation sub-problem. We propose a distributed Q-learning method to allocate subcarriers among users. Given an optimal subcarrier allocation, the power allocation sub-problem modeled as a non-cooperative game. To solve the game, a no-regret algorithm which can be executed in a distributed manner is used. To evaluate the performance of our proposed approaches, we utilize a Qlearning based mechanism presented in [8] for our power allocation problem.

A. RELATED WORKS
Several related works have studied resource allocation for C-V2X communication. In [1], a coalition formation game was proposed to maximize the system sum rate in a statistical average sense for cellular users and multiple C-V2X. An OFDMA-based cellular network with specific frequency bands are considered for each user. As far as, using the fixed frequency band for each user does not seem to be the optimal use in energy, we tried to propose a NOMAbased system and learn the optimal subcarriers for the users. In [9], the authors studied a coalition formation game to address the uplink resource allocation problem for multiple cellular users and C-V2X. In [10], the main contribution was to propose a non-cooperative game and real-time mechanism based on deep reinforcement learning to deal with the energy-efficient power allocation problem in C-V2X networks.. In [11], the authors studied the energyefficient channel assignment problem for a self-organizing D2D network, and they proposed a distributed game theorybased solution to solve it. In [12], a game theory based learning approach to solve the joint power control and subchannel allocation problem for D2D uplink communications was developed. In [13], the authors studied the behavior of two devices attempting to communicate with a base station from the perspective of non-cooperative game theory, specifying both pure and mixed Nash equilibrium. In [14], to address a resource allocation problem, where C-V2X links use resources common to multiple cells, a new game theory based mechanism was proposed, which indicated that each player had an incentive to conceal their information to improve their profits.
However, the papers mentioned above used game theory based mechanisms, they did not address the energy efficiency issue in C-V2X networks with reinforcement learning mechanisms. We exploit the non-cooperative game to model the power allocation subproblem in a PD-NOMA energyefficient system, and utilize the no-regret learning method for solving it. C-V2X players tried to learn their resources in a self-organizing manner, independently, which in turn, leads to converge to a Nash equilibrium convergence point more quickly than other methods. Moreover, we utilize a Gibbs sampling scheme to solve the proposed game which is a probabilistic method compared to the approach developed in [15].
In [16], the authors developed a carrier sensing multiple access (CSMA) based algorithm to find the optimal distributed channel allocation of D2D networks. In [17], a multi-agent reinforcement learning-based autonomous mechanism was proposed to achieve optimal channel allocation and effective co-channel interference management for D2D pairs. In [18], to improve the spectral efficiency of a C-V2X network, a spectrum sharing scheme was proposed to provide ad-hoc multi-hop access to a network, however, we proposed the distributed Q-learning method for allocating subcarriers, which in turn leads to reach the optimal resources for the users in terms of maximizing the energy efficiency in the C-V2X network.
In [19], an efficient power control algorithm was proposed to maximize the sum rates. In [20], the authors discussed recent advances in the C-V2X communication system design paradigm from the perspective of a socially aware resource allocation scheme. In [21], the authors first analyze the main streams of the cellular-vehicle-to-everything (C-V2X) technology evolution within the third generation Partnership Project (3GPP), with focus on the sidelink air interface. Then, they provide a comprehensive survey of the related literature, which is classified and dissected, considering both the Evolution-based solutions and the 5G New Radio-based latest advancements that promise substantial improvements in terms of latency and reliability. In [22], authors addressed the problem of optimizing the energy efficiency of the system by allocating the power and subcarriers in the SC-FDMA wireless networks. The subcarriers are allocated to the users by adopting a multilateral bargaining model. Then, an optimization problem with respect to user's uplink transmission power is formulated and solved. However, we investigate the problem of energy efficiency of the system in the C-V2X communication network in the PD-NOMA system by using the SIC technique to manage the interferences among the users.
Reference [23] presents Open C-V2X, the first publicly available, open-source simulation model of the third generation partnership project (3GPP) release 14 Cellular Vehicle to everything (C-V2X) sidelink, which forms the basis for 5G NR mode 2 under later releases. In [15], the authors proposed an energy-efficient self-organized cross-layer optimization scheme in an OFDMA-based cellular network to maximize the energy-efficiency of a D2D communication system, without jeopardizing the quality-of-service (QoS) requirements of other tiers. In [24], the authors studied interference management in hybrid networks consisting of D2D pairs and cellular links, and they proposed a distributed approach that required minimal coordination yet achieved a significant gain in throughput. In [25], a two-phase resource sharing algorithm was proposed for a D2D communication system whose computational complexity could be adapted according to the network condition. In [26], the authors used the concept of convolution to derive a two-parameter distribution that represented the sum of two independent exponential distributions to enhance the performance of the system. In [27], the authors investigated a power-efficient mode selection and power allocation scheme based on an exhaustive search of all possible mode combinations of devices in a D2D communication system. Note that we utilize an exhaustive search method for joint power and subcarrier allocation problem to compare the results of the proposed methods with the optimal results for resource allocation problem. In [28], the use of self-organized D2D clustering was advocated over the physical random access channel (PRACH), and two D2D clustering schemes were proposed to solve the problem. In [8], the authors employed a Q-learning method to jointly address the channel assignment and power allocation problem to improve the system capacity. In [29], the authors have pointed out D2D based vehicular communication in the V2X environment. In this, device discovery was established using two different techniques that are direct discovery and direct communication.
Most of the technologies have been employed in Table 1. However, the aforementioned works did not address the energy efficiency issue in C-V2X networks through optimizing power and subcarrier allocations in a distributed manner. In addition, they did not consider a PD-NOMA system with SIC techniques for interference management with QoS constraints. Moreover, using the fifth-generation (5G) technology leads to increase the accuracy and speed of achieving the optimal results compared with previous works. Compared to other Q-learning based approaches, our proposed model uses an novel reward function to maximize the overall sum rate of cells and guarantee minimum interference among users. Moreover, simulation results show the better performances compared with the Q-learning method adopted from the [8], GABS-Dinkelbach algorithm adopted from the [30], VD-RL algorithm and Meta training mechanism with VD-RL algorithm in [31], which are shown in Fig. 8.

B. CONTRIBUTION
The main contribution of this paper is that it introduces a framework for an energy efficiency optimization problem in a C-V2X networks to allocate subcarrier and power among users [32]. Furthermore, SIC technique is performed in the PD-NOMA system to reduce interference among users [33], [34]. To develop this framework, we present two approaches [35].
• In the first approach, a distributed joint Q-learning mechanism for power and subcarrier allocation is proposed. Vehicles and D2D pairs select their transmit power level based on a Gibbs probability distribution. Optimal actions are determined according to the optimal current policy of the proposed multi-agent Q-learning method.
• In the second approach, the optimization problem is divided into two sub-problems: a subcarrier allocation sub-problem and a power allocation sub-problem, due to both binary and continuous optimization variables.
-In the subcarrier allocation sub-problem, a distributed Q-learning algorithm to allocate the subcarriers is proposed. The value of this method is shown in designing the reward function which contemplates the SIC technique, probability of each subcarrier and energy efficiency of the system. All of the users in the coverage area of the BS choose the subcarriers as the actions, and in each iteration the maximum reward function would be selected for each user, and whenever the agents select the new subcarrier as an action, the current state would be changed. Accordingly, the optimal subcarriers are determined according to the optimal current policy of the Q-learning method. -In the power allocation sub-problem, we use a distributed no-regret learning algorithm. In each iteration, each user selects its strategy independently. Furthermore, this distributed approach does not require a control channel for information sharing, and thereby the signaling overhead would be decreased. This approach is suitable when the number of users varies over time, and there is no centralized controller. Furthermore, centralized approaches rely on a single controller. If the controller is compromised, it can lead to failures throughout the network. The advantage of the first proposed multi-agent joint algorithm is its simplicity and convergence rate relative to the second disjoint Q-learning approach, which requires feedback from UEs. However the proposed multi-agent joint method is about 17% less complex compared with the second disjoint Q-learning method. Increasing the number of subcarriers beyond the 15 cause to increase the complexity of the first multi-agent joint algorithm about 26% than the second disjoint Q-learning method. Moreover, we can show intuitively that the second approach manages the power among UEs more effectively respect to receiving more information from the users during the game. Thus, we can choose the solution that best fits with the priorities of the system.

C. ORGANIZATION
The rest of the paper is organized as follows. In Section II, we present the system model and formulate the resource allocation problem. In Section III, we propose a multi-agent joint distributed Q-learning algorithm and a distinct algorithm for allocating the power and subcarrier to each user. We analyze the convergence and complexity of the proposed algorithms in Section IV and V, respectively. In Section VI, we present simulation results. Finally, conclusions are given in Section VII.

A. SYSTEM MODEL
We consider a PD-NOMA single-cell system consists of vehicles and D2D pairs shown in Fig. 1, and model the interferences among users in the proposed system model. Considering multi-base stations, just caused to increase in the interferences produced in the system, in which the results are predictable. Thus, to avoid from the complexity of the computation of the interference formula of the system model, we investigate the energy efficiency problem with one base station (BS) located in the center of the area, which is equipped with omni-directional antennas for cellular communications. We assume there are K vehicles labeled as a set of C = {c 1 , c 2 , . . . , c K } which share their uplink resources with D2D pairs. We denote the set of devices by We define a binary variable x d i ,n for C-V2X frequency, and thereby if x d i ,n = 1, subcarrier n is assigned to the device d i ; otherwise, x d i ,n = 0. Similarly for vehicles, η c i ,n represents a binary variable that determines the subcarrier assignment for vehicles [36].
The set of all subcarriers is shown by N , and the total available system bandwidth is denoted by B divided into |N| subcarriers with the bandwidth w = B/N. In a PD-NOMA system, each subcarrier can be assigned to more than one user, and the corresponding signal is detected by the SIC technique [5]. In this technique, the signal with the highest strength is decoded, subtracted from the combined signal, and a signal with weaker strength is removed. Furthermore, we assume the SIC technique is performed successfully for the user i if Since each D2D pair shares the same spectrum with the vehicles or with other D2D pairs, system performance will be reduced; therefore, we focus on the intra-cell interference generated by the users sharing the same frequency band. Three kinds of system interference are described here: • The vehicle and its corresponding D2D pairs interfere with each other because they share the same uplink spectrum resources.
• The received signals at the BS from the vehicle c i interfere with the transmitters of the D2D communication system sharing the same spectrum resources in the C-V2X environment. • The signal at the D2D receiver d i interferes with the vehicle c j and the other C-V2X links sharing the same spectrum resources. The interference power received at vehicle c i on subcarrier n is defined as (2), shown at the bottom of the page. Parameter h d i ,b,n is a complex Gaussian random variable for the channel coefficient between D2D pair d i and the BS on subcarrier n, with unit variance and zero mean. Let G c j denote the transmit antenna gain for vehicle c j and G b denote the receive antenna gain for the BS. The signalto-interference-plus-noise ratio (SINR) of vehicle c i over subcarrier n is given by The C-V2X receiver d i suffers interference from the vehicle c i and other D2D pairs sharing the same spectrum resources. Therefore, we employ the parameter P int d i ,n as defined in (4), shown at the bottom of the page, to denote the interference power at D2D s receiver d i . Here, h c i ,d i ,n is a complex Gaussian random variable for the channel coefficient gain between D2D pair d i and vehicle c i with unit variance and zero mean. Here, G d i is the transmit antenna gain for D2D pair d i , and G d j is the receive antenna gain for D2D pair d j . The SINR of user d i over subcarrier n is given by Accordingly, the problem of allocating resources among D2D users in the C-V2X environment, to maximize the energy efficiency of the system is formulated in the following section.

B. OPTIMIZATION FRAMEWORK
In this section, we formulate an outage-based energy efficiency optimization problem, which is shown in (6), shown at the bottom of the page, and allocates resources effectively to each user, while guaranteeing the QoS requirements for both D2D pairs and vehicles in the C-V2X environment. The system constraints are determined accordingly.

C. SYSTEM CONSTRAINTS
Here, we describe the system constraints, including subcarrier allocation and power allocation constraints, separately.

1) SUBCARRIER ALLOCATION CONSTRAINTS
We define subcarrier allocation constraints in the following form: where (7) indicates the binary variables for cellular and D2D subcarrier assignment, and the constraint defined in (8) indicates that each D2D pair can be assigned to at most one subcarrier.
The SIC technique guarantees that each subcarrier can be reused at most for L T users. This constraint can be expressed as where the system complexity increases as the value of L T increases. Parameter L T depends on the signal processing delay in the SIC technique and the receiver's design complexity.

2) POWER ALLOCATION CONSTRAINTS
Parameters p c i ,n and p d i ,n need to satisfy the following constraints: where (12) and (13) indicate the maximum requirement for the transmit power threshold P max d i ,n and P max c i ,n of each D2D pair and cellular user, respectively.

3) QUALITY OF SERVICE CONSTRAINTS
The QoS constraints of all users are expressed on the basis of the minimum SINR demands for D2D pair and cellular users according to (3) and (5) as follows:
where EE is the energy efficiency of the system, The optimization problem (16) consists of non-convex objective functions and both integer and continuous variables. Therefore, we have an NP-hard problem, and the available methods to solve the convex optimization problem can not be applied directly. Furthermore, the formulated problem in its original form is not easy to address in a distributed manner [37]. For simplicity, we break problem (16) down into two sub-problems: a subcarrier allocation sub-problem and a power allocation sub-problem.

1) SUBCARRIER ALLOCATION
The subcarrier allocation sub-problem for vehicles and D2D pairs in the C-V2X environment can be written as (14), (15).
First, we investigate the joint subcarrier and power allocation problem in Section IV. We then investigate sub-problems (17) and (18) and propose distributed learning algorithms for solving them in Sections V and VI, respectively.

III. MULTI-AGENT JOINT POWER AND SUBCARRIER ALLOCATION A. MULTI-AGENT JOINT POWER AND SUBCARRIER ALLOCATION
In this section, we apply a distributed Q-learning mechanism for joint power and subcarrier allocation based on reinforcement learning. Reinforcement learning is an area of machine learning where agents interact with the environment to reach an optimal solution in an autonomous manner [38]. We use a multi-agent extension of the Markov decision process (MDP) to model multi-agent reinforcement learning. An N-agent Markov game is determined by where all the agents take actions a i t based on the policy π i . We define a set of transmit power levels for vehicles and D2D pairs as P L = {P min , aP min , a 2 P min , a 3 P min , . . . , P max } where P max and P min represent the maximum and minimum transmit power for all vehicles and D2D pairs, respectively.
Parameter (a > 1) indicates the number of increasing from one level to another fixed in the dBm domain. At first, each agent selects one power level with uniform probability π p l (t) for the vehicles and D2D pairs. Then, in each iteration, the probability function of each power level would be updated. Since the proposed method uses the Boltzmann-Gibbs distribution and probability law for power-levels, it estimates power levels with specific probability distribution and causes a noticeable change in the system. However, the training process for the vehicles occurs at the BS, and the D2D pairs obtain the trained weights for the actions from the BS in the C-V2X environment. Following the actions lead to transits to a new state s i t+1 by agent i and get a reward r i t . The accumulated reward R i t over time t is expressed as where parameter 0 < β < 1 is a discounted factor. Since no user has enough information about the optimal performance of the network, the learner tries to learn the optimal strategy π * to maximize the accumulated expected returned reward over time t [8], [39], [40]. When the states are selected, the expected return value can be obtained, and the policy for the state action of agent i can be defined as follows: As a matter of fact, we developed non-cooperative mechanisms in a distributed manner to reduce the signaling overhead in the system. In this regard, the reward function needs to be improved to make each agent learn independently from other agents, therefore it only captures the local observations so that it yields sub-optimal solutions.
According to the optimal policy π * , we can define the ). Therefore, the Q-function for the expected state-action is updated with the learning rate α shown in equation (21), shown at the bottom of the page.
The optimal value of the action for state s is defined as [41], [42] Here, we define the agents, states, actions, and reward function.
• Agents. All the vehicles and D2D pairs. • Actions. At each step, each agent i takes an action, a t ∈ A, which selects a subcarrier with a decision policy π i . The set of all actions is expressed as A = {a 1 , a 2 , . . . , a N } where a i represents the subcarrier of the agent i at time slot t. Moreover, a second case study is also studied where the combined power level and subcarriers are selected as an action. The subcarriers distribution statically depends on the BS decisions, however, power levels depend only on a probability model. Therefore, this action result could not maximize the energy efficiency. • States. The key to affect the state of the network environment is the channel and the transmit power of the players. The QoS of users is restricted by the network environment. We can consider a set of . . , u c K } represents the set of all users, A = {a 1 , a 2 , . . . , a N } represents the set of actions, and P L = {p l1 , p l2 , . . . , p lL } represents the set of power levels for the vehicle and D2D users.
Here, s t is the system state at time t and defined as It indicates that the j th subcarrier and q th transmit power level are assigned to the i th player at time t. As a matter of fact, allocating the power and subcarrier to the user u i is defined as a current state. Hence, the state space contains NL(M+K) states as S To maximize the energy efficiency of the system, we define a distributed local reward function related to the energy efficiency of the system as where p(u i |a j ) indicates the probability of the presence of u i in the subcarrier j. To evaluate the system performance at the end of each epoch, we define ε as the threshold of a new state: Whenever the network satisfies this threshold ϕ > ε, it will start a new round of training based on the current state of the system [38].

B. Q-LEARNING SUBCARRIER ALLOCATION
In this section, we apply distributed learning methods to solve the primary problem by simplifying it into subproblems. Some subcarrier parameters are optimized at each step, while others remain fixed. We propose an iterative Q-learning mechanism for the subcarrier allocation and we describe the action, state, and reward functions here.
the set of actions (subcarriers) [38]. Here, s t is the system state at time t, and is defined as s t = (u i , a j ) where 1 ≤ i ≤ M + K and 1 ≤ j ≤ N. It indicates j th subcarrier assigned to the i th user at time t. Hence, the state space contains N(M + K) states as a 1 ), . . . , (u 1 , a N ), . . . , (u M+K , a N )}. • Reward function. To maximize energy efficiency and guarantee the QoS of the system, we define a reward function related to the SINR constraints of all users. If the SINR constraints are satisfied, the reward function is positive; otherwise it is negative. Accordingly, the following reward function for D2D pairs in the C-V2X environment at time t is defined: where λ indicates the SINR coefficient for the reward function and is defined as follows: (14) and (15) are satisfied, −1, otherwise. (26) p(u i |a j ) indicates the probability of presence of u i in the subcarrier j and σ (u i |a j ) is a binary parameter to satisfy the SIC constraint. It is described below

1) 5G NR INTERFERFACE DECISION
Note that vehicles use the NR V2X PC5-interface for selecting the subcarriers. C-V2X employs two complementary transmission modes, and vehicles autonomously select their sub-channels in C-V2X mode 4. Therefore, C-V2X users would be allocated resources according to the environment information in Q-learning method [43]. In each iteration, the feedback report includes information of the transmission and retransmissions of the subcarriers, and cellular users report an ACK to the base station. After receiving feedback report, the BS evaluates if it has to allocate new subcarrier resources to that C-V2X user or not [7], [44], [45]. After each transmission, new resources or sub-channels must be selected and reserved. New resources must also be selected if selected resources do not fit in the resources previously reserved or do not maximize the energy efficiency of the system. As a result, all the C-V2X users are allocated subcarriers according to decisions of the BS.

C. GAME THEORY BASED FRAMEWORK FOR POWER ALLOCATION
In this section, we aim to solve (18) by assuming optimal subcarriers assigned to the users according to the proposed Q-learning subcarrier allocation method. In the proposed approach, we model the competition among vehicles and D2D pairs as a non-cooperative game, where the vehicles and D2D pairs are players and their transmit power levels are selected independently. Then, we apply a no-regret learning approach to solve the sub-problem. We model sub-problem (18) as a non-cooperative game g = (U ,  = {s u,1 , . . . , s u,|s u | } is the strategy set of player u, and s u,i denotes the i th pure strategy of player u. The players, strategy sets, and payoff functions are defined as follows: • Players: These include D2D pairs and vehicles.
• Strategy sets: The transmit power threshold of the players is defined as a strategy set of the players. We have The energy efficiency of the system is defined as a payoff function (6). A common method for updating the probability distribution assigned to each player u d i and u c i at time t is a Boltzmann-Gibbs probability distribution [46], [47]. It is proportional with the energy of each state and system's temperature. The probability for all players can be expressed as follows: where EE is the energy of the system in state s t , and a constant kτ is the product of Boltzmann's constant k and thermodynamic temperature τ . In this regard, if kτ −→ ∞ there will be a uniform distribution over the strategy set of player b, and if kτ −→ 0, it causes to select the strategy which is mostly reported by the users [48].

1) NO-REGRET BASED LEARNING ALGORITHM
In a no-regret learning algorithm, players learn their environment to choose transmission power levels along with maximizing the energy efficiency of the system. The regret function is defined as the difference between the average payoff function achieved by strategies of the given algorithm until time t and the payoff function obtained by other fixed sequence of decisions due to a change in strategy [49]: where s −b is the strategy of other players. Given a noncooperative game G = (B, S b,i , u b ∀b ∈ B), we can define the correlated strategy p(s) as a probability distribution over the strategy profile s i ∈ S b . Given these basic notions, the concept of a -coarse correlated equilibrium can be defined as the next theorem.
Theorem 1: Given a game G, a distribution p(s) = p(s b,i , s b,−i ) is defined as a -coarse correlated equilibrium if no player can ever expect to unilaterally gain by deviating from their recommendation, assuming the other players follow their recommendations [50], [51]. If (30) and for D2D pairs, Players estimate the payoff function concerning the balance between minimizing their regret and the average payoff function for all their strategies. Therefore, for each D2D player and s d,i ∈ S d , the payoff estimation function can be calculated by [49], [52] Similarly for each vehicle and s c,i ∈ S c , the payoff estimation function can be calculated by [49], [52] u c,s c,j (t + 1) =û c,s c,j (t) whereû d,s d,i (t+1) andû c,s c,j (t+1) denote the estimated D2D and cellular payoff function at time t. The strategy played at the last iteration sees the corresponding estimated payoff updated, independently. To calculate the regret, each player needs the learning tool to update the estimated regret [53]. Each D2D player estimates its regret function for each s d,i ∈ S d as follows: Similarly each vehicle estimates its regret for each s c,i ∈ S c as follows: The update probability function assigned to each strategy s d,i ∈ S d of D2D players is described next [49] π d,s d,i (t + 1) = π d,s d,i (t) Similarly the probability assigned to each strategy s c,i ∈ S c of cellular users is updated as

Input
: N , u(t), p(u i |a j ), Q i t (s, a), r i t , ∀u i ∈ U , π p l (t), p l ∈ P L Output : u(t), P c , P d , X d , η c Initialiation: t = 1, T, D = {1, ..., |D|}, C = {1, ..., |C|} 1: All agents receive initial observation states S 0 = s 1 0 , ..., s N 0 2: while t ≤ T max do 3: for ∀d i ∈ D ∨ ∀c i ∈ C do 4: Select: p d i (t), p c j (t) using π p l (t) 5: end for 6: All agents select actions a i t according to the current policy 7: for ∀d i ∈ D ∨ ∀c j ∈ C do 8: Calculate: υ c j ,n (t), υ d i ,n (t) according to (3) Calculate: u(t) according to (6) 12: if λ > 0 then 13: All agents Observe immediate reward r i t and next state s t+1 14: Update the Q table according to (21) 15: end if 16: All agents choose actions with maximum Q-value (22) 17: is satisfied according to (27) then 18: Adjust X d , η c according to the optimal action x n d i = 1, η n c j = 1 19: Save (s i t , a i t , r i t , s i t+1 ) 20: end if 21:

IV. CONVERGENCE ANALYSIS
In this section, we investigate the convergence of learning algorithms.

A. Q-LEARNING ALGORITHM
For the Q-learning algorithms, Q t (s, a) converges to an optimal value if the following two conditions are satisfied: (1) the learning rate is suitably reduced to 0; (2) each state-action pair is visited infinitely [8], [54], [55].
Theorem 2: Given a finite MDP model, the Q-learning algorithm, given by the update rule (21), converges to the optimal Q-function if Theorem 3: In the proposed Q-learning methods, each agent i takes an action, a i ∈ A with a decision policy π i . Since the learning rate, 0 < α t (s t , a t ) < 1, and all state-actions of the users could be visited infinitely in (21), Algorithms 1 and 2 converge to a fixed point.

B. NO-REGRET LEARNING ALGORITHM
The no-regret learning algorithm is based on stochastic approximation theory and uses a Boltzmann-Gibbs distribution to allocate the initial transmit power. For the convergence of the mechanism, the set of ι = {γ, ζ, ν} should satisfy the following conditions [56], [57]: Accordingly, the learning rates should be large enough to overcome any undesirable conditions and small enough to guarantee the convergence of no-regret algorithm. We should choose all ι = {γ, ζ, ν} ∈ (0.5, 1) and follow ζ > γ, ν > ζ. To this end, the strategies converge if the learning rate exponents satisfy the following criteria To obtain an optimal result, the convergence of the utility function and stopping criteria should be verified.

V. COMPUTATIONAL COMPLEXITY ANALYSIS
In each iteration, the computational complexity depends on the number of subcarriers (N) and the number of vehicle and D2D pairs (M+K). Furthermore, the overall complexity depends on the number of iterations (T) needed for convergence. Here, we calculate the complexity of each proposed algorithm.

A. SUBCARRIER ALLOCATION
The complexity of the exhaustive search algorithm for the subcarrier allocation sub-problem can be calculated as follows: which denotes all the probable combinations of selecting (M + K) states from N(M + K) existing states. For the Q-learning algorithm, there are N(M + K) states, and the complexity can be represented in the following way: (TN(M + K)). (45) Algorithm 2 Training Subcarrier Allocation Q-Learning Select a initial state s 0 randomly 3: while t ≤ T max do 4: Select an action a t based on strategy 5: Calculation: υ c j ,n (t), υ d i ,n (t) according to (3), (5) 6: Observe λ 7: if λσ (u i |a t ) > 0 then 8: Obtain immediate reward r i t and next state s t+1 9: Update the Q table according to (21) 10: end if 11: Choose the action for the user u i with maximum Q-value (22) 12: Adjust X d , η c according to the optimal action x n d i = 1, η n c j = 1 13: (46)

C. MULTI-AGENT JOINT POWER AND SUBCARRIER ALLOCATION
In this mechanism, all the agents take the actions with a maximum Q-value according to the optimal policy. Hence, the corresponding space complexity is reduced, and it can be written as The above analysis provides the computational complexity for the proposed algorithms [58], [59]. We can observe a trade-off between the performance and convergence speed of the proposed algorithms. The results are shown in Table 3 and Table 4.

VI. SIMULATION RESULTS
We consider a single-cell scenario, where D2D pairs and vehicles are uniformly distributed over an area of 500 × 500m 2 with the BS located in the center of the C-V2X environment. We consider a fixed number of vehicles and D2D pairs determined according to the closest distance. When

Algorithm 3 No-Regret Power Allocation Algorithm
Update: X d , η c 3: Select: p d i ,n (t) using π s d i ,n (t) 5: Select: p c j ,n (t) using π s c j ,n (t) 6: end for 7: for ∀d i ∈ D ∨ ∀c j ∈ C do 8: Calculate: υ c j ,n (t), υ d i ,n (t) according to (3), (5) 9: end for 10: if υ c,n (t) > γ c ∧ υ d,n > γ d then 11: Calculate: u(t) according to (6) 12: end if 13: for ∀c j ∈ C do 14: Update: u s c j ,n (t + 1),R s c j ,n (t + 1),π s c j ,n (t + 1) according to (33), (35), (37) 15: end for 16: for ∀d i ∈ D do 17: Update: u s d i ,n (t + 1),R s d i ,n (t + 1),π s d i ,n (t + 1) according to (32), (34), (36) 18: end for 19: t = t + 1, 20: end while two D2D users are physically close, a Rayleigh C-V2X communication channel is established. For a fixed number of vehicles and D2D pairs, we ran 500 independent simulations, and we present the average of these results. The pathloss model and shadow fading were considered for C-V2X links, and we set the pathloss exponent in a free space propagation model to be 2. Furthermore, we vary the number of vehicles and D2D pairs, and observe the performance of the system. The simulation parameters are summarized in Table 5.
In Figs. 2-4, we investigate our proposed disjoint approach for allocating the subcarriers to each user by varying the number of subcarriers. However, to evaluate the  results of the Q-learning method for allocating the subcarriers, we utilize the Exhaustive search method for finding the optimal subcarriers and comparing the results with each other.
In Fig. 2, the proposed Q-learning algorithm for subcarrier allocation brought about a convergence approximately as fast as the exhaustive search method for subcarrier allocation. We noted only a 14.5% difference between the two algorithms in term of energy efficiency to achieve the same converge point, while Q-learning algorithm implies a much lower complexity than the exhaustive search method.
In Figs. 3 and 4, we varied the number of subcarriers to demonstrate the impact of this on the performance of our proposed Q-learning algorithm. We set the number of D2D pairs and cellular users to be 10 and 5, respectively. As we can see in Fig. 3, varying the number of subcarriers from 5 to 12 brings about a significant performance gain, due to   increasing allocated subcarriers to the users. Adopting the proposed Q-learning algorithm for allocating the subcarriers results in much better performance for the cellular and D2D links. The proposed Q-learning approach can gain the value as well as the exhaustive search method with only a 9% difference in average energy efficiency.
In Fig. 4, we can see that an increase in the number of subcarriers results in an increase in the spectrum available for users and a decrease in the interference among users in the system, which in turn lead to increase in the data rate of the system. There is only a difference about 13% compared with the exhaustive search results.
In Figs. 5-7, we show how the performance of the proposed no-regret learning algorithm for power allocation in the non-cooperative game achieves better performance. For the sake of simplicity, we set the number of subcarriers and vehicles to 10 and 5, and we vary the number of D2D pairs from 5 to 19. However, we determine the number of users and subcarriers as variable parameters in the proposed algorithms and they could be assigned a large number. Furthermore, we compare our proposed self-organizing mechanism with three following benchmark references:   5 shows the average utilities achieved by different methods which is increased by varying the number of D2D pairs. However, the proposed method using the Boltzmann-Gibbs distribution to assign the probability to each subcarrier indicates the higher value compared to the algorithm using the two other methods for selecting the power level. Since the proposed method using the Boltzmann-Gibbs distribution, is based on probability law, estimates power level with specific probability distribution and causes a noticeable change in the system. Moreover, by increasing the power threshold level (Pdmax) from 18 dBm to 22 dBm, the average energy efficiency of the system decreased. This was due to the fact that increasing the power level may lead to an increase in energy consumption and result in a decrease in energy efficiency.
As we can see in Fig. 6, the average data rate of the system achieved with these methods increased by varying the number of D2D pairs. Furthermore, by using the roulette wheel  method, which is based on the probability distribution law, the Nash equilibrium is reached faster than with the other methods in the simulation. Simulation results show that the first algorithm using the roulette wheel method can attain data rates respectively 3% and 5% higher than the maximum and random power levels. Furthermore, it can be observed increasing the power threshold level (Pdmax) results in an increase in the average system sum rate. This is due to strong management of interference among the users. Since the proposed mechanism performs well at a power threshold of 22 dBm, it yields higher average result about 32% compared with the result at a power threshold of 18 dBm. Fig. 7 shows the power consumption of the system achieved by these three methods. Using the roulette wheel method for selecting the transmit power level results in a faster convergence, and consumes less energy than the other two methods that use the maximum and random power levels. In addition, increasing the power threshold level (Pdmax) from 18 dBm to 22 dBm increases the average energy consumption of the system. This is due to the fact that the number of the strategies in the game increases, which may lead to greater competition among users to achieve an optimal power level, thereby using more energy. In Figs. 8-10, we show the performance of the our proposed two multi-agent joint Q-learning and disjoint Qlearning algorithm compared with each other. To evaluate the performance of our proposed joint and disjoint algorithms, we use the Q-learning method adopted from the [8], GABS-Dinkelbach algorithm adopted from the [30], VD-RL algorithm and Meta training mechanism with VD-RL algorithm in [31]. Moreover, to evaluate the optimality of the proposed methods, the results would be compared with exhaustive search method for allocating the optimal subcarriers and powers to the users. Results show that increasing the power threshold levels from 10 dBm to 40 dBm brings about a significant performance; however, increasing the power threshold beyond 40 dBm only achieves marginal benefits in the above algorithms. We compare our proposed algorithms with following benchmark references: • Multi-agent joint Q-learning. This algorithm is executed to allocate the joint power and subcarriers. • No-regret disjoint algorithm. This algorithm is proposed for power allocation. If the Q-learning algorithm is implemented for allocating the subcarriers, it is labeled as (Disjoint no-regret power, Q-learning). If an exhaustive search method is implemented for subcarrier allocation, it is labeled as (Disjoint no-regret power, Exhaustive).
• Q-learning disjoint algorithm. This algorithm is developed in [8], and used for power allocation. If a Q-learning algorithm is implemented for allocating the subcarriers, it is labeled as (Disjoint Q-learning power, Q-learning). If the exhaustive search method is implemented for subcarrier allocation, it is labeled as (Disjoint Q-learning power, Exhaustive). • Disjoint GABS-Dinkelbach algorithm. This algorithm is developed in [30], and used for power allocation. Moreover, the exhaustive search method is implemented for subcarrier allocation, which is labeled as ( Disjoint GABS-Dinkelbach power, Exhaustive). • Disjoint VD-RL algorithm. This algorithm is used in [31] for power allocation, and exhaustive search method is implemented for subcarrier allocation, which is labeled as (Disjoint VD-RL power, Exhaustive). • Disjoint meta learning VD-RL power. This algorithm is developed in [31], and used for power allocation. Moreover, the exhaustive search method is implemented for subcarrier allocation, which is labeled as (Disjoint meta learning VD-RL power, Exhaustive). • Exhaustive search algorithm. This algorithm is executed to allocate the joint power and subcarriers. We evaluate the performance of our proposed algorithms in terms of different power levels. Fig. 8 shows the average energy efficiency of the system. Increasing the power threshold puts the system within a maximum value range of 18-20 dBm, while increasing the power threshold beyond the 20dBm, enhance the right to choose the transmit power strategy and lead to consume more energy. Thus, it drops down slowly. The proposed multi-agent joint Q-learning algorithm converges to an optimal point faster than other disjoint algorithms. This is due to the simultaneous allocation of resources and low complexity. Accordingly the second proposed disjoint algorithm which is involved with the no-regret algorithm for power allocation has the faster convergence rate than the disjoint Q-learning method and GABS-Dinkelbach algorithm, which are taken from other papers. It has a greater influence on the energy efficiency of the system, due to the fact that the no-regret algorithm uses the regret function and the probability-based which increases the convergence rate. The multi-agent joint Q-learning algorithm can yield a higher average energy efficiency, of up to 11%, 14% and 18%, than the proposed disjoint mechanism with no-regret learning, GABS-Dinkelbach algorithm and other Q-learning methods for power allocation, respectively.
The results also show that using the proposed meta training mechanism with VD-RL algorithm in [31], can find optimal solution in an unseen environment with faster convergence speed than VD-RL algorithm. However, there is about 14% differences between the proposed joint Q-learning algorithm and the meta-learning with VD-RL methods. This is because that, joint Q-learning proposed method is competitive and users learn their strategies in a distributed manner without the information of others. Moreover, past data from the meta-training method, can be recycled to adapt the policy on a new task in the proposed joint Q-learning method, which in turn lead to reach more efficient results than the meta-learning method. Therefore, the proposed Q-learning method compares favorably with the state of the art in meta-RL.
Furthermore, in order to evaluate the optimality of the proposed methods, we utilize the Exhaustive search method for finding the optimal convergence point. There is only 8% differences between the joint proposed method and the Exhaustive search method in term of energy efficiency to achieve the same convergence point, while Q-learning algorithm implies a much lower complexity than the exhaustive search method. Fig. 9 shows the average throughput when the power threshold level increases. Increasing the power threshold causes an increase in the average throughput. The main reason for this, is that the D2D links use the same radio frequency band used by cellular links in the adjacent zones. Therefore, the throughput of the D2D link is affected by the transmission power of the cellular link and the surrounding D2D links. Thus, if the transmission power of the D2D link becomes greater than that of the cellular link, the throughput of the system increases. For instance, the joint multi-agent mechanism yields up to 18%, 26% and 35% improvement in terms of throughput, relative to the proposed disjoint mechanism with no-regret learning, GABS-Dinkelbach algorithm and other Q-learning methods for power allocation, respectively. Furthermore, by increasing the power threshold, the average throughput in the disjoint mechanisms GABS-Dinkelbach algorithm have the almost same performance as no-regret algorithm for allocating the power. Fig. 10 shows the average power consumption when the power threshold level increases. As the Pdmax increases, energy consumption increases because the interference becomes stronger, and users require more power to meet QoS constraints. The multi-agent joint Q-learning algorithm consumes less energy, about 5.3%, 10.2% and 15.2% compared with the proposed disjoint mechanism with no-regret learning, GABS-Dinkelbach algorithm and other Q-learning methods for power allocation, respectively. This is due to the fact that, allocating the subcarrier and power simultaneously in a distributed manner causes minimal human interference and complexity. Moreover, for a given Pdmax, the second and third disjoint algorithms with the proposed Q-learning method for subcarrier allocation consume less energy compared to the other approaches which involve exhaustive search methods for subcarrier allocation. However, they have almost the same performance.
In Fig. 11, we show the performance of our two proposed methods in term of power consumption; first, multi-agent joint power and subcarrier allocation algorithm and second, the disjoint distributed learning algorithm. Varying the number of subcarriers from 5 to 20 yields a significant performance gains for the joint algorithm due to the more efficient management of interference among users with the Q-learning method. As a matter of fact, There is a gap about 12% between the results of the joint and disjoint algorithm in terms of energy efficiency of the system. This is because the feasibility region of finding the optimal value of variables in the joint multi-agent Q-learning method is larger than that of disjoint learning method. Thus, it is reasonable that the joint method gives larger value rather than the disjoint method. Note that the disjoint method searches for the optimal values in the smaller region (because in each subproblem one variable is fixed and the other is optimized), so it gets a lower EE value.
However, the proposed multi-agent joint method has about 16.2% lower complexity compared with the second disjoint Q-learning method, increasing the number of subcarriers beyond 20 caused to increase the memory usage and the complexity of the first joint algorithm about 11% over the second proposed disjoint method.

VII. CONCLUSION AND FUTURE WORK
In this paper, we investigated the resource allocation problem for a C-V2X network to improve the energy efficiency. We proposed two approaches using machine learning. In the first, a multi-agent Q-learning algorithm was applied for the joint power and subcarrier allocation. In the second approach, we broke the problem down into two sub-problems: a power sub-problem and a subcarrier allocation sub-problem. To allocate the subcarrier among users, a distributed Q-learning algorithm was proposed. Then, given optimal subcarrier allocation, we modeled the power allocation sub-problem as a non-cooperative game. To solve the game, an algorithm was used, which could be executed in a distributed manner. Moreover, we compared the results with a third Q-learning algorithm for power allocation. Simulation results showed that the multi-agent joint Q-learning approach yielded significant performance gains of about 36% and 27% in terms of energy efficiency and sum rate over other disjoint learning algorithms. In addition, our no-regret based learning approach for power allocation was shown to provide better performance, of about 14% and 16% compared with a disjoint benchmark algorithm which utilizes a Q-learning algorithm for power allocation, in terms of the average energy efficiency and average throughput. In the future work, it is interesting to consider multi-base stations, which causes to increase the interferences produced in the system, and try to optimize the resource allocation in the system.

C. PROOF OF THEOREM 3
Algorithm (1) solves (16) by alternating maximum Q-value and calculating the energy efficiency of the system. Since maximum reward function maximizes the Q-function, we want to show that reward function in Algorithm (1) does not increase the objective value of (16). According to line (16) of Algorithm (1), computational resource allocation does not increase the objective value of (16). In addition, based on (38) and (39), convergence of Algorithm (1) is guaranteed.
In i th iteration of algorithm (1), energy efficiency of the system depends on the numbers of users and their power levels. As a matter of fact, it would be equal to EE k for cellular and D2D users when the numbers of users are larger than their maximum acceptable value. Therefore, we have EE + k , and need to show that EE k , does not increase after i th iteration. If EE k = max N EE after i iterations, varying the number of users more than N i caused to increase the power consumption and decrease the energy efficiency of the system. Thus, EE max does not increase more than EE k when increasing the number of users in other iterations.
Moreover, the learning rate is suitably reduced to 0, which is vital for convergence of the algorithm (1). As a result, the objective value of (16) is non-increasing in each iteration, and since it is lower bounded by zero, Algorithm (1)