Deep Reinforcement Learning Based Joint Allocation Scheme in a TWDM-PON-Based mMIMO Fronthaul Network

In next-generation centralized or cloud radio access networks (C-RANs), time and wavelength division multiplexed passive optical network (TWDM-PON) has been well recognized as a promising candidate to build the mobile fronthaul. Considering the stringent bandwidth efficiency, latency, and cost requirements in C-RAN, an efficient bandwidth and wavelength allocation scheme is highly desirable for TWDM-PON-based fronthaul. Especially for the massive multiple input multiple outputs (mMIMO) enabled beamforming scenario, the additional radio resource is required to be jointly allocated with bandwidth and wavelength resources in TWDM-PON. In this paper, we formulate the joint allocation problem into an integer linear programming mathematical model and propose a deep reinforcement learning (RL)-based joint allocation scheme with an energy-efficient architecture for the TWDM-PON-based mMIMO fronthaul network. The proposed scheme couples the heuristic radio resource allocation algorithm with the RL-based wavelength allocation model to optimize the fronthaul bandwidth, radio resource, and wavelength utilization efficiencies jointly in the downstream direction. Simulation results show that the proposed scheme achieves a high bandwidth efficiency and high radio resource block utilization simultaneously across different traffic loads and, meanwhile, reduces the wavelength usage compared with the benchmark.


I. INTRODUCTION
W ITH the advent of next-generation mobile communica- tions, the exponentially growing user devices cause ever-increasing traffic requests.To tackle this, the massive multiple input multiple outputs (mMIMO) enabled beamforming technique provides high transmission capacity and spectral efficiency [1], [2], [3].However, the large data volume caused by the beamforming technique brings tremendous pressure on data transmission in centralized, or cloud, radio access networks (C-RANs), which has aroused much research interest from both academia and industry to look for optimal architecture and resource allocation schemes.In traditional C-RAN architecture, a base station (BS) is composed of a baseband unit (BBU) and a remote radio head (RRH) to perform digital signal processing and transport analog signals [4].In next-generation radio access networks (NG-RANs), a mobile BS is further split into a central unit (CU), a distributed unit (DU), and a radio unit (RU) [5], [6] to meet different transmission bandwidth and latency requirements.Fig. 1 shows a typical C-RAN architecture.The link between a DU and an RU is called fronthaul, while the connection between a CU and a DU is called midhaul.There are several functional split options for the split point among CU, DU, and RU.Different layer split options bring various network features and characteristics regarding latency, bandwidth [7], and multicell coordination [5].Due to the stringent latency and bandwidth requirement, split options 6 and 7 are the main focused candidates for the low-layer split (LLS) [8].Considering next-generation mobile communica-tions, although selecting split options 6 or 7 as the LLS has reduced the required bandwidth compared to traditional C-RAN, it is still challenging to accommodate such heavy traffic between the DU and the RU.In this regard, the fronthaul bandwidth capacity and resource allocation efficiency must be guaranteed.
Time and wavelength division multiplexed passive optical network (TWDM-PON) offers an economical solution to realize the fronthaul network infrastructure in C-RAN, which can provide a large capacity [9].The current commercial TWDM-PON systems have a single wavelength capacity of 10 Gbps, while it is expected to upgrade to 25 Gbps [10] and even 50 Gbps [11] per wavelength in the future.Fig. 2 illustrates a typical TWDM-PON system for fronthaul in the C-RAN.In a TWDM-PON system, an optical line terminal (OLT) is connected to several optical network units (ONUs), which share the same optical fiber link in between.Each ONU is equipped with tunable transmitters and receivers, in which each transmitter is tunable to any one of the four upstream wavelengths, and each receiver is equipped with tunable filters, tunable to any one of the four downstream wavelengths [12].An ONU can only choose one wavelength in a time slot, while a wavelength can be shared among the ONUs in different time slots.Although the TWDM-PON can provide large capacity, the operational expenditure (OpEx) and bandwidth efficiency problem remain to be solved.
There are some research efforts for efficiently accommodating fronthaul traffic and reducing the OpEx in a PON-based fronthaul [13], [14], [15], [16], [17], [18], [19], [20], [21].The work in [13] introduced the basic idea of reducing the deployment cost and the energy consumption of the TWDM-PON-based fronthaul, where the number of active wavelength channels was minimized.An energy-efficient framework has also been proposed in [14] and [15] to optimize the active wavelength and reduce energy consumption.The low-latency allocation scheme for TWDM-PON-based fronthaul has also been investigated.The work in [16] proposed a dynamic wavelength and bandwidth allocation (DWBA) algorithm to schedule the upstream wavelength channels while providing a low average packet delay and tackling the frame re-sequencing problem.An online gated service DBA algorithm for NG-EPON was proposed in [17] to assign a flexible number of wavelengths and a dynamic grant size on each upstream wavelength.In [18], the authors proposed and analyzed several online scheduling schemes, leading to improved utilization of the network capacity and reduced frame delay.Also, in [19], the authors proposed a novel wavelength and bandwidth allocation algorithm that could minimize the required number of active wavelength channels, considering the high burstiness and delay requirement of the fronthaul data transmission.Besides, to jointly support 5G fronthaul and best-effort data services in the same PON channel, authors in [20] proposed a novel dynamic bandwidth allocation (DBA) algorithm for NG-PON networks, reducing the management coordination requirement.A wavelength and bandwidth preemption mechanism was further introduced in [21] to reduce the latency in a fixed and mobile converged TWDM-PON.Although the works above considered the issues in TWDM-PON-based fronthaul, few studies paid attention to the mMIMO scenario.
The mMIMO-enabled beamforming allows multiple antennas to transmit the same data to a user to improve signal strength and quality [22], [8].However, redundant data transmission over fronthaul may occur due to beamforming [23].Recently, a joint allocation scheme with three heuristic algorithms was proposed in [24] for mMIMO fronthaul, optimizing fronthaul bandwidth and radio resource block (RB) utilization.To further optimize the allocation performance, deep reinforcement learning (RL) approaches are employed to improve the allocation efficiency [25], [26].By training with enough data in advance, RL-based approaches can provide a solution that can approximate the optimal solution quickly.In [26], the authors established a 3D beam antenna array mapping model and proposed a DRL model to jointly optimize 2D antenna sub-array selection and radio RB allocation.Although a lower fronthaul bandwidth is attained in [26], the employed network architecture was not sufficiently flexible to efficiently utilize the bandwidth, and the energy consumption of the active wavelength channels was not considered.In this regard, a more flexible system architecture and a more efficient allocation scheme are desired.
In this paper, we propose an RL-based allocation scheme and a more flexible architecture to improve the resource block and bandwidth efficiency and, meanwhile, reduce the energy consumption of the wavelength channels in the downstream direction.Compared with our previous work in [25], we consider a more stringent latency constraint and further combine the RL model with the heuristic algorithm to allocate the radio resource and wavelength channels jointly.The joint allocation problem is formulated into a mathematical model using integer linear programming (ILP).We also analyze the scalability of the proposed downstream allocation scheme with different network scales and various traffic types.
The rest of the paper is organized as follows.Section II describes the allocation problem and presents the proposed network architecture.Section III illustrates the integer linear programming (ILP) formulation, while Section IV discusses the proposed reinforcement learning-based allocation scheme.Section V presents the simulation results to evaluate our design objectives and analyze the results.Finally, Section VI concludes the paper.

II. ARCHITECTURE AND PRINCIPLE
To provide sufficient bandwidth with low cost for an mMIMO fronthaul network, a TWDM-PON-based architecture is proposed to connect the DU and the RU, as shown in Fig. 3.This study focuses on the downstream data transmission.For the upstream direction, an additional cooperative interface needs to be employed to guarantee the strict latency requirement [27], where the cooperative interface can utilize the scheduling information of the wireless domain information to achieve the TWDM-PON allocation in advance.For the functional split options of NG-RAN, option-7.3and option-2 are chosen for low-layer split and high-layer split, respectively [28].The radio resource control (RRC) and the packet data convergence protocol (PDCP) are processed at the CU, while the radio link control (RLC), the media access control (MAC) function, and the forward error correct (FEC) encoding are processed at the DU.The modulation (Mod), resource element mapping (RE-Map), beamforming port expansion (BP-Exp), beamforming pre-code, and IFFT with cyclic-prefix-process are incorporated into the RU to reduce the required bandwidth of the fronthaul.Unlike the fixed architecture (FIX-ARCH) adopted in [24], which employed a wavelength de-multiplexer and multiple power splitters, a general TWDM-PON architecture (GEN-ARCH) with a single splitter introduced in Section I is adopted as the fronthaul infrastructure for the mMIMO fronthaul network.By employing GEN-ARCH, all wavelengths assigned at the ONUs can attain all possible permutations, making it more flexible for the wavelength assignment.We focus on the mMIMO scenario with the large phased array antennas (PAA), where beamforming is performed in the radio frequency (RF).As Fig. 3(a) shows, each ONU is connected to a RU, and then connected to a sub-PAA composed of multiple antennas from the PAA.Each ONU has its corresponding RU and sub-PAA.In this case, the RUs independently process the corresponding sub-PAA attached.Fig. 3(b) presents the beamforming example for the multiple-users scenario.We assume the case that the PAA can generate multiple beams simultaneously, and the whole system is omnidirectional.In this regard, the PAA can attain flexible 360°coverage.The 360°c overage area is divided into 24 regions considering that each beam can achieve a 15°beamwidth angle.Users in the same regions can be allocated with different beams using different wireless resources (Beam 1 and Beam 2), while users in different regions can be allocated with beams using the same wireless resources (Beam 3 and Beam 4).
The allocation program will be executed in DU and send the control signal to RU via the control plane channel to control the modulation, antenna mapping, and beamforming.A fair number of antennas and resource blocks (RBs) will be assigned to each beam, where the RB is the smallest unit of radio resources that can be allocated to a user and usually is 180 kHz wide in frequency and one slot long in time.The assigned beams will cause bandwidth requirements for ONUs that connect to the sub-PAAs containing the allocated antennas.The bandwidth requirements in each ONU are proportional to the allocated RB number for the corresponding beams.The detailed mapping relationship between allocated RBs and required traffic bit rate over fronthaul transmission will be shown in Section III.After OLT transmits data to ONUs, data is copied into many pieces in the RU in the downstream direction before being launched via several antennas within the connected sub-PAA for a single beam request.The beam direction can be controlled by adjusting the beamforming pre-coder for the beam transmission.Fig. 4 shows an example of resource allocation for several beam requests, where the small block represents the RB.As shown in Fig. 4, every ONU connects to an RU, and then connected to a sub-PAA.Each sub-PAA can generate multiple beams allocated with different RBs, e.g., Beam 1 and Beam 2 in Fig. 4. A beam can be allocated with antennas from different sub-PAA, e.g., Beam 3 and Beam 4, however, this may cause redundant data transmission over fronthaul.For instance, Beam 3 utilizes disparate antennas from sub-PAA 1 and sub-PAA 2 separately.Since the corresponding ONUs for sub-PAA 1 and sub-PAA 2 are assigned with two different wavelengths, the data has to be transmitted using both wavelengths over fronthaul.On the contrary, although Beam 4 encounters the same problem as Beam 3 for the antenna assignment, it will not cause redundant transmission because ONU 2 and ONU 3 use the same wavelength.
Therefore, in the process of allocating resources to a particular beam, we need to consider wavelengths, antennas, and RBs simultaneously.The appropriate wavelength, antennas, and RBs allocation can reduce the fronthaul bandwidth demand, and minimize the number of the active sub-PAA and the active wavelengths.

III. PROBLEM FORMULATION
In this section, we formulate the resource allocation problem for TWDM-PON-based mMIMO fronthaul into ILP.The TWDM-PON GEN-ARCH depicted in Fig. 2 is adopted.Since Option-7.3 is chosen for the low-layer split, the fronthaul traffic bandwidth demand is proportional to the wireless traffic, therefore, proportional to the allocated RBs.For each beam, the fronthaul bandwidth requirement in each allocated wavelength channel can be calculated by (1) [29].
where N mcs , N sym , N sc , N RB , and N mimo are the modulation order, the number of symbols within the transmission time interval (TTI), the number of subcarriers per RB, the number of RBs per user equipment (UE), and the number of MIMO streams.A fully digital mMIMO system and a time-division duplex system are assumed, where the traffic is highly bursty.The ILP formulation for the bandwidth, the RBs, and the wavelength channel allocation problem are presented below.The corresponding parameters and variables are shown in Tables I  and II.

A. Objective
The objective is to minimize the fronthaul bandwidth and the number of active ONU (also the sub-PAA) and reduce the active wavelength channels.Equation ( 2) is the objective function in which the first minimizes the fronthaul bandwidth demand; The second part maximizes radio resource utilization, and the third term minimizes active wavelength channels.If the beam is allocated to different sub-PAA using the same wavelength, the fronthaul bandwidth is only counted once.
B. Constraints 1) Wireless Allocation Constraints: Equation ( 3) is the fundamental constraint for the wireless RBs allocation, ensuring the sum of RBs allocated for a beam equals the total request RBs.Equation ( 4) restricts that a specific RB cannot be assigned to multiple beams simultaneously.
Equation ( 5) guarantees that a BAA should be allocated with the same RBs for all assigned antennas.Equation (6) suggests that two beam requests from the same site should be allocated with different RBs to avoid interference.Equation (7) guarantees that the allocated RBs for a beam should be contiguous.
i∈Q j∈As (5) 2) Bandwidth and Wavelength Allocation Constraints: Equation ( 8) is the fundamental constraint for the bandwidth allocation, which indicates that an ONU can only choose one wavelength in a time slot.Equation (9) ensures that a wavelength channel is active as long as one ONU is allocated with it.
w∈W s Once the wireless RBs allocation variable X m i,j,r is determined by ( 3)-( 5), the ONU allocation for beams is also determined.In other words, the specific ONU used by each beam is already determined by X m i,j,r , (10) shows this relationship.Equation (11) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
ensures that an ONU is used as long as one beam employs it.
Similar to (10), after we decide the RBs allocation and the wavelength allocation for an ONU, wavelength assignment for the beam is also conditioned, which is expressed in (12).Eq. ( 13) shows the wavelength capacity constraint.
IV. ALLOCATION SCHEME Resource allocation problems are commonly optimized by the ILP technique.However, they are not practical due to their prohibitively time-consuming computation [24].This paper combines the heuristic algorithm with the RL-based model to optimize the fronthaul bandwidth, the radio resource utilization, and the number of employed wavelengths jointly.The proposed allocation scheme couples the RB allocation for beam and the wavelength allocation for ONUs to attain a high bandwidth efficiency and low energy cost.

A. Bidirectional Heuristic Algorithm for RBs Allocation
The work in [24] has proposed an RB allocation scheme for mMIMO fronthaul with beamforming.However, it was not efficient enough to utilize the bandwidth.In algorithm 1, a more efficient RB allocation algorithm is provided, which can be coupled with the wavelength assignment to further optimize the bandwidth efficiency.Using the wavelength flexibility feature at the ONU and the bidirectional allocation, the proposed scheme can attain superior performance in both RB and bandwidth utilization efficiencies.The proposed allocation algorithm is illustrated as follows.
Line 1 gives the initialization for the allocation variable set, where M , Q , A , and RB are sets containing the allocated beams, sub-PAAs, antennas, and RBs, respectively.Lines 2-4 do the preparatory work, sorting the requests based on their volumes and calculating a threshold γ.The whole RB allocation process is divided into two stages, and the allocation direction for RBs changes dynamically within each stage, where the first stage includes lines 5-23, and the second stage includes lines 24-29.In the first stage, a single beam request can only be served by antennas from a single sub-PAA, and at most γ sub-PAAs can be used.Therefore, some beam requests with small request RBs volume may not be served by γ ONUs, and these beam requests will be put into a waiting list L, waiting for the second-stage allocation.In the process of allocation, we need to guarantee beam requests from the same region that cannot be assigned with the same RBs (lines 6-7), and the RBs allocated for a single beam must be consecutive (lines 9-17).The RB allocation direction also changes dynamically to maximize the RB utilization ratio, which is determined by whether the currently used sub-PAA index is even or odd.In the second stage (lines 25-30), we do the RB allocation for the beam requests in the waiting list L.
In this stage, a single beam can be served by antennas from multiple sub-PAA, while the allocation criterion is the same as the counterpart in the first stage (lines 5-24).Besides, within each TTI, we prioritize packets with more rigorous latency requirements.Finally, an RL model, which combines the pointer network and actor-critic (PtrNet-AC) algorithm, is employed to tackle the wavelength assignment problem based on the RB allocation results.

B. RL-based Wavelength Assignment
The PtrNet-AC model is adopted for the wavelength assignment problem to increase the bandwidth efficiency and reduce the energy cost of active wavelength channels.To evaluate it, we compare it with another RL-based method, deep Q network (DQN).Besides, First-Fit and Best-Fit algorithms are employed as two benchmark algorithms.The detail of the DQN and PtrNet-AC methods is illustrated as follows.
1) DQN-based Wavelength Assignment: For the DQN-based algorithm, the wavelength allocation is modeled as a Markov Decision Process.All ONU bandwidth requests are put into a queue first.The DQN model iteratively chooses a wavelength for the first ONU in the waiting queue.Once the first ONU is allocated, it will be popped.Thus, the previous second ONU becomes the head of the queue.The iterative assignment process continues until all the ONU groups in the waiting queue have been popped.The state space and the action space are modeled as follows, while the reward function is elaborately designed.
State space: The bandwidth demand for all ONUs and the remaining available candidate wavelength capacity, Action space: The candidate wavelengths, aࢠWs = {λ 1 , λ 2 , …, λ n }.
2) PtrNet-AC Model-Based Wavelength Assignment: The adopted PtrNet-AC model directly attains a sequential index permutation of all ONUs, representing the order of selecting the ONUs.Fig. 5 illustrates the architecture and the mechanism of the pointer network (PtrNet) [30], which consists of two Long Short-Term Memory network (LSTM) modules, one as the encoder (LSTM Network1) and the other (LSTM Network2) as the decoder.The dimension of the input vector of the pointer network is alterable, which enables our model to deal with various numbers of active ONUs.Given an input sequence regarding the ONUs' traffic requests, the PtrNet points to a specific position in the input sequence rather than predicting an index value from a fixed-size vocabulary, and finally, it gives the sequence of indexes of the input ONUs.The state space and the action space are modeled as follows, State space: The bandwidth demand for all ONUs and the maximal wavelength capacity, s = (r 1 , r 2 , …, r m , c max ) Action space: The permutation of the ONU indices π.
The detailed procedure of the PtrNet-AC model-based wavelength assignment is illustrated as shown in Fig. 5.The ONUs request information is first transformed into a d-dimensional embedding of a two-dimensional point x i , via linear transformation for all nodes before being fed into the pointer network.As shown in Fig. 5, the encoder network (LSTM Network1) sequentially reads and transforms the input information x i into a sequence of latent memory states e i .Once the encoding finishes, a d-dimensional vector g is fed into the decoder network (LSTM Network2) as a signal to trigger the decoding.Similarly, a sequence of latent memory states d i are also maintained in the decoder network at each step ii.For each decoding step, the attention mechanism produces a distribution u i for the next ONU request to select, where v, W 1 , and W 2 are the learnable parameters of the pointer network.
Then, the softmax function is employed to normalize the probability distribution p i among the available ONU requests, which is expressed as, The PtrNet selects one ONU request for each decoding step.Once the ONU request is selected, it is passed as the input to the next decoder step.Finally, we will get a sequence of input ONUs requests index.After we get the index permutation of all ONU indices, we iteratively select the ONU according to the indices order, starting from the first empty wavelength.A new empty wavelength will be used when the current wavelength cannot provide sufficient bandwidth for the current selected ONU request.
We further optimize the parameters of the PtrNet using the actor-critic algorithm [31].In this regard, two PtrNets are constructed, one called actor and the other called critic.The actor PtrNet is responsible for generating the allocation policy π, which is the ONUs selection order that requires the minimum number of active wavelengths; the critic PtrNet model provides an unbiased estimation of the required number of active wavelengths.The training objective of actor PtrNet is to minimize the number of active wavelengths resulting from policy π, which is defined as where θ are the parameters of actor PtrNet, and s is the input state.D(π|s) is defined as the total number of active wavelengths employing the allocation policy π with the input state s.p θ (•|s) is the stochastic policy given by actor PtrNet.
We then resort to policy gradient methods and stochastic gradient descent to optimize the parameters.The corresponding gradient is shown in (17).
(17) where b(s) is the output of the critic PtrNet, representing the estimated number of active wavelengths.
The PtrNet parameters are updated based on the allocation result.The gradient is approximated with Monte-Carlo sampling for sampling a single allocation task, i.e., π i ∼ p θ (•|s i ).
The critic PtrNet, parameterized by θ v , is trained with stochastic gradient descent on a mean squared error objective between its predictions b θ v (s) and the actual number of active wavelengths sampled by the most recent policy.The training objective of critic PtrNet is formulated as The training procedure is shown in Algorithm 2.

C. Computational Complexity
The

A. Simulation Setup
We adopted a similar setup used in [24] for the simulation, where an RB has a 180 kHz frequency range, and each RB has 12 subcarriers and 7 OFDM symbols within a TTI.The duration of one TTI was set to 100 μs.Both large-scale and small-scale networks were considered in this work.Table III shows the simulation parameters for the two cases.For the small-scale scenario, for instance, in an exhibition hall, there are fewer user devices, and their distribution is relatively scattered.The bandwidth demand is relatively small, and the system bandwidth of the wireless part was set to 3.2 MHz.There are a total of 64 antennas equally distributed to 8 sub-PAA, and every antenna has the same 16 RBs.In contrast, for large-scale scenarios, e.g., in a football stadium or music concert, a large number of user devices are densely distributed.In this regard, large bandwidth is desired and the wireless bandwidth for large-scale scenarios was set to 200 MHz, where all the antennas share the same 1000 RBs.The total number of antennas in the large-scale network grows to 512 and each sub-PAA contains 32 antennas.Each sub-PAA is connected to an RU, and then connected to an ONU.We also evaluated the effect of antenna deployment with cases of 64/128 antennas per sub-PAA.A static traffic matrix was considered, in which the number of antennas and RBs demand per beam were generated randomly, following the uniform distribution as shown in Table III.We divided the space into 24 regions for beamforming, and all user equipment was uniformly distributed among these 24 regions.We utilized the TWDM-PON setup introduced in Fig. 2.There are 4 wavelengths in a TWDM-PON system in the downstream direction, and the capacity of each is assumed to be 25 Gbps.Since we considered the downstream direction, the laser turning time is zero.The simulation results of our proposed GEN-ARCH and RL-based allocation algorithm were compared with the FIX-ARCH and the three allocation algorithms presented in [24].The results in this section were obtained from an average of 1000 consecutive time slot simulations.The time duration for each simulation is one TTI.The PtrNet-AC model was employed for the wavelength assignment, in which the training batch size was set as 512.The embedding size and the hidden layer size were both set to 128, and Adam was employed as the optimizer.The initial learning rate was set as 1e-3 and the learning rate decays every 500 steps with a decay factor of 0.96.

B. Simulation Results for Small-Scale Network
Fig. 6(a) shows the required fronthaul bandwidth against the traffic load.Since the results are affected by the radio resource and wavelength allocation jointly, the number of beam requests is employed to present the traffic load for the performance evaluation, "RL" is used to refer to the combination of bidirectional RBs allocation and PtrNet-AC model-based wavelength assignment in the following discussion.The proposed algorithm was compared with three heuristic algorithms named "Inseparable Beam antenna array Based Allocation" (IBBA), "Separable Beam antenna array Based Allocation" (SBBA), and "Threshold Separable Beam antenna array Based" (TSBBA) in [24].IBBA aims to minimize the fronthaul bandwidth, while SBBA aims to utilize the resource fragments of the antennas.TSBBA makes a balance of them.The performance of the proposed architecture (GEN-ARCH) also was compared with the counterpart in [24] (FIX-ARCH).As shown in Fig. 6(a), all four algorithms produced less bandwidth using the GEN-ARCH compared with the FIX-ARCH one.This is because the proposed architecture could assign all occupied ONUs with the same wavelength and does not cause redundant transmission.The results also show the proposed RL-based algorithm attained a competitive bandwidth efficiency compared with IBBA, which outperformed the others.Fig. 6(b) shows the result of the antenna resource utilization ratio versus varying network load conditions.The antenna resource utilization ratio was calculated by dividing the employed RBs by the total RBs of the antennas in a sub-PAA.Since the network architecture does not affect the antenna resource utilization ratio, we only investigated the effect of the four algorithms.As Fig. 6(b) shows, the proposed RL scheme attained a high antenna resource utilization ratio.Although IBBA achieved a high bandwidth efficiency, it attained a relatively poor performance in the antenna resource utilization ratio.Our proposed RL scheme achieved a high bandwidth efficiency and high radio resource block utilization simultaneously.In a small-scale network, a single wavelength channel can support all requests.Therefore, the wavelength channel utilization efficiency was not compared here.

C. Simulation Results for Large-Scale Network
Fig. 7 compares the results obtained from various algorithms and architectures for the large-scale network.Fig. 7(a) shows the required fronthaul bandwidth against varying load conditions.Similar to the small-scale network results, both the proposed GEN-ARCH and the RL-based algorithm improved bandwidth efficiency compared with the FIX-ARCH and benchmark algorithms, respectively.When the number of beam requests reaches 300, the required fronthaul bandwidth for the SBBA algorithm exceeds 120 Gbps, which exceeds the maximum network capacity of 100 Gbps (4× 25 Gbps), while the proposed RL-based algorithm only requires 90 Gbps.Besides, it is noticed that the slope of the curve slightly decreased with the increase in the number of beam requests.This is because some requests could not be served due to the constraint (6).If too many beam requests were from the same site region, some of them could not be served due to limited wireless bandwidth.Fig. 7(b) shows the result of the antenna resource utilization ratio versus varying network load conditions for large-scale networks.The RL-based algorithm achieved a high antenna resource utilization ratio.A high antenna resource utilization ratio means fewer sub-PAAs are active, reducing the OpEx cost.Combining Fig. 7(a) and (b), the proposed RL-based allocation scheme could simultaneously attain a high fronthaul bandwidth efficiency and a high antenna resource utilization ratio for large-scale networks.
The network scalability of the proposed architecture and allocation algorithm was also investigated in this paper.Fig. 8 shows the simulation results when the size of the sub-PAA changes.Fig. 8(a) exhibits the effect of antenna deployment on bandwidth efficiency.The required fronthaul bandwidth decreased concerning the increasing number of antennas per sub-PAA.The reason is that with more antennas contained in a sub-PAA, it is more likely that a beam was served by a single sub-PAA, therefore, using only one wavelength channel.In this case, the The influence of the different types of traffic was also investigated, including human-type communication (HTC) and machine-type communication (MTC).The statistical properties of HTC traffic and MTC traffic are different.Unlike the HTC traffic, the MTC traffic is heavy-tailed [32], [33], [34].We analyzed the situation with a different composition ratio of HTC and MTC.In the simulation, uniform distribution was employed for HTC bandwidth request generation, while Pareto distribution was employed for MTC traffic.Fig. 9 shows the result of the antenna resource utilization ratio versus varying network load conditions for different proportions of HTC and MTC traffic.As Fig. 9 shows, when the number of beam requests is small, the higher the proportion of MTC traffic, the lower the antenna  resource allocation ratio is.This is because the RBs of the selected antennas in the last sub-PAA may not be fully utilized due to the heavy-tailed property of MTC traffic.When the number of beam requests increased, the allocation performance of MTC traffic approached 87%.Since the number of employed sub-PAA increased, the unused RBs in the last sub-PAA had less impact on the overall performance.The proposed scheme can   support different types of traffic and can achieve a high antenna resource allocation ratio for all mixing proportions with a large number of beam requests.
Figs. 10 -12 shows the results for wavelength allocation.We compared the PtrNet-AC model-based wavelength allocation scheme with the DQN, ILP, First-Fit, and Best-Fit algorithms.To verify the wavelength allocation performance, we evaluated the proposed PtrNet-AC model under different traffic load conditions, where some cases are beyond the capacity of the previously employed TWDM-PON system and are the analyses for larger capacity systems.Besides, to ensure the PtrNet-AC model performance, we trained the model with a larger range of traffic demand.Fig. 10 shows training performance for 300 beam requests where the average number of active ONUs is around 12. It shows the average number of wavelengths used with 51200 randomly generated requests after certain training steps (each step a batch of 512 samples).We can see that PtrNet-AC could outperform First-Fit and Best-Fit after about 75000 training steps and approach the optimal value (ILP) after about 600000 training steps.
We investigated the PtrNet-AC model under different traffic load conditions.Since the dimension of the input vector of the PtrNet is alterable, we can train and test the model with various traffic loads (varying from 15 to 400 beam requests and the number of active ONUs varies from 1 to 16).In this case, the PtrNet-AC model converged after about 1100000 training steps.Therefore, only one PtrNet model is required for different numbers of active ONUs.On the contrary, the DQN-based model does not have such flexibility.The DQN-based model requires the corresponding model for every case; thus 16 models are needed when the number of active ONUs varies from 1 to 16. Fig. 11 shows the demand versus the number of beam requests using different allocation algorithms.The vertical axis is the optimality gap, which represents the difference between the results of the algorithms and the optimal value (from ILP).The results were obtained from an average of 1000 consecutive time slot simulations.The PtrNet-AC modelbased allocation exhibited superior performance to the DQN, the First-Fit, and the Best-Fit algorithms for all traffic load conditions with up to an 83% reduction in the optimality gap and a 9% reduction in active wavelength usage compared to First-Fit and Best-Fit algorithms.The energy consumption can be significantly reduced by using the PtrNet-AC model-based wavelength allocation scheme.Moreover, due to the pointer network's characteristics, the trained PtrNet-AC model can also be extended to do more complex tasks.For instance, although the PtrNet model was trained with 1 to 16 active ONUs, the trained model can also be applied to cases when the number of active ONUs was higher than 16.Table IV shows the extensible ability of the PtrNet-AV model.Although the model has been trained with ONU number smaller than 16, the trained PtrNet model could also be applied to the cases when the number of active ONUs varies from 20 to 50 and still attained a superior performance compared with First-Fit and Best-Fit.
The computation complexity of these algorithms was evaluated in Fig. 12.Although the ILP could always attain optimal Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
results, there is a disadvantage in high computational complexity.As shown in the subfigure of Fig. 12, the runtime of ILP increased exponentially with the increasing number of active ONUs.The average runtime of the PtrNet-AC model increases gently, and the average executing time is within 1 millisecond.

VI. CONCLUSION
In next-generation radio access networks, TWDM-PON is expected to be a leading candidate for mobile fronthaul due to its large capacity and flexibility.In this paper, the TWDM-PONbased fronthaul for the mMIMO-enabled beamforming scenario was investigated.A flexible fronthaul architecture was proposed, and an ILP was formulated for the joint radio, wavelength, and bandwidth resource allocation problem in TWDM-PON-based mMIMO fronthaul.We have coupled the heuristic RB allocation algorithm with the RL-based bandwidth allocation model to jointly optimize the fronthaul bandwidth, the radio resource utilization, and the number of employed wavelengths.Simulations were demonstrated for both large-scale and small-scale networks.The results show that our approach can simultaneously realize a high fronthaul bandwidth efficiency and resource block utilization ratio for different traffic loads.By employing the PtrNet-AC model, the wavelength usage was further optimized to save energy consumption, attaining up to an 83% reduction in the optimality gap and up to a 9% reduction in wavelength usage compared with the benchmark algorithm.Besides, the trained PtrNet-AC model can outperform the benchmarks with the active ONU number beyond it is trained.

Fig. 3 .
Fig. 3.An illustration of network scenario.(a) TWDM-PON based fronthaul for mMIMO system with a potential functional split option and the processing stages.(b) An allocation example for multi-beams generated by a PAA.

Fig. 4 .
Fig. 4.An example of data transmission in TWDM based mMIMO system.
computational complexity of the proposed bidirectional RBs algorithm is O(|M ||Q ||A||RB|), where |M |, |Q |, |A| and |RB| are the number of beam requests, number of active sub-PAAs, number of antennas per sub-PAA and number of available recourse blocks.The complexity of PtrNet-AC modelbased wavelength assignment is O(|Q | 2 ).|Q | is the number of active sub-PAAs, which is also the number of active ONUs.Therefore, the overall time complexity of RL-based joint allocation scheme is O(|M ||Q ||A||RB| + |Q | 2), which is similar to the counterpart of the benchmark algorithms in[24].

Fig. 11 .
Fig. 11.Wavelength demand versus the number of beam requests.

Fig. 12 .
Fig. 12.Average runtime versus the number of beam requests.

Algorithm 1 :
Bidirectional RBs Allocation.1: Input: Beam set M, sub-PAA set Q, antenna set As, and RB set RBs Output: Allocated sets (M ; Q ; A ; RB ) 2: Set the output sets with empty set ϕ; 3: Sort all beam requests m ࢠ M in a descending order of (rb m × a m ); Find a subset K in M , where beam request in K is in the same region as m.RB K is an RB set allocated to K.
m rb m × a m )/(|RB| × |A|) 5: for m ࢠ M do 6: m antennas with consecutive RBs begin with rb and such: {RB S } ∩ {RB K } = ∅ 11: Allocate these RBs to beam m and update (M'; Q'; RB') if exist one active sub-PAA can provide the antennas with consecutive RBs 12: Continue if available consecutive RBs not exist 13: end for 14: else if N is odd then 15: for rb = R -rbm + 1; rb > 0; rb −− do 16: Repeat lines 10-11

17: end for 18: end if 19: if beam
m is not allocated and N < γ -1 then 20:Allocate a new sub-PAA and update Q 21: else if beam m is not allocated then 22:Put beam m into waiting list L

23: end if 24: end for 25: for m ࢠ L do 26:
Repeat lines 6-14 and allow to allocate antennas from multiple sub-PAAs to serve a single beam request m 27: if beam m is not allocated then 28:Allocate a new sub-PAA and update Q 29

: end if 30: end for 31:
Employ RL-based wavelength allocation model to assign wavelength for active ONUs connecting sub-PAA in Q and compute the required fronthaul bandwidth.

TABLE IV SIMULATION
RESULT FOR WAVELENGTH ALLOCATION