Heterogeneous Doppler Spread-based CSI Estimation Planning for TDD Massive MIMO

Massive multi-input multi-output (Massive MIMO) has been recognized as a key technology to meet the demand for higher data capacity and massive connectivity. Nevertheless, the number of active users is restricted due to training overhead and the limited coherence time. Current wireless systems assume the same coherence slot duration for all users, regardless of their heterogeneous Doppler spreads. In this paper, we exploit this neglected degree of freedom in addressing the training overhead bottleneck. We propose a new uplink training scheme where the periodicity of pilot transmission differs among users based on their actual channel coherence times. Since the changes in the wireless channel are, primarily, due to movement, uplink training decisions are optimized, over long time periods, while considering the evolution of the users channels and locations. Owing to the different rates of the wireless channel and location evolution, a two time scale control problem is formulated. In the fast time scale, an optimal training policy is derived by choosing which users are requested to send their pilots. In the slow time scale, location estimation decisions are optimized. Simulation results show that the derived training policies provide a considerable improvement of the cumulative average spectral efficiency even with partial location knowledge.


I. INTRODUCTION
Future wireless networks have to address an exponentially increasing demand for high datarate.In this context, several technologies have been proposed to improve the overall wireless networks performance: dense small-cell deployment, millimeter-wave communications and massive MIMO among others [1].Massive MIMO was identified as one of the most promising technologies to meet this requirement.Originally introduced by Marzetta [2], massive MIMO exploits a large number of base station (BS) antennas in order to enable the spatial multiplexing of a large number of devices.By coherent processing of the signals over the BS antennas, transmit precoding can be used in order to concentrate each signal at its intended terminal and receive combining can be used in order to discriminate between the signals of different users.
Massive MIMO have been thoroughly studied and have shown to improve the networks spectral efficiency (SE) and energy efficiency (EE) in addition to providing a high capacity per area [3].These gains are conditioned by an accurate channel state information (CSI) at the BSs.In this paper we will focus on Time Division Duplexing (TDD) systems, where CSI can be acquired using uplink training with orthogonal pilot sequences [4].A major issue in TDD systems is that a number of these pilot sequences are reused resulting in pilot contamination [2], [4], [5].
Another reason for CSI inaccuracy is channel aging.This phenomenon results from the variation of the channel between the instant when it is learned and the instant when it is used for signal processing.This time variation is due to users mobility and processing delays at the BS.
Performance degradation due to channel aging was studied in a MIMO system with coordinated multi-point transmission/reception (CoMP) in [6].The authors showed that the impact of channel aging is mitigated when utilizing channel prediction filters in the low mobility regime.
The authors in Truong et al. [7] provide an analysis of rate performance in the presence of channel aging and prediction.They showed that, although channel aging leads to degradation in the performance of massive MIMO systems, channel prediction can overcome this issue.In Papazafeiropoulos et al. [8], [9], the effect of channel aging combined with channel prediction has been investigated in scenarios with regularized Zero Forcing (ZF) precoders and minimum-meansquare-error (MMSE) receivers, respectively.In Kong et al. [10], lower bounds of the sum-rate for both Maximum Ratio Combining (MRC) and ZF receivers with/without channel prediction have been derived with an arbitrary number of BS antennas and users.The impact of channel aging and prediction on the power scaling law has been studied.The authors demonstrated that the transmit power scaling is not affected neither by aged CSI nor channel prediction.
Channel aging can also be leveraged in order to optimize uplink training.In Vu et al. [11], two spectral-efficient multiuser models for massive MIMO systems have been proposed.The main idea comes from the observation that users with low velocity are not required to send training sequences with the same periodicity as faster moving users owing to the resulting heterogeneous coherence times.The two proposed models proved to achieve significant SE gains.
In this paper, we aim at increasing the SE by exploiting the heterogeneous channel aging among users.In the current literature, the number of scheduled users, is limited by the fixed length of the uplink training reference signal.A more appropriate approach would be to define the needed training resources dynamically, at each time slot.We aim at adapting uplink training based on the actual coherence times.This means that, at a given slot, if the correlation between the estimated CSI and the actual channel was not considerably degraded, due to aging, the network is not required to reestimate it.Doing so enables to spear part of the training resources that can be used for data transmission or to schedule more users.This is in accordance with the concept of dynamic TDD that is already considered in the development of the 5G standard [21].
Channel aging results, primarily, from mobility, with speed being an important parameter.
Consequently, developing an uplink training policy that takes into consideration the second order channel statistics is of paramount importance.Developing such policy requires accurate estimates of user locations, which can be rather complicated to obtain, in practice.In fact, localizing all scheduled users requires non negligible signaling, if it is done through the localization capabilities of the network (OTDOA [13] for example).Global Positioning System (GPS) can also be used but it rises the problem of the life span of mobile devices batteries [18].Consequently, we suppose that the network is able to estimate the location of a limited set of users.Adapting to the change in the large-scale fading coefficients and optimizing uplink training decisions based on the channel's autocorrelation should occur on two different time scales [14].In fact the two optimizations are based on information that change over heterogeneous time scales.In order to achieve the maximum cumulative average SE over time spans larger than the large-scale fading coherence block, a two time scale control problem is considered.
In the fast time scale, an optimal training policy is derived.By taking into consideration the evolution over time of the correlation between the estimated CSI and the actual channel, the network is able to optimize its decisions to schedule users for uplink training over a finite time horizon.Taking into consideration the time dimension allows the network to be more efficient since it becomes able to predict the impact of its decisions on long term performance.
Deriving such policy can naturally be formulated as a discrete planning problem over a finite time horizon [15].The optimal training decisions are derived for a predefined time duration, denoted here by H, for which the large-scale fading coefficients are supposed to be constant.This is quite advantageous since it allows to optimize training over time without requiring the actual channel estimates.Results prove that the derived training policy provides substantial performance increase.Since deriving the optimal policy can be computationally prohibitive for large optimization horizons, we provide a combinatorial optimization framework that enables to derive an approximate training policy with reduced running time.
In the slow time scale, the network adapts to user mobility by deciding which users are required to feedback their locations.Estimating the exact location of all users requires a non negligible signaling overhead.Consequently, efficiently selecting the users that are required to feedback their location is important.Since locations are estimated in a periodic manner, we consider user locations that evolve according to independent Markovian stochastic processes [20].The location estimation problem introduced above, with locations evolving in a Markovian fashion, can be formulated as a Partially Observable Markov Decision Process (POMDP) [19].Simulations prove that the combined optimization, on the two time scales, provides an efficient training strategy that improves the achievable cumulative average SE even with partially erroneous geolocalisation.This paper is organized as follows.We describe the considered system model in Section II.We discuss the advantages of coherence time based training in Section III.Two time scale training strategy learning is discussed in Section IV.Finally, in Section V, numerical results are presented.

II. SYSTEM MODEL AND PRELIMINARIES
We consider the uplink of a multi-cell multiuser massive MIMO system constituted of C macro BSs operating in TDD mode.Each macro BS is equipped with M omnidirectional antennas and serves K mobile devices equipped, each, with a single omnidirectional antenna.We will refer to the latter as users.All users in the network move according to different speeds and directions.Consequently, their signals are subject to heterogeneous Doppler spreads which results in different wireless channel autocorrelations in time.We consider a system where time is slotted t ∈ {0, 1, . ..} and the duration of each time slot t is given by D c .We note that D c is the channel coherence time which depends on the maximum Doppler spread supported by the network, see for instance Toufik et al. [13].We also consider the corresponding coherence interval T s .The wireless channel of each user can be decomposed as a product of small and large scale fading coefficients.The wireless channel from user k (in cell c) to BS j, at time slot t, i.e., g kc (t), for all k = 1, . . ., K, and j, c = 1, . . ., C, where h [j] kc (t) ∈ C M ×1 is the fast fading vector, h kc ∈ R + models the largescale effect including shadowing and pathloss, which are assumed to remain constant during large-scale coherence blocks of T β OFDM symbols.
Remark 1.In Sections IV and V we will consider a system where β kc evolves according to a Markovian model.

A. Channel Estimation
As introduced above, in this paper we focus on a TDD system, where the entire frequency band is used for downlink and uplink transmission by all BSs and users.The BSs acquire CSI estimates using orthonormal training sequences (i.e., pilot sequences) in the uplink.We consider a pilot reuse factor of 1, i.e. the same sets of pilot sequences are used in all cells.
We also consider that, during each coherence interval, a maximum of τ users are scheduled for uplink training in each cell with τ ≤ K.For that, we consider a set of orthonormal training sequences, that is, sequences q i ∈ C τ ×1 such that q † i q j = δ ij (with δ ij the Kronecker delta).During uplink training of slot t, the l th BS receives the pilot signal Y where W p (t) ∈ C M ×τ refers to an additive white Gaussian noise matrix with i.i.d.CN (0, 1) entries.P p refers to the training signal power.The l th BS then uses the orthogonality of training sequences in order to obtain the MMSE estimate of the channel of user k, l [5] as Note that the MMSE channel estimate ĝ[l] kl (t) follows a CN 0, The wireless channel between user k (in cell l) and BS l can then be decomposed as follows where g[l] kl (t) represents the estimation error and follows a CN 0, β kl (t) are independent [5].

B. Channel aging
In practice, the wireless channel varies between the time when it is learned and used for precoding in downlink and decoding in uplink.This variation is due mainly to user movement and processing delays.Such phenomenon is referred to as channel aging.Its impact can be captured by a time varying wireless channel model.To this end we consider a stationary ergodic Gauss-Markov block fading regular process (or auto-regressive model of order 1) [16].The evolution of the channel vector of user k, l between the two slots t and t − 1 is expressed as where ε [l] kl (t) denotes a temporally uncorrelated complex white Gaussian noise process with zero mean and variance kl represents a temporal correlation parameter of the channel of user k, l.This parameter is given by Jakes et al. [16] and reads as follows where J 0 (•) is the zeroth-order Bessel function of the first kind and f kl represents the maximum Doppler shift of user k in cell l with respect to the antennas of BS j.In our work, we adopt a realistic setting in which, mobile users have different frequency shifts since we consider heterogeneous movement velocities and directions.For every user k in cell l, the maximum Doppler shift with respect to the antennas of BS j is given by f where ν kl is the velocity of user k in cell l in meters per seconds, c = 3 × 10 8 mps is the speed of light, f c is the carrier frequency and θ kl represents the angular difference between the directions of the mobile device movement and the incident wave.Taking into consideration the combined effects of estimation error and impairments due to channel aging, we can express the wireless channel of user k, l at time t as

III. AN ADAPTIVE UPLINK TRAINING APPROACH FOR MASSIVE MIMO TDD SYSTEMS
In current Massive MIMO models, the same coherence interval T s is considered for all users.
T s is defined as a system parameter that is based on the maximum Doppler spread supported by the network [13].This consideration results in a suboptimal use of the time-frequency resources and a loss of flexibility that can be leveraged otherwise.In fact, in practice, users experience heterogeneous Doppler spreads.Consequently, their channels do not age at the same rate.Forcing all users to perform uplink training with the same periodicity causes vain redundancy and a loss of resources.A more efficient approach should adapt the periodicity of each user CSI estimation according to its actual coherence time [12] [11].This means that, at a given slot, if the correlation between the estimated CSI and the actual channel was not considerably degraded, the network is not required to reestimate it.Doing so enables to spear part of the training resources that can be used for data transmission or to schedule more users.In all cases, the latter results in an increase in SE.In this section, we present a novel approach for uplink training that leverages the users heterogeneous channel coherence times.We present a detailed analysis of its impact on the achievable SE with MRC receivers.We also derive an important condition which ensures that the proposed scheme is able to improve performance.

A. An adaptive coherence time-based uplink training scheme
We consider a massive MIMO system in which CSI estimation is adapted according to the actual users' coherence times.We consider that the network groups users according to their channel autocorrelation coefficients into N G copilot user groups λ g , g = 1, ..., N G .The users in each group are either scheduled for uplink training synchronously, using the same pilot sequence, or not scheduled at all.This requirement guarantees that copilot users always have the same CSI delay and a similar channel aging effect.For each copilot group λ g , g = 1, ..., N G , the CSI delays are denoted by d g , g = 1, ..., N G .At each slot, all N G copilot user groups are scheduled for data transmission and a maximum of τ (τ < N G ) copilot groups are selected for uplink training.The rest will have their signals processed using the last estimated version of their CSI.
The proposed Time-Division Duplexing (TDD) protocol consists of the following seven steps.
1) In the beginning of each large-scale coherence block, the BSs estimate the large scale fading and channel autocorrelation coefficients, i.e., β kc and ρ
2) Next, the CP clusters users according to their autocorrelation coefficients using K-mean, see Young et al [17].The resulting clusters will be characterized by an average autocorrelation coefficient or, equivalently, an average Doppler spread and a variance of the corresponding users autocorrelation coefficients.The considered number of clusters is N c .Defining N c is of paramount importance.In this work, we choose to define N c according to where D max represents the maximum coherence time.(9) guarantees that the average coherence time per cluster is approximately equivalent to a multiple of D c .This is needed in order to appropriately define CSI estimation periodicity as a function of the parameter 3) Next, the CP allocates all users in the network (K per cell) to N G copilot groups.Each group contains at maximum C users from the same channel autocorrelation based cluster and from different cells.These N G copilot groups are formed with minimum variance of the autocorrelation coefficients in each group.This guarantees that copilot users has similar channel aging impact.The justification for this grouping is discussed in Section III.C.

4)
At each coherence slot, the network schedules at maximum τ copilot groups for uplink training synchronously.Depending on the main key performance indicator (KPI) to optimize, different scheduling algorithms can be used to select these copilot groups.In this paper, we propose a scheduling algorithm that exploits the aforementioned user grouping in order to derive an optimal CSI estimation policy.This is the focus of Section IV and represents one of the main contributions of the present work.
5) All N G copilot groups transmit their uplink signal in a synchronous manner.
6) The BSs process the received pilot signal and estimates the channels of the active users during uplink training using MMSE estimators.The BSs decode and precode the uplink and downlink data signals, respectively, using the last estimated version of each user CSI.
7) All BSs synchronously transmit downlink data signals to the N G copilot groups.

B. Spectral efficiency with outdated CSI
In what follows, we analyze the impact of the aforementioned training procedure on the achievable SE with a MRC receiver.We also explain why an adaptive Doppler based training can be more efficient.Moreover, we provide a condition in order to ensure that the spectral efficiency of all users is improved when the aforementioned training procedure is used.For the sake of analytical traceability, we consider that the N G copilot groups contain exactly C users.
We, henceforth, refer to each user by its copilot group and serving BS indexes.During uplink data transmission, at time slot t, BS l receives the data signal Y [l] u (t) which is given by where W u (t) ∼ CN (0, I M ) is the additive noise, S kc denotes the uplink signal of user k, c, k = 1, . . ., N G , c = 1, . . ., C and P u denotes the reverse link transmit power.Each BS applies a MRC receiver based on the latest available CSI estimates.BS l, l = 1, . . ., C detects the signal of user g, g = 1, . . ., N G , within the same cell, by applying the following where ĝ[l] gl (t − d g ) denotes the latest available channel estimate of user g in cell l.The resulting average achievable SE in the system with MRC receivers is given in Theorem 1.
Theorem 1.For N G active copilot groups, τ of which are scheduled for uplink training and using a MRC receiver u gl (t) that is based on the latest available CSI estimates of each user g, l, the average achievable spectral efficiency in the uplink RMRC u is lower bounded by: where d g , g = 1...N G represents the copilot groups CSI delays.I p gl and I n gl are given by: gb ).
Proof: See appendix A.
Equation (12) provides further insights into the impact of channel aging on the achievable average SE as a function of the CSI time offset.We can clearly see that the SE decreases as a function of its CSI time offset.This is an intuitive result since the correlation between the estimated CSI and the actual channel fades over time.Equation (12) shows also that for a same CSI time offset, the degradation due to channel aging is higher for users with lower autocorrelation coefficients.Although outdated CSI causes an SINR degradation, the speared resources from uplink training can lead to an increase in SE.

C. ASYMPTOTIC Performance
We now analyze the potential gain that the proposed training approach can provide.To do so, we compare it with a reference model that follows a classical TDD protocol in which all of N G copilot groups are scheduled for uplink training at each time slot.We consider a worst case scenario with random delays and random copilot groups allocation.In this scenario, each user experiences the lowest channel autocorrelation coefficient in comparison with its copilot users.
This means that each user suffers from the heaviest channel aging impact in its copilot group. with Proof: See appendix B.
Condition (14) ensures that the SE of each user increases when outdated CSI is used.Equation (14) shows that the speared resources due to the reduced training overhead is a defining parameter.In fact, SE is improved as long as the SINR degradation is compensated for by the spared resources from training.It also shows the importance of the ratio between the minimum and maximum autocorrelation coefficients in a copilot group.A high ratio is required in order to achieve the needed SE gain.This requirement become tighter as the CSI time offset increases.(14) shows that the use of the proposed procedure can improves the achievable SE even with random delays and random pilot sequence allocation.
Remark 2. In order to satisfy condition (14), copilot users need to have similar autocorrelation coefficients.This explains Steps 2) and 3) in the protocol in Section III.A.In fact, clustering users based on their autocorrelation coefficients and grouping them accordingly results in copilot user groups with homogeneous channel aging within each group.This allows to tolerate higher CSI time offset.( 14) also shows that the use of the aforementioned training procedure can improves the achievable SE of the network, even with random pilot allocation.Consequently, one can do better if a coherence time adaptive scheduling for uplink training is implemented.More importantly, the proposed scheme shows the impact of the time dimension.This fact justifies the need for a time-aware training optimization which will be the focus of the next section.

IV. OPTIMAL TRAINING STRATEGY WITH OUTDATED CSI AND USER MOBILITY: A TWO-TIME SCALE DECISION PROCESS
We proved that adapting uplink training periodicity to the actual channel coherence time can provide a considerable increase in network performance, even with random pilot sequence allocation.Nevertheless, higher performance gain can be obtained if more sophisticated and adapted scheduling policy is used.Developing such policy is the focus of this section.
As a matter of fact, knowing that CSI estimation periodicity should depend on the rate of channel aging, it makes sense to develop an uplink training policy that takes into consideration the evolution in the difference between the estimated CSI and the actual wireless channels.In opposition to a per slot uplink training optimizing, such policy enables to take into consideration the impact of past scheduling decisions on the long term performance.User mobility should also be included.In fact, channel aging results, primarily, from mobility, with velocity being a defining parameter.Consequently, developing an uplink training policy that takes into consideration the evolution of large-scale fading coefficients, in addition to channel aging, is of paramount importance.Developing such strategy requires accurate estimates of user locations, which can be rather complicated to obtain, in practice.As a matter of fact, localizing all covered users requires a non negligible signaling overhead and energy consumption [13], [18].Consequently, this problem should be addressed while assuming a partial knowledge of the user positions.
Adapting to the change in user locations and optimizing uplink training decisions based on the channels' autocorrelation coefficients, should occur on two different time scales [14].In fact, the two optimizations are based on information that change over heterogeneous time scales (The wireless channel changes faster than user position).Consequently, a two time scale control problem should be formulated.This will be the focus of the present section.We now model the two-time scale system introduced above as a POMDP [19].We assume finite action and state spaces in both time scales (see Figure 1).
We consider that, in the slow-time scale (upper level), the position of the users evolves according to a Markovian Mobility model [20] within its serving cell.These position variations occur at decision times n = 0, 1, . ... Let g (n) be the combination of the positions of users from copilot group g at time n.Considering the combination of copilot users positions instead of each individual one enables to reduce the complexity of the present model.We assume, for the sake of simplicity, that all copilot groups have L possible position combinations, hence g (n) ∈ {1, . . ., L}. Building this model requires a portioning of the coverage area of each cell into a number of disjoint regions.The area of each region is chosen such that the variation of the large scale fading coefficients can be considered as negligible within the region.For copilot group g, each position g (n) ∈ {1, . . ., L} corresponds to a combination of regions in each cell.
The transition probabilities are characterized by the matrix P g = (p g (i, j)) i,j∈{1,...,L} , for copilot group g. ( The large scale fading coefficients for user g in cell l, i.e., β gl , j ∈ {1, . . ., C} depend on the users' position.In previous sections, we assumed that this values were constant.In this section, we add a time dependency to it, namely, β gl }.Acquiring the information on the position of all users can be really expensive in terms of processing overhead and energy consumption [13], [18].Consequently, we consider that a limited number of users can feedback its positions to the network.In particular, we assume that, in every decision epoch, the users from U max copilot groups can feedback their positions (with U max < N G ).The CP therefore can only acquire the positions of the users from U max copilot groups, at each time n.
The positions of the rest of the user will be inferred from previous estimations.This estimation is characterized by the belief state vector.The belief state vector of copilot group g, at decisiontime n, will be denoted by b g (n), where the i th entry in b g (n) refers to the probability that the users of copilot group g are in positions of combination i.We define by X g the set of all belief states for copilot group g and we let X = X 1 × . . .× X N G be the state space in the upper level.
A remark on the notation is now in order.
Remark 3. The state in the upper level x ∈ X is an L × N G matrix, whose columns represent the belief state vectors of all copilot groups g, In the upper level, at every decision epoch n = 0, 1, . .., the decision is to select which U max copilot groups out of the N G will transmit their positions to the BSs.That is, we consider the 1 users in copilot group g feedback their positions at decision epoch n, At decision epoch n, the transition probability from belief state matrix x(n) ∈ X to belief state matrix x(n + 1) ∈ X is defined by The latter is satisfied because all users have independent movements.Recall that each position combination of users in copilot group g is characterized by a set of large scale fading coefficients β gl , j ∈ {1, . . ., C}, l ∈ {1, . . ., C}.In the fast-time scale, we define the state-space by X = {0, . . ., H − 1} N G , that is, the set of all possible delay vectors.Namely, such that d g is the CSI delay of all users in copilot group g, i.e., λ g .

The action space is
The decision times at the fast-time scale (lower level) will be denoted by t = {t 0 , t 1 , . ..}, with t nH = n for all n = 0, 1, . . .and H the finite-time horizon in the lower level.Moreover, we make the assumption that the decision u(n + 1), in the slow-time scale, is made right after the decision at time t nH .We denote by d(0) = d 0 ∈ X the initial state in the fast-time scale at n = 0 and x 0 ∈ X the initial state in the slow-time scale.In this particular model, the fast time scale transitions from time t nH until time t (n+1)H−1 for all n ≥ 0 are deterministic.Namely, At the fast time scale, we therefore encounter a finite-state finite-horizon deterministic sequentialdecision problem [15].The reward in this level, at time t with MRC receivers, is the following where x ∈ X and u ∈ A are fixed and the SINR of user g in cell l with MRC receiver.I p gl and I n gl are given in Theorem 1.Note that the reward function at the lower level, i.e., R low , depends on the belief state and the decision in the upper level.We now define the sequence π low = { φ low n } ∞ n=0 , where for each n, φ low n = (φ low t nH , φ low t nH+1 , . . ., φ low t (n+1)H−1 ).
Each function φ low t nH+j : X × X × A → A prescribes the action to be taken at decision time t nH+j (in the lower level), for all n ≥ 0 and all 0 ≤ j ≤ H − 1.For this model we only look at the set of stationary decision rules, π low with respect to the upper level, such that φ low n ( d, x, u) = φ low n ( d, x, u) for all n and n given d ∈ X, x ∈ X and u ∈ A. That is, for fixed d ∈ X, x ∈ X and u ∈ A the optimal decision rule in the lower level will be independent of the decision epoch n in the slow time scale.This consideration is in accordance with most existing literature and can also be justified by the considered setting.The set of all possible lower level decision rules will be denoted by Π low , i.e., π low ∈ Π low .Moreover, we drop the dependency on n, since we only consider policies that are n-independent, and we denote by Φ low the set of all H-horizon policies φ low , i.e., φ low ∈ Φ low .We now define Φ low x, u ⊂ Φ low as follows The latter is the set of all H-horizon policies given initial belief state matrix x and action in the upper level u.Note that, in the definition of Φ low x, u to introduce the policy φ low x, u , we use the decision times t 0 , . . ., t H−1 .This is without loss of generality, since we recall that these policies are independent from n. Next we define the reward in the upper level.Namely, where d is the delay state vector at time t nH .We remark that none of the upper level decisions incur in an immediate cost.Let us denote by Φ up the set of all possible stationary decision rules in the upper level, such that π up ∈ Φ up , π up : X × X → A. Consequently, the objective is to find π up ∈ Φ up and π low ∈ Φ low such that max The latter problem is a POMDP [19].To see this, it suffices to note that the slow time scale sequential decision making problem is just a POMDP with a reward that depends on the fast time scale deterministic decision making problem.Therefore the standard theory on Bellman's optimality equations follows.The optimal decision-rule for this POMDP can be obtained as a solution of the optimality equation for 0 < α < 1 V ( d, x) denotes the value function [19] which refers, in our case, to the long term SE.We will now make an assumption that simplifies the model significantly.We define Φ low ⊂ Φ low where φ low t j : X × {x} × { u} → A for j = 0, . . ., H − 1, and φ low t 0 = (1, . . ., 1)}.
For all φ low ∈ Φ low , φ low is such that, in the first stage of the H-horizon problem, all copilot groups are scheduled for uplink training.This allows us to start every slow-time scale with the same delay state d(nH) = (0, . . ., 0) for all n = 0, 1, . ... Eq. ( 27) then reduces to where R up ( φ low x, u , x, u) = R up ((0, . . ., 0), φ low x, u , x, u).If we further denote we then obtain a standard one-time scale POMDP, and its optimality equation reduces to POMDPs have been long studied in the literature.It was shown that the complexity of POMDP exact algorithms grows exponentially with the number of state variables [27].Even for simpler finite-horizon POMDPs, finding the optimal policy is PSPACE-hard [27].This means that deriving an optimal policy for ( 26) is too complex since belief-state monitoring is infeasible for large problems.As ( 26) is too complex to solve directly, we decompose the problem and tackle the two time-scales separately (see Figure 1).Indeed in order to solve (26), two decision policies, associated each with a time scale, are needed.While, in the fast time scale, a finite horizon training policy is derived, in the slow time scale, an infinite horizon position estimation policy is required.The combination of the latter will provide a solution to (26).In the slow time scale, at each decision epoch n, the network estimates the locations of a maximum of U max copilot groups and update the large-scale fading coefficients accordingly.In the fast time scale, between two upper level decision epochs (n and n + 1), a finite horizon training policy is derived based on the updated user locations that result from the upper level optimization.

B. Fast time scale: learning an optimal training strategy for finite horizon
In this subsection, we focus on solving the lower level planning problem in order to derive R max (x, u), see equation (30).We consider a deterministic sequential-decision making problem with reward given in equation (21).The actions of the network on the fast time scale are optimized while assuming a given belief state x, an initial state d(0) = (0, . . ., 0) for all n = 0, 1, . . .and a given action in the upper level u.The control horizon H is selected to be equal to the large-scale fading coherence block.Without loss of generality, we consider n = 0.The problem of optimal users scheduling for uplink training can be formulated as follows: with a g (t) ≤ τ, ∀t = t 1 , . . ., t H−1 and d(0) = (0, . . ., 0).
A naive approach to solve problem (32) is to generate all H-length sequences of actions and then select the sequence that results in the higher CASE after H slots (brute force).Clearly, this approach can be quite computationally prohibitive when the action space and the optimization horizon are large.A more appropriate approach is to use the Dynamic Programming (DP) algorithm, more precisely value iteration, see [25] (based on the Bellman Equation).The DP approach can be used for sequential decision making problems like the one proposed in Eq. (32).
Remark 4. We note that solving (32) using the DP approach can be computationally expensive for large optimization horizons H with a running time O((H − 1) |X| |A|).Consequently, we provide an algorithm with lower complexity in order to derive an approximate policy that reaches a guaranteed fraction of the optimal solution.
As mentioned in Remark 4, the DP approach results in a long running time that can hinder the uplink training procedure.Consequently, we adopt an alternative approach and trait problem (32) by combinatorial optimization.Expressing the CSI delays d(t j ) = (d 1 (t j ), . . ., d N G (t j )) as a function of the action vectors a(t) = (a 1 (t), . . ., a N G (t)), ∀t = t 0 , . . ., t j−1 and d(t) = (d 1 (t), . . ., d N G (t)), ∀t = t 0 , . . ., t j , is now in order.Recall the definition of the deterministic fast time scale delay transition (20).The delay d g (t j ), ∀g = 1, . . ., N G , can be written as follows Consequently, the objective function in problem (32) can be transformed into the following max a(t 0 ),..., a(t H−1 ) with I p gl and I n gl are also defined accordingly by combining Eq (13) and Eq (33).The following Theorem helps to derive an efficient algorithm to solve problem (34).The structure of problem (34) is quite convenient.In fact, even-though the objective function is not monotone, efficient approximation algorithms exist for the non-monotone submodular set function case.In this work, we make use of the approximation algorithm proposed in [22] which provides a 1 k+2+ 1 k + -approximation of the optimal solution under k matroid constraints.In our case, we consider H − 1 matroid constraints.Each one is associated with a given optimization stage t, t = t 1 , . . ., t H−1 .Consequently, the proposed algorithm in this subsection provides a 1 H+1+ 1 H−1 + -approximation of the optimal cumulative average spectral efficiency with a running time (N G (H − 1)) O(H−1) [22].The detailed algorithm is given in table I.We define the , where each element v gt represents the scheduling of copilot group g for training at slot t.We also define the sets I t , t = t 1 , . . ., t H−1 .
Each I t contains the selected elements at stage t with |I t | ≤ τ .

Table I: Algorithm for Approximate Finite horizon training strategy
Input: Ground set X of elements 1. Set v ←− argmax u∈X (f (u)) and S ←− {v}

While one of the following local operations applies, update S accordingly
•Delete Operation on S:

C. Slow time scale: adapting to user mobility
Once the fast time scale planning problem is solved, we tackle the infinite horizon positioning problem of the slow time scale.Since we have chosen to decompose (26) into two levels, the combination of the policies, in the two time scales, will provide an infinite horizon policy that solves (26).The mobility of each copilot group g is modeled by an L-state Markov chain.The positions of users in a each copilot group g remain the same for a given period which is equal to the large scale fading coefficients coherence block and evolves according to the probability transition matrix P g .
Solving the slow time scale control problem, directly, becomes intractable for a large number of users and possible positions, owing to the resulting complexity of belief-state monitoring [27].Nevertheless, practical methods exist if policy optimality is abandoned for the sake of convergence speed.We adopt the approximate approach in Nourbakhsh et al. [28], which solves a POMDP by exploiting its underlying Markov Decision problem (MDP).This is done by ignoring the agent's confusion (uncertainty about users locations) and assuming that it is in its most likely state (MLS).Replacing a complicated POMDP Problem by its underlying MDP enables to considerably reduce complexity since the belief space is replaced by a more practical and smaller state space.
We now discuss in more details how the upper level policy is derived.Particularly, in our case, the state of the underlying MDP, at a given decision epoch, s ∈ S is an N G × 1 vector whose elements represent the location of all copilot groups.That is, s = ( 1 , . . ., N G ).The most likely positions of users, for each decision epoch n = 0, 1, . .., are obtained as Recall that the belief position at decision epoch n depends on the belief state transition given in (18).Using (36), the agent's uncertainty about user locations is removed and the upper level planning problem is transformed to a more practical MDP.The resulting MDP is solved using value iteration [15].At each iteration, the CP updates its belief-state (according to (18)) and assumes that the users are in their most likely positions (according to (36)).Then, a training policy is derived in the fast time scale, based on the assumed positions.Deriving the latter can be done using the algorithm in table I.This provides the upper level reward which is equivalent to the H-horizon lower level reward R max (x, u) = max φ low x, u ∈Φ low {R up ( φ low x, u , x, u)}.The same procedure is repeated until deriving the best position estimation decision for each most likely state.Although the derived policy provides only an approximate location estimation strategy, it enables, nevertheless, to solve a problem otherwise intractable in realistic scenarios.

V. NUMERICAL RESULTS
In this section we provide some numerical results to validate the analytical expression derived in section III and to demonstrate the performance of the proposed training/copilot group scheduling scheme.We also showcase the performance of the proposed uplink training learning procedures.We compare the obtained results for the proposed schemes with a reference model where all scheduled users take part in uplink training.Consequently, the reference model is characterized by 0 CSI delay for all users and higher training overhead which taken to be equal to the number of scheduled users per cell.We consider C = 7 hexagonal cells, each of which has a radius of 1.5 Km.The possible positions of the mobile users are generated randomly in each cell with minimum distance of 10 m to their serving BSs.The movement velocities and directions are generated randomly for all users.User speeds are drawn randomly from [4Km/h; 80Km/h].This interval covers pedestrian public transportation and urban car movement speeds.The angle separating the movement direction of the mobile devices and the directions of their incident waves are drawn from [0, 2π].The path-loss exponent is considered to be equal to 3.5.A coherence slot of T s = 200 samples is assumed.We also consider a coherence time of 1 ms.The system operates over a bandwidth of 200MHz as considered in 5G systems [26].Once the copilot groups formed, we consider L = 5 possible position combinations for each group.The transition probabilities matrices P g , g = 1, . . ., N G are also generated randomly with L j=1 p g (i, j) = 1, ∀g = 1, . . ., N G , ∀i = 1, . . ., L. Figure 2 examines the tightness of the proposed analytical lower bound given in Theorem 1.As can be observed, the proposed lower bound almost overlap with the simulation curve.
In addition, we readily see that using outdated CSI with the implicated decrease of training resources increases the SE by 6.91 bit/s/Hz for M = 50.This gain attains 11.2 bit/s/Hz for M = 150.optimization.The latter means that the evolution of the correlation between the estimated CSI and the actual channel according to the time dimension is not taken into consideration and the scheduling of copilot groups for uplink training is optimized in order to maximize the ASE at each slot.Figure 4 shows that using value iteration an the approximation algorithm in table I, the gain in CASE is maintained and attains 41.99 bit/s/Hz and 38.7 bit/s/Hz respectively, at the final stage of the optimization horizon H.However, although per slot optimization achieves also a gain in CASE, we can see that this method performs poorly in comparison with the proposed policies which shows the paramount importance of taking the time dimension into consideration when optimizing uplink training decisions.Finally, due to its good performance, we can deduce that the approximate method (Algorithm table I) represents an efficient low complexity substitute to the more computationally prohibitive DP approach.
In Figure 5, we illustrate the achievable CASE gain after 3 upper level decision epochs with H = 4.In this example 3 values for U max were considered.As can be readily observed, decreasing U max results in lower CASE gain.This is quite intuitive since a lower U max results in more confusion about the users locations.In fact, the CPU commits more errors when inferring users positions from its belief states for lower U max values.Nevertheless, despite the positioning errors the proposed two time scale learning approach is able to provide a considerable CASE gain of 110.26 bit/s/Hz, 97.863 bit/s/Hz and 70.73 bit/s/Hz with U max1 , U max2 and U max3

VI. CONCLUSION
In this paper, we analyzed the performance of an adaptive uplink training scheme for TDD massive MIMO systems, taking into account the actual coherence time of the wireless channels, the impact of channel aging and user mobility.The idea is to adapt the periodicity of CSI estimation based on the actual coherence times.We proposed a two time scale control problem in order to allow the network to learn the best uplink training policy taking into consideration user mobility, channel coherence time and practical signaling overhead limitations.In the fast time scale, the network learns an optimal training policy by choosing which users are requested to send their pilot signals for a predefined optimization horizon.In the slow time scale, owing to practical signaling and processing overhead limitation, the network is required to choose which users are required to feedback their positions, based on their belief states.The present work shows that the aforementioned approach enables to leverage the time evolution of the correlation between the wireless channel and the estimated CSI and provides an impressive increase in the achievable cumulative average spectral efficiency that cannot be obtained otherwise.Future work include the investigation of similar procedures with fairness consideration and user traffic awareness.

A. Proof of Theorem 1
The network serves N g copilot groups, τ of which are scheduled for uplink training.At the reception, each BS uses MRC receivers that are based on the latest CSI estimates.BS l detects the signal of user g in cell l by applying the following filter u gl (t) = ĝ [l] gl (t−dg) ĝ [l] gl (t−dg) , t ≥ d g , where ĝ[l] gl (t−d g ) denotes the latest available CSI estimate for user g in cell l.Consequently, the detected signal of user g in cell l is given by the following Equation 37 follows from the fact that g t) for all t.We note that I 1 (•) refers to the useful signal, I 2 (•) represents the impact of pilot contamination and I 3 (•) regroups the impact of white noise, estimation error, non correlated interference due to users with different pilot sequences and the impact of aging.The instant SE attained by user g in cell l is: We now define R g,l to be the average achievable sum rate of user g in cell l, namely, the last equality follows from the law of total expectation.Let us define R therefore, R g,l = E(R 0 g,l ).Based on the convexity of log(1 + 1 x+a ), and Jensen's inequality we obtain the following since We now aim at computing gl (t − d g )) for j = 2, 3.In order to do so, we start by obtaining an alternative expression for I 2 (t), that is, We where the equality follows from noting the following four properties; (i) S kc • S ic = 0 for all k = i and all c, c ∈ {0, . . ., C}, (ii) E(ZW u (t)) = E(Z)E(W u (t)) = 0 for all random variables Z that are independent of W u (t) (zero mean complex Gaussian noise), (iii) similar to the previous property, E(Zε ic (t) (zero mean complex white Gaussian noise) and finally (iv) g k c are independent for all (k, c) = (k , c ).We now compute the four terms in Equation 48.The last term, i.e.,

E(|W
We now compute the third term in Equation 48, that is, for the second equality we have used the expression of finite geometric sums since (ρ [l] gc ) 2 < 1 for all g and c.Next we compute the second term in Equation 48, namely, the latter is satisfied due to the fact that the variance of g for all g and c.We are left with the first term in Equation 48, that is, Combining all four terms, that is, Equations 49, 50, 51, 52 and 48, we obtain with F =|u † gl (t)ĝ  .(56) Now, we apply Jensen's inequality to the right hand side (RHS) in Eq. 42, that is, with Note that u † gl (t)ĝ ).Consequently, the mean value of where I p gl and I n gl are given by Summing the achievable SE of all grouped users concludes the proof.

B. Proof of Theorem 2
We consider the asymptotic regime where the number of BS antennas M grows large.In this case, the lower bound on the SE of each user g, l converges to the following limit: The proposed framework is compared with a reference massive MIMO system where, all scheduled users participate in uplink training.The lower bound on the achievable SE of each user g, l, in the reference system, converges to the following: The aim here, is to improve the achievable SE of each scheduled users.Consequently, the SE of each user in the two considered systems should verify, ∀ g = 1...N G , l = 1...C: which is equivalent to the following condition:

Theorem 2 .
In the asymptotic regime (M grows large), with ρ[min] g and ρ[max] g denoting, respectively, the minimum and maximum autocorrelation coefficients in copilot group g, g = 1, ..., N G , the proposed training framework enables to improve the SE of each user when (14) is satisfied

Figure 1 :
Figure 1: A two time-scale planning problem

Theorem 3 .
Problem (34) is equivalent to maximizing a submodular set function subject to matroid constraints.Proof: See appendix C.

Figure 3
Figure 3 presents a comparison of CDFs of the achievable SE between the reference model and the proposed training scheme for different numbers of antennas at the BS.For 50 BS antennas, the proposed training scheme achieves a gain in the 5% outage rate of 6 bit/s/Hz.For 150antennas, the gain in the 5%-outage rate grows to 8 bit/s/Hz.This increase in the performance is mainly due to the reduced training resources which can be used to transmit more data.

Figure 2 :Figure 3 :
Figure 2: Spectral efficiency for varying values of M

Figure 4 :
Figure 4: CASE gain for different lower level algorithms

From
Equation 42 and 55 we obtainR g,l = E(R 0 g,l ) = E 1 − τ T log 1 + (ρ [l] gl ) 2dg G , where, G = F |u † gl (t)ĝ [l]gl (t − d g )| 2 minimum and maximum channel autocorrelation coefficients in group g.This means that we assume the worst case scenario for each user.Finally, by considering SIN R, we obtain (14) which finishes the proof.

Table II :
Approximate Local search Procedure results in Equations 56 and 58 we obtain the desired lower bound, that is,