Multi-cell Multi-beam Prediction using Auto-encoder LSTM for mmWave systems

Millimeter wave (mmWave) systems rely on communication in narrow beams for directional and spatial multiplexing gains. A key challenge in realizing these systems is beam tracking, particularly in environments with high mobility and blockage. Additionally, in wide-area mmWave cellular systems, user equipment (UE) devices must often simultaneously track signals from multiple cells, since links to individual cells can be unreliable. Models of the channel dynamics across multiple cells and multiple beams are difficult to derive from first principles. In this work, we propose a fully data-driven approach based on a novel auto-encoder integrated long short term memory (LSTM) network, which predicts multiple beams from multiple cells, one time step in the future. The key innovation is to use an auto-encoder pre-processing step, which reduces the dimensionality of the input – the main challenge in multi-cell, multi-beam tracking. The prediction capability of the proposed network is verified and compared to common baseline predictors as well as popular machine learning (ML) based neural network predictors in realistic system-level simulations using a commercial ray-tracer. We observe that predictions from the proposed network, which utilizes auto-encoders for dimensionality reduction, offers significantly better best beam accuracy and lower beam misalignment loss than common baseline approaches. We also discuss outage prediction and proactive beam switching as applications of the multi-cell multi-beam prediction.


A. Motivation
Millimeter wave (mmWave) wireless systems have emerged as a key component of fifth generation (5G) cellular standards [2]. The abundance of available bandwidths at these frequencies can enable both massive broadband and ultra-low latency communications for use cases including vehicle to everything (V2X) communications, robotics, drones, healthcare, augmented reality and virtual reality.
A well-known challenge of mobile communication at these frequencies is beam tracking. To overcome the high isotropic path loss in the mmWave frequencies, both the transmitter (TX) and receiver (RX) must typically communicate in The authors are with NYU WIRELESS, Tandon School of Engineering, New York University, Brooklyn, NY 11201, USA (Email: {s.hashim, sran-gan}@nyu.edu).
This work was supported by the National Science Foundation under Grants 1302336, 1564142, and 1547332, NIST, SRC and the industrial affiliates of NYU WIRELESS. A part of this paper was presented at the IEEE International Conference on Communications (ICC) held in Dublin, Ireland [1]. narrow, steerable directional beams. Tracking these beams at a high angular resolution is challenging, particularly in high mobility environments. In addition, mmWave signals are highly susceptible to blockage from humans, hands and many everyday building materials [3]- [6]. Thus, small changes in the orientation of the device or appearance of blockers can result in a rapid degradation of link quality in any given direction. As a result, mmWave systems often need to track and predict link quality along multiple directions to guarantee reliable communication. Moreover, most mmWave systems rely on dense cell deployments combined with multi-connectivity to provide macro-diversity resistance to blockage [7]. Multiconnectivity can be supported via carrier aggregation [8] where a mobile (UE) can be simultaneously connected to multiple cells. Hence, the mobile needs to track and predict link quality not only from multiple directions (beamforming) but also multiple directions from multiple cells (beamforming coupled with macro-diversity).
The broad goal of this work is to understand the problem of tracking multiple beams from multiple cells. We will refer to the signal path from one cell to a UE along a particular TX and RX direction pair as a link. Each link has a timevarying quality. The problem is to use past measurements to estimate a set of future link qualities as well as the best link indices (that have the best quality) with high accuracy. These future estimates can help mobile mmWave wireless systems accurately track links and proactively switch them when needed.
Traditional statistical prediction approaches are difficult, since link statistics are complex and difficult to model from first principles. The link qualities in particular can have intricate statistical relationships between different angles and base stations. We thus propose a machine learning approach where the prediction algorithms can be trained from data. Specifically, we formulate the multi-beam multi-cell prediction problem as a vector-valued sequence-to-sequence problem and solve it using recurrent neural networks (RNNs) and autoencoders. In our work, we use a well-known RNN called long short term memory (LSTM), which has worked for similar problems [9], [10]. LSTMs can capture long-term dependencies and have been successful in a range of problems, particularly in natural language processing (NLP), speech recognition and robotics. Auto-encoders are used to reduce input dimensions of LSTM. Our contributions in this paper are: (a) Novel neural network architecture with an auto-encoding precoder: We propose an auto-encoder integrated LSTM network capable of predicting all link qualities with correct indices, one time step ahead in the future (e.g., 20 ms, the typical period for reference signals in 5G NR [11]). A key challenge in these architectures is the large raw signal dimension, particularly when the number of base stations and directions is high, as discussed later in Section I-B. The high dimension can result in poor generalization and a high computational cost. We thus propose a novel auto-encoder based pre-coder for initial dimensionality reduction. We demonstrate in simulations that the auto-encoder based LSTM can offer a reduction of approximately 260% in the number of parameters with improved performance compared to standard dimensionality reduction methods such as principal component analysis (PCA). (b) Ray tracing evaluation: To validate the method, we generate traces of link quality -signal to noise ratio (SNR) -using a commercial ray-tracer at mmWave frequencies. The traces mimic a car with a multiple antenna receiver moving in downtown Rosslyn, which is connected to multiple cells with multiple antennas. These traces capture most of the major channel characteristics like multi-path, mobility and blockages from buildings.
To make the scenario more realistic, we also implement a measurement-based hand blockage model on top of the ray-tracer generated SNRs. The generated data set is also useful for ML-based wireless research, as discussed in Section I-B. After data generation, we test the prediction performances of the methods mentioned in (a) on the traces. Over multiple test trajectories of the generated data, the average test error in best link prediction from the proposed predictor is 90% of the time less than 2 dB, outperforming optimally tuned baseline linear predictors (ML predictors) by at least 78% (10%) at the same percentile. Similarly, for the top 10 links, the average error is less than 2 dB for 94% of the time, which is 86% (8%) better than the linear baseline predictors (ML-based predictors). Furthermore, the error due to misalignment of best predicted beams is less than 2 dB for 98% of the time using the proposed predictor, outperforming baseline linear predictors by at least 86%. (c) Site-specific training: An important implication of the work is that we offer a method for site-specific training, where the prediction of links from a particular collection of base stations can be optimized. Site-specific models can be run in the network (where the UE reports measurements to the network) or in the UE (where the network provides the UE parameters). This site-specific training, using an edge server, is demonstrated in Fig. 1. (d) Applications for beam management procedures: We also discuss outage prediction and proactive beam-switching as applications of the proposed predictors. Although the desired predictors are not optimized for these applications, we observe that predictors still deliver adequate accuracy. The proposed auto-encoder integrated LSTM predictor can successfully predict outages 96% (80% for the best baseline linear predictor) of the time with a false alarm prediction lower than 5% (same for the best baseline linear predictor). Similarly, the proposed predictor can proactively switch beams with a 91% accuracy (81% for the best baseline linear predictor), while keeping the false beam-switching rate lower than 2% (10% for baseline predictor).

B. Related Work
There is now a growing body of work on deep learning methods for various forms of link prediction and channel estimation. For example, previous work on single link quality predictions have been done at sub-6 GHz frequency for a vehicular scenario in [13]. Work on link prediction based on LTE and WiMax measurements has been done in [14]. CSI estimation using deep learning has also been addressed in [15] and tested on sub-6 GHz measurements. The work [16] uses RNNs for a very simple LTE-MIMO system (only four links) with no blockages, and [17] uses LSTMs to predict RSSIs (one link) for different sub-6 GHz interfaces, while [18] has developed neural network models for single beam estimation from non-coherent measurements and validated these in experiments. Our work however, tackles multi-beam multi-cell prediction at the mmWave frontier, which is more complicated because of the channel impediments like narrow beams, severe blockages, complex interactions with the environment, etc. Importantly, since we consider tracking a much larger number of links, the role of the dimensionality reduction is key. As shown in [19], the number of beams increases as carrier frequency f c increases (∝ f 2 c ) 1 . Increasing f c results in severe blockage [20] and penetration loss [21], which necessitates more macro-diversity (multi-cell connectivity), thereby increasing the input dimensions even more. As we move to next generation wireless networks (higher f c ) with even higher Fig. 1: Demonstration of site-specifc training. The blue arrows indicate gNBs sending data to the edge server. The edge server collects the data and trains the neural network. Once trained, the parameters of the network are broadcast to all gNBs and UEs, indicated by orange arrows. Similar architecture is proposed in [12]. The channel between the edge sever and UE/gNBs is not part of training the ML network. dimensional inputs, the dimensionality reduction will become more crucial for ML-aided wireless communications. To the best of our knowledge, this is the first work that tackles the increased dimensionality problem (because of increased beams and multi-cells) for mmWave wireless systems using autoencoders.
A related line of work [22]- [24] tackles beam and blockage predictions at mmWave, leveraging sub-6 GHz links (nonstandalone mode of operation). In our work, we solve the multi-cell multi-beam prediction problem solely based on mmWave links (standalone mode of operation). Also, [25] and [26] use ML for mmWave link blockage classification and prediction, while [27] uses gated recurrent unit (GRU) for blockage prediction and proactive hand-over in a simplistic environment. However, these works do not address the link quality (SNR) and link index prediction problem. Our work confronts the link SNR and index prediction problem in a realistic environment based on 3GPP parameters, measurement campaigns and proved works. These prediction capabilities will help in processes like proactive beam switching, handovers [27] and adaptive rate prediction. V2X, robotics and drone communications can also benefit from these proactive applications. To the best of our knowledge, this work is unique in solving the multi-beam, multi-cell magnitude and index prediction problem for mmWave systems.
Finally, a key challenge in ML methods is the need for large quantities of training data. A common theme in many prior works, such as [17], [24], [28], has been the use of ray tracing, which enables large quantities of training points to be generated via electromagnetic simulations. Ray tracing has also been vital in training deep generative models [29], [30]. This work also uses ray tracing combined with hand blockage models to capture local effects not included in a conventional ray tracer. Since the ray tracing scenario conforms with the 3GPP NR standard at mmWave frequencies, the generated data set is essential to foster research in ML-assisted wireless communications (similar to the DeepMIMO data [28]) 2 .

C. Organization
Section II defines some system parameters for link measurements based on 3GPP standards, which will help us align our work with the standard. We formulate the single-step ahead prediction problem in Section III and define some performance metrics, which are useful from a wireless communications perspective. Section IV presents proposed LSTM-based predictors and an argument about the need for dimensionality reduction, which will be achieved using auto-encoders and PCA. We also introduce some other ML-based neural network (NN) predictors for comparison in this section. In Section IV-D, we present some baseline linear predictors to which performance of NN-based predictors will be compared. Discussion on a detailed and realistic simulation setup based on a commercial ray tracer is included in Section V. In Section VI, training 2 The data set can be found at https://github.com/shastpi/mmWave-ray-tracer-dataset and tuning of hyper-parameters of the proposed predictors as well as the baseline predictors are given. In Section VII-A, we compare prediction performance of all predictors and observe how the proposed predictor outperforms the baseline linear predictors as well as other NN predictors. We discuss prediction performances of all the predictors for various applications in Section VII-B. Finally, Section VIII concludes the paper with a summary.

II. SYSTEM PARAMETERS
Although our methodology is general, to make the analysis concrete we will focus on tracking and predicting the links for 3GPP NR-like systems, which can be reviewed below 3 .
gNB and UE codebooks: In 5G NR terminology, the base station cell is called the gNB and the mobile is called the UE [31]. To simplify the analysis, we assume the gNB transmits from a codebook of N TX possible directions, and the UE receives from a codebook of N RX directions. Hence, for each gNB-UE pair there are N TX N RX direction pairs. In general, we will assume that N TX is equal to the number of TX antennas at the gNB, and N RX is equal to the number of receive antennas at the UE. Hence, there is one codebook vector for each spatial degree of freedom. However, most of the framework can also be applied to over-sampled codebooks.
Reference signals (RS) for beam measurements: Beam tracking in 5G NR is done using reference signals such as synchronization signal blocks (SSBs) or channel state information reference signals (CSI-RS). SSBs are periodically broadcast on relatively wider beams from each 5G NR gNB for the purpose of base station discovery and downlink beam detection (usually in idle mode) [31]. CSI-RS on the other hand are sent on narrower beams during data transmission from gNB and enable beam tracking in mobile environments. The beamsweep is generally done in a hierarchical manner i.e., the SSBs with wider beams are first used to determine a coarse direction of transmission, which is then refined using reference signals like CSI-RS. However, in this work, we only consider narrow beams, which are referred to as refined beams after beam refinement. This brings us to our first assumption that reference signals only use narrow beams for beam quality measurements (SNR). Measurements over narrow beams eliminate the hierarchical aspect of beam tracking and make beam tracking more challenging.
These RSs are transmitted in bursts with some periodicity T RS . We set this interval to 20 ms, which is consistent with SSB and CSI-RS periodicity in the 3GPP NR standards for carrier frequency of 28 GHz with a sub-carrier spacing (SCS) of 120 kHz [32]. In each RS burst, N RS beams can be measured (typically with one TX direction for each RS). The parameters N RS and T RS are configurable. In simulations below, we will set N RS = N TX allowing one RS in each downlink direction. The values of these and other important parameters are given in Table I.  Network Model with Carrier Aggregation: Resilience to blockage at mmWave frequencies necessitates macrodiversity, i.e., the UE must be connected to multiple cells [7], [8]. To this end, we assume that the UE is connected to N gNB gNBs via carrier aggregation, a key feature in 3GPP systems that enables simultaneous connections to multiple cells [8]. The cells either operate in different component carriers or within the same component carrier 4 -the analysis for this paper is identical. The above process does not require synchronization across cells.
The notions of RSs and carrier aggregation are introduced to justify an important assumption for our prediction method − the UE/gNB are able to measure all the beams at each discrete time interval. This discrete time interval in our case is T RS .
Our analysis can apply to both fully digital and analog beamforming at the UE. With fully digital beamforming, the UE can measure all N RX directions every RS measurement period. Hence, after one RS burst of N TX transmissions, all RX-TX pairs will have been measured. For analog beamforming during an RS burst allocated period, the UE can send uplink measurement signals (like sounding reference signals − SRS) to gNB from one of its beams, and a gNB with fully digital beamforming can measure all the beam-pairs for that particular beam 5 . In this manner, the complete beam sweep for all pairs will take N RX such instances. In either beamforming case, since we assume carrier aggregation, the UE can measure the signal from all cells in each measurement burst. Henceforth, for simplicity, we will assume that beam sweeping is done at a fully digital beamformed UE via RS bursts. Therefore, the UE measures each synchronization resource individually. The other signals will appear as interference. Since mmWave systems are wideband and generally power limited, we have neglected this interference.

III. PROBLEM FORMULATION
We index the discrete time steps (RS bursts) by t = 0, 1, . . . T . Let γ(i, j, l, t) denote the measured channel quality (i.e., SNR) from cell i = 1, . . . N gNB , in TX direction j = 1, . . . N TX , and RX direction l = 1, . . . , N RX at measurement period t. We merge the first three dimensions of the SNR tensor so it becomes γ(k, t), where k = 1, · · · , K = N TX N RX N gNB . We call each k a link. The matrix γ(k, t) thus describes the variation of the link qualities over time. The variations will in general depend on UE motion, blocking, smallscale fading, hand blockage and other channel characteristics. The SNR measurement can be a wideband average SNR or effective SNR when there is frequency-selective fading.
We will often train on multiple trajectories where each trajectory is some route of the UE experiencing some blockage. In this case, we denote the SNR tensor for n-th trajectory as γ n (k, t). A trajectory consists of traces of SNRs on all beampairs at all gNBs for T time steps (refer to Section V-B for the exact definition). We consider predictors of the form, where P[·] is the prediction function, and we have used the python-like 6 notation to indicate that the predictor depends on all K links from the previous M time samples. The output is a prediction of all K links. The predictor can be a simple linear one such as moving average, or it can be more complex such as LSTM or GRU. Given training data of the n-th training trajectory, γ n (k, t), n = 1, . . . , N , the predictors will be trained with the standard mean squared error (MSE) loss as defined in [37], The proposed ML-based technique of minimizing loss in (2) has no theoretical guarantees, which tends to be the case in most ML works. Therefore, we opt the methodology followed by other ML-aided wireless works [18], [38] and instead rely on developing good training and testing data sets. Moreover, even for classic algorithms, theoretical guarantees are typically only given for simplified versions of the problems [18]. For complex problems, validation on data is done to prove effectiveness of the proposed scheme, which is consistent with 6 In python, indexing of a vector v using v(a : b) means that we would like to obtain the values of the vector from index range [a, b). Note that the index b is not included in the final values. So in (1), we are trying to predict the SNR values at time t based on the previous M SNR values on all links. Another notation we use from python is the : . If A is an m × n matrix, A[1, :] means that we want the data from row 1 and all columns of A. This idea can be similarly extended to the tensor γ. This article has been accepted for publication in IEEE Transactions on Wireless Communications. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TWC.2022.3183632 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ our approach of testing. For our work, we will use several metrics in test. For example, given a test trajectory, γ(k, t), we can evaluate the root mean squared error (RMSE) loss : However, the RMSE loss, while good for training, is not necessarily representative of performance. We thus consider two other test metrics: the Top C link(s) prediction error and beam misalignment error.
Top C link(s) prediction error, ϵ C , captures the difference between the true C links with maximum SNR(s) and C predicted links with maximum SNR(s) at time t. Specifically, consider a test trajectory, γ(k, t). At time t, let γ(u, t) denote the values of γ(k, t) sorted over k in descending order. Hence, γ(u, t) represents the link qualities in sorted order. Similarly, for the predictions γ p (k, t), we define γ p (k, t). The top C RMSE is then defined as, We will use C = {1, 10} in this work. This metric captures how well a predictor can predict best C links from the available links. This metric is similar to Top-1 and Top-3 metrics discussed in [22]. The motivation for introducing this metric for C = 1 is that if a UE/gNB is tracking multiple links, it will always choose the link with the best quality (SNR in our case) to transmit/receive on so it can yield maximum gains during communication. One of the use cases that can be derived from this metric is proactive link rate adaption where UE/gNB will adapt its transmission rate according to predicted SNR. Likewise, a use case for C = 10 is proactive beam switching i.e., if the best beam/link is predicted to be blocked, the UE/gNB could switch to any of the predicted unblocked links. Hence, ϵ C will indicate the accuracy of the best C link predictions.
Beam misalignment error ξ: The predictor must estimate not only the future best link quality but also the index of the best link. An instance might occur when a predictor predicts the best link quality accurately but mispredicts the index of the best link, which will result in misalignment loss in real time 7 . So in order to capture how precise best link index predictions are, we introduce ξ, which can be written as : where k p (t) is the best predicted index from the estimated SNRs and can be written as: A similar metric was used in [18]. The beam misalignment error is the difference between true SNR on the true best beam index (maximum true SNR) and true SNR on the best predicted beam index. The beam misalignment error we define in (4) is measured in dB − not to be confused with degrees, which is another measure of beam misalignment. Measuring loss in degrees might have different circumstances depending on the communication architecture e.g., a sub-6 GHz system might be able to provide a decent throughput even with a large misalignment loss in degrees, while the same is not true for mmWave systems. Due to this inconsistency of measuring misalignment loss in degrees, we opt to measure it in dB instead. Another advantage of measuring loss in dB is that it can be translated directly into other system performance metrics like throughput.

A. LSTM
In this work, we consider an LSTM [9], which is widely used for sequence-to-sequence prediction problems. LSTM is a natural choice for wireless tracking problems due to its ability to capture short-term dependencies (e.g., multipath fading) and long-term dependencies (e.g., shadowing and blocking). LSTM networks from a research viewpoint − although relatively old − have been successful in ML-aided wireless communications [1], [39]- [41]. Moreover, LSTM based predictors are less complex in terms of architecture design as compared state of art sequence predictors like selfattention based transformers (e.g., BERT, which themselves have underlying RNNs). LSTM networks are designed to circumvent the vanishing gradient problem, which is prominent in RNNs. The aforementioned success and comparatively reduced complexity makes LSTM networks a suitable choice for next generation ML-aided wireless communications. The standard LSTM operation with q hidden units and d dimensional input is governed by the following set of equations: where x(t) ∈ R d is the input vector to the LSTM unit, i(t) ∈ R q is the input gate, f (t) ∈ R q is the forget gate, o(t) ∈ R q is the output gate, h(t) ∈ R q is the hidden state vector, s(t) ∈ R q is the cell state vector and g(t) ∈ R q is the cell input activation vector.
are the weights and biases of respective gates that need to be learned during the LSTM training. The input gate decides whether the current incoming data is contributing new information to the network. The forget gate flushes out unwanted data from the memory. The output gate dictates what to show at the network output [10], [42]. The cell state keeps track of the memory of the unit, which includes both short-term and long-term memories. The hidden state vector is eventually used to predict the output variable. The hidden state can extract short-term, long-term or both types of memory stored in the cell state to make the prediction. The last equation (6g) represents a fully connected NN that takes the predicted hidden state vector from LSTM as an input and maps it to a d dimensional output z(t). Hence during training, the network also needs to learn weight matrix W zh ∈ R d×q and bias vector b z ∈ R d . In the above equations, ⊙ represents element-wise multiplication, σ and ReLU represent sigmoid and rectified linear unit activation functions, respectively. These are given by: ReLU(x) = max{0, x}.
A visualization of an LSTM cell unrolling in time with all the aforementioned parameters can be found in [10].

B. Dimensionality reduction via auto-encoding
The LSTM outputs will be the predicted link qualities one time step in the future: For the inputs, we could use the raw measured SNR values 8 : Now, the total number of parameters L LSTM that are needed to train an LSTM network with q hidden units and d dimensional input is: Using the raw SNR values (9) as inputs, the input and output dimensions would be d = K = N TX N RX N gNB . As we will see in Section VI-B, this number can be prohibitively large, therefore requiring a large number of LSTM parameters. The large number of parameters increases the generalization error and inference complexity. Thus, we also consider employing a dimensionality reduction of the form: 8 We use python-like notation here as well.
which transforms the K-dimensional SNR data at each time window M to some lower dimension d ′ ≤ K before it is sent to the LSTM. We call Φ(.) and Ψ(.) encoder and decoder, respectively. The LSTM predicts x p (t) one time step ahead and the decoder converts the predictions back to SNRs. The classic dimensionality reduction method is PCA, which can be trained on the set of SNR values γ n (k, t) over the training trajectories n and times t. The LSTM predicts x p (t) and a decoder provides us with predicted SNRs γ p (k, t). We call this method LSTM-PCA. We will quantify the performance of dimensionality reduction methods using the following RMSE metric υ: (13) We define the dimensionality reduction factor κ as: Both metrics above [ (14), (13)] characterize the performance of a dimensionality reduction technique. The limitation of PCA is that it only performs linear dimensionality reduction -it is essentially a projection from the d-dimensional space to a lower d ′ -dimensional space. We thus consider auto-encoder based approach.
Choice of auto-encoders: To address the dimensionality reduction, we need to choose an auto-encoder that best serves our purpose. We choose undercomplete auto-encoders [43], which compress (encodes) large dimensional input data into lower dimensional signals (bottleneck). These signals are then used to recreate the original data 9 . We use undercomplete auto-encoders consisting of convolutional neural network (CNN) layers and hence are termed as convolutional autoencoders (CAEs). In CAEs, the encoding function Φ(·) is realized as a CNN. In addition, we train a decoder network Ψ that maps the low-dimensional x(t) back to the original space. Several loss functions are possible, and in this case, we use the standard MSE loss between the original γ n (k, t) and their reconstruction. See [44] for an example. We integrate the designed auto-encoder with LSTM and call this scheme LSTM-AC. The dimensionality of the hidden states d ′ as well as the encoder and decoder architectures are parameters in the network. We discuss their selection and design in Section VI-A.
Regardless of the dimensionality reduction method used, the network is trained on the one-step ahead MSE prediction loss (2). In training, we use M time steps of input, x(t − M ), . . . , x(t − 1), to generate each z(t). The parameter M , indicating the memory of the network, dictates the number of time steps over which the LSTM network unfolds. M is another parameter that needs to be tuned, and its value can be found in Section VI-B. Final design of the proposed LSTM-AC architecture is shown in Fig. 2.

C. Other ML-based NN predictors
In addition to LSTM-PCA and LSTM-AC, we also test the performance of some additional ML-based NN predictors to investigate how ML-based solutions perform for the given problem. These predictors are Vanilla RNN, Transformer and CNN. Vanilla RNN, which is a simple predecessor of LSTM, is tested to see how much gain LSTM provides over simple RNNs. Transformer-based predictor (used extensively in NLP) is much more complex than LSTM, because it uses selfattention 10 and is probed to observe how complex predictors perform. CNNs are explored to note how generic NN architectures perform, which are not designed for time series. All the aforementioned predictors will use auto-encoder in their architecture, since networks with large dimensional inputs are hard to train and provide poor generalization performance. The complexity, design and training of these ML-based predictors is discussed in Section VI-B.

D. Baseline linear predictors
We will also compare the prediction performance of the LSTM-based predictors to simple baseline linear predictors. The first is a simple moving average, which takes the average of the previous M time steps. The parameter M can be optimized in the training phase. A more general estimator is a linear estimator, which takes a linear combination of the links in previous times. We allow dependencies from the predicted link k from all measured links v. The weights in the model, W k,v,m , can be learned from minimizing the mean squared loss. Similar to moving average, M is a parameter that needs to be optimized for the linear estimator. 10 Transformers themselves consist of RNNs in their architectures.

A. Scenario/Layout
A vital step in testing the prediction capabilities of different predictors is to generate a realistic data set of SNRs. The data should ideally come from real-life measurements. However, measurements which include exhaustive beamsweep of all the links between UE and multiple gNBs are hard to obtain and are not currently available. We therefore adopt a ray-tracer based approach, which enables much larger volumes of data. The ray tracing is accurate in that it captures paths from all propagation phenomena like diffraction, reflections and transmissions. We use the commercial ray tracer from Remcom called Wireless Insite [45], which has been widely used in research communities [24], [46] and has been verified through mmWave measurements [47], [48]. This ray-tracing package has also been widely used in many ML experiments [22], [28], [49], [50].
The first step in setting up the ray-tracer is to import the scenario layout. In our case, the scenario is downtown Rosslyn, Virginia 11 . The layout consists of building locations and dimensions in the area. The layout also includes materials from which these buildings are made so that the propagation mechanics like reflection, refraction and penetration of the scenario are accurately captured. Once the layout is imported into the ray-tracer, we place four gNBs (labeled BS in Fig.  3a) at some of the intersections in the city. The gNBs are approximately 200 m apart, translating to a cell radius of roughly 100 m, which is consistent with the 3GPP Urban Micro "UMi" scenario [51]. These gNBs need to be assigned certain parameters like f c , transmit bandwidth BW , etc. The ray tracer also needs to consider the total number of paths i.e., the number of paths to consider from each gNB to each receiver point. We set this property equal to 20 in accordance with the 3GPP UMi NLOS scenario [51]. The ray tracer is configured to show the paths with a maximum of 2 reflections, 1 transmission and 0 diffractions. As described in [52], [53], mmWave systems will mostly rely on reflections for multi-path propagation, justifying the choice to mostly focus on reflections. Similarly, the 1 transmission means we only consider penetration of a signal through one obstacle 12 . The main mode of signal propagation in our work is line of sight (LOS) paths and non-line of sight (NLOS) reflected paths. The main sources of reflections are the buildings and terrain (ground). Once all the aforementioned parameters (listed in Table I) have been set, we place receiver points over the entire layout grid spaced 0.5 m apart both in x and y axes. We deploy isotropic antennas at gNB and receiver points. Adding beamforming on top of these traces will be discussed in Section V-D. We now execute the ray-tracing. The raytracer output provides us with propagation information: (1) Received power on all the paths at each receiver point for each gNB, (2) The spatial information (e.g., path lengths, angle of arrivals and departures) of all these paths at each receiver point for each gNB and (3) The temporal information (i.e., delays) of all these paths at each receiver point for each gNB. All this information from the ray-tracer is sufficient to start modeling link quality (SNRs). In the next section, we discuss the addition of mobility to the current scenario.

B. Mobility
As mentioned above, the ray-tracer provides all the received signal information on the points in the layout grid (spaced 0.5 m). The next step is to add mobility to the scenario. The goal is to mimic a vehicle (UE) moving downtown with velocity given in Table IV (from [51]). We use the MATLAB Navigation Toolbox [54], which implements rapidly-exploring random tree (RRT) algorithm [55] to achieve this goal. MAT-LAB enables us to control various aspects of mobility: (a) Generating random routes for UEs, (b) Handling UE velocity in these routes and (c) Preventing collisions with obstacles (buildings) in these routes.
We start by importing the obstacle layout from the ray tracer to MATLAB. This layout is converted into Binary Occupancy Grid where length and width of each grid square is set to 0.5 m. The binary occupancy grid assigns ones to the grid points where obstacles are present and zeros otherwise. At the beginning of each route, a starting point and an end point of the UE are sampled from the uniform distribution over the grid 13 . Similarly, a random velocity with distributions from Table IV is assigned to the UE. To avoid collisions, we use the Navigation Toolbox [54] from MATLAB, which works on a binary occupancy grid and ensures that the UE does not collide with any buildings during the course of its route. The UE continues to move until a total of T = 3000 samples spaced 20 ms apart (60 s for each trajectory, in accordance to the beam measurement periodicity from 3GPP [33]) are collected. We refer to these T = 3000 samples as a trajectory 14 . A total of 200 trajectories (100 for training and 100 for testing) are generated. A generated trajectory with a binary occupancy grid is shown in Fig. 3b.

C. Hand blockage modeling
So far, the link quality generated from simulation trajectories captures the effect of multi-path, mobility and blockage by buildings. To make our simulations more realistic, we add hand blockage on link quality as well. As mentioned in Section I, hand blockage is also something that has to be overcome in the mmWave regime. We use a linear interpolated hand blockage model from [56] based on measurements at 28 GHz. The model depends on the angle of arrivals of different paths 13 It is ensured that the UE does not start inside any of the buildings (through a binary occupancy grid).
14 Multiple routes might be generated during the trajectory until the required number of samples are collected. Multiple routes are connected together by their end/start points i.e., the end point of the older route becomes the start point of the new route, ensuring continuity. Since we use a ray tracer for our simulations, all the spatial information needed to implement the hand blockage model is available. For orientation, we choose one randomly at the start of each trajectory with equal probability. A hand blockage event on any path is triggered if the azimuth (elevation) angle of arrival ζ(θ) falls in the range [ζ 1 ± χ/2] ([θ 1 ± η/2]). The range signifies the azimuth (elevation) angular spread. Both azimuth and elevation angle of arrival conditions need to be true for a blockage event to be initiated. These conditions are similar to what has been proposed by the 3GPP standard to model hand blockage [51]. The values of ζ 1 , χ, θ 1 and η have been listed in Table II. After triggering, the time dynamics of the blockage event are controlled by random variables τ d3dB , τ r3dB and τ Block . Where τ D is the total blockage event time , τ d3dB and τ r3dB represent the time taken for the signal to decay or rise by 3 dB, respectively. These values (in ms) are generated upon triggering a blockage event and are provided in Table III     The blockage event on a particular path is modeled based on the parameters above using linear interpolation. We now define some parameters that will aid in this interpolation, where τ decay is the time needed for the signal level to decay to A dB from the initial signal level. τ rise is the time needed to rise to the normal signal level from A dB. The 3 in the denominator is because the transition is measured every 3 dB.
The total time of blockage event is given by: The time during which the signal level remains constant at A dB during a blockage interval τ constant is given by: We now have all the parameters required to represent the blockage event in time. The loss suffered by hand blockage ρ(τ ) at time sample τ can be represented in the following piece-wise linear manner: It should be noted that A < 0 since it measures loss. Fig.  4a shows a blockage event labeled with all the parameters mentioned above. Fig. 4b shows an instance in the trajectory where a link suffers from hand blockage. The figure shows how the SNR degrades by 10 dB in 100 ms just by hand blockage. Other factors that contribute to this degradation (not shown in figure) in the simulation setup are blockage by buildings and fast fading (since coherence time of channel is small). The factors all together may result in very frequent degradation of a link that is being tracked. This necessitates a good predictor, which can accurately predict on all links so the gNB/UE can always track the best link.

D. Beamforming codebook design
A 4 × 2 uniform planar array (UPA) with λ/2 antenna spacing is assumed at the UE and an 8 × 8 UPA is assumed at the gNB. These sizes for 28 GHz are similar to past capacity analyses such as [34]. We assume two identical antenna arrays at the UE and gNB for full 360 degree coverage, like practical devices [57] (i.e., one array covering the front hemisphere and the other covering the rear). Let F j := {f l }) denote the pair of gNB (UE) beamforming vectors corresponding to the j-th (l-th) TX (RX) direction, where f (2) l ∈ C NRX ), correspond to the front and rear antenna arrays, respectively. We consider a simple beamforming codebook based on the steering vector of a UPA, such that the main lobes of the beam patterns cover the hemisphere, equally spaced in both azimuth and elevation. We refer the reader to [19] for the expressions of f   l .

E. SNR calculation
Given the rays for respective paths from the ray tracer for a trajectory n, we compute the narrowband channel matrix for  the i-th gNB 15 , H i,n (t). We apply the beamforming vectors to compute the SNR on each link. The expression for γ n (i, j, l, t) is as follows: γ n (i, j, l, t) = 10 log 10 where k B is Boltzmann's constant, BW denotes the system bandwidth, N F is the noise figure, and T 0 is the temperature. We flatten the i = 1, · · · , N gNB , j = 1, · · · , N TX and l = 1, · · · , N RX dimensions to k = 1, · · · , N TX N RX N gNB = K, which is often done in machine learning problems. γ n (k, t) is given by: where η lower = 0.2344 bps/Hz is the spectral efficiency offered by the lowest modulation and coding scheme (MCS 0) according to 3GPP NR standards [62]. Hence, γ lower is the ideal lowest SNR at which a signal can be decoded. If the UE/gNB is not able to measure SNR on a link (from blockage or any other reason), it reports a value of γ lower on that link. Similarly, η upper = 7.4063 bps/Hz dictates upper bound on SNR since there is no change in throughput afterwards. ∆ indicates how far the system is operating from the Shannon capacity and is set to 3 dB [35].

A. Performance evaluation of auto-encoders and PCA
The encoders and decoders of the auto-encoder are designed to reduce the dimensions of the links from K to d ′ AC 15 We consider narrowband since primary synchronization signal (PSS) (used for estimating link quality) is narrowband. Tracking based on wideband SNR is an interesting aspect to look at in the future. 16 Practically, this range is a function of the receiver sensitivity and then back to K. We use 50% of the SNR trajectories for training (N train = 100, total number of training samples = N train ×T ). The depth (number of layers) and width (hidden units) contribute to the number of auto-encoder parameters L AC that are to be optimized. Too many parameters will cause processing inefficiency, while fewer parameters will result in information loss. Training epochs will similarly impact the training time and over/under-fitting of the data. We use crossvalidation to roughly find these hyper-parameters that provide a good processing-accuracy trade-off. These parameters along with CAE architecture is shown in Fig. 5. The proposed CAE takes input tensor with dimensions N train T × N gNB × N RX × N TX (refer to Table IV for values) and the encoder returns a tensor of dimensions N train T × N gNB × N RX × 8 17 , reducing input dimensions by a factor of 8. The tensor is then flattened into a d ′ AC dimensional vector. This flattening is necessary to make the CAE compatible with ML-based predictors since next-step prediction will happen over these flattened latent variables. For decoding, there is a reshape layer that reshapes the flattened vector into a tensor of the dimensions mentioned above. The CAE decoder maps the compressed tensor back to the original SNR dimensions.
Following the discussion above, we reduced the dimensions of the SNR data from K = 2048 to d ′ AC = 256. Similarly, we use PCA for dimensionality reduction over all the training trajectory SNRs. For PCA to get the same order of accuracy as auto-encoder (Table V), we need more dimensions as compared to CAE. The parameters that need to be tuned for PCA, L PCA 18 > L AC (from Table V). Although PCA has more trainable parameters than CAE, it is easier to train because of its linear nature (e.g., using singular value decomposition). An ML-based predictor will have to predict a 512 dimensional vector for PCA and a 256 dimensional vector for auto-encoder. This difference in dimensions for predictor inputs will cause the processing intensity of PCAbased predictors to significantly increase as compared to CAEbased designs. We will show this in the next sub-section, where LSTM-AC and LSTM-PCA are compared.

B. Training ML-based predictors
After creating an appropriate auto-encoder and PCA encoder/decoder set, we train LSTM-PCA, LSTM-AC, Vanilla RNN with AC, Transformer with AC and CNN with AC. ML-based methods are trained over 100 train trajectories 17 The CAE compresses the N TX dimension because it contributes the most to the number of input dimensions. 18 This article has been accepted for publication in IEEE Transactions on Wireless Communications. This is the author's version which has not been fully edited and content may change prior to final publication.  of encoded SNRs (of dimensions d ′ AC obtained using autoencoder) to solve the one time step ahead prediction problem using MSE loss 19 . LSTM-PCA is trained similarly over PCAencoded SNRs (of dimension d ′ P CA ). The hyper-parameters for all these methods are found using cross-validation. The trainable parameter count (complexity) for these methods is listed in Table VI. We see that vanilla RNN with AC is the most processing efficient, while LSTM-PCA is most expensive. As discussed in Section IV-B, this complexity is due to the larger input dimensions for LSTM-PCA. We also see that Transformer with AC has a lot of training parameters, which can be attributed to the underlying self-attention mechanism of the network. Overall, we note that RNN-based predictors − which use auto-encoders LSTM-AC and Vanilla RNN − are approximately 260% more processing efficient than LSTM-PCA. This processing efficiency justifies the use of autoencoders for dimensionality reduction in context of the multicell multi-beam prediction problem.

C. Tuning parameters for baseline predictors
There is not much space for tuning for baseline predictors except for the parameter M (window size in this case). The training method for baseline predictors is to find a value of M that minimizes the losses in (3)-(4) 20 . The values of M that 19 All ML-based predictors have been trained for equal number of epochs (20) so the comparison is fair. Training times differ for each network based on their complexity. 20 Baseline predictors do not need any dimensionality reduction since they are already simple. provide a good trade-off between the two losses are obtained by brute-force method iterating over values of M localized to {1, · · · , 50} for all the N train trajectories. For moving average, the best window size turns out to be M MVA = 14, while M LR = 10 is the best for linear estimators. This tuning is done to ensure we are comparing the performances of MLbased predictors to baseline predictors, which are best (at least locally) in their own domain.

A. Prediction performance metrics evaluation
In this sub-section, we present the generalization error analysis of all the predictors. As mentioned in Section I, we take a site-specific training approach, where a site comprises of a group of gNBs and a UE in a particular environment. This training enables capturing useful correlations across time and across gNBs. The networks are trained over known trajectories within the site as mentioned in Section VI. The predictor performance is measured over new trajectories (i.e., trajectories that the network has not seen before). This training and testing procedure is consistent with the ones widely used in the ML community. The generalization ability of the predictors is the testing of prediction performance over the new trajectories near the site. We test the predictors on N test (= 100) 21   trajectories 22 . We calculate the metrics ϵ C and ξ (from Section III) for all test trajectories (a total of N test points), and for all the predictors (ML-based and baseline). The performance comparison is captured over all trajectories in form of a cumulative density function (CDF), F (·). These CDFs for ϵ C and ξ for different predictors are shown in Fig. 6 and are summarized in Table VII.     (3) and (4). P (y) means probability that event y happened.
It can be observed from Table VII and Figs. 6a and 6b that LSTM-AC has the best prediction performances among the predictors tested. For example, LSTM-AC keeps the top 1 (top 10) prediction error below 2 dB 90% (94%) of the time, outperforming the transformer-based predictor by 10% (6%). This is an interesting observation because generally, transformers outperform LSTMs particularly in fields of computer vision (CV) and NLP. The better performance of LSTMs can be explained by limited training data, small number of training epochs, lack of transformer depth/width and dependence of transformer complexity on M . In [63], [64] authors show LSTMs can outperform transformers in scarcity of training data. Regarding training epochs, both architectures were trained for 20 epochs to make comparison fair. Therefore, Transformer might not have trained enough hence impacting its prediction capacity. Additionally, we consider the simplest transformer design that converged (loss reduced in training). Even the simplest transformer with AC has 4.2 million parameters (400% more than LSTM-AC). Since, one of the goals of this work is designing processing efficient networks, we did not modify the width or depth of the transformer, which can result in worse prediction performance. Moreover, complexity of transformer increases with increasing memory (M ), which is not case for LSTM. This increased complexity is not justified in terms of the bias-variance trade-off (as compared to LSTM) and will result in performance degradation 23 .
Comparing dimensionality reduction techniques, we see that LSTM-AC has a 40% (13%) gain in top 1 (top 10) link prediction over LSTM-PCA. This gain is due to better encoding/decoding performance of auto-encoders as compared to PCA. LSTM-AC also outperforms baseline linear predictors by 78% (86%). Overall, ML-based predictors perform better as compared to linear predictors (which is expected) and auto-encoder-based predictors perform better than the PCAbased predictor. We can also observe from Table VII and Fig.  6c that all ML-based predictors have similar misalignment error performances (between 94% to 98%) outperforming baseline linear predictors by at least 82%. This means that these ML-based predictors are able to successfully predict the beam indices within 2 dB of the best beam at least 94% of 23 These are only few reasons. The final performance evaluation is a function of data set, choice of loss functions, metrics, hyper-parameters etc. the time. The takeaway of these analyses is that NN-based methods with auto-encoder pre-processing offer significantly better performance than standard pre-processing such as PCA or linear prediction. Among the NN-based predictors, LSTM-AC in particular provides a good processing efficiency and performance trade-off.

B. Applications
In this sub-section, we discuss some applications of multicell multi-beam tracking based on the network predictions. These applications are just a byproduct of the prediction problem that minimizes the loss in (2). Hence, these applications are just a subset of the core prediction problem we address in this paper. The applications themselves may be solved using a relatively simpler approach if the problem is formulated according to the application 24 . However, we look at these applications in context of the problem formulated in this work: based on the M previous measurements, predict the next time slot SNRs on all beams from all cells. The applications we discuss are outage prediction and proactive beam switching. 1) Outage prediction: As mentioned in Section I-B, a lot of work using RNNs has been done explicitly for blockage prediction purposes. Since we are predicting on all the beampairs from all the gNBs (using all predictors), we can predict blockages on any link. One extreme case of these blockages is outage i.e., all the available links are blocked, hence the UE/gNB goes into outage. Using this definition, we can define outage when the maximum SNR from all the links falls below a threshold γ lower + ∆ dB. Mathematically, we define a true outage event as a binary variable B given by: We can similarly define a predicted outage event B p (t) as: With these definitions, the following two metrics can be used to capture the outage prediction performances of different predictors: • Outage detection accuracy: When B(t) = B p (t) = 1, an outage is correctly predicted. Outage detection accuracy is the ratio of correctly predicted outages to the total number of true outages over all trajectories. • Outage false alarm ratio: When B(t) = 0 and B p (t) = 1, an outage is predicted when there was none. This is the ratio of falsely predicted outages to the total number of predicted outages over all trajectories.
These metrics have been shown in Fig. 7a for ∆ = 3 dB.
We can observe that NN-based techniques have more than 96% outage prediction accuracy as compared to 80% for the best baseline linear predictor. Higher prediction accuracy can be attributed to the ability of NN-based predictors to foresee 24 For example, proactive beam switching discussed below can be formulated as a markov decision process, centered around optimizing beam switching based on some observed action and state space. sudden channel variations (triggering of a hand blockage event or a blockage caused by building). The false alarm ratio for all the policies except LSTM-PCA is less than 6%. Hence, the designed predictors are able to correctly detect outages 96% of the time, while also keeping the false alarm rate low. LSTM-AC in particular delivers an outage accuracy of more than 96% with a false alarm rate of around 4% and is comparably processing efficient. The outage prediction capacity of these predictors can be used for various proactive purposes e.g., a UE can turn off its radio frequency front end (RFFE) to save power when it senses an outage. Similarly, a gNB can smartly allocate resources to different UEs from the predicted outages.
2) Proactive beam switching: The predictors designed can also be used for proactive link switching. We assume that at the start of every test trajectory, UE is served by best available link (k 0 ). At time t, a beam/link needs to be switched if there is a better link (with greater SNR) available as compared to the serving link. For proactive link switching application, we define two events: successful proactive link switch and false proactive link switch. A successful proactive link switch occurs when the following three conditions are true: γ n (k p new , t) > γ n (k 0 , t) :the new predicted link index, (27) k p new has a better SNR.
After a successful switch, the UE updates the best-serving link k 0 = k p new . Similarly, a false proactive link switch can be defined as: We present the results of successful and false link switch prediction percentages in Fig. 7b 25 . We see that ML-based predictors are successful in proactive beam switching 91% of the time as compared to linear estimator and moving average at around 80%. However, looking at the false alarm percentage, we see that the false alarm percentage of MLbased predictors is less than 2% as compared to 10% (16%) to that of the linear estimator (moving average), meaning LSTM predictors not only accurately predict the beam switching in advance but also keep the false beam switch rate low. LSTM-AC and Vanilla RNN will be preferred in this case because of their low complexities. The superior performance of MLbased predictors can be explained by the ability of the neural networks to predict multiple beams from multiple cells more accurately as compared to baseline linear predictors. This proactive beam switching can be translated into handovers if the beams switched are from different gNBs 26 .

VIII. SUMMARY AND FUTURE WORK
Beam tracking is a fundamental challenge in all mmWave systems. In this work, we have proposed an auto-encoder integrated LSTM network for multi-cell multi-beam prediction. Auto-encoders reduce input dimensionality of the predictor − a major problem in multi-cell multi-beam tracking scenarios − enabling processing efficient design of accurate LSTM predictors. Notably, the method can track signals from multiple cell sites and is applicable for procedures including handover and carrier aggregation with multiple cells. The method was validated on detailed ray tracing measurements. There is significant opportunity to build on this work. Most importantly, we have looked at narrowband measurements similar to what is obtained with reference signals in 5G NR. A key research direction is to predict the wideband channel characteristics from intermittent narrowband measurements. A second avenue of future research is to validate the work on larger training data sets. We have already accumulated ray tracing on five large cities in our work [30] and a similar campaign can be used here. Another direction for the future is tightening the assumption from "UE is able to measure all the links to predict 25 Successful link switch percentage is the ratio of the sum of successful link switches to total number of link switches needed. False alarm switch percentage is the ratio of the sum of false link switches to total number of link switches predicted. 26 In general, handovers are much more expensive than beam switching, but if there is exchange of information about UEs within the gNBs, handovers can be handled in a similar manner to beam switching. all the links" to "UE is able to measure a subset of links to predict all the links". This new assumption gives rise to a new problem, which is finding how to choose the subset of links from which the predictors can extrapolate the link qualities of all the links.