Distributed Machine-Learning for Early HARQ Feedback Prediction in Cloud RANs

In this work, we propose novel HARQ prediction schemes for Cloud RANs (C-RANs) that use feedback over a rate-limited feedback channel (2 - 6 bits) from the Remote Radio Heads (RRHs) to predict at the User Equipment (UE) the decoding outcome at the BaseBand Unit (BBU) ahead of actual decoding. In particular, we propose a Dual Autoencoding 2-Stage Gaussian Mixture Model (DA2SGMM) that is trained in an end-to-end fashion over the whole C-RAN setup. Using realistic link-level simulations in the sub-THz band at 100 GHz, we show that the novel DA2SGMM HARQ prediction scheme clearly outperforms all other adapted and state-of-the-art schemes. The DA2SGMM shows a superior performance in terms of blockage detection as well as HARQ prediction in the no-blockage and single-blockage cases. In particular, the DA2SGMM with 4 bit feedback achieves a more than 200 % higher throughput in average compared to its best alternative. Compared to regular HARQ, the DA2SGMM reduces the maximum transmission latency by more than 72.4 %, while maintaining more than 75 % of the throughput in the no-blockage scenario. In the single-blockage scenario, DA2SGMM significantly increases the throughput for most of the evaluated Signal-to-Noise-Ratios (SNRs) compared to regular HARQ.


I. INTRODUCTION
T HE emergence of new services, such as Vehicle-To- Everything (V2X), Virtual Reality (VR), Ultra-Reliable Low Latency Communication (URLLC) and many more, has increased the need for higher data rates and extremely low latencies.This has directed the interest of mobile communication standards to cover new and higher frequency bands.The Fifth Generation (5G) standardization body, the 3rd Generation Partnership Project (3GPP), has recently finished a new work item targeting frequencies up to 71 GHz for access [1].In particular, recent advances in hardware have paved the way for using these bands.The sub-THz and THz bands, which reach from 100 GHz up to 3 THz, are now in the focus for beyond 5G technologies [2], [3].However, the use of high-frequency bands has the disadvantage of being highly dependent on an unobstructed Line-Of-Sight (LOS) path and having significantly shorter channel coherence times, which require a higher control signaling overhead due to more frequent channel measurements [2].Especially, the latter poses a bottleneck for the Channel State Information at the Transmitter (CSIT), which arrives with a delay [4].CSIT is essential to estimate the appropriate transmission parameters, such as the Modulation and Coding Scheme (MCS), precoding, etc. Especially, for highly mobile User Equipments (UEs), such as cars or trains, the CSIT is already outdated when it is available at the transmitter.Although, exploiting geometrical properties of the environment and employing Machine Learning (ML) enables predicting the CSIT over larger time windows [4], the fast fading behavior of the channel may still make the channel estimation inaccurate.
To cope with inaccurate CSIT, physical layer retransmission mechanisms, such as Hybrid Automatic Repeat reQuest (HARQ), are used.However, HARQ, also known as reactive HARQ, increases the end-to-end latency because the transmitter requires feedback from the receiver after each transmission round in form of an ACKnowledgment (ACK) or Non-ACKnowledgment (NACK).Especially, for 5G URLLC use cases with end-to-end latency requirements of down to 1 ms [5], reactive HARQ poses a limitation.For Sixth Generation (6G) use cases, where end-to-end latency requirements even down to 100 µs are foreseen [6], this becomes even more an issue.This drawback is compensated by proactive HARQ that continuously transmits further retransmissions until an ACK is received [7], [8].Proactive HARQ combines high reliability with extremely short latencies [9].Nevertheless, it trades these advantages for a degraded spectral efficiency due to unnecessary retransmissions [10].
The dependence on the LOS path also poses a major issue for reliable communication, as any obstruction by an object causes a severe degradation of the channel quality.As a remedy, Cloud Radio Access Network (C-RAN) architectures with multiple reception points at different locations, i.e.Remote Radio Heads (RRHs), are foreseen for sub-THz and THz communications [11].The BaseBand Unit (BBU), which is responsible for higher layer processing, decodes the packet by combining all received signals from the different RRHs.However, in the context of C-RAN architectures, the aforementioned drawbacks of reactive HARQ and proactive HARQ become even more critical due to the significantly larger feedback delay [12].In particular, a fronthaul latency of up to 250 µs is assumed [13].Hence, many papers in the scientific literature studied ways for reducing the feed-back delay using prediction mechanisms [12], [14], [15].For architectures with a single reception point, different HARQ feedback prediction methods exist [10], [12], [14]- [25].In contrast, for architectures with multiple reception points, i.e.C-RANs, we know only Signal-to-Noise Ratio (SNR)-based HARQ schemes proposed by Khalili and Simeone in [14] and Makki et. al. in [15].Other prediction mechanisms, such as Log-Likelihood Ratio (LLR)-based and subcode-based approaches, [10] and [20]- [25], have not been adapted yet to C-RAN architectures.Current state-of-the-art designs assume that the predictor has full knowledge of the prediction features.However, in C-RAN architectures, the RRHs only have partial knowledge and further, the feedback channels to the UE are rate-limited.Hence, in C-RAN architectures, HARQ prediction schemes that consider the locality of the information are required.Due to the rate-limitation of the feedback channels, schemes also have to develop efficient representations of the local feedback and rules on how to combine these.Furthermore, even for the SNR-based approach, no evaluation using realistic link-level simulations in the C-RAN context, in particular considering blockage, is available.Against this background, the contributions of this paper are summarized in the following: • To address the HARQ prediction problem in C-RANs in a holistic manner, we present a novel Dual Autoencoding 2-Stage Gaussian Mixture Model (DA2SGMM) that exploits subcode features as well as channel estimation features.Furthermore, to reduce the dimensionality of the input features we propose over [25] and [10] a subcarrierbased averaging of the LLRs for the DA2SGMM.• To enable the application of state-of-the-art feedback prediction mechanisms for single reception points in C-RANs, we propose a distributed HARQ prediction setup with quantization of the feedback and a combining rule at the UE.Within this setup, we develop a distributed Logistic Regression on LogLikelihood Ratios (LR-LLR) based on [23].
• Finally, we compare all schemes in the context of our HARQ system evaluation methodology using realistic link-level simulations.In particular, we also consider the single-blockage case, where one RRH is blocked.We show that the DA2SGMM clearly outperforms all other schemes in all experiments.

A. Related work on HARQ feedback prediction
As mentioned in the previous section, different HARQ prediction schemes have been proposed and studied in the literature.The variety of schemes reaches from simple thresholding, e.g.[12], up to complex machine learning schemes, e.g.[22] and [25].In particular, the latter has gained interest recently, also in the context of the new Rel.18 standardization [26].In the following, we classify these schemes into three categories: 1) Channel-estimation-based feedback prediction: In [15], Makki et al. investigated a mixture of proactive and reactive HARQ protocols to reduce the expected latency.
In the proposed scheme, the receiver accumulates the received signal until the sum channel gain that is estimated over quantization regions, exceeds a certain threshold associated with a sufficiently high decoding probability.
In case of a negative prediction, i.e. the transmitted redundancy is not sufficient for successful decoding, the receiver switches to a reactive HARQ approach.In [12], Rost and Prasad and, in [14], Khalili and Simeone put the channel-estimation-based prediction schemes into the context of C-RANs and showed the benefits of early feedback in C-RANs with non-ideal backhaul.
Besides that, in [19], AlMarshed et al. proposed a Deep ML scheme that uses the received complex signal to estimate the decodability of a packet.Although being different from the previously described schemes, we categorize the Deep ML as a channel-estimation-based approach because it does not involve the computation of LLRs or any other channel-code-aware features.2) LLR-based feedback prediction: In [20] and [21], Berardinelli et al. used a Bit Error Rate (BER) estimate based on LLRs to predict the decoding outcome ahead of the actual decoding.They empirically computed a threshold for the BER estimate to predict the decodability.In contrast to the channel-estimation-based schemes, the LLRs inherently contain a reduced form of the channel estimates.Nevertheless, different from the first category, the LLR-based schemes consider the whole received data signal for the prediction instead of relying only on pilots used for the channel estimation.As an improvement over the simple thresholding that was used by Berardinelli et al., Hummert et al. proposed in [22] to use a neural network, designated as NN ForeCast, that can mimic the decoder.A hybrid pathway was proposed by AlMarshed et al. in [23].They combined both LLRand channel-estimation-based features using a logistic regression.The proposed logistic regression showed a significant enhancement over other approaches that use only one of both.3) Subcode-based feedback prediction: In [10], [24] and [25], the authors proposed a feedback prediction mechanism that observes the partial decoding behavior of so-called subcodes.The subcodes reflect dependencies between received symbols arising from the structure of the channel code.Similar to the LLR-based feedback prediction, this approach uses the LLRs as a basis.However, in contrast to these, subcode-based schemes apply additional processing on the LLRs based on the knowledge of the code structure.In [24], Göktepe et al. empirically determined thresholds for these code-aware features.As an improvement over this thresholding approach, in [25], the authors applied machine learning techniques, i.e. logistic regression, random forests, isolation forests and supervised autoencoders, on the LLR and subcode features.Especially, the logistic regression and the supervised autoencoder have proven to be fruitful approaches to enhance the feedback prediction.
Apart from the previously described HARQ prediction schemes that are mainly designed for single reception points, [12], [14] and [15] presented similar approaches for implementing the channel-estimation-based feedback prediction in C-RANs.The works proposed collecting the channel gains and the SNRs over the received Redundancy Versions (RVs), respectively.In [12], Rost and Prasad applied Gallager's error exponent E r to convert the channel estimation into an estimated error probability , as this metric is easier to work with (see [27] for more details): where γ is the average SNR, R is the code rate, N is the code length, and R 0 is the cut-off rate associated with the SNR.After calculating Gallager's error exponent locally at the RRHs, they proposed applying a threshold to generate a positive or negative feedback.However, they ignored the case of multiple RRHs receiving the same packet.This was investigated by Khalili and Simeone in [14].They proposed using a vector quantization, as described in [28], to compress the channel state at the RRHs.Following the compression, the RRHs transmit this compressed feedback over a rate-limited feedback channel to the UE where a joint feedback is calculated.Furthermore, they also analyzed the impact of quantization of feedback on the system performance.
Notation: Throughout the paper, we use C to denote the set of complex numbers and N the set of natural numbers.Furthermore, C n , n ∈ N, denotes the n-dimensional complex vector space.Bold letters are used to indicate vectors, while bold capital letters are used to indicate matrices.Random variables are noted in capital letters, where random matrices are further highlighted by bold font.E[•] is the expected value.diag(•) : C n → C n×n , n ∈ N, denotes the mapping of an ndimensional complex vector to an n × n complex diagonal matrix where the diagonal elements of the matrix are the components of the vector.N (µ, σ 2 ) designates the normal distribution with mean µ and variance σ 2 and CN (µ, σ 2 ) designates its circularly-symmetric complex counterpart.Given n modulation symbols, the channel probability measures represent the following association between the random vector representing the transmitted signal X ∈ A and the received signal random vectors Y (1) , Y (2) ∈ B: where H (i) ∈ B, i = 1, 2, are random fading matrices and Z (i) ∈ B, i = 1, 2, are random noise vectors with each element distributed according to CN (0, 1).In our link-level simulations, we assume a spatially filtered Clustered Delay Line (CDL) channel model [29].The assumed channel model, which is explained more in detail in Sec.III-G, can be modeled as complex channel gains that distort each symbol individually and additive normally distributed noise on top.
To determine the final decoding outcome, the BBU combines and jointly decodes the received signal vectors Y (1) and Y (2)  from both RRHs.On the other hand, the RRHs calculate the feedback, which is a map where S is the sample space of the feedback.T (i) is specified differently depending on the prediction scheme.As we assume binary communication, the feedback sample space reduces to S : where N b is the number of bits used for the feedback transmission.Finally, the UE applies a combination rule F : S × S → {ACK, NACK}, which leads to the corresponding UE behavior, i.e. stop transmitting or transmit more redundancy.

III. DISTRIBUTED EARLY HARQ STRATEGIES
In this paper, we consider early HARQ strategies that attempt to predict the decodability of a packet ahead of the actual decoding.In particular, we take only a part of the whole transmitted signal vector into account.In contrast to early HARQ strategies that use the whole signal vector, this approach allows for providing the feedback at an earlier stage, which is crucial especially for latency-constrained use cases.In the broadest sense, the decodability prediction can be interpreted as binary statistical hypothesis testing, where the early HARQ predictor tries to discriminate between two probability distributions P and Q: the probability distribution of decodables and the probability distribution of undecodables.The feedback maps T (1) and T (2) have to be chosen such that the two distributions become as distinguishable as possible.Ideally, these maps are sufficient statistics to the statistical hypothesis testing problem.However, in practice, depending on the type of feedback, it is a notoriously difficult problem to exactly characterize these distributions and hence, also finding sufficient feedback maps.In terms of the system model, the prediction is based on p modulation symbols with p < n.Hence, given proj p Y (1) and proj p Y (2) , the binary hypothesis testing task at the UE is to decide between the two distributions where D ∈ {ACK, NACK} respresents the decoding outcome at the BBU with n modulation symbols available at the decoder and proj p : C k×p × C k×(n−p) → C k×p , p, k, n ∈ N, p < n, denotes the function, which maps an element from the Cartesian product of two vector spaces on the first vector space.

A. Channel-estimation-based HARQ prediction (Q-SNR)
Channel estimation predictors focus on the estimated channel realization Ĥ(i) , i = 1, 2, at each RRH.The estimation is performed based on known parts of the transmitted signal, e.g.reference signals, such as Demodulation Reference Signal (DMRS).In the particular case of the paper's transmission model, one DMRS is located at the beginning of each RV.We use these DMRS to obtain the received SNRs, γ (1) and γ (2) , at each RRH, respectively.
For the channel-estimation-based prediction, we evaluate the scheme proposed in [15] that accumulates the quantized received SNRs from the RRHs at the UE and applies a threshold to the sum to predict an ACK or NACK.In partic- are the quantized received SNRs from the respective RRHs and C Q−SNR is a constant that controls the trade-off between false-positive and false-negative errors.In the constant power case, the received SNR is equivalent to the accumulated channel gain, which is used in [15].Furthermore, we model the quantization by a quantization layer.We assume both quantization functions T SNR (γ) := T SNR (γ)∀γ to be the equal and further to be piece-wise constant functions.The constant intervals are chosen, such that each interval contains approximately the same number of data points, over the relevant SNR range, where the relevant range is determined by the minimum and maximum values of the training set.See "quantile" in [30] for more details.Then, T SNR assigns the value of the interval center to each SNR that falls into that interval.This scheme is designated as Quantized Signal-to-Noise-Ratio (Q-SNR) in the following.In [14] and [15], the authors use the error probability approximation from [31, Eq. ( 59)] to estimate the failure probability.In particular, in [14], Khalili and Simeone apply a threshold to the estimated error probability.However, in our evaluated scenario, the Q-SNR scheme achieves the same performance as the scheme in [14] at a significantly lower complexity.Hence, we restrict only to the Q-SNR scheme.

B. LLR-based HARQ prediction (LR-LLR)
The LLR-based approaches assume that each element x i ∈ S ⊂ C, i = 1, 2, ..., n, of the transmitted signal vector x is i.i.d. and each element of the symbol set S representing M bits has the same probability.We are aware that this assumption does not hold in practice due to the channel code.Nevertheless, in the next section, we discuss a scheme that does not resort to this assumption.Using the i.i.d.assumption, the LLRs are calculated as where r q , q = 1, 2, ..., n, is the q-th received and equalized symbol and b (q−1)M +j , j = 1, 2, .., M , is the j-th bit in the q-th equalized symbol.This definition of LLRs leads to the following bit error probability: Against the background of [20], we provide in Eq. ( 7) a corrected version for the bit error probability [20, Eq. ( 7)].
In order to reduce the high dimensionality of p bit error estimates v l , l = 1, ..., p, we average over all received bit error estimates: We apply first a local logistic regression at each RRH that is fed with v (i) and the received SNR γ (i) .We train the local logistic regression with the BBU decoding outcome as the ground truth.We apply l 2 regularization, see [32] for more details, and balanced weight classes to the logistic regression using the liblinear solver from the scikit-learn package [33].
The local feedback function LLR , i = 1, 2, is given as where Q is the quantization function and is the parameter set of the logistic regression.The quantization function Q is determined analogously to the Q-SNR scheme.
The range between the minimum and maximum value from the training set is divided into uniform intervals, where each interval assigns the value of its center.After having generated the local feedback LLR , we again use a logistic regression at the UE to learn the feedback combination rule F [34]: with LLR , T LLR ) := exp(θ 30 + T LLR T (2) where {θ 30 ∈ R, θ 3 ∈ R 2 } is the learnt parameter set of the logistic regression.Then, the combination rule is defined by LLR ) := P LLR (T where C LLR is an appropriately chosen constant.This scheme is referred to as LR-LLR.

C. Subcode-based HARQ prediction
As LLR-based schemes, subcode-based HARQ predictors take the LLRs as a basis.However, instead of assuming i.i.d.components of the signal vector X, the subcode-based prediction considers constraints defined by the parity-check matrix P ∈ F L×nM

2
. The relation between the bit vector b := (b i,j ) ∈ F nM 2 , i = 1, 2, ..., n, j = 1, 2, ..., M , and the parity check matrix is defined as P b T = 0. Message passing decoders, such as the min-sum implementation, iteratively update the LLRs based on P , which is described by: where M(l) is the set of check nodes that are associated with the bit b l , δ m,k is the check node to variable node message at the k-th iteration, and l .In contrast to tree codes where message passing decoders always converge to the best solution, for modern Low-Density Parity-Check (LDPC) codes, the evolution of the LLRs can be interpreted as a sequence which may or may not converge to a "degraded" marginalization [35].Compared to considering only the received LLRs, this behavior provides additional information on the healthiness of the received codeword.1) Supervised dual autoencoding 2-stage gaussian mixture model (DA2SGMM) for anomaly detection: In the machine learning literature, autoencoders are well established for unsupervised anomaly detection tasks [36]- [38] due to their unprecedented dimensionality reduction capabilities [39].Also for HARQ prediction purposes, a supervised autoencoder proposed in [25] outplayed other machine learning techniques, such as logistic regression, random forests, and others.However, the approach in [25], which builds on the DAGMM architecture that was proposed for anomaly detection in [37], assumes a scenario with a single receive point.In this section, we extend this autoencoder to handle two separated RRHs.This novel approach, referred to as DA2SGMM, achieves a dimensionality reduction of the input features.Furthermore, we use two independent classifiers at each RRH to convert the compressed subcode features together with the received SNR features to a decodability feedback, which is afterwards combined at the UE classifier.In Fig. 4, we show the schematic design of the proposed DA2SGMM architecture.As can be seen, we incorporate the constraints of the architecture of the communication system directly into the setup of the DA2SGMM.The upper box represents the parts executed at the first RRH, the lower box the parts at the second RRH and finally, the right small box represents the network at the UE.We provide further details of the DA2SGMM setup and training in App. A. The subcode features are generated from a partial decoding process at the RRHs.In particular, the subcode features at the RRHs are represented as: where In contrast to the previous schemes, the training is performed in an end-to-end manner.Hence, we do not have to distinguish the feedback T (i) and the combination rule F .Instead, we can see the whole network, represented by P DA2SGMM , as part of the combination rule F DA2SGMM s (1) , γ (1) , s (2) , γ (2)  := , where C DA2SGMM again is a constant controlling the trade-off between false-positives and false-negatives.

D. Complexity comparison
The different HARQ prediction strategies come at different costs in terms of computations and memory.In particular, the required processing time, which results from the computational complexity, is critical for low-latency applications.In order to compare the different schemes, we assume that the SNR and LLR features themselves are available without any processing cost.In [40], the decoding latency of a flexible offset min-sum LDPC decoder is given by where N is the size of the codeword, d v is the average variable node degree, Z is the lifting size of the code, i.e. 104, I is the number of performed iterations and f is the clock rate of the decoder.This decoder type is implementation-wise very similar to the optimized min-sum LDPC algorithm, which has been used for the simulations.Applying to the used subcode and assuming a decoder frequency of 1 GHz, as motivated in [40], we obtain a decoding latency of 305 ns for the partial decoding to obtain the subcode features from the LLRs.For the evaluation of the classifiers, we determine the amount of memory required to store the model parameters and input features and the number of elementary floatingpoint operations.Furthermore, to validate the estimated we perform a processing time measurement of a single-threaded implementation of the schemes on an Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz processor.The LR-LLR scheme uses logistic regressions with 2 input features each.Hence, besides the features themselves, the parameter set of the logistic regression also has to be stored on the devices.This results to a memory consumption of C mem,LR = 5.Furthermore, the computational complexity is given by C comp,LR = 5C +, * + 1C / + 1C exp , where C +, * is the computational complexity of an elementary multiplication or addition, C / is the computational complexity of an elementary division and C exp is the complexity of computing the exponential function.We use C quant to designate the computational complexity of the quantization.Different from the logistic regression, the DA2SGMM scheme is built up by multiple Fully Connected (FC) layers, see App.A for more details.The overall memory consumption of an FC layer results to Furthermore, the overall computational complexity of an FC layer is given by The Softmax layer does not have any stored parameters and hence, C mem,SM = 0.The computational complexity is given by C comp,SM = 2C +, * + 1C / + 1C exp .Tab. II summarizes the overall complexity of all schemes.Obviously, the Q-SNR scheme has the least memory consumption as well as the least computational complexity at all devices.Compared to that, the LR-LLR slightly increases the memory consumption and computational complexity at the RRHs.The DA2SGMM clearly has the highest memory consumption and computational complexity on all devices.The complexity of the elementary operations can be approximated by weights corresponding to the number of performed floating point operations: C +, * := 1, C / = 4 and C exp = 8 [42].For the quantization, we take PyTorch's FakeQuantize implementation as baseline [43].Hence, the computational complexity of the quantization results to C quant = 4C +, * .Tab. III shows the estimated processing times for a Raspberry Pi 3 processor and the results of actual processing time measurements on an Intel Xeon processor.In a practical implementation, the processing time may vary based on the capabilities of the processor platform, the latency of memory access, the efficiency of the implementation and many more factors.However, this impact also heavily depends on the actual implementation.This also explains discrepancies between the estimated processing times and the actual processing times.Obviously, the quantization operation requires much more time on the Intel processor than estimated.In contrast to the DA2SGMM that uses PyTorch also for the quantization, we use scikit-learn's KBinsDiscretizer for the quantization of the other schemes.Also, the implementation of the linear regression is not optimized for the particular case and hence, performs unnecessary double calculations.Furthermore, we do not consider memory access delays for our estimated processing times.However, with specialized hardware, such as GPUs or even TPUs, a significantly better performance is to be expected.In Tab.IV, we show a simple estimation of the processing times of the DA2SGMM on multi-core platforms.We assume that matrix multiplications resulting in an output vector, i.e. a linear layer, can easily be parallelized by computing each entry of the output vector separately.We observe that the processing time on the RRH scales almost  linearly up to 16 cores.In contrast to that, on the UE, 16 cores only have a very small advantage compared to 8 cores.However, a UE processing time of 25.8 ns already is very small.Nevertheless, the processing times are at most in the small µs range on single-threaded platforms, which is extremely small even compared to stringent latency budgets, such as 1 ms and even 0.1 ms.In particular, specialized hardware, such as GPUs and TPUs, are widely used to run neural networks.Overall, our evaluation shows that the processing time are expected to be sufficiently small on practical platforms.

E. Data transmission model
We assume an incremental redundancy HARQ protocol with up to four RVs.In our simulation setup, an RV spans over 14 Orthogonal Frequency Division Multiplexing (OFDM) symbols in time, which is equivalent to 15.63 µs.The UE transmits the RVs in a consecutive manner, as depicted in Fig. 5.After receiving each RV, the RRHs provide feedback using the previously described prediction schemes.The UE decides based on a combination rule F whether further RVs are required or not.This approach enables a good trade-off between reliability and spectral efficiency since some RVs are omitted, if an early decoding is successfully predicted.After having received all RVs from the UE, the RRHs forward the received signals to the BBU where a single decoding attempt is conducted.In this work, we simulate four RVs; hence, there are three prediction points: 1) the first prediction point (Pos#1), which uses the received RV#0 to decide whether RV#2 is required or not, 2) the second prediction point (Pos#2), which uses RV#0 and RV#1 to decide whether RV#3 is required or not.3) the blockage prediction point (Pos#3), which uses RV#0-2 to decide whether blockage is detected or not.A positive prediction at the first prediction point causes the UE to stop transmitting further RVs.Hence, the second prediction point would not be reached in that case.However, this depends on the particular prediction scheme and modeling this accurately would require to incorporate the prediction directly into the link-level simulations, which increases the complexity extremely.Instead, we assume that all prediction schemes correctly identify transmissions that are already decodable with the first RV and only the remaining transmissions reach the second prediction point.

F. Evaluation methodology
In addition to achieving the reliability and the latency targets which are mandatory requirements, the performance of the HARQ prediction schemes can be compared in terms of their achieved throughput.In contrast to commonly used classifier performance metrics, such as precision and recall, the throughput provides a measure with direct takeover to practical scenarios.Especially, when considering edge cases with extremely low-reliability requirements, common classifier metrics may not provide a good metric for comparison [25].Based on the renewal-reward theorem [44], the throughput is expressed as , where E[R] is an expected reward and, E[L(T )] is the expected transmission latency with L(T ) as defined in ( 18) and ( 19) and T ∈ N + being the number of requested RVs.Furthermore, the reward is R := 0 in case the transmission failed within the latency budget.In case the transmission was successful, R is N packet , the size of the packet in nats.Hence, the expected reward is given as E[R] := (1 − tot )N packet , where tot is the associated total error probability.The transmission latency of the prediction schemes is composed of multiple components: where δ exc := max(δ proc + δ fb − δ RV , 0) and δ RV is the time to transmit an RV, δ proc is the processing time required by the specific prediction scheme and δ fb is the time required to transmit the feedback.Furthermore, L blk is a latency penalty.
For simplicity, we assume L blk = T blk δ RV , where T blk is a spectral efficiency penalty when blockage is detected.In contrast to that, the latency of a regular HARQ system is composed as: where δ fh is the fronthaul round-trip time, which is the time required for transporting the received signal vectors to the BBU, decoding the packet at the BBU, and sending the result back to the RRHs.The probability distribution of T for a feedback delay of δ := 1 is determined by the performance of the different estimators: with where T max is the maximum number of transmissions, T blk is an additional spectral efficiency penalty in case blockage is detected, j , j ∈ {1, 2, ..., T max −1}, are the error probabilities at the (j+1)-th RV given that previous RVs were unsuccessful, α j , j ∈ {1, 2, ..., T max −1}, are the false positive probabilities, i.e. predicting an unsuccessful decoding as an ACK, and β j , j ∈ {1, 2, ..., T max − 1}, are the false negative probabilities, i.e. predicting a successful decoding as a NACK.The falsepositive and false-negative probabilities are defined as and where D j+1 is the decoding outcome with (j + 1) RVs and F j is the outcome of the j-th prediction.The total error performance tot is determined by the error probabilities i but also the false positive error probability α i .The total error performance in a non-blockage scenario is given by [10]: where the false-positive, false-negative and error probabilities are estimated from non-blockage scenarios.However, in a blockage scenario a sufficiently low error probability cannot be maintained and hence, alternative procedures, e.g.additional redundancy, switching to a lower frequency or adapting the beam, have to be initiated.However, instead of making an assumption on the specific blockage recovery scheme, we use the spectral efficiency penalty T blk to model the blockage case.This also means that an effective blockage, i.e. nondecodability of the T max RVs, has to be detected with the same target error probability.The probability of blockage misdetection is given by blk|sb where the false-positive, false-negative and error probabilities are derived from single-blockage scenarios.
The false-positive probabilities and the false-negative probabilities behave in a conflicting manner, where the trade-off between both can be controlled by adjusting the bias s of the respective predictor.However, the functional relation between α i (s) and β i (s), i = 1, ..., T max − 1, is not known.Hence, we determine 1000 admissible pairs (α i (s j ), β i (s j )), j = 1, 2, ..., 1000, from the link-level simulations and interpolate them piece-wise linearly.Obviously, any false-positive falsenegative curve of a reasonable predictor is a convex function, where at the extreme case when no prediction is possible, this function becomes a straight line connecting the points (0, 1) and (1, 0).Hence, a piece-wise linear interpolation is a conservative approximation, where the actual performance of the predictors at an interpolated point is expected to be better than the approximated value.
With s := (s 1 , s 2 , ..., s Tmax−1 ) being the vector of biases per prediction, we derive the following optimization problem for the expected number of RVs under no blockage: where target corresponds to the overall reliability requirement.We use the Sequential Least Squares Programming (SLSQP) algorithm [45] with a Monte-Carlo approach to numerically find a valid solution to the aforementioned optimization problem.

G. Link-level simulation setup
To compare the performance of previously described HARQ prediction schemes, we conduct link-level simulations to collect the required LLR, subcode and channel estimation features.We choose the frame structure of the transmission, i.e. the mapping of the code block to resource elements and reference signals, e.g.DMRS, in accordance with the Rel.16 3GPP specifications.Furthermore, we use a subcarrier spacing of 960 kHz, which is currently being specified in the "NR operation up to 71 GHz" work item [1].Tab.V summarizes the link-level parameters.We assume that different UEs are scheduled on different orthogonal time-frequency resources and hence, interference from other transmitters can be neglected.In the sub-THz frequency spectrum, the use of beamforming is necessary due to the high pathloss, even for free space propagation.However, Multiple-Input Multiple-Output (MIMO) schemes that allow dynamic beamforming are complex in the sense that they offer a large set of tunable parameters and transmission modes.Optimizing these goes beyond the scope of this work.Hence, we use a spatially filtered Clustered Delay Line D (CDL-D) channel model that already incorporates the effects of beamforming [29], see Sec.III-G for more details.We evaluate the performance in a single-blockage and a noblockage scenario.We do not consider the case where both RRHs are blocked, as this would also make any communication at a feasible rate impossible.In the no-blockage scenario, we assume no SNR difference between the two RRHs.In the single-blockage scenario, we assume the same SNR at the non-blocked RRH and an additional pathloss of 11.2 dB at the blocked RRH.This is the pathloss difference of a blocked and non-blocked channel with a UE at 20 m distance from both RRHs in the UMi scenario [29].At the RRHs, we use a spatially filtered CDL channel model.In particular; at the un-blocked RRH, we use CDL-D, which is a LOS channel model with Rician distributed LOS component and Rayleigh distributed Non-Line-Of-Sight (NLOS) components [29].At the blocked RRH, we use the CDL-C, which is a NLOS channel model [29].Furthermore, we apply the directional antenna pattern, as specified in [47], at both sides to generate a spatially filtered Tapped Delay Line (TDL) channel which models the effective channel between the UE and the RRHs.The spatial filtration procedure is performed in accordance with [29,Sec. 7.7.4].For the decoding of the received signal vector, we apply an optimized min-sum algorithm [48] with 50 iterations.In contrast to that, the subcode prediction uses only 5 iterations.Hence, its complexity is only 1/10-th of the complexity of a full decoding attempt.

IV. SIMULATION RESULTS
In this section, we present the results for the different HARQ prediction approaches.We train all schemes jointly on all SNRs except 6 dB.The SNR of 6 dB is not used during training but only used for testing to evaluate the generalization performance of the models.

A. False-positive and false-negative performance
The false-positive and false-negative mispredictions determine the performance of the predictor, as we can see from Eq. ( 20) and (24).Especially, the regime of low falsepositives is of special interest because the cost of a falsepositive misprediction is significantly higher than the cost of a false-negative misprediction.However, due to the finite size of the test sets, we have to deal with false-positives of zero, which makes a logarithmic scale unusable.Hence, a symmetrical logarithmic scale has been chosen for the falsepositive axis.This scaling puts emphasis on the lower regime of the false-positive probabilities while keeping the zero point interpretable; however, special care has to be given to the linear scaling between the zero and the first step.
Fig. 6 shows the SNR-averaged false-positive rate over the false-negative rate at the first and the second prediction points with a feedback size of 4 bits.In Fig. 6a, we note that all schemes achieve a better performance in the no-blockage scenario compared to the single-blockage scenario.In particular, the Q-SNR and LR-LLR achieve in both scenarios a comparable performance, whereas the LR-LLR performs slightly better than the other schemes except at very small false-negative rates in the no-blockage scenario.Furthermore, we note that the DA2SGMM clearly outperforms all other schemes in both scenarios.In the no-blockage scenario, it reaches zero mispredictions on the test set already at an falsenegative rate of approximately 0.5.In contrast to that, the other schemes reach zero mispredictions only at a false-negative rate of 1.This behavior even reinforces at the second prediction point, seen in Fig. 6b.Here, in the no-blockage scenario, the DA2SGMM reaches zero mispredictions already below a false-negative rate of 0.2.In the single-blockage scenario, the performance of the DA2SGMM degrades slightly compared to the first prediction point.However, the performance of the other prediction schemes degrades significantly in both scenarios compared to the first prediction point.
The false-positive-false-negative curve only shows the averaged performance.Hence, we also want to compare the performance of the schemes at specific SNRs.In particular, the SNR of 6 dB is of uttermost interest as this data was excluded from the training.Hence, we further introduce the notion of the Area Under the false-positive-Curve (AUC) as AU C := 1 0 fα,β (x)dx, where fα,β (x) is the piece-wise linear interpolation of the false-positive-false-negative pairs (α, β).In Fig. 7, we present the AUC performance in the no-blockage and single-blockage scenarios with a feedback size of 4 bits.In the no-blockage scenario, in Fig. 7a, we observe at the first prediction point that the AUC decreases with increasing SNR, i.e. the prediction accuracy increases.In particular, for the SNR of 6 dB, we note that none of the schemes show a particularly degraded AUC performance.For the second prediction point, we observe that the AUC tends to slightly increase with increasing SNR for all schemes except the DA2SGMM, which shows an almost flat behavior over the SNR range.As seen already in Fig. 6, it can be clearly seen that the DA2SGMM achieves by far the lowest AUC at all SNRs and both prediction points.In the single-blockage scenario, in Fig. 7b, we observe the AUC tending to increase with the SNR except for the DA2SGMM at the first prediction point.Again, the DA2SGMM clearly outperforms the other schemes at all SNRs and both prediction points.Furthermore, we note as in the no-blockage scenario that no degradation of the AUC at an SNR of 6 dB can be observed.Hence, a good generalization of all models may be assumed.
1) Impact of feedback size: In the previous section, we show results for a feedback size of 4 bits.However, the question of how many feedback bits are required is important for the practicability of the HARQ prediction schemes, as more bits result in a significantly higher control signaling overhead.In Fig. 8, we show the SNR-averaged AUC over different feedback sizes in the no-blockage and the singleblockage scenarios, in Fig. 8a and Fig. 8b, respectively.In the no-blockage scenario, Fig. 8a, we clearly note a trend of lower AUC at higher feedback sizes.This matches the intuition that more accurate feedback benefits the prediction accuracy.However, in the single-blockage scenario, in Fig. 8b, we observe a lower AUC at lower feedback sizes.Although this seems counter-intuitive, this behavior is explained by the trade-off between the no-blockage AUC and the single-blockage AUC.Depending on the hyperparameters, the feedback size itself and in particular the ACK weight class for the DA2SGMM, see Sec.A, the schemes train for a different trade-off at the different feedback sizes.Besides that, we observe that the DA2SGMM achieves the lowest AUC at both prediction points and in both scenarios even compared to higher feedback sizes of the other schemes.Furthermore, we can see that all schemes profit from more feedback bits.In particular, we observe that Q-SNR gains the most until 3 bits and only benefits slightly from more bits.The LR-LLR and DA2SGMM mostly improve in terms of AUC until 4 bits.Although, the DA2SGMM has an outlier for the first prediction point at 3 bits in the blockage scenario, as seen in Fig. 8b.
2) Blockage Detection: In addition to the first and second HARQ prediction points, the blockage prediction also plays a crucial role for practical applications.In Fig. 9, we show the SNR-averaged false-positive rate over the false-negative rate in the no-blockage and the single-blockage scenarios, in Fig. 9a and Fig. 9b, respectively.Similar to the previous prediction points, we observe generally a better performance for all schemes in the no-blockage scenario compared to the singleblockage scenario.Again, the DA2SGMM clearly achieves a significantly lower false-positive rate at the same false-negative rates.These results indicate a superior performance for the DA2SGMM scheme compared to the other schemes in terms of HARQ prediction and also blockage detection.

B. HARQ system performance
In the previous section, we evaluated the false-positive and false-negative performance.However, the performance in a practical setup has to be shown to prove the efficiency of a scheme.Hence, we evaluate the different prediction schemes using the evaluation methodology described in Sec.III-F.
In Fig. 10, we show the HARQ performance of the prediction schemes with 4 bits feedback size in terms of the throughput with and without the blockage side constraint as defined in (26).We assume T blk := 4, which implies that a positive blockage detection results in 4 additionally requested RVs.Furthermore, we set δ fb = δ RV .Due to the downlink control signaling design, i.e.PDCCH in 5G, it is possible to achieve even smaller feedback delays.However, as it is clear from (18), any smaller δ fb would diminish the impact of the complexity differences of the schemes.For the processing latency of the schemes, we take the measurement results from the single-threaded implementation on an Intel Xeon CPU as a basis.In Fig. 10a, we observe in the no-blockage scenarios that DA2SGMM with 23 -25 Mbit/s throughput clearly outperforms all other prediction schemes, which achieve ap- proximately 8 MBit/s at all SNRs.We note that the latency that is required to achieve the target error rates, does not differ significantly for the HARQ prediction schemes.Compared to regular HARQ, DA2SGMM reaches approximately 20 % less throughput.Nevertheless, the higher throughput of regular HARQ comes at the cost of a significantly larger maximum transmission latency compared to DA2SGMM.In the singleblockage scenario, we note that the additional latency due to retransmissions significantly degrades the throughput of regular HARQ.The DA2SGMM, Q-SNR and LR-LLR achieve a similar and significantly higher throughput for all SNRs except the 7 dB SNR of HARQ with δ fh = 150µs.In Fig. 10b, we show the throughput without the blockage side constraint.We observe that the performance improves for all prediction schemes in both scenarios.Especially, the Q-SNR and LR-LLR significantly benefit from removing this side constraint.This indicates that in the previous performance evaluations, these schemes are mainly limited by the stringent blockage detection side constraint defined in (26).Another critical issue for machine learning schemes is the robustness against unknown channel variations.In particular, any learned scheme has to reliably perform for a larger range of channel parameters, such as the SNR.In order to test the robustness of the trained DA2SGMM, we exclude the SNR of 6 dB from training.We note that the throughput of DA2SGMM behaves as expected also at this SNR point.Furthermore, we note that the achieved throughput at 6 dB is even closer to the throughput of 7 dB than 5 dB, which hints that the scheme behaves as expected also in unknown channel variations.

V. SUMMARY AND CONCLUSIONS
In this work, we proposed novel machine-learning assisted HARQ prediction schemes and evaluated them within the context of a C-RAN scenario in the sub-THz regime consider also blockage using link-level simulations.In particular, we extended the LLR-and subcode-based approaches proposed in [25] enabling their usage in a C-RAN setup by introducing quantization and a feedback combination module using a logistic regression and we proposed a novel end-to-end DA2SGMM architecture that exploits SNR and subcode features.Using realistic link-level simulations, we showed that the proposed DA2SGMM clearly outperforms other prediction mechanisms in no-blockage as well as in single-blockage scenarios within the context of the HARQ evaluation methodology.In particular, we present that the DA2SGMM HARQ prediction achieves a more than 200 % higher throughput compared to other HARQ prediction schemes at SNRs ranging from 4 dB to 7 dB and target error rates from 1 • 10 −4 to 3.5 • 10 −6 , if singleblockage is considered.Even without blockage, we show that the throughput of the DA2SGMM is approx.29 % higher compared to the LR-LLR and even 45 % higher compared to the Q-SNR.Compared to regular HARQ, our proposed DA2SGMM with a sufficient feedback size suffers only by a throughput reduction of approx.20 % while reducing the maximum transmission latency by a factor larger than 4. Furthermore, we show that 4 bits for the feedback transmission is sufficient and the schemes do not benefit from more bits.In future research, the impact of double-blockage as well as a setup with more than two RRHs may be studied.Furthermore, the impact of non-ideal channel estimation on the performance of the different schemes has to be evaluated in further studies.
≡ [Lin(x,y), BN, L-ReLU] and d is the input dimension.Furthermore, Lin(x,y) denotes a linear transformation layer, BN a Batch Normalization-layer, L-ReLU a Leaky ReLU activation layer with a slope of 0.01.Each RRH further contains a classifier that operates on the compressed form of the subcode features.The network configurations of the RRH classifiers each read as [FC (5,10), FC (10,15), FC (15,15), FC (15,10), Lin (10,2), SM], where FC(x,y) ≡ [Lin(x,y), BN, ReLU] where ReLU is a ReLU activation layer and SM is a softmax activation layer.The classifiers each receive the compressed representation of s (i) and the received SNR γ (i) as input.To prevent drifting off of the two "arms" of the network, the local encoders and classifiers are tied together.In particular this means, the weights and the biases of the linear layers are updated equally at both RRHs.Furthermore, we implement the quantization layer by a FakeQuantize layer with a MinMaxObserver in PyTorch using quantization-aware training [49].Lastly, the network configuration of the UE classifier is given by [FC (2,20), FC (20,10), FC (10,5), Lin(5,2), SM].We train the DA2SGMM in an end-to-end fashion using a loss function L that is composed by the L 2 norm: and the cross-entropy between the predicted output d and the actual decoding outcome d: where N is the number of samples in a batch and ω ACK is a weight class for ACKs.The two loss functions are combined as: where λ is a fixed weight factor.The fixed weight factor and the batch size are found to give the best results at 15.0 and 15, 000, respectively, for all prediction points.For the first and the blockage prediction point, the ACK weight class ω ACK achieves at 0.5 the best performance.For the second prediction point, the ACK weight class is chosen to be 0.1.To train DA2SGMM under the given loss function, we use the Adam optimizer [50] at a learning rate of 0.001 and weight decay of 10 −5 .We initialize the parameters of the whole network with the Kaiming normal initialization [51].Due to the nature of the sample data, the ratio between ACKs and NACKs is heavily imbalanced.Hence, we undersample the majority class, i.e.ACKs, to create a balance between ACKs and NACKs.

Fig. 1 .
Fig. 1.Uplink C-RAN scenario with local HARQ feedback generated at the RRHs which is combined at the UE.

Fig. 1
Fig. 1 shows the system setup which is used throughout the paper.The definitions of commonly used variables are summarized in Tab.I. We assume an uplink scenario where a UE is transmitting a packet, which is simultaneously received by two RRHs.After partially receiving a packet, the RRHs generate a local feedback based on the evaluated prediction algorithms.The generated local feedback is transmitted and combined at the UE.In the meanwhile, the received signals from the RRHs are accumulated and jointly decoded at the BBU which determines the final decoding outcome.Let A := C n and B := B 1 × B 2 = C p × C (n−p) = C n designate the input and output sets.Furthermore, let each channel be characterized by its respective conditional probability measure P Y (1) |X : A → B and P Y (2) |X : A → B.Given n modulation symbols, the channel probability measures represent the following association between the random vector representing the transmitted signal X ∈ A and the received signal random vectors Y(1) , Y(2) ∈ B:

Fig. 4 .
Fig. 4. A supervised DA2SGMM to compress the feedback information at the RRHs and evaluate the combined result at the UE.The yellow boxes are only used for training purposes and are removed for inference.

Fig. 6 .
Fig. 6.False-positive prediction performance over false-negatives after the first RV and after second RV with 4 bit feedback.No-blockage (circle).Singleblockage (circle).(a) First prediction point.(b) Second prediction point.

Fig. 9 .
Fig. 9. Blockage prediction at the third prediction point with a feedback size of 4 bits.(a) No-blockage scenario.(b) Single-blockage scenario.

Fig. 10 .
Fig. 10.Throughput (4 bits feedback) over the maximum transmission latency (δ fb = δ RV ) with and without the blockage side constraint at the evaluated SNRs and corresponding target error rates, as provided in Tab.V. (a) With blockage constraint.(b) Without blockage constraint.

TABLE II MEMORY
CONSUMPTION IN NUMBER OF STORED FLOATING POINT VARIABLES AND COMPUTATIONAL COMPLEXITY IN TERMS OF ELEMENTARYFLOATING POINT OPERATIONS.

TABLE III SINGLE
-THREADED PROCESSING TIME ON DIFFERENT PROCESSOR PLATFORMS.