Outage Performance and Novel Loss Function for an ML-Assisted Resource Allocation: An Exact Analytical Framework

In this paper, we present Machine Learning (ML) solutions to address the reliability challenges likely to be encountered in advanced wireless systems (5G, 6G, and indeed beyond). Specifically, we introduce a novel loss function to minimize the outage probability of an ML-based resource allocation system. A single-user multi-resource greedy allocation strategy constitutes our application scenario, for which an ML binary classification predictor assists in selecting a resource satisfying the established outage criterium. While other resource allocation policies may be suitable, they are not the focus of our study. Instead, our primary emphasis is on theoretically developing this loss function and leveraging it to train an ML model to address the outage probability challenge. With no access to future channel state information, this predictor foresees each resource’s likely future outage status. When the predictor encounters a resource it believes will be satisfactory, it allocates it to the user. The predictor aims to ensure that a user avoids resources likely to undergo an outage. Our main result establishes exact and asymptotic expressions for this system’s outage probability. These expressions reveal that focusing solely on the optimization of the per-resource outage probability conditioned on the ML predictor recommending resource allocation (a strategy that - at face value - looks to be the most appropriate) may produce inadequate predictors that reject every resource. They also reveal that focusing on standard metrics, like precision, false-positive rate, or recall, may not produce optimal predictors. With our result, we formulate a theoretically optimal, differentiable loss function to train our predictor. We then compare predictors trained using this and traditional loss functions namely, binary cross-entropy (BCE), mean squared error (MSE), and mean absolute error (MAE). In all scenarios, predictors trained using our novel loss function provide superior outage probability performance. Moreover, in some cases, our loss function outperforms predictors trained with BCE, MAE, and MSE by multiple orders of magnitude. Additionally, when applied to another ML-based resource allocation scheme (a modified greedy algorithm), our proposed loss function maintains its efficacy.


I. INTRODUCTION
W IRELESS channels are inherently dynamic, strongly influenced by their surrounding environments and users' mobility.Their ability to support communication fluctuates with space, time, and frequency.Such random dynamicity directly impacts the quality of service experienced by the user.Hence, sophisticated control techniques are pivotal to minimizing the deterioration posed by unfavorable conditions.
Advanced generations of wireless systems, such as 6G, and beyond will predominantly operate at millimeter wave (mmWave) and terahertz (THz) bands as this is one of the proposed solutions to handle the ever-increasing data traffic demands caused by the massive growth of connected devices [1].High radio frequencies make communication channels more susceptible to interference from surroundings, such as obstacles and atmospheric conditions, which would have minimal impact on links operating at lower frequencies.Hence signal degradation and unstable communications may become frequent [2]- [4].Moreover, in advanced systems, devices are expected to play an active role in decisionmaking (without human involvement) tasks.Thus, it is paramount that their actions be highly reliable.Operation reliability will be further amplified for critical, immersive, and omnipresent communications.Indeed, it may be possible to increase the communication reliability if the actions are taken based on adequately predicting the future status of the link.To address these challenges, there is a requirement for agile, high-dimensional modelling techniques with real-time adaptation [5].Machine Learning (ML) is becoming a pivotal tool in all walks of applications, and advanced wireless communications systems (5G, 6G, and indeed beyond) are no exception.Regarding the reliability of radio links in complex systems such as these, for which intertwined phenomena, parameters, and metrics are at play in a constantly changing scenario, guaranteeing the proper quality of service is a challenge.In cases like this, ML techniques arise as a handy and powerful instrument.ML is key, offering flexibility by letting data, which describes the system and its performance, drive decisionmaking.Through ML, we can adapt more effectively to the dynamic nature of future systems, enabling seamless communication and enhanced problem-solving.
Demonstrating its capabilities, ML has been effective in predicting outages, identifying link blockages, and assessing link quality [6]- [13].For example, in [10], a deep neural network was trained to map user positions and data traffic demand to their corresponding blockage status and optimal beam index.It was shown that this scheme could predict blockages in mmWave communications with 90% accuracy.Furthermore, [11] implemented a framework for predicting blockages in mmWave and THz systems based on metalearning that trained a recurrent neural network using just a few data samples.Other methods, e.g., combining computer vision and deep learning tools, were also used to identify blockages in [12] and [13].
Further expanding its utility, ML has been used to optimise resource allocation in wireless systems, and improve channel estimation in multiple-input multiple-output (MIMO) or in orthogonal frequency division multiplexing (OFDM) systems [14]- [18].For instance, [14] proposed a deep reinforcement learning based approach for resource allocation in vehicle-to-vehicle communications.The developed algorithm could be applied to unicast and broadcast scenarios, and reduced interference in vehicle-to-infrastructure communications.In [15], the potential of deep learning for power allocation in massive MIMO systems was investigated, demonstrating a significant reduction in complexity and processing time of the optimization process, while [16] presented a novel concept of employing deep neural networks to learn the channel-to-channel mapping between different sets of antennas and frequency bands, with the aim of optimizing performance in similar systems.In [17], deep reinforcement learning was employed to optimize energy efficiency in device-to-device enabled heterogeneous networks.
In the studies highlighted above, the underlying learning algorithms are often directed by traditional loss functions, such as binary cross-entropy (BCE) or mean squared error (MSE).At its core, a loss function evaluates the gap between the predicted outputs of the model and the actual observations.This evaluation guides the learning algorithm to minimize this difference by refining the model's parameters, commonly using approaches like gradient descent.However, our findings indicate that relying solely on these traditional loss functions can lead to marginal performance gains, potentially failing to satisfy the stringent demands of nextgeneration systems.Highlighting the necessity for a change in learning strategies, [19] emphasized the significance of incorporating domain-specific knowledge from communication systems into the learning procedure.
Recently, there has been a growing momentum towards this domain-centric integration, aiming for optimal system performance.To this end, tailored loss functions are often developed to ensure that the learning process meets certain constraints or highlights specific prediction aspects, refining the model to better meet the unique requirements of a specific task.As an illustration, [20] introduced three deep neural networks to approximate singular value decomposition in MIMO systems.The authors presented a tailored loss function for hybrid beamforming, considering the challenges of finite-precision phase shifters and power limitations.In [21], a deep neural network-driven resource allocation method was proposed for cell-free massive MIMO systems with hardware limitations.This deep neural network employed a tailored loss function that was designed to optimize sum rates while taking into account user power and front-haul capacity limitations.The potential of reconfigurable intelligent surfaces (RISs) to enhance signal-to-noise ratio and improve network coverage using ML was explored in [22].Here, an unsupervised deep neural network with a tailored loss function was used to optimize RIS reflection coefficients.
A generative adversarial network was used to improve wireless channel predictions in [23] by employing a tailored loss function that aimed at preserving low-rank channel matrices.Furthermore, a strategy for resource allocation was suggested in [24] for advanced network slicing in beyond 5G settings, using statistical federated learning coupled with a tailored loss function.Additionally, both [25] and [26] demonstrated the benefits of deep unsupervised and reinforcement learning methods, in their respective studies, for optimizing resource allocation in ultra-reliable low-latency communication using a tailored loss function.
Though these investigations offer important perspectives, they do not integrate outage probability reliability metric directly into their loss functions, specifically aiming to reduce the chances of link failures.This gap is significant, given that most link failures arise at the extreme, less probable ends of the channel's distribution (also referred to as deep-tail learning1 [27]).Failing to effectively learn from these statistically infrequent events during model training can result in ML approaches that are inadequately designed to mitigate such failures-crucial for meeting the demanding reliability expectations of future wireless technologies.

A. CONTRIBUTIONS
To reduce outages encountered in future wireless systems, in this paper we use ML to predict and avoid link deterioration.The scenario exercised is a single-user multi-resource greedy allocation system amenable to machine intelligence.Here, the resource allocation strategy uses an ML binary classification predictor, which anticipates the future outage status of each resource.The predictor traverses the available resources, stopping when it encounters a resource it believes to be satisfactory.The goal of the predictor is to prevent the allocation of an unsatisfactory resource, which would cause outage.The main challenge in the prediction task is that the predictor does not have access to any future channel state information.Instead, it must use its predictive capacity to infer future resource outages based on its historical state.
It is worth highlighting that our primary aim is to theoretically develop and showcase a custom loss function designed to enhance the outage probability performance of ML-assisted wireless systems.To achieve this, we first analyze the resource allocation system described above and derive associated outage probability results.We then construct a theoretical framework (i.e., a custom loss function) to effectively train our predictor, targeting the minimization of outage probability, which involves deep-tail learning.The main contributions are summarized as follows: 1) Novel expressions are derived for the outage probability of a single-user multi-resource greedy allocation system which uses an ML binary classification predictor for detecting outages.2) These expressions reveal that focusing solely on the optimization of the per-resource outage probability conditioned on the ML predictor recommending resource allocation may produce inadequate predictors that reject every resource.Also, focusing on standard metrics, like precision, false-positive rate, or recall, may again not produce optimal predictors.3) Leveraging the exact and asymptotic outage expressions derived for this resource allocation system, a novel custom loss function is formulated.4) A simulations-based assessment of our novel loss function is performed, and several valuable insights are obtained.Crucially, it is shown that this novel loss function significantly dominates the conventional loss functions.In some situations, it outperforms predictors trained using the BCE, MSE and mean absolute error (MAE) [28] by approximately two orders of magnitude (i.e., 100×).5) It is also demonstrated that our custom loss function is particularly effective in regions experiencing infrequent outages, illustrating its capability to facilitate deep-tail learning.6) Finally, even when applied to a different ML-based resource allocation scheme (modified greedy algorithm), we show that our proposed loss function retains its effectiveness.

B. ORGANIZATION AND NOTATION
Section II describes the system model, which consists of the resource allocation system combined with an ML binary classification predictor.Section III provides exact and asymptotic expressions for this system's outage behaviour.Section IV builds on the theory presented in Section III to construct a novel loss function that is used to train an ML predictor for our system.Section V presents experimental results using our custom loss function.Finally, the work is concluded in Section VI.Table 1 highlights important notation used throughout the paper.

II. SYSTEM MODEL
We present our system model in three parts.The first describes the channel model adopted for a resource.The second describes the ML binary classification predictor.The third describes the role of the predictor when allocating a resource to a user.

A. CHANNEL MODEL FOR A RESOURCE
We consider a generic single-user multi-resource system.The resources could be, e.g., non-overlapping portions of the spectrum [29]- [31], eigen-channels of a spatial multiplexing transceiver [32], relays in a multi-relay system [33], etc.
Outage probability of the system with a single resource ( 4) The ML binary classifier with output in [0, 1]; Θ denotes the predictor's parameters (6) The threshold above which outages are predicted by the predictor (7) The cumulative distribution function of the predictor's output (8) The subset of resources for which the predictor predicts no outages ( 9) Outage probability of the system with |R| → ∞ resources P |R| (γ th , q th ) Outage probability of the system with |R| resources (13) or (14) Resource i ∈ R is assumed to have a fluctuating channel state, which can be expressed as a time-series, Each unit increment in t corresponds to a channel sample interval.The sampling interval is a function of the correlation bandwidth (or time or distance), which in turn is a function of the operation frequency and the velocity of the receiver.Typically for a system operating at 5.8 GHz, the sampling interval can range from 0.1 ms to 1 ms [8], [34].
The state h i (t) is accessible by a user, Alice, before she is allocated the resource.It is assumed that h i (t) and h j (t) are independent and identically distributed 2 (i.i.d.) for i ̸ = j.It is further assumed that h i (t) and h i (t + ∆), ∆ ∈ R, are identically distributed but may be correlated.This correlation declines as ∆ increases and goes to zero as ∆ → ∞.For our theoretical analysis, we do not impose any requirements on how the correlations decline.Note that Alice has an ML classification predictor, the job of which will be to learn these correlations and aid her in selecting an appropriate resource for her future communication.We discuss this ML predictor and resource allocation strategy in Section II.B and Section II.C, respectively.
With h i (t) given by ( 1), for k ∈ N we define the following vector of channel samples Here, k is the length of the window of past samples of channel states for each resource i ∈ R. The ability for resource i to support Alice's communication over a period t−k+1 to t is determined by its capacity C (H i (t, k)) ∈ R + for that period [35].As an example, for a quasi-static Gaussian channel, where SNR represents the average signalto-noise ratio (SNR) per sample, the capacity for resource i 2 The i.i.d.assumption simplifies our theoretical analysis.Non-i.i.d.
scenarios may be considered in the future.
in this period is given by [36, eq. ( 5.80)] (3) If Alice requires communication at a rate less than or equal to this capacity, the resource will be satisfactory for her.Otherwise, the resource will be in outage.For the remainder of this work, we assume that Alice's required communication rate is determined by a threshold γ th .The outage probability of a single resource i can then be written as where the equality across all i is a consequence of the resources being i.i.d.

Problem Statement:
As mentioned previously, Alice is equipped with an ML classification predictor whose goal is to allocate a resource r ⋆ ∈ R to her whilst trying to avoid outages.In more detail, this ML predictor evaluates the current channel conditions H i (t, k) for a particular resource and predicts the likelihood of successful communication over the upcoming l ∈ N channel samples, H i (t + l, l), for that resource.The main challenge here is that future channel state h i (t ′ ) for t ′ > t is not available for any of the resources i ∈ R. Our resource allocation strategy hinges on the predictions made by this ML predictor to selectively allocate resources, ensuring minimal outages.Central to our study is to derive this system's outage probability and then establish a theoretical framework for constructing a communication system's loss function.This will optimize our predictor's training to reduce the probability of system outages.We now discuss the ML-based resource allocation predictor and the resource allocation strategy adopted here.

B. ML PREDICTOR FOR RESOURCE ALLOCATION
Without loss of generality, it is assumed that resource i begins at i = 1, and for each time the predictor predicts an outage event, i is incremented by 1.For resource i, the predictor accepts an input vector of channel coefficients H i (t, k) defined in (2).It provides an output value in the closed interval [0, 1], which is used to classify whether the following l channel samples (i.e., H i (t + l, l) ) support communication without outage for each resource i ∈ R.This paper considers the general case where the model may or may not be well calibrated.The special case corresponding to well calibrated models should also satisfy the criteria [37] (5) where Θ represents the predictor's parameters and q ∈ [0, 1].The condition in (5) may or may not be the case for our analysis 3 .This clarification implies that while the ML predictors in our study are designed to learn conditional probability distributions, they are not explicitly required to fulfill the additional calibration condition indicated in (5).
In this work, For some choice of the predictor's classification threshold implies that an outage is predicted for the next l samples.Furthermore, the probability density function (PDF) for the predictor's output is given by f Q (x) , where is the cumulative distribution function of the predictor's output, and the equality across all i is a consequence of the resources being i.i.d.Importantly, for x = q th , we retrieve the predictor's resource acceptance probability, F Q (q th ).
Alternatively, the predictor's resource rejection probability is given by 1 − F Q (q th ).We next discuss the resource allocation strategy adopted here.

C. RESOURCE ALLOCATION
To illustrate the feasibility of our novel loss function, we examine a basic communication setup and resource allocation method.Specifically, we employ a greedy resource allocation policy similar to those found in [38]- [40] as they are known for their efficient implementation and reduced time complexity.Our objective is not to explore different channel allocation techniques, although this can be done.Instead, our primary focus is on introducing a novel approach to construct loss functions for communication systems and evaluating their performance.Our preliminary findings indicate that our novel approach can significantly enhance wireless system reliability when compared to commonly used loss functions.Thus, future research will explore the application of this technique in more intricate resource allocation systems.Fig. 1 shows the allocation policy we consider.In this, Alice scans the available resources R and selects a resource r ⋆ based on whether an ML binary classification predictor Q predicts that future communication will succeed on that resource.The search for an appropriate resource for Alice stops when the predictor predicts a no-outage event for that resource.When none of the resources are predicted to be satisfactory (referred to as the critical scenario), we explore two alternatives: Case 1. Alice is allocated the final resource |R|, or Case 2. Alice is allocated the resource with the lowest ML binary classification value (the best predicted resource seen).
Accordingly, for R = {1, 2, • • • , |R|}, where resource i ∈ R is the ith one being explored by the allocation procedure, we can summarize the adopted strategy as follows.If (9) is the subset of resources for which the predictor predicts no outages, then the scheme chooses: In what follows, the outage probability for a system with |R| resources is denoted as P |R| (γ th , q th ), and the outage probability in the infinite resource limit, i.e., |R| → ∞, is given by For brevity, going forward, we may drop the predictor's dependency on Θ in our notation where appropriate.

III. THE SYSTEM'S OUTAGE PROBABILITY
In this section, we present the main result of this work, describing how the ML binary classification predictor affects Alice's outage probability.

A. GENERAL OUTAGE EXPRESSIONS
We begin with the following theorem.
Theorem 1.Consider the resource allocation system described in Section II, where P 1 (γ th ), F Q (q th ) and P ∞ (γ th , q th ) are defined in (4), (8), and (12), respectively.Furthermore, assume that all resources i ∈ R are i.i.d.Then, the outage probability for a system with |R| resources can be expressed as Case 2 : where For ( 14) and (15), the equality across all i is a consequence of the i.i.d.resources.

Proof:
See Appendix A.
A striking observation from this result, and one that we will discuss later, is that if we wish to minimise the outage probability of the system, we should not necessarily focus on minimising the outage probability of a resource given the predictor recommends allocating that resource, P ∞ (γ th , q th ).Rather, it is important that we also consider the number of resources available, the probabilities associated with rejecting them all, and the unconditional probability that a single resource is in outage.As we shall see later (discussions following Theorem 2), if we fail to make such consideration, focusing only on minimising the outage probability of a resource given the predictor recommends allocating that resource, we may end up in the unfortunate situation where our predictor refuses to select any resource at all.From (13), the outage probability for Case 1 can be viewed as the average over the limiting instantiations of the system when |R| = 1 and |R| = ∞, where the probabilities of the instantiations are, respectively, (1 − F Q (q th )) |R|−1 (i.e., the probability that all resources would be rejected by the predictor) and 1 − (1 − F Q (q th )) |R|−1 (i.e., the probability that at least one resource would be accepted by the predictor).From ( 14), the outage probability for Case 2 can similarly be viewed as the average over the instantiations of the system when the index with the lowest predictor score is selected and when the system has |R| = ∞ resources.
For both cases it is clear that when |R| = 1 or ∞, the system's outage probability reduces to P 1 (γ th ) and P ∞ (γ th , q th ), respectively, as expected.It can also be seen that the outage probability for the two cases differ only in their first term.Consequently, as this first term becomes small (relative to the second term) we expect the two cases to provide similar performance.Notably, this first term decays exponentially with |R|.
Furthermore, for Case 1, inspecting the limiting values of q th = 0 and 1, we observe that To see this, note that F Q (q th ) = 0, 1 for q th = 0, 1, respectively.Thus, (13) reduces to P 1 (γ th ) when q th = 0 and P ∞ (γ th , q th ) when q th = 1.Moreover, because Q(H i (t, k); Θ) ≤ q th is always true when q th = 1, P ∞ (γ th , q th ) reduces to P 1 (γ th ).Again, considering (7), this behaviour should be expected because q th = 0 or 1 will, respectively, result in the user always being allocated to the final or first resource when Case 1 is considered.For Case 2, however, inspecting the limiting values of q th = 0 and 1, we observe different values for the outage probability.Specifically, we have and From (16) we see that Case 2 reduces to the scenario in which the best predicted resource is always selected from the available resources R.
Finally, it is noteworthy that the predictor's quality (i.e., the probability of communication success given predicted success) is exactly the complement of the system's outage probability in the infinite resource limit.This is because, with an infinite number of resources, a resource will almost surely be predicted to be good.And when it is, the outage probability observed by the system will simply be the probability that this selected resource is in outage.Because the resource is selected only when Q(H i (t, k)) ≤ q th , we obtain (15).This is not true when only finitely many resources are available.Consequently, the outage probability in the infinite resource scenario reduces to the conditional expression in (15).

B. OUTAGE EXPRESSIONS IN TERMS OF THE PREDICTOR'S OUTPUTS
We next show that P 1 (γ th ), F Q (q th ) and P ∞ (γ th , q th ), given in ( 4), ( 8) and (15), respectively, can be expressed in terms of the predictor's true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).Critically, these alternative expressions will be used to formulate a differentiable loss function for training Q, which will be shown to be extremely effective at optimizing the system's performance.We begin with the following definition and key remark, which helps formalise TN, TP, FN, and FP.Definition 1.Consider resource i ∈ R. Also, let f : R → [0, 1] and W n = {(H i (t, k), b i )}, b i ∈ {0, 1}, be a collection of n labelled samples drawn uniformly and independently from our resource model described in Section II, where label b i = 0 or 1 implies, respectively, that C (H i (t + l, l)) ≥ γ th or C (H i (t + l, l)) < γ th (i.e., a no-outage event or outage event follows H i (t, k)).Then we define the following functionals: Importantly, when f (•) = 1(•) (i.e., the Heaviside step function [41]) in Definition 1, ( 18) -( 21) reduce to an intuitive understanding of TN, FN, TP, and FP (see Fig. 2).
In particular, the summands become a counter for TN, FN, TP and FP.For example, considering (18), the Heaviside step function term is equal to 1 only when Q(H i (t, k)) ≤ q th .When Q(H i (t, k)) ≤ q th , the predictor is predicting nooutage.Moving to the second term (1 − b i ) of ( 18), this is equal to 1 only when the label b i = 0.When b i = 0, no-outage has occurred.Thus ( 18) is counting all the events when Q(H i (t, k)) ≤ q th and b i = 0, i.e., when the predictor has predicted no-outage occurring when no-outages have occurred.These are the TNs.Likewise, considering (20), the Heaviside step function term is equal to 1 only when Q(H i (t, k)) > q th .When Q(H i (t, k)) > q th , the predictor is predicting an outage.Moving to the second term b i of ( 20), this is equal to 1 when an outage has occurred.Thus ( 20) is counting all the events when Q(H i (t, k)) > q th and b i = 1, i.e., when the predictor has predicted an outage occurring when outages have occurred.These are the TPs.Similar arguments apply to the other equations in Definition 1 when f (•) = 1(•).The construction of ( 18) - (21) in this way will be helpful when formulating our custom loss function.Specifically, making a choice for f (•) that approximates 1(•) and satisfies differentiable properties will allow us to effectively train a neural network predictor using back-propagation.
Alternative expressions are now presented in Theorem 2 below for P 1 (γ th ), F Q (q th ), and P ∞ (γ th , q th ) given in ( 4), (8), and (15), respectively.This allows us to express the system's outage probability in Case 1 purely in terms of TN, TP, FN and FP.However, the following theorem does not offer the same benefit for Case 2, due to the complexities in formulating analogous equations.This constitutes an open problem for further investigation.
Theorem 2. Let W n , TN, TP, FN and FP be given by Definition 1. Furthermore, consider P 1 (W n ; 1), F Q (W n ; 1) and P ∞ (W n ; 1) given by ( 22), ( 23) and (24) at the top of the next page.Then (4), (8), and (15) can, respectively, be expressed as At this point, we are in a position to highlight the potential consequence of training an ML predictor where the focus is only on minimising P ∞ (γ th , q th ), which describes the predictor's ability to minimise the outage probability of a resource given it recommends allocating that resource.We see from (24) and ( 27) that doing so may result in a strategy that minimises the predictor's false-negatives (the numerator of ( 24)), which can be trivially achieved by rejecting every resource.This, of course, is not desirable.Instead, considering Theorem 1, we should also consider the number of resources available, the probabilities associated with rejecting them all, and the unconditional probability that a single resource is in outage.While minimising P ∞ (γ th , q th ) should not be the sole objective when training our ML predictor, it will still form a core component of the optimization problem.From (24), this would correspond approximately to minimising FN/(TN + FN), which represents the proportion of negative predictions that were incorrect out of all the actual negative predictions.Importantly, this metric is not a standard one in confusion matrix analysis (e.g., precision, recall, or false-positive rate, etc, [42]).Instead, it can be thought of as the proportion of all negative predictions (both true negatives and false negatives) that were actually false negatives.This gives us a measure for how "risky" a negative prediction is.In the context of our system model, this makes sense because the negative events correspond to the ML predictor's recommendation to allocate a particular resource.

IV. A CUSTOM LOSS FUNCTION
In this section, we leverage the results that were presented in Section III.B to construct a custom loss function for 4  Case 1.We then use this loss function to train a sequenceto-sequence LSTM neural network predictor [43] in Section V.Because our loss function will act as an accurate approximation to P |R| (γ th , q th ) for Case 1 given in Theorem 1, we hypothesise that it will result in a trained classification predictor that is effective at minimising the system's outage probability.In our experiments that follow, this hypothesis does -indeed -appear to be true.In all scenarios tested, our novel custom loss function provides superior performance when compared to the same neural network trained with BCE, MAE, and MSE 5 loss functions (traditional loss functions found in the literature).In certain scenarios, the LSTM predictor -when trained using our novel custom loss function -results in a system that achieves multiple orders of magnitude improvement in the system's outage probability.As identified in the discussion following Theorem 1, when the first term of ( 13) and ( 14) becomes small (relative to the second term) we expect the two cases 4 Recall that formulating a custom loss function for Case 2 still constitutes an open problem due to the discussion preceding Theorem 2. 5 Because the predictor is acting as a binary classifier, its output is unidimensional.This means that loss functions suitable for both regression and classification can be used.
to provide similar performance.As such, the custom loss function formulated for Case 1 may be expected to provide exceptional performance for Case 2 as well.We now present our novel custom loss function.
Definition 2. With P 1 (W n ; ϕ α ), F Q (W n ; ϕ α ) and P ∞ (W n ; ϕ α ) given by ( 22), ( 23) and (24), respectively, the custom loss function for our system in Case 1 is given by Similar to other loss functions, our custom loss function depends on the predictor's output and labels.

V. EXPERIMENTATION AND SIMULATION RESULTS
In this section, we present a collection of experimental results that apply the resource allocation strategy from Section II.In these experiments, we train a lightweight LSTM neural network on simulated wireless channel data in-line with the Clarke's 3D model [45] using different loss functions, including the custom loss function presented in Definition 2 above.

A. GENERATING DATA
There are various methods to generate data for any given random process.The DeepMIMO dataset, as presented in [46], [47], is one such example.Numerous other datasets are available in the literature.To construct our channel data, we perform the following steps.At time t = 0 and ν = 1024, we generate a time domain vector of zero mean complex Gaussian variates with total variance 1/ν given by To model movement in a random direction by a mobile user, at each time increment t, we apply a per-element independent and uniformly distributed phase shift and so on.In the frequency domain we, respectively, have and (33) Letting h i [t] denote the ith element of H [t], we model the channel for resource i at time t as the narrowband system where x i [t] is a unit variance signal term and w i [t] is a unit variance zero mean complex Gaussian noise term, which yields an average SNR of where E [•] is the expectation operator.Clearly, the variables h i [t] are stationary and zero mean complex Gaussian.Also, the autocorrelation between successive channel samples is given by: where the final line follows from the characteristic function of the uniform distribution [48, Eq. ( 5)].This model is inline with the 3D Clarke's model presented in [45, Eq. ( 7)], where ζ corresponds to the straight-line distance travelled by the mobile user.This model6 is widely accepted by the wireless community.We use the above method to generate a sequence of k + l channel samples for each resource.In our experiments, we consider k = 100 and l = 10.We also set ζ = 0.1 radians.The first k samples H i (t, k) are used as input to our LSTM neural network.The final l samples H i (t + l, l) are used to construct a label b i , identifying whether communication could have been supported at a particular rate.Specifically, using the formula from (3) to determine a capacity, we have:

B. TRAINING
We employ a lightweight LSTM neural network predictor consisting of a single LSTM layer with 32 hidden units.Subsequent to this layer are two dense layers: the first contains 10 units activated by PReLU, while the second has a singular unit activated by a sigmoid function.Our model is trained on a dataset comprising (4,500 x number of resources) samples.This means that for a 4 resource system, our model is trained on a dataset consisting of 18,000 samples.For an 8 resource system, this would be 36,000 samples, and so on.This dataset, W n , is generated as described in Section V.A. Furthermore, the size of the validation dataset matches that of the training dataset and we test on 13,000 instantiations of the |R| resource system.This number is chosen to ensure statistically robust results.Additionally, we retrain the predictor 10 times, taking the average of the performance to obtain a single data point in each of the figures.This has the effect of performing the complete test on (130000 x number of resources) samples.For the training, we adopt a supervised learning approach, leverage TensorFlow's Keras API, utilize an ADAM optimizer, and employ the following loss functions: BCE, MAE, MSE, and the custom loss function from Definition 2.
The key hyperparameters that were chosen to train our LSTM model are provided in Table 2. Additionally, we employed the early stopping technique during training to avoid unnecessarily long training routines and mitigate overfitting, setting a threshold of 30 epochs based on observed performance metrics.This decision was substantiated by the convergence patterns of the training and validation loss curves, which maintained close alignment throughout the training period, as depicted in Fig. 4.This figure also makes it clear that our model does not display either overfitting or generated to validate the efficacy of our loss function.It should be noted that the results to be discussed in the subsequent subsections are equally applicable to this dataset.under-fitting behaviour.Finally, it should also be noted that an independent test set, was used to validate all subsequent results, ensuring they were not influenced by any potential model overfitting.For all of our results, our predictor with the custom loss function 7 is trained with a prediction classification threshold of q th = 0.5 and α = 10.For a fixed rate threshold and fixed number of resources, a single predictor took approximately 5 minutes to train on a desktop computer using an NVIDIA ® GeForce ® RTX 2080 Ti 11 GB GDDR6 GPU.The code developed that supports our work can be found here [50].

C. RESULTS
As mentioned in the previous subsection, in every figure that follows, each data point is created by retraining the predictor 10 times and taking the average performance of the model.Moreover, during each of these iterations, the model evaluated its performance over 13,000 instantiations of the |R| resource system, a number chosen to ensure statistically robust results.Fig. 5 shows the outage performance of a 4 and 10 resource system for Case 1 when using different loss functions and rate thresholds, γ th , for a range of predictor classification thresholds, q th .We would expect to see a 'U' or 'V' shaped curve, indicating that too small or too large values of q th are not good, whilst the best value lies somewhere in between.Since the predictor was trained at q th = 0.5, a reasonably low outage probability is observed at this value.At the limiting values of q th = 0 and q th = 1, we observe that the outage probability of the system equals that of a single resource system, a behaviour that was suggested in the discussion following Theorem 1.It can also be seen that the range of q th values over which the custom loss function performs reasonably well is significantly broader than that of the BCE loss function.Furthermore, at q th = 0 and q th = 1, we can see that the predictor's performance for different loss functions coincide.This is because q th = 0 or 1 will, respectively, result in the user always being allocated to the final or first resource, irrespective of the predictors output.As highlighted in the discussion following Theorem 2, a core component of training our ML predictor is to minimise FN/(TN + FN).Importantly, this is not the same as optimising for recall (sometimes called true-positive rate) (TP/(TP + TN)), false positive rate (FP/(FP + TN)), or precision (TP/(TP + FP)), etc, of our predictor.As such, we may not expect our ML predictor with our custom training procedure to provide standout performance with respect to these measures.It is, however, of interest to observe how it performs with respect to these more standard metrics.Fig. 6 shows an ROC curve [42] for a predictor trained using our custom loss function or BCE.Despite not having optimised for either true-positive rate or false-positive rate, our ML predictor still performs well, providing on-par performance with BCE.Similar conclusions are found for other configurations too.Fig. 7 shows the outage performance of a 10 resource system for Case 1 when using different loss functions and different predictor classification thresholds q th , for a range of rate thresholds γ th .It is clear that the custom loss function formulated in this paper provides similar or superior performance when compared to the others.For regions where outages occur infrequently, i.e., for low rate thresholds, our novel loss function becomes increasingly more dominant.As an example, in both figures, our custom loss function may improve the performance of the system by approximately two orders of magnitude.In the same γ th region, the BCE, MAE, and MSE loss functions perform poorly which is most likely due to the increasing bias in the dataset as outages become more rare (i.e., as outage events go deeper into the tail of the channel's distribution).Notably, without making any corrections for this bias (i.e., imbalance in the proportion of outages to non-outages), our custom loss function is able to improve the system's performance dramatically, thus demonstrating deep-tail learning.Deeptail learning is significant because it enables our model to accurately capture the extremely low probability tail-end of the channel's distribution, which is where outage events are present for low values of the rate threshold γ th .
Figs. 8 and 9 show the outage performance of a 4 and 8 resource system, respectively, for cases 1 and 2 when using the BCE, MSE and custom loss functions for a range of SNR and two different rate thresholds.The outage probability of a single resource system i.e., P 1 (γ th ) in ( 4) is also plotted for comparison and is the same for all loss functions.As before, the custom loss function developed in this work demonstrates comparable or enhanced performance relative to other loss functions, showing strong dominance in high SNR regions.It is also evident that the custom loss function that was specifically formulated for Case 1 performs almost identically for Case 2.
Fig. 10 shows the outage performance of a 6 resource system for Case 1 when using different loss functions, q th = 0.1 and γ th = 0.575, for a range of output prediction lengths.Again, the custom loss function performs similarly or substantially improves upon the performance attained using the other loss functions.

VI. CONCLUSIONS
In this work, novel expressions were presented for the outage probability of a single-user multi-resource allocation system that exploits an ML binary classification predictor to predict and avoid future resource outages.Through these expressions, we revealed that optimization of the perresource outage probability conditioned on the ML predictor recommending resource allocation may produce inadequate predictors that reject every resource.Also, focusing on standard metrics, like precision, false-positive rate, or recall, may again produce inadequate predictors.We then proposed a novel custom loss function, which approximates the theoretical outage probability, is differentiable, and can be calculated using our dataset.Importantly, it was demonstrated that predictors trained with our novel loss function provided exceptionally competitive performance when compared to those trained with BCE, MAE or MSE loss functions.For example, in some cases we observed performance improvements of approximately two orders of magnitude over all other tested loss functions.Moreover, this improvement was notable even without addressing the imbalance in the proportion of outages to non-outages in the dataset, showcasing deep-tail learning.

A. PROOF OF THEOREM 1
Case 1: To prove this theorem for Case 1, we define the resource allocation event E i , shown in (37) at the top of the next page.This takes two forms.The first is when i < |R|, i.e., when the number of available resources to select from is greater than 0. The second is when i = |R|, i.e., when there are no extra resources available, in which case the predictor's prediction is ignored.
This allows us to define the following event that corresponds to the simultaneous occurrence of resource outage and resource allocation: Thus, for the greedy resource allocation strategy, the outage probability of the system ( 13) is given by the probability of the union over all possible events, E i : We now prove (15), the outage probability for an infinite resource system.For this, see (40), ( 41) and ( 42), shown at the top of the next page.Here, (40) follows from E i and E i ′ being mutually exclusive8 for i ̸ = i ′ , (41) follows from (38), and (42) follows from Bayes' theorem.Since H i (•, •) and H i ′ (•, •) are independent for i ̸ = i ′ , (42) simplifies to Because the resources are identically distributed, (43) simplifies to P ∞ (γ th , q th ) = ∞ i=1 p (1 − F Q (q th )) i−1 F Q (q th ) (44) where p = P [C (H i (t + l, l)) < γ th | Q (H i (t, k)) ≤ q th ].
The infinite sum in ( 44) is a geometric series, which reduces to p.This completes the proof for the infinite resource case given by (15) in Theorem 1.

FIGURE 1 .
FIGURE 1.An example of Alice accessing available resources.Here, she scans from left to right and chooses the first good resource, r ⋆ .These resources could be portions of the spectrum in a licensed or an unlicensed band, sub-bands in frequency-division multiple access (FDMA), subcarriers in an Orthogonal-FDMA (OFDMA) system, or eigen channels of a MIMO system, etc.

FIGURE 2 .
FIGURE 2. Pictorial representation of samples from Wn (Definition 1) along with an overlay identifying how each sample is classified by the predictor.Outages occur when bi = 1, and predicted outages occur for all samples lying inside the circlular region.With the overlayed circular region, true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) can be identified.

FIGURE 4 .
FIGURE 4. Training versus Validation loss curve generated for a 10 resource system with rate threshold γ th = 0.5, SNR = 0 dB and q th = 0.5.

FIGURE 5 .
FIGURE 5. Analytical and Monte Carlo simulation results for the outage probability of the system P |R| (γ th , q th ) in Case 1 (13), using the custom (29) or BCE loss function.Each figure considers a different number of available resources |R| and rate thresholds with SNR = 0 dB.For the custom loss function the model was trained with q th = 0.5.

FIGURE 6 .
FIGURE 6. ROC curve generated for a 10 resource system with rate threshold γ th = 0.5 and SNR = 0 dB.Each point is parametrically generated by sweeping q th between 0 and 1.

FIGURE 7 .
FIGURE 7. Comparison of Monte Carlo simulation results for the outage probability of the system P |R| (γ th , q th ) in Case i (13) with 10 resources for different rate thresholds with SNR = 0 dB when using the custom (29), BCE, MAE and MSE loss functions.Each figure considers a different q th .

FIGURE 8 .
FIGURE 8. Comparison of Monte Carlo simulation results for the outage probability of the system P |R| (γ th , q th ) in Case 1 (13) and Case 2 (14) with 4 resources for different SNRs when using the custom (29), BCE, and MSE loss functions.Each figure considers a different γ th .P1 (γ th ) given in (4), is also plotted for comparison and is the same for all loss functions.

FIGURE 9 .FIGURE 10 .
FIGURE 9. Comparison of Monte Carlo simulation results for the outage probability of the system P |R| (γ th , q th ) in Case 1 (13) and Case 2 (14) with 8 resources for different SNRs when using the custom (29), BCE, and MSE loss functions.Here γ th = 2. P1 (γ th ) given in (4), is also plotted for comparison and is the same for all loss functions.