Calibrating AI Models for Few-Shot Demodulation via Conformal Prediction

AI tools can be useful to address model deficits in the design of communication systems. However, conventional learning-based AI algorithms yield poorly calibrated decisions, unabling to quantify their outputs uncertainty. While Bayesian learning can enhance calibration by capturing epistemic uncertainty caused by limited data availability, formal calibration guarantees only hold under strong assumptions about the ground-truth, unknown, data generation mechanism. We propose to leverage the conformal prediction framework to obtain data-driven set predictions whose calibration properties hold irrespective of the data distribution. Specifically, we investigate the design of baseband demodulators in the presence of hard-to-model nonlinearities such as hardware imperfections, and propose set-based demodulators based on conformal prediction. Numerical results confirm the theoretical validity of the proposed demodulators, and bring insights into their average prediction set size efficiency.


A. Motivation
How reliable is your artificial intelligence (AI)-based model?The most common metric to design an AI model and to gauge its performance is the average accuracy.However, in applications in which AI decisions are used within a larger system, AI models should not only be as accurate as possible, but they should also be able to reliably quantify the uncertainty of their decisions.As an example, consider an unlicensed link that uses AI tools to predict the best channel to access out of four possible channels.A predictor that assigns the probability vector of [90%, 2%, 5%, 3%] to the possible channels predicts the same best channel -the first -as a predictor that outputs the probability vector [30%, 20%, 25%, 25%].However, the latter predictor is less certain of its decision, and it may be preferable for the unlicensed link to refrain from accessing the channel when acting on less confident predictions, e.g., to avoid excessive interference to licensed links [1], [2].
As in the example above, AI models typically report a confidence measure associated with each prediction, which reflects the model's self-evaluation of the accuracy of a decision.Notably, neural network models implement probabilistic predictors that produce a probability distribution across all possible values of the output variable.
The self-reported model confidence, however, may not be a reliable measure of the true, unknown, accuracy of a prediction.In such situations, the AI model is said to be poorly calibrated.As illustrated in the example in Fig. 1, accuracy and calibration are distinct criteria, with neither criterion implying the other.It is, for instance, possible to have an accurate predictor that consistently underestimates the accuracy of its decisions, and/or that is overconfident where making incorrect decisions (see fourth column in Fig. 1).Conversely, one can have inaccurate predictions that estimate correctly their uncertainty (see fifth column in Fig. 1).
Deep learning models tend to produce either overconfident decisions [3], or calibration levels that rely on strong assumptions about the ground-truth, unknown, data generation mechanism [4]- [9].This paper investigates the use of conformal prediction (CP) [10]- [12] as a framework to design provably well-calibrated AI predictors, with distribution-free calibration guarantees that do not require making any assumption about the ground-truth data generation mechanism.

B. Conformal Prediction for AI-Based Wireless Systems
CP leverages probabilistic predictors to construct well-calibrated set predictors.Instead of producing a probability vector, as in the examples in Fig. 1, a set predictor outputs a subset of the output space, as exemplified in Fig. 2. A set predictor is well calibrated if it contains the correct output with a pre-defined coverage probability selected by the system designer.For a well-calibrated set predictor, the size of the prediction set for a given input provides a measure of the uncertainty of the decision.Set predictors with smaller average prediction size are said to be more efficient [10].This paper investigates CP as a general mechanism to obtain AI models with formal calibration guarantees for communication systems.The calibration guarantees of CP hold irrespective of the true, unknown, distribution underlying the generation of the variables of interest, and are defined either in terms of ensemble averages [10] or As compared to the ground-truth distribution in the second column, the first predictor (third column) is accurate, assigning the largest probability to the optimal decision (indicated as "opt" in the second column) and also well calibrated, reproducing the true accuracy of the decision; the second predictor (fourth column) is still accurate, but it is underconfident on the correct decision (for input x 1 ) and overconfident on the correct decision (for input x 2 ); the third predictor (fifth column) is not accurate, producing a uniform distribution across all output values, but is well calibrated if the data set is balanced [13]; and the last predictor (sixth column) is both inaccurate and poorly calibrated, providing overconfident decisions.p 0 (y|x 3 ) Fig. 2. Set predictors produce subsets of the range of the output variable (here { } ) for each input.Calibration is measured with respect to a desired coverage level 1 − α: A set predictor is well calibrated if the true label is included in the prediction set with probability at least 1 − α.
A well-calibrated set predictor can be inefficient if it returns excessively large set predictions (forth column).In contrast, a poorly-calibrated set predictor (fifth column) returns set predictions that include the true value of the label with a probability smaller than 1 − α.
in terms of long-term averages [14].CP is applied in conjunction to both frequentist and Bayesian learning, and specific applications are discussed to demodulation, modulation classification, and channel prediction.

C. Related Work
Most work on AI for communications relies on conventional frequentist learning tools (see, e.g., the review papers [15]- [18]).Frequentist learning is based on the minimization of the (regularized) training loss, which is interpreted as an estimate of the ground-truth population loss.When data is scarce, this estimate is unreliable, and hence the focus on a single, optimized, model parameter vector often yields probabilistic predictors that are poorly calibrated, producing overconfident decisions [3], [19]- [21].
Bayesian learning offers a principled way to address this problem [22], [23].This is done by producing as the output of the learning process not a single model parameter vector, but rather a distribution in the model parameter space, which quantifies the model's epistemic uncertainty caused by limited access to data.A model trained via Bayesian learning produces probabilistic predictions that are averaged over the trained model parameter distribution.
This ensembling approach to prediction ensures that disagreements among models that fit the training data (almost) equally well are accounted for, substantially improving model calibration [24], [25].
Exact Bayesian learning offers formal guarantees of calibration only under the assumption that the assumed model is well specified [4], [5].In practice, this means that the assumed neural network models should have sufficient capacity to represent the ground-truth data generation mechanism, and that the predictive uncertainty should be unimodal for continuous outputs (since conventional likelihoods are unimodal, e.g., Gaussian) [5], [23], [24].These assumptions are easily violated in practice, especially in communication systems in which lower-complexity models must be implemented on edge devices, and access to data for specific network configurations is limited.Specific examples are provided in [21] for applications including modulation classification [45], [46] and localization [47], [48].
Robustified versions of Bayesian learning that are based on the optimization of a modified free energy criterion were shown empirically to partly address the problem of model misspecification [4], [5], with implications for communication systems presented in [21].However, robust Bayesian learning solutions do not have formal guarantees of calibration in the presence of misspecified models.
Another family of methods that aim at enhancing the calibration of probabilistic models implement a validationbased post-processing phase.Platt scaling [49] and temperature scaling [3] find a fixed parametric mapping of the trained model output that minimizes the validation loss, while isotonic regression [50] applies a non-parametric binning approach.These recalibration-based approaches cannot guarantee calibration, as they may overfit the validation data set [51] and they are sensitive to the inaccuracy of the starting model [52].
Conformal prediction is a general framework for the design of set predictors that satisfy formal, distribution-free, guarantees of calibration [10], [11].Given a desired miscoverage probability α, CP returns set predictions that include the correct output value with probability at least 1 − α under the only assumption that the data distribution is exchangeable.This condition is weaker that the standard assumption of "i.i.d." data made in the design of most machine learning systems.
The original work on CP, [10], introduced validation-based CP and full CP.Since then, progress has been made on reducing computational complexity, minimizing the size of the prediction sets, and further alleviating the assumptions of exchangeability.Cross-validation-based CP was proposed in [53] to reduce the computational complexity as compared to full CP, while improving the efficiency of validation-based CP.The authors of [54], [55] proposed the optimization of a CP-aware loss to improve the efficiency of validation-based CP, while avoiding the larger computational cost of cross-validation.The work [56] proposed reweighting as a means to handle distribution shifts between the examples in the data set and the test point.Other research directions include improvements in the training algorithms [57], [58], and the introduction of novel calibration metrics [59], [60].Finally, online CP, presented in [14], [61], was shown to achieve long-term calibration over time without requiring statistical assumptions on the data generation.

D. Main Contributions
To the best of our knowledge, with the exception of the conference version [62] of this paper, this is the first work to investigate the application of CP to the design of AI models for communication systems.The main contributions of this paper are as follows.
• We provide a self-contained introduction to CP by focusing on validation-based CP [10], cross-validation-based CP [53], and online conformal prediction [61].The presentation details connections to conventional probabilistic predictors, as well as the performance metrics used to assess calibration and efficiency.
• We propose the application of offline CP to the problems of symbol demodulation and modulation classification.
The experimental results validate the theoretical property of CP methods of providing well-calibrated decisions.
Furthermore, they demonstrate that naïve predictors that only rely on the output of either frequentist or Bayesian learning tools often result in poor calibration.
• Finally, we study the application of online CP to the problem of predicting received signal strength for over-the-air measured signals [63].We demonstrate that online CP can obtain the predefined target long-term coverage rate at the cost of negligible increase in the prediction interval as compared to naïve predictors.
The conference version [62] of this work presented results only for symbol demodulation, while not providing background material on CP and not considering online CP.In contrast, this work is self-contained, presenting CP from first principles and including also online CP.Furthermore, this work investigates applications of CP to modulation classification and to channel prediction by leveraging real-world data sets [63], [64].For reproducibility purposes, we have made our code publicly available 1 .
The rest of this paper is organized as follows.In Sec.II, we define set predictors, and introduce the relevant performance metrics.Then, in Sec.III, naïve set predictors are introduced that do not provide guarantees in terms of calibration.Sec.IV describes conformal prediction, a general methodology to obtain well-calibrated set predictors.
Sec. V details online conformal prediction, which is well suited for time-varying data.Applications to wireless communications are investigated in the following sections: Symbol demodulation is studied in Sec.VI; modulation classification in Sec.VII; and channel prediction in Sec.VIII.Sec.IX concludes the paper.

II. PROBLEM DEFINITION
This section introduces set predictors, along with key performance metrics of coverage and inefficiency.To this end, we start by describing the data-generation model and reviewing probabilistic predictors.

A. Data-Generation Model
We consider the standard supervised learning setting in which the learner is given a data set ) for i = 1, . . ., N , and is tasked with producing a prediction on a test input x with unknown output y.Writing z = (x, y) for the test pair, data set D and test point z follow the unknown ground-truth, or population, distribution p 0 (D, z).Apart from Sec. V, we further assume throughout that the population distribution p 0 (D, z) is exchangeable -a condition that includes as a special case the traditional independent and identically distributed (i.i.d.) data-generation setting.Note that we will not make explicit the distinction between random variables and their realizations, which will be clear from the context.
Mathematically, exchangeability requires that the joint distribution p 0 (D, z) does not depend on the ordering of the N + 1 variables {z [1], . . ., z[N ], z}.Equivalently, by de Finetti's theorem [65], there exists a latent random vector c with distribution p 0 (c) such that, conditioned on c, the variables {z [1], . . ., z[N ], z} are i.i.d.Writing the conditional i.i.d.distribution as for some ground-truth sampling distribution p 0 (z|c) given the variable c, under the exchangeability assumption, the joint distribution can be expressed as where E p(x) [•] denotes the expectation with respect to distribution p(x).
The vector c in (2) can be interpreted as including context variables that determine the specific learning task.For instance, in a wireless communication setting, the vector c may encode information about channel conditions.In Sec.V, we will consider a more general setting in which no assumptions are made on the distribution of the data.

B. Probabilistic Predictors
Before introducing set predictors, we briefly review conventional probabilistic predictors.Probabilistic predictors implement a parametric conditional distribution model p(y|x, φ) on the output y ∈ Y given the input x ∈ X , where φ ∈ Φ is a vector of model parameters.Given the training data set D, frequentist learning produces an optimized single vector φ * D , while Bayesian learning returns a distribution q * (φ|D) on the model parameter space Φ [23], [24].In either case, we will denote as p(y|x, D) the resulting optimized predictive distribution for frequentist learning Note that the predictive distribution for Bayesian learning is obtained by averaging, or ensembling, over the optimized distribution q * (φ|D).We refer to Appendix A for basic background on frequentist and Bayesian learning.
From (3), one can obtain a point prediction ŷ for output y given input x as the probability-maximizing output as In the case of a discrete set Y, the hard predictor (4) minimizes the probability of detection error under the model p(y|x, D).The probabilistic prediction p(y|x, D) also provides a measure of predictive uncertainty for all possible outputs y ∈ Y.In particular, for the point prediction ŷ(x|D) in ( 4), we have the predictive, self-reported, confidence level As illustrated in Fig. 1, the performance of a probabilistic predictor can be evaluated in terms of both accuracy and calibration, with the latter quantifying the quality of uncertainty quantification via the confidence level ( 5) [3].Specifically, a probabilistic predictor p(y|x, D) is said to be well calibrated [3] if the probability that the hard predictor ŷ = ŷ(x|D) equals the true label matches its confidence level π for all possible values of probability Mathematically, calibration is defined by the condition where the probability P(•) follows the ground-truth distribution p 0 (x, y).Stronger definitions, like that introduced in [66], require the predictive distribution to match the ground-truth distribution also for values of y that are distinct from (4).

C. Set Predictors
A set predictor is defined as a set-valued function Γ(•|D) : X → 2 Y that maps an input x to a subset of the output domain Y based on data set D. We denote the size of the set predictor for input x as |Γ(x|D)|.As illustrated in the example of Fig. 2, the set size |Γ(x|D)| generally depends on input x, and it can be taken as a measure of the uncertainty of the set predictor.
The performance of a set predictor is evaluated in terms of calibration, or coverage, as well as of inefficiency.
Coverage refers to the probability that the true label is included in the predicted set; while inefficiency refers to the average size |Γ(x|D)| of the predicted set.There is clearly a trade-off between two metrics.A conservative set predictor that always produces the entire output space, i.e., Γ(x|D) = Y, would trivially yield a coverage probability equal to 1, but at the cost of exhibiting the worst possible inefficiency of |Y|.Conversely, a set predictor that always produces an empty set, i.e., Γ(x|D) = ∅, would achieve the best possible inefficiency, equal to zero, while also presenting the worst possible coverage probability equal to zero.
Let us denote a set predictor Γ(•|•) for short as Γ.Formally, the coverage level of set predictor Γ is the probability that the true output y is included in the prediction set Γ(x|D) for a test pair z = (x, y).This can be expressed as coverage(Γ) = P y ∈ Γ(x|D) , where the probability P(•) is taken over the ground-truth joint distribution When the desired coverage level 1 − α is fixed by the predetermined target miscoverage level α ∈ [0, 1], we will also refer to set predictors satisfying (7) as being well calibrated.
Following the discussion in the previous paragraph, it is straightforward to design a valid, or well-calibrated, set predictor, even for the restrictive case of miscoverage level α = 0.This can be, in fact, achieved by producing the full set Γ(x|D) = Y for all inputs x.One should, therefore, also consider the inefficiency of predictor Γ.The inefficiency of set predictor Γ is defined as the average prediction set size where the average is taken over the data set D and the test pair (x, y) following their exchangeable joint distribution p 0 (D, (x, y)).
In practice, the coverage condition ( 7) is relevant if the learner produces multiple predictions using independent data set D, and is tested on multiple pairs (x, y).In fact, in this case, the probability in ( 7) can be interpreted as the fraction of predictions for which the set predictor Γ(x|D) includes the correct output.This situation, illustrated in Fig. 3(a), is quite common in communication systems, particularly at the lower layers of the protocol stack.For instance, the data D may correspond to pilots received in a frame, and the test point z to a symbol within the payload part of the frame (see Sec. VI).While the coverage condition ( 7) is defined under the assumption of a fixed ground-truth distribution p 0 (D, z), in Sec.V we will allow for temporal distributional shifts and we will focus on validity metrics defined as long-term time averages (see Fig. 3(b)).4. A naïve probabilistic-based (NPB) set predictor uses a pre-trained probabilistic predictor to include all output values to which the probabilistic predictor assigns the largest probabilities that reach the coverage target 1 − α.This naïve scheme has no formal guarantee of calibration, i.e., it does not guarantee the coverage condition (7), unless the original probabilistic predictor is well calibrated.

III. NAÏVE SET PREDICTORS
Before describing CP in the next section, in this section we review two naïve , but natural and commonly used, approaches to produce set predictors, that fail to satisfy the coverage condition (7).

A. Naïve Set Predictors from Probabilistic Predictors
Given a probabilistic predictor p(y|x, D) as in (3), one could construct a set predictor by relying on the confidence levels reported by the model.Specifically, aiming at satisfying the coverage condition (7), given an input x, one could construct the smallest subset of the output domain Y that covers a fraction 1 − α of the probability designed by model p(y|x, D).Mathematically, the resulting naïve probabilistic-based (NPB) set predictor is defined as for the case of a discrete set, and an analogous definition applies in the case of a continuous domain Y. Fig. 4 illustrates the NPB for a prediction problem with output domain size |Y| = 4.Given that, as mentioned in Sec.I, probabilistic predictors are typically poorly calibrated, the naïve set predictor (9) does not satisfy condition (7) for the given desired miscoverage level α, and hence it is not well calibrated.For example, in the typical case in which the probabilistic predictor is overconfident [3], the predicted sets (9) tend to be too small to satisfy the coverage condition (7).

B. Naïve Set Predictors from Quantile Predictors
While the naïve probabilistic-based set predictor (9) applies to both discrete and continuous target variables, we now focus on the important special case in which Y is a real number, i.e., Y = R.This corresponds to scalar regression problems, such as for channel prediction (see Sec. VIII).Under this assumption, one can construct a naïve set predictor based on estimates of the α/2and (1 − α/2)-quantiles y α/2 (x) and y 1−α/2 (x) of the ground-truth distribution p 0 (y|x) (obtained from the joint distribution p 0 (D, z)).In fact, writing as the q-quantile, with q ∈ [0, 1], of the ground-truth distribution p 0 (y|x), the interval y α/2 (x), y 1−α/2 (x) contains the true value y with probability 1 − α.

IV. CONFORMAL PREDICTION
In this section, we review CP-based set predictors, which have the key property of guaranteeing the (1 − α)-validity condition (7) for any predetermined miscoverage level α, irrespective of the ground-truth distribution p 0 (D, z) of the data.We specifically focus on validation-based CP [10] and cross-validation-based CP [53], which are more practical variants of full CP [10], [69].In Sec.V, we cover online CP [14], [61].

A. Validation-Based CP (VB-CP)
In this subsection, we describe validation-based CP (VB-CP), which partitions the available set D = D tr ∪ D val into a training set D tr with N tr samples and a validation set D val with N val = N − N tr samples (Fig. 5(a)).This class of methods is also known as inductive CP [10] or split CP [53].
VB-CP operates on any pre-trained probabilistic model p(y|x, D tr ) obtained using the training set D tr as per (3).
At test time, given an input x, VB-CP relies on a validation set to determine which labels y ∈ Y should be included in the predicted set.Specifically, for any given test input x, a label y ∈ Y is included in set Γ VB (x|D) depending on the extent to which the candidate pair (x, y ) "conforms" with the examples in the validation set.
This "conformity" test for a candidate pair is based on a nonconformity (NC) score.An NC score for VB-CP can be obtained as the log-loss or as any other score function that measures the loss of the probabilistic predictor p(y|x, D tr ) on example (x, y).It is also possible to define NC scores for quantile-based predictors as in ( 14), and we refer to [61] for details.Mathematically, the VB-CP set predictor is obtained as where the empirical quantile from the top for a set of N real values {r[i]} N i=1 is defined as Specifically, as illustrated in Fig. 6, K-fold CV-CP [53], referred here as K-CV-CP, first partitions the data set , each with N/K points, i.e., ∪ K k=1 S k = D (Fig. 6(a)), for a predefined integer K ∈ {2, . . ., N } such that the ratio N/K is an integer.
During training, the K subsets D \ S k are used to train K probabilistic predictors p(y|x, D \ S k ) defined as in (3) (Fig. 6(b)).Each trained model p(y|x, D \ S k ) is used to evaluate the |S k | = N/K NC scores NC z k D \ S k for all validation data points z k ∈ S k that were not used for training the model (Fig. 6(c)).Unlike VB-CP, K-CV-CP requires keeping in memory all the N validation scores for testing.These points are illustrated as crosses in Fig. 6(c).
During testing, for a given test input x and for any candidate label y ∈ Y, CV-CP evaluates K NC scores, one for each of the K trained models.Each such NC score NC (x, y ) D \ S k is compared with the N/K validation scores obtained on fold S k .We then count how many of the N/K validation scores are larger than NC (x, y ) D \ S k .If the sum of all such counts, across the K folds {S k } K k=1 , is larger than a fraction α of all N data points, then the candidate label y is included in the prediction set (Fig. 6(d)).This criterion follows the same principle of VB-CP of including all candidate labels y that "conform" well with a sufficiently large fraction of validation points.
Mathematically, K-CV-CP is defined as where 1(•) is the indicator function (1(true) = 1 and 1(false) = 0).The left-hand side of the inequality in (18) implements the sums, shown in Fig. 6(d), over counts of validation NC scores that are larger than the corresponding NC score for the candidate pair (x, y ).K-CV-CP increases the computational complexity K-fold as compared to VB-CP, while generally reducing the inefficiency [53].The special case of K = N , known as jackknife+ [53], is referred here as CV-CP.In this case, each of the N folds S k , k = 1, . . ., N uses a single cross validation point.In general, CV-CP is the most efficient form of K-CV-CP, but it may be impractical for large data set sizes due to need to train N models.The number of folds K should strike a balance between computational complexity, as K models are trained, and inefficiency.Specifically, for frequentist learning, the optimization algorithm producing the parameter vector φ * D in (3) must be permutation-invariant.This is the case for standard methods such as full-batch gradient descent (GD), or for non-parametric techniques such as Gaussian processes.For Bayesian learning, the distribution q * (φ|D ) in (3) must also be permutation-invariant, which is true for the exact posterior distribution [23], as well as for approximations obtained via MC methods such as Langevin MC [23], [31].
The requirement on permutation-invariance can be alleviated by allowing for probabilistic training algorithms such as stochastic gradient descent (SGD) [70].With probabilistic training algorithms, the only requirement is that the distribution of the (random) output models is permutation-invariant.This is, for instance, the case if SGD is implemented by taking mini-batches uniformly at random within the training set D [70]- [72].With probabilistic training algorithms, however, the validity condition (7) of CV-CP is only guaranteed on average with respect to the random outputs of the algorithms.Specifically, under the discussed assumption of permutation-invariance of the NC scores, by [53, Theorems 1 and 4], CV-CP satisfies the inequality while K-CV-CP satisfies the inequality Therefore, validity for both cross-validation schemes is guaranteed for the larger miscoverage level of 2α.Accordingly, one can achieve miscoverage level of α, satisfying (7), by considering the CV-CP set predictor Γ CV (x|D) with α/2 in lieu of α in (18).That said, in the experiments, we will follow the recommendation in [53] and [71] to use α in (18).

V. ONLINE CONFORMAL PREDICTION
In this section, we turn to online CP.Unlike the CP schemes presented in the previous section, online CP makes no assumptions about the probabilistic model underlying data generation [14], [61].Rather, it models the observations as a deterministic stream of input-output pairs z[i] = (x[i], y[i]) over time index i = 1, 2, . . .; and it targets a coverage condition defined in terms of the empirical rate at which the prediction set Γ i at time i covers the correct output y[i].
In the offline version of CP reviewed in the previous section, all N samples of the data set D are assumed to be available upfront (see Fig. 3(a)).In contrast, in online CP, a set predictor Γ i for time index i is produced for each new input x[i] over time i = 1, 2, . . .Specifically, given the past observations {z[j]} i−1 j=1 , the set predictor j=1 outputs a subset of the output space Y.Given a target miscoverage level α ∈ [0, 1], an online set predictor is said to be (1 − α)-long-term valid if the following limit holds for all possible sequences z[i] with i = 1, 2, . . .Note that the condition (21), unlike (7), does not involve any ensemble averaging with respect to the data distribution.We will take (21) as the relevant definition of calibration for online learning.
Rolling conformal inference (RCI) [61] adapts in an online fashion a calibration parameter θ[i] across the time index i as a function of the instantaneous error variable which equals 1 if the correct output value is not included in the prediction set Γ i (x[i]), and 0 otherwise.This is done using the update rule where γ > 0 is a learning rate.Accordingly, the parameter θ is increased by γ(1 − α) if an error occurs at time i, and is decreased by γα otherwise.Intuitively, a large positive parameter θ[i] indicates that the set predictor should be more inclusive in order to meet the validity constraint (21); and vice versa, a large negative value of θ[i] suggests that the set predictor can reduce the size of the prediction sets without affecting the long-term validity constraint (21).
Following [61], we elaborate on the use of the calibration parameter θ[i] in order to ensure condition (21) for an online version of the naïve quantile-based predictor ( 14) for scalar regression.A similar approach applies more broadly (see [14], [73], and [74]).Denote the data set as having all previously observed labeled data set up till time i − 1.The key idea behind RCI is to extend the naïve prediction interval (14) depending on the calibration parameter θ[i] as = ŷ(x|φ where is the so-called stretching function, a fixed monotonically increasing mapping.

VI. SYMBOL DEMODULATION
In this section, we focus on the application of offline CP, as described in Sec.IV, to the problem of symbol demodulation in the presence of transmitter hardware imperfections.This problem was also considered in [20], [75] by focusing on frequentist and Bayesian learning.Unlike [20], [75], we investigate the use of CP as a means to obtain set predictors satisfying the validity condition (7).

A. Problem Formulation
The problem of interest consists of the demodulation of symbols from a discrete constellation based on received baseband signals subject to hardware imperfections, noise, and fading.The goal is to design set demodulators that output a subset of all possible constellation points with the guarantee that the subset includes the true transmitted signal with the desired target probability 1 − α.In the context of channel decoding, this type of receiver is referred to as a list decoder [76].
To keep the notation consistent with the previous sections, we write as y[i] the i-th transmitted symbols, and as x[i] the corresponding received signal.Each transmitted symbol y[i] is drawn uniformly at random from a given constellation Y.We model I/Q imbalance at the transmitter and phase fading as in [62].Accordingly, the ground-truth channel law connecting symbols y[i] into received samples x[i] is described by the equality for a random phase ψ ∼ U[0, 2π), where the additive noise is v[i] ∼ CN (0, SNR −1 ) for signal-to-noise ratio level SNR.Furthermore, the I/Q imbalance function [77] is defined as with y I [i] and y Q [i] being the real and imaginary parts of the modulated symbol y[i]; and ȳI [i] and ȳQ [i] standing for the real and imaginary parts of the transmitted symbol f IQ (y[i]).In (28), the channel state c consists of the tuple c = (ψ, , δ) encompassing the complex phase ψ and the I/Q imbalance parameters ( , δ).

B. Implementation
As in [20], [75], demodulation is implemented via a neural network probabilistic model p(y|x, φ) consisting of a fully connected network with real inputs x[i] of dimension 2 as per (26), followed by three hidden layers with 10, 30, and 30 neurons having ReLU activations in each layer.The last layer implements a softmax classification for the |Y| possible constellation points.
We adopt the standard NC score (15), where the trained model φ D for frequentist learning is obtained via I = 120 GD update steps for the minimization of the cross-entropy training loss with learning rate η = 0.2; while for Bayesian learning we implement a gradient-based MC method, namely Langevin MC, with burn-in period of R min = 100, ensemble size R = 20, learning rate η = 0.2, and temperature parameter T = 20.We assume standard Gaussian distribution for the prior distribution [31].Details on Langevin MC can be found in Appendix A.
We compare the naïve set predictor (9), also studied in [20], [75], which provides no formal coverage guarantees, with the CP set prediction methods reviewed in Sec.IV.VB-CP uses equal set sizes for the training and validation sets.We target the miscoverage level as α = 0.1.16), CV-CP, and K-CV-CP (18) with K = 4, for symbol demodulation problem (Section VI).For every set predictors, the NC scores are evaluated either using frequentist learning (dashed lines) or Bayesian learning (solid lines).
The coverage level is set to 1 − α = 0.9, and each numerical evaluation is averaged over 50 independent trials (new channel state c) with N te = 100 test points.
Fig. 7 shows the empirical coverage = 1 and Fig. 8 shows the empirical inefficiency = 1 both evaluated on a test set D te = {(x te [j], y te [j])} N te j=1 with N te = 100, as a function of the size of the available data set D. We average the results for 50 independent trials, each corresponding to independent draws of the variables {D, D te } from the ground truth distribution.This way, the metrics ( 29)-( 30) provide an estimate of the coverage (7) and of the inefficiency (8), respectively [53].
From Fig. 7, we first observe that the naïve set predictor, with both frequentist and Bayesian learning, does not meet the desired coverage level in the regime of a small number N of available samples.In contrast, confirming the theoretical calibration guarantees presented in Sec.IV, all CP methods provide coverage guarantees, achieving coverage rates above 1 − α.Furthermore, as seen in Fig. 8, coverage guarantees are achieved by suitably increasing the size of prediction sets, which is reflected by the larger inefficiency.The size of the prediction sets, and hence the inefficiency, decreases as the data set size, N , increases.In this regard, due to their more efficient use of the available data, CV-CP and K-CV-CP predictors have a lower inefficiency as compared to VB predictors, with CV-CP offering the best performance.Finally, Bayesian NC scores are generally seen to yield set predictors with lower inefficiency, confirming the merits of Bayesian learning in terms of calibration.

VII. MODULATION CLASSIFICATION
In this section, we propose and evaluate the application of offline CP to the problem of modulation classification [45], [46].

A. Problem Formulation
Due to the scarcity of frequency bands, electromagnetic spectrum sharing among licensed and unlicensed users is of special interest to improve the efficiency of spectrum utilization.In sensing-based spectrum sharing, a transmitter scans the prospective frequency bands to identify, for each band, if the spectrum is occupied, and, if so, if the signal is from a licensed user or not.A key enabler for this operation is the ability to classify the modulation of the received signal [78].The modulation classification task is made challenging by the dimensionality of the baseband input signal and by the distortions caused by the propagation channel.Data-driven solutions [79] have shown to be effective for this problem in terms of accuracy, while the focus here is on calibration performance.Accordingly, we aim at designing set modulation classifiers that output a subset of the set of all possible modulation schemes with the property that the true modulation scheme is contained in the subset with a desired probability level 1 − α.To this end, we adopt the data set provided by [80], which has approximately 2.5 × 10 6 baseband signals of 1024 I/Q samples, each produced using one out of 24 possible digital and analog modulations across different SNR values and channel models.We focus only on the high SNR regime (≥ 6 dB).This data set D is made out of approximately 1.28 × 10 6 (x, y) pairs, where x is the channel output signal of dimension 2048 and y is the index of one of the |Y| = 24 possible modulations.The SNR value itself is not available to the classifier.

B. Implementation
We use a neural network architecture similar to the one used in [80], which has 7 one-dimensional convolutional layers with kernel size 3 and 64 channels for all layers, except for the first layer with has 2 channels.The convolution layers are followed by 3 fully-connected linear layers.A scaled exponential linear unit (SELU) is used for all inner layers, and a softmax is used at the last, fully connected, layer.We assume availability of N = 4800 pairs (x, y) for the data set D, while gauging the empirical inefficiency and coverage level with N te = 1000 held-out pairs.A total number of I = 4000 GD steps with fixed learning rate of 0.02 are carried out, and the target miscoverage rate is set to α = 0.1.VB partitions its available data into equal sets for training and validation.

C. Results
In this problem, due to computational cost, we exclude CV-CP and we focus on K-CV-CP with a moderate number of folds, namely K = 6 and K = 12.In Fig. 9, box plots show the quartiles of the empirical coverage (29) and of the empirical inefficiency (30) from 32 independent runs, with different realizations of data set and test examples.The lower edge of the box represents the 0.25-quantile; the solid line within the box the median; the dashed line within the box the average; and the upper edge of the box the 0.75-quantile.As can be seen in the figure, the naïve set predictor is invalid (see average shown as dashed line), and it exhibits a wide spread of the coverage rates across the trials.On the other hand, all CP set predictors are valid, meeting the predetermined coverage level 1 − α = 0.9, and have less spread-out coverage rates.
As also noted in the previous section, VB-CP suffers from larger predicted set size as compared to K-CV-CP, due to poor sample efficiency.A small number of folds, as low as K = 6, is sufficient for K-CV-CP to outperform VB-CP.This improvement in efficiency comes at the computational cost of training six models, as compared to the single model trained by VB-CP.

VIII. ONLINE CHANNEL PREDICTION
In this section, we investigate the use of online CP, as described in Sec.V, for the problem of channel prediction.
We specifically focus on the prediction of the received signal strength (RSS), which is a key primitive at the physical layer, supporting important functionalities such as resource allocation [81], [82].

A. Problem Formulation
Consider a receiver that has access to a sequence of RSS samples from a given device.We aim at designing a predictor that, given a sequence of past samples from the RSS sequence, produces an interval of values for the next RSS sample.To meet calibration requirements, the interval must contain the correct future RSS value with the desired rate level 1 − α.Unlike the previous applications, here the rate of coverage is evaluated based on the time average The second data set [63] reports samples y[i], measured in dBm, on a 5.8 GHz device-to-device link without additional input.Hence, in this case, we predict the next RSS sample y[i] using the previous RSS samples y [1], . . ., y[i − 1].Note that the prior works [63], [64] adopted standard probabilistic predictors, while here we focus on set predictors that produce a prediction interval

B. Implementation
We build the CP set predictor by leveraging the probabilistic neural network used in [61] as the model class for the quantile predictors in ( 13)- (14).Each quantile predictor consists of a multi-layer neural network that pre-processes the most recent K pairs {z[i − K], . . ., z[i − 1]}; of a stacked long short-term memory (LSTM) [83] with two layers; and of a post-processing neural network, which maps the last LSTM hidden vector into a scalar that estimates the quantile used in (14).For details of the implementation, we refer to Appendix C.

C. Results
Fig. 10 and Fig. 11 report the time-average coverage = 1 and the time-average inefficiency = 1  for online CP (24), compared to a baseline of the naïve quantile-based predictor ( 14), as a function of the time window size I for data sets [63] and [63], respectively.We have discarded 1000 samples for a warm-up period for both metrics (32) and (33).
In both cases, the naïve predictor is seen to fail to satisfy the coverage condition (21) for both data sets, while online CP converges to the target level 1 − α = 0.9.This result is obtained by online CP with a modest increase of around 8% for both data sets in terms of inefficiency.
In practice, as the true posterior distribution is generally intractable due to the normalizing factor in (35), approximate Bayesian approaches are considered via VI or MC techniques (see, e.g., [23]).
In the experiments, we adopted Langevin MC to approximate the Bayesian posterior [23], [31].Langevin MC adds Gaussian noise to each standard GD update for frequentist learning (see, e.g., [23,Sec. 4.10]).The noise has power 2η/T , where η is the GD learning rate and T > 0 is a temperature parameter.Langevin MC produces R model parameters {φ[r]} R r=1 across R consecutive iterations.We specifically retain only the last R samples, discarding an initial burn-in period of R min iterations.The temperature parameter T is typically chosen to be larger than 1 [86], [87].With the R samples, the expectation term in (36) is approximated as the empirical average We observe that Langevin MC is a probabilistic training algorithm, and that it satisfies the permutation-invariance property in terms of the distribution of the random output models discussed in Sec.IV-C.

APPENDIX B ALGORITHMIC DETAILS FOR ROLLING CONFORMAL INFERENCE
The RCI algorithm is reproduced from [61] in Algorithm 1.
The third and last network is a post-processing MLP f post (•) with one hidden layer of 32 neurons, and with parameter vector φ post [i], which maps the last LSTM hidden 64-length vector h that estimates the quantile for the output y[i].Accordingly, the time evolving model parameter is the tuple This model is instantiated twice for the regression problem: one for the α/2 lower quantile and the other for 1 − α/2 upper quantile.For every time instant i, after the new output y[i] is observed, continual learning of the models is taken place by training the models with corresponding pinball losses (11) using the new pair (x[i], y[i]), while initializing the models as the previous models at time instant i − 1.
The miscoverage rate was set to α = 0.1, the learning rate to η = 0.01, and we chose γ = 0.03 for the calibration parameter θ in (23).

Fig. 1 .
Fig. 1.(a) Examples of probabilistic predictors for two inputs x 1 and x 2 : As compared to the ground-truth distribution in the second column, the (b) Confidence versus accuracy for the decisions made by the corresponding predictors.p 0 (y|x 1 )

Fig. 3 .
Fig. 3. (a) The validity condition (7) assumed in offline CP is relevant if one is interested in the average performance with respect to realizations (D, z) ∼ p 0 (D, z) of training set D and test variable z = (x, y).Input variable x is not explicitly shown in the figure, and the horizontal axis runs over the training examples in D and the test example z.(b) In online CP, the set predictor Γ i uses its input x[i] and all previously observed pairs z[1], . . ., z[i − 1] with z[i] = (x[i], y[i]) to produce a prediction set.The long-term validity (21) assumed by online CP is defined as the empirical time-average rate at which the predictor Γ i includes the true target variable y[i].

Γ
Fig. 4. A naïve probabilistic-based (NPB) set predictor uses a pre-trained probabilistic predictor to include all output values to which the

Fig. 5 .
Fig. 5. Validation-based conformal prediction (VB-CP): (a) The data set is split into training and validation set; (b) A single model is trained over the training data set; (c)-(d) Post-hoc calibration is done by evaluating the NC scores on the validation set (c) and by identifying the (1 − α)-quantile of the validation NC scores.This divides the axis of NC scores into a "keep" region of NC scores smaller than the threshold, and into a complementary "discard" region (d).(e) For each test input x, VB-CP includes in the prediction set all labels y ∈ Y for which the NC score of the pair (x, y ) is within the "keep" region.

B
. Cross-Validation-Based CP (CV-CP)VB-CP has the computational advantage of requiring the training of a single model, but the split into training and validation data causes the available data to be used in an inefficient way.This data inefficiency generally yields set predictors with a large average size(8).Unlike VB-CP, cross-validation-based CP (CV-CP)[53] trains multiple models, each using a subset of the available data set D. As detailed next and summarized in Fig.6, during the training phase, each data point z[i] in the validation set is assigned an NC score based on a model trained using a subset of the data set D that excludes z[i], with i ∈ {1, ..., N }.Then, for testing, the inclusion of a label y in the prediction set for an input x is based on a comparison of NC scores evaluated for the pair (x, y ) with all the N validation NC scores.

Fig. 6 .
Fig. 6.K-fold cross-validation-based conformal prediction (K-CV-CP): (a) The N data pairs of data set D are split into K-folds each with |S k | = N/K samples; (b) K models are trained, each using a leave-fold-out data set of |D \ S k | = N − N/K pairs; (c) NC scores are computed on the N/K holdout data points for each fold S k ; (d) For each test input x, all labels y ∈ Y for which the number of "higher-NC" validation points exceeds a fraction α of the total N points are considered in the prediction set.CV-CP is the special case with K = N .
), via the additive stretching function ϕ(θ[i]) based on the calibration parameter θ[i].As the time index i rolls, the calibration parameter θ[i] adaptively inflates and deflates according to (23).Upon each observation of new label y[i], the quantile predictor model parameters φ D[i],α/2 and φ D[i],1−α/2 can also be updated, without affecting the long-term validity condition (21) [61, Theorem 1].We refer to Appendix B for further details on online CP.

Fig. 9 .
Fig. 9. Coverage and inefficiency for NPB (9), VB-CP (16), and K-CV-CP (18) with K = 6 and K = 12, for the modulation classification problem (implementation details in Section VII-B).The boxes represent the 25% (lower edge), 50% (solid line within the box), and 75% (upper edge) percentiles of the empirical performance metrics evaluated over 32 different experiments, with average value shown by the dashed line.
=E p D (x,y) − log p(y|x, φ) , with empirical distribution p D (x, y) defined by the data set D.Bayesian learning addresses epistemic uncertainty by treating the model parameter vector as a random vector φ with prior distribution φ ∼ p(φ).Ideally, Bayesian learning updates the prior p(φ) to produce the posterior distribution p(φ|D) asp(φ|D) ∝ p(φ) N i=1 p y[i] x[i], φ(35)and obtains the ensemble predictor for the test point (x, y) by averaging over multiple models, i.e., p(y|x, D) = E p(φ|D) [p(y|x, φ)].

Algorithm 1 : 3 Set prediction of new input 4 j=1 using 5 ŷ 6 Check if prediction is unsuccessful 7 err 9 θ[i + 1 ] 10 Update models using new sample 11 φ 12 φ
Rolling Conformal Inference (for Regression)[61] Inputs : α = long-term target miscoverage level θ[1] = initial calibration parameter φ lo[1], φ hi[1] = initial models Parameters : I = number of online iterations γ = learning rate for calibration parameter η = learning rate for model updatesOutput : {Γ RCI i x[i] {z[j]} i−1 j=1 } I i=1 = predicted sets for {x[i]} I i=1 1 for i = 1, . . ., I time instants do 2 Retrieve a new data sample (x[i], y[i]) Calculate set Γ RCI i x[i] {z[j]} i−1 x[i] φ lo [i] − ϕ(θ[i]), ŷ x[i] φ hi [i] + ϕ(θ[i]) ← θ[i] + γ err[i] − α lo [i + 1] ← φ lo [i] − η∇ φ α/2 y[i], ŷ x[i] φ lo [i] hi [i + 1] ← φ hi [i] − η∇ φ 1−α/2 y[i], ŷ x[i] φ hi [i]13 return predicted sets{Γ RCI i x[i] {z[j]} i−1 j=1 } I i=1 APPENDIX C IMPLEMENTATION OF ONLINE CHANNEL PREDICTIONThe architecture of the set predictor is inspired by[61], and made out of three artificial neural networks.The first, f pre (•), is a multi-layer perceptron (MLP) network with hidden layers of 16, 32 neurons each, parametrized by vectorφ pre [i].It is meant to apply a pre-process over the most recent observed K = 20 pairs {z[i − K], . . ., z[i − 1]} to be transformed element-wise into a length-K vector w[i] = w 1 [i], . . ., w K [i] , in which the k-th element (k = 1, . . ., K) is w k [i] = f pre z[i − K + k − 1] φ pre [i] .Effectively, this will serve as a temporal sliding K-length window, with a time-evolving pre-processing function.The second neural network, f LSTM (•) has two layers with model parameter vectors φ 1 LSTM [i] (first layer) and φ 2 LSTM [i] (second layer), which retains a memory via the hidden state vectors h and c, initialized at every time index i as c 1 0[i] = c 2 0 [i] = h 1 0 [i] = h 2 0 [i] =0.By accessing the previous K pairs via the vector w[i], this recurrent neural network extracts temporal patterns by sequentially transferring information via LSTM cells (with shared parameter vectors) in the image of hidden and cell state vectors c k [i], h k [i] via the LSTM cells.These vectors flow along the LSTM by concatenating k = 1, . . ., K cells, and forming vectors of length 32 each [64] is computed as the fraction previous time instants i ∈ {1, ..., t} at which the set predictor Γ i includes the trueRSS value y[i].We consider two data sets of RSS sequences.The first data set records RSS samples y[i] in logarithmic scale for an IEEE 802.15.4 radio over time index i[64].We further use the available side information on the time-variant channel ID, which determines the carrier frequency used at time i out of the 16 possible bands, as the input x[i].At time i, we observe a sequence of RSS samples z[1], . .., z[i − 1] with z[i] = (x[i], y[i]),and the goal is to predict the next RSS sample y[i] via the online set predictor Γ RCI i Depending on the situation of interest, post-hoc calibration leverages either an held-out (cross) validation set or previous samples.Unlike calibration approaches that do not formally guarantee reliability, such as Bayesian learning or temperature scaling, CP provides formal guarantees of calibration, defined either in terms of ensemble averages or long-term time averages.Calibration is retained irrespective of the accuracy of the trained models, with more accurate models producing smaller set predictions.To validate the reliability of CP-based set predictors, we have provided extensive comparisons with conventional methods based on Bayesian or frequentist learning.Focusing on demodulation, modulation classification, and channel prediction, we have demonstrated that AI models calibrated by CP provide formal guarantees of reliability, which are practically essential to ensure calibration in the regime of limited data availability.
IX. CONCLUSIONSAI in communication engineering should not only target accuracy, but also calibration, ensuring a reliable and safe adoption of machine learning within the overall telecommunication ecosystem.In this paper, we have proposed the adoption of a general framework, known as conformal prediction (CP), to transform any existing AI model into a well-calibrated model via post-hoc calibration for communication engineering.* D by tackling the following empirical risk minimization (ERM) problem