A hybrid model-based and learning-based approach for classification using limited number of training samples

The fundamental task of classification given a limited number of training data samples is considered for physical systems with known parametric statistical models. The standalone learning-based and statistical model-based classifiers face major challenges towards the fulfillment of the classification task using a small training set. Specifically, classifiers that solely rely on the physics-based statistical models usually suffer from their inability to properly tune the underlying unobservable parameters, which leads to a mismatched representation of the system's behaviors. Learning-based classifiers, on the other hand, typically rely on a large number of training data from the underlying physical process, which might not be feasible in most practical scenarios. In this paper, a hybrid classification method -- termed HyPhyLearn -- is proposed that exploits both the physics-based statistical models and the learning-based classifiers. The proposed solution is based on the conjecture that HyPhyLearn would alleviate the challenges associated with the individual approaches of learning-based and statistical model-based classifiers by fusing their respective strengths. The proposed hybrid approach first estimates the unobservable model parameters using the available (suboptimal) statistical estimation procedures, and subsequently use the physics-based statistical models to generate synthetic data. Then, the training data samples are incorporated with the synthetic data in a learning-based classifier that is based on domain-adversarial training of neural networks. Specifically, in order to address the mismatch problem, the classifier learns a mapping from the training data and the synthetic data to a common feature space. Simultaneously, the classifier is trained to find discriminative features within this space in order to fulfill the classification task.

I. INTRODUCTION We revisit the problem of classification with limited number of training data samples in this paper. The fundamental task of classification comes up in various fields and is traditionally tackled within two frameworks: 1) statistical setting, and 2) fully data-driven setting. In the first case, the main assumption is that data generation adheres to a known probabilistic model of the underlying physical process. Subsequently, the classification problem is usually dealt with within a hypothesis testing (HT) framework aimed at testing between two (or more) hypotheses. Here, optimality in both the Bayesian sense and the Neyman-Pearson sense relies on computation of the likelihood-ratio terms, which requires clairvoyant knowledge of the probabilistic models under different hypotheses [1]. However, accurate modeling of the physical processes in increasingly complex engineered systems is either not tractable or it relies on The authors are with WINLAB, Department of Electrical and Computer Engineering, Rutgers University, NJ, USA. Emails: {alinoora,narayan}@winlab.rutgers.edu, waheed.bajwa@rutgers.edu. This work was supported by the National Science Foundation (NSF) under grants ECCS-2028823 and OAC-1940074, and in part by ACI-1541069. a large number of unobservable parameters, estimation of which from limited number of data samples could be a major hurdle [2], [3]. As a result, a mismatch between the physicsbased statistical models and the real physical processes is inevitable. This precludes exact computation of the likelihoodratio values, which deteriorates the classification performance [4]. The fully data-driven (i.e., learning based) setting, on the other hand, relies on a large number of data samples for finding an optimal mapping from the data samples to the corresponding labels. But availability of such data in many real-world problems, e.g., channel-based spoofing detection [5] and signal identification [6], is generally limited, which might lead to learning of a suboptimal map. Moreover, one should always expect mislabeled data in many applications, since the employed labeling procedures might not be error free. Consequently, classification performance of data-driven models can be seriously limited for many real-world applications.
The overarching objective of this paper is to develop an algorithmic framework for classification from limited number of training data samples in applications in which neither modelbased nor learning-based approaches alone result in very good classification performance. To this end, note that learning-based approaches traditionally tend to disregard the physics-based models developed to describe the physical phenomena through tractable mathematical analysis. For instance, in the context of wireless communications, numerous theoretical models for channels and resource management have been developed over the years [2], [5], [7]. Despite being approximations in many cases, these models provide important prior information about the corresponding physical systems that might be utilized to facilitate the subsequent classification tasks. At the same time, physics-based models consist of numerous unobservable parameters, the tuning of which is a major hurdle for complex systems [3]. For example, physical channel models in the multiinput multi-output (MIMO) and 5G communications scenarios rely on a large number of multidimensional parameters that are defined over a mixed set of discrete and continuous spaces [8], [9]. In such cases, the maximum likelihood estimation (MLE) of the parameters could incur a formidable computational cost [9]- [11]. Our goal in this context is to develop a classification framework that can deal with these practical considerations through a hybrid approach that consolidates physics-based and fully data-driven classification approaches. The expectation is that the hybrid approach would fuse the strengths of the two approaches towards achieving an overall superior classification performance.
Our proposed hybrid approach first employs the (necessarily) suboptimal parameter estimation methods to estimate the unobservable parameters. Then, it utilizes them in the physics-based models to generate synthetic data, which enables us to leverage learning-based classification approaches. The mismatch between the physics-based models and the underlying physical process is addressed in a learning setting. Specifically, a neural network is trained to map the training and synthetic data to a common discriminative feature space, which is often referred to as domain-invariant space in the domain adaptation literature [12], [13]. Meanwhile, a neural networkbased classifier is trained on the mapped synthetic data to extract class-specific discriminative features from them. The resulting classifier in this way is expected to perform well on both synthetic and training data distributions.

A. Relation to prior works
In the realm of statistical model-based classifiers, the difficulties associated with estimating the parameters of the physics-based models are recognized in various works [4], [14]. This is mainly attributed to the inherent difficulties associated with determining probability distributions from only a limited number of data samples. Along these lines, classification under the assumption of mismatched models is considered in several works [4], [14]- [16]. Specifically, [14], [16] derive bounds on the probability of classification error in the presence of mismatch via the f -divergence between the true and mismatched distributions. In contrast to these bounds that are general in the sense that no assumption is made regarding the underlying distributions, [4] considers data that are contained in a linear subspace. This enables the authors to derive an upper bound on the classification error of the mismatched model that predicts the presence/absence of an error floor. The analyses in these works, however, do not lead to a classification algorithm for the mismatched setting as they merely analyze the mismatch problem itself.
The mismatch problem for the learning-based classifiers corresponds to the cases where the distribution of the available training data is different from that of the test data. Such mismatches are primarily studied in the transfer learning (TL) and the data-shift literature [13]. In particular, covariate shift [17], which is also studied under the name of transductive TL [18], refers to the case where the underlying data distributions for the test and training data are different. Concept shift [19], also known as inductive TL [18], on the other hand, deals with situations in which the posterior distribution of the labels given the data is not the same for the training and the test data. A wide range of algorithms have been proposed in order to alleviate the performance loss due to such shifts. For example, importanceweighting technique [20], [21] is proposed for the covariate shift scenario to remove the bias from the training data. Furthermore, algorithms based on subspace mapping [22] and learning domain-invariant representations [12] have also been proposed in the literature to address the mismatch problem. The authors in [22] propose a transfer component analysis method aimed at finding a transformation under which the maximum mean discrepancy between the true and mismatched distributions is small. The work in [12] aims at finding a representation that is invariant for the training and test distributions in order to mitigate the effect of discrepancies in the subsequent learning tasks. For the specific task of classification, the authors in [23] introduce the domain-adversarial neural network (DANN) framework, which extracts domain-invariant representations via (deep) neural networks that are discriminative for the training data in order to devise a classifier on the test data.
Deep transfer learning (DTL) is another prime subject related to our work that studies the transfer learning concept in the context of deep neural networks (DNNs). DTL considers a DNN that has been pre-trained on the training data as transferable knowledge useful for the test data. This knowledge can be transferred based on different strategies. The pre-trained DNNs can either be used directly for the test data, or serve as an intermediate feature extracting step that could facilitate the subsequent learning process for the test data. In another DTL strategy called fine-tuning, the pre-trained DNN or, certain parts of it, is refined using the available test data to further improve the effectiveness of transfer knowledge. We refer the reader to [24], [25] for a survey on DTL methods.
Model-based deep learning is another related line of work that aims at designing systems whose operation combines physics-based models (domain knowledge) and data. To this end, two main strategies are typically exploited in such works, known as model-aided networks and DNN-aided inference. The former results in specialized DNN architectures by identifying structures in a model-based algorithm; e.g., an iterative structure for the case of deep unfolding [26]. The latter primarily utilizes model-based methods for inference, but replaces explicit domain-specific computations with dedicated DNNs in order to facilitate operation in complex environments; e.g., using generative models for compressed sensing applications [27]. We refer the readers to [28] and references therein for the stateof-the-art strategies in model-based deep learning methods.
There also have been previous attempts to incorporate physics-inferred information in the fully data-driven setting. In the field of wireless communications, for instance, the authors in [2] employ DTL to solve a specific resource management problem. Similarly, the task of signal classification is tackled via DTL under different practical assumptions, such as real propagation effects [29], hardware impairments [30] and weak received signal strength [31]. These works utilize abundant data from an approximate model along with limited data from the real-world model in the DTL fine-tuning approach. More closely to the idea of physics-guided machine learning (ML), a recurrent neural network (RNN) is modified in [32] to incorporate information from the physics-based model as an internal state of the RNN. Furthermore, parameters of the physics-based models are combined with sensor readings and used as input to a DNN to develop a hybrid prognostics model in [3].
We note that the aforementioned works in domain adaptation literature do not employ any available physics-based statistical models and, consequently, rely on large number of training data samples for dealing with the mismatch problem. In addition, model-based deep learning strategies might not be applicable to the statistical classification problem in general due to the lack of algorithmic structure such as an iterative structure. Equally importantly, DTL fine-tuning and physics-guided learning approaches do not consider the difficulties associated with estimating the physics-based parameters, which would indeed lead to inaccurate physics-based statistical models. The resulting discrepancy between the model and the underlying physical process necessitates a learning-based classifier that is capable of leveraging the data in a way to alleviate this mismatch problem.

B. Our contributions
The main contributions of this work are as follows.
• We focus on the task of classification for a physical process assuming that a limited number of training data samples, with possibly mislabeled instances, is available. We consider the case where the physical process (or its approximation) can be described by physics-based parametric statistical models. As these models tend to be complex in general, estimation of the unknown model parameters using the maximum likelihood estimation (MLE) procedure could be computationally prohibitive. 1 We instead propose HYPHYLEARN-a novel hybrid classification method-as a solution, which exploits both physics-based statistical models and learning-based classifiers. This approach makes use of (necessarily suboptimal) parameter estimation algorithms/heuristics to obtain (approximate) parameter estimates. Next, plugging in these estimates in the physics-based statistical models enables us to generate synthetic data. HYPHYLEARN then relies on neural networks (NNs), which are powerful tools for finding a discriminative feature space, towards obtaining a learning-based classifier. Specifically, the learning process involves training a NN to map the training and synthetic data to a common space under which they are not distinguishable. In the mean time, a learningbased classifier is trained on the synthetic data mapped to the new space to find discriminative class-level features. Indeed, learning the common feature space addresses the distribution mismatch problem between the training data samples and the generated synthetic data due to the errors in parameter estimation. It is then expected that the classifier trained on the mapped synthetic data will perform well on both data distributions. We repurpose theories from the domain adaptation literature based on learning invariant representations for our specific problem to justify the proposed hybrid approach. A schematic of HYPHYLEARN for a binary classification example is illustrated in Fig. 1. • We also consider two prototypical problems from the wireless communications literature to investigate the performance of our proposed approach and show its superiority in comparison to the stand-alone statistical model-based classifiers as well as the fine-tuning approach as the best existing hybrid approach applicable to these problems. We first consider the problem of channel spoofing in the wireless communications setting, where an adversary (Eve) spoofs a legitimate transmitter (Alice) and sends a message to a legitimate receiver (Bob) [5], [34], [35]. The spoofing detection at Bob involves making a decision on whether an incoming message corresponds to Alice or Eve. This can be cast as a binary classification problem at Bob. Second, we revisit the problem of multiuser detection (MUD) in the uplink of a cellular network, where different users are asynchronously sharing a channel with a base station [11]. For a K-user system, MUD is basically a 2 K -ary classification problem in which the goal is to infer K binary information bits from a given observation. By obtaining likelihood ratio test (LRT) for each problem, we show that statistical model-based classifiers rely heavily on the wireless channel parameters in the above problems. However, estimation performance of these parameters suffers from both the paucity of training data and complexity of the physics-based statistical models. In fact, these models are complex in the sense that MLEs of the corresponding parameters require an exhaustive search over the space of the parameters, which is not feasible for many communication scenarios including MIMO transmissions in a 5G setting [8]. For both problems, numerical results show that HYPHYLEARN provides major improvements in terms of the classification accuracy in comparison to the best existing approaches.

C. Notation and organization
Throughout the paper, vectors are denoted with lowercase bold letters, while uppercase bold letters are reserved for matrices. Furthermore, equality by definition is expressed through the symbol =. Non-bold letters are used to denote scalar values and calligraphic letters denote sets. Furthermore, the cardinality of a set S is denoted by |S|. The spaces of real and complex vectors of length d are denoted by R d and C d , respectively. The mth element of a vector u and the trace of a matrix U are shown by u[m] and Tr(U), respectively. Also, real and imaginary parts of a complex number a are denoted by {a} and {a}, respectively. The probability density function and expectation of a random variable w are denoted by p(w) and E p (w), respectively, while P[·] is used to denote the probability of an event. The Gaussian and circularlysymmetric complex Gaussian distributions are denoted by N and CN , respectively, while the uniform distribution supported between two real numbers a and b is denoted by unif(a, b). We denote the kth standard basis vector of length N in R N by e k , and use u to refer to the Euclidean norm of the vector u. We refer to identity matrix of size N and the indicator function by I N and 1 A (x) = 1, x ∈ A 0, x / ∈ A , respectively. Transpose and conjugate transpose of u are denoted by u T and u H , respectively. Furthermore, e n (y) refers to a one-hot encoded version of a non-negative integer y, which equals to an all-zero vector of length n except for the yth element which is set to 1. Also, • and denote the Schur componentwise and the Khatri-Rao product, respectively, while ⊗ is reserved for the Kronecker product. Finally, given two vectors a and b The rest of the paper is organized as follows. The problem is formally posed in Section II. Our proposed solution is described in Section III, which discusses various pieces of HYPHYLEARN approach. We introduce the first case study involving the spoofing detection problem in Section IV. The second case study, which concerns the multi-user detection problem, is presented in Section V. We present numerical results concerning the application of our proposed approach in the above two case studies in Section VI, and contrast it with the existing methods. Finally, the paper is concluded in Section VII.

II. PROBLEM FORMULATION
Consider a physical process consisting of C distinct behaviors where the physics-based parametric statistical model for the ith behavior is available in the form of a parametric probability density function (PDF) denoted by the conditional prior p i (x; θ i ) on observations x that belong to an observation space X . Assuming the true underlying parameter for the ith behavior is θ * i , the data for this behavior is generated by drawing independent and identically distributed (i.i.d.) samples from p i (x; θ * i ). Assuming further that the ith behavior is chosen with a prior probability π i , our goal is to devise a decision rule to determine a given sample x = [x 1 , . . . , x n ] T is generated under which behavior. Clearly, this can be cast as a C-ary classification problem via H i : x ∼ p i (x; θ * i ), i = 0, . . . , C−1. We consider the case where this decision is made by a classifier h φ (·) parameterized by φ ∈ R d , h φ (x) : X → {0, . . . , C −1}, which partitions X into C disjoint sets, {X i }, and decides in favor of H i if x ∈ X i . Defining θ * =[θ * 0 , . . . , θ * C−1 ], we denote the probability of error associated with h φ (x) by P θ * [e φ ], which can be computed as where e φ indicates the event that h φ (x) makes an erroneous decision. The optimal classifier h φ * (x) that minimizes the error probability is given by the Bayes decision rule, i.e., h φ * (x) = argmax i=0,...,C−1 π i p i (x; θ * i ) [1]. For the specific case of C = 2, this rule takes the famous form of the likelihood ratio test, π0 π1 , where y = i implies making a decision in favor of the ith behavior.
We focus in this paper on the case where although the parametric model p i (x; θ i ) is known for the ith behavior, one does not have access to the corresponding underlying true parameter θ * i . Instead, only a small number of training data generated in an i.i.d. manner from p i (x; θ * i ), ∀i, are available. Specifically, we denote the available dataset by D r = {x r,n } Nr n=1 , where N r is the total number of data samples. Also, the corresponding ground-truth label for the nth sample is denoted by y r,n which is only given for N r,l number of data samples where N r,l ≤ N r . Furthermore, we consider the case where the model p i (x; θ i ) under the ith behavior is a non-trivial function of the underlying parameter for which conventional estimation procedures such as maximum likelihood estimation (MLE) are either not available or are computationally prohibitive to implement. The implication of this aspect of the problem formulation is that the performance of any suboptimal parameter estimation method is bound to be limited. As a result, statistical model-based classifiers, which plug-in these estimates in p i (x; θ i ), would have a deteriorated performance as well.
Unlike these classifiers that rely heavily on the knowledge of the parametric statistical models and the estimated parameters, a purely data-driven approach can result in a classifier that disregards the available parametric models. However, as the data generation processes are governed by non-trivial models, a large number of data is needed in this case to extract related patterns from each behavior that would lead to a highly discriminative feature space. By noting that the performance of the fully data-driven and the statistical model-based classifiers is particularly curbed when they are used in a stand-alone fashion, we conjecture that fusing the strengths of the two can lead to a superior classification algorithm in our setting, as described in the next section.
Before delving into the proposed solution for the described problem setting, we discuss further two existing approaches towards obtaining a statistical model-based classifier for the benefit of the reader. Recall that within the framework of statistical model-based classification, one would first estimate the unknown model parameters as θ i 's, i = 1, . . . , C, and plug them in the available models to obtain p i (x; θ i ). The resulting plug-in models are then used in practice in lieu of the true models within the optimal Bayes decision rule. The parameters, φ, of the resulting plug-in classifier consist solely of the parameters of physics-based statistical models, i.e., φ = θ = [θ 0 , . . . , θ C−1 ]. 2 Based on this fact, we denote the plugin classifier by h θ (x) in the remainder of this section. The unknown model parameters can be estimated using numerous approaches. In the following, we discuss two of the most popular ways to estimate them as well as the shortcomings of these approaches that warrant a new approach to classification.
Empirical error minimizer: Given a set of training data with their corresponding labels, {x r,n , y r,n } Nr n=1 , the most natural approach for parameter estimation corresponds to the setting in which the resulting plug-in classifier, h θ (x), minimizes the empirical error probability defined by P Nr [e θ ] = 1 Nr Nr n=1 1 {h θ (xr,n) =yr,n} . Specifically, for the case of C = 2 consider the family of the classifiers for which the parameter values θ 0 and θ 1 are chosen from a space Θ. The parameter estimates that minimize the empirical error are obtained as θ = [ θ 0 , θ 1 ] ∈ argmin θ P Nr [e θ ]. The following lemma, which is a direct result of Corollary 16.1 in [36], presents an upper bound on the performance of the Bayes decision rule in terms of that of the plug-in classifier that is obtained using empirical error minimization.
Lemma 1. If θ * 0 , θ * 1 ∈ Θ, then the error probability of the Bayes decision rule, with the probability at least 1 − δ, is bounded by where b denotes the Vapnik-Chervonenkis (VC) dimension [36] of the family of classifiers, h θ (x), defined above.
The above lemma guarantees a O( log N r /N r ) rate of convergence to the Bayes error for hθ(x) when θ is chosen to minimize the empirical error. However, obtaining such θ is computationally expensive in general as the empirical error probability might be a non-trivial function of the parameters.
Maximum likelihood estimator: In practice, the unknown model parameters are commonly replaced with their corresponding MLEs under each beahvior; the resulting plug-in classifier gives rise to the well-known generalized likelihood ratio test (GLRT) for the binary case (C = 2) [1]. Specifically, assuming the training data and their corresponding labels are available in the form of {x r,n , y r,n } Ni n=1 for the ith hypothesis, the MLE of (1−π)p0(x;θ0)+πp1(x;θ1) is continuous in (θ 0 , θ 1 , π), as the parameters' estimates converge to the true values, the error of the plug-in classifier also converges to that of the Bayes decision rule. However, not only no optimality condition can be stated in general for the plug-in classifier relying on MLEs [33], obtaining such estimates might also be computationally prohibitive for system with complex likelihood functions.

III. PROPOSED SOLUTION: HYPHYLEARN
The main deciding factor in superiority of a solution for the problem setup introduced in Section II is the extent to which it exploits the available information, i.e., training data and the parametric statistical models. In particular, the plugin classifiers tend not to exploit this information in the most optimal fashion as performance of the parameter estimation procedures can be curbed due to the complexity of the underlying models and lack of the corresponding ground-truth labels. We instead propose a novel hybrid classification method to make use of the available information in learning-based classifiers, which are powerful tools for finding discriminative feature spaces. Specifically, our proposed solution relies on the parametric models to generate synthetic data and incorporate them with the training data in a classifier that makes use of adversarial training between NNs. Next, we describe the various steps of the proposed solution that is termed HYPHYLEARN in detail.
Step 1-Imperfect labeling: As the available data are not assumed completely labeled in our problem setup, the first step in our solution deals with assigning labels to the unlabeled data samples in D r . This involves a clustering step that partitions the dataset D r into C distinct groups. Then, the groups are labeled using the available N r,l labels. For example, a label can be assigned to a group based on the number of labeled training data it includes from each behavior; If the majority of such samples corresponds to the ith behavior, the group is labeled as i. Subsequently, we refer to a group assigned with the label i by D r,i for i = 0, . . . , C − 1. Denoting this imperfect labeling process by g(x) : X → {0, . . . , C − 1}, a non-trivial labeling error over D r is associated with g(x) that can be computed via e r = 1 Nr Nr n=1 1 {g(xr,n) =yr,n} . In the remainder of this paper, we refer to the number of samples in the cluster labeled as i by N r,i . The function g(x) may be obtained based on any one of the simple clustering algorithms from the ML literature, such as the Gaussian mixture model [37], or it may be a decision rule obtained based on the statistical analysis of the parametric models. For instance, for the problem of channel spoofing detection, a hypothesis test is proposed in [5] that assigns labels to unlabeled samples based on their similarity, measured in terms of the Euclidean distance, to a reference data sample.
Step 2-Parameter estimation: Based on the labels assigned in Step 1 to the unlabeled data samples, we estimate the parameters of the physics-based statistical models under each behavior. To this end, we utilize D r,i to estimate the parameter vector θ * i corresponding to the ith behavior. Furthermore, the priors are estimated as π i = N r,i /N r . We note that the procedure for estimating θ * i depends on the available parametric models corresponding to the ith behavior, i.e., p i (x; θ i ). We recall from our problem setup that the MLE, which is usually utilized for parameter estimation purposes, might not be employed here due to the formidable complexity of optimizing p i (x; θ i ) over θ i . Instead, a (necessarily) suboptimal estimator, T (·), built upon either heuristics or optimization techniques like alternate maximization (see Sections IV-C and V-B) could be utilized to estimate the parameters as θ i = T (D r,i ) for all the behaviors. The parameter estimation performance is therefore limited here due to both the suboptimality of T (·) and presence of the mislabeled samples in D r,i , ∀i.
Step 3-Forming a synthetic dataset: The paucity of available data in our problem formulation seems to preclude utilization of a learning-based classifier as part of the solution. However, we note that the available physics-based statistical models, in the form of parametric PDFs, enable us to generate synthetic data to augment the available data, and make it possible to exploit the discriminative power of learningbased classifiers. Having access to the estimated parameter θ i obtained in Step 2, we plug it in the available physics-based statistical model to obtain a PDF p i (x; θ i ) for the ith behavior. In order to generate a synthetic dataset, we first sample w from a categorical distribution parameterized by π = [ π 0 , . . . , π C−1 ] over the sample space of {0, . . . , C − 1}. Then, we sample a data point x s,i according to x s,i ∼ p w (x; θ w ) with the associated label y s,i = w. Repeating this process N s number of times, we obtain a synthetic dataset in which the data samples are generated in a statistically independent fashion.
Step 4-Incorporating synthetic and training data in a learning-based classifier: The synthetic data generated in Step 3, besides retaining essential information about the underlying physics-based statistical models, enables us to utilize the discriminative power of learning-based classifiers. However, the errors introduced during the labeling and the parameter estimation steps that precede the synthetic data generation process incur a mismatch between the distributions corresponding to the training and synthetic datasets. This mismatch is bound to deteriorate the performance of a classifier trained on the synthetic data alone, when utilized in a real-world setting. Then the question is how a learning-based classifier can be trained to alleviate this problem. For example, in the fine-tuning approach [2], a NN-based classifier will be trained on the synthetic data first, and then, training data are used to refine the weights of the corresponding NN. However, we conjecture that such learning strategies that utilize the training and synthetic data in the separate stages of training are not the best solution here; rather, synthetic and training data should jointly be incorporated in a learning-based classifier. To this end, inspired by the works in the domain-adaptation literature and specifically feature space mapping [12], we propose to map the synthetic and training data through a data-driven function M ψ : X → Z, which is parameterized by a real vector ψ, into a common feature space Z. Consequently, a classifier h φ1 (z), parameterized by φ 1 , which is trained on the synthetic data within the space Z is expected to perform well on both training and synthetic data. To this end, we choose M ψ and h φ1 to be NNs, which are powerful tools for finding discriminative features from a given dataset. We discuss this step in detail in the following subsection.
HYPHYLEARN: We now present our final solution as an algorithmic framework composed of the aforementioned four steps. In a nutshell, HYPHYLEARN generates synthetic data based on the physics-based parametric statistical models and utilizes them along with the available data in a learning-based classifier powered from the adversarial training of the NNs (see the following subsection). In order to train the NNs based on their specific loss functions, described in the following subsection, we utilize the stochastic gradient descent method [37] along with mini-batches consisting of random samples from the training and synthetic datasets in an iterative manner. The details of the whole process is presented in Algorithm 1.
Training dataset Dr = {xr,n} Nr n=1 ; learning rates µr 1 , µr 2 , µr 3 ; Number of training steps Ntr; Mini-batch size N b < Nr; Number of synthetic data samples Ns to be generated 2 Output: The mapping M ψ (·) and the classifier h φ 1 (·), parameterized by the real vectors ψ and φ1, respectively Ls ← Ls(ψ, φ1|D s,b ) 14 Lc ← Lc(ψ, ζ|D r,b , D s,b ) // Backward propagation 15 Computing gradients: A. Incorporating synthetic and training data in a learningbased classifier for HYPHYLEARN To elaborate further on Step 4, we first denote the distributions corresponding to the real and synthetic data as , respectively. We refer to p θ * (x) and p θ (x) as the true and estimated distributions, respectively. For each distribution, applying the mapping M ψ (·) to the input space X would induce a distribution over the feature space Z. Specifically, we denote the mapping of the true distribution Assuming that X and Z are topological spaces, for any A ⊂ Z the probability of A in space Z is where the pre-image M −1 ψ (A) belongs to the Borel σ-algebra over X . Subsequently, the probability of error corresponding to a classifier h φ1 (z), parameterized by a real vector φ 1 , with respect to the mapping of the true distribution to the Z space is computed via where the dependence of P on π i 's is suppressed for notational simplicity. Similarly, mapping of the estimated distribution to the space Z is characterized by a distribution denoted by p ψ, θ (z). Furthermore, the probability of error for a classifier h φ1 (z) with respect to p ψ, θ (z) can be computed similar to (4), which we refer to as Our main goal is to learn a map M ψ (·) and a classifier h φ1 (z) in a way that the probability of error of h φ1 (z) with respect to the mapping of the true distribution to Z, i.e., P ψ,θ * [e φ1 ], is small. To this end, we repurpose theories from the domain-adaptation literature in the following to obtain an upper bound on P ψ,θ * [e φ1 ], which leads to explicit loss functions for the joint learning of M ψ and h φ1 (z) using both the training and synthetic datasets. Specifically, it is desired for the mapping M ψ (·) from X to Z to transform the true and estimated distributions in a way that p ψ,θ * (z) and p ψ, θ (z), which are defined in the feature space Z, are similar. Mathematically, this similarity should be measured in terms of a distance metric. However, as there are only a limited number of samples available from p ψ,θ * (z), we need to be able to approximate this distance from a finite number of samples. We expand further on this idea by primarily focusing on binary classification in this section, although the results are extendable to the classification task in general. We begin with the following distance definitions.

Definition 1. For a family of binary-valued functions
3 Similar to the total variation distance, it can be readily verified that d A Φ and d B Φ are also distance metrics.
The A Φ -distance is also referred to via other names like A-distance and H-distance in [23], [38]. By looking at the following extreme choices of H Φ , these distances are clearly a function of richness of the class H Φ . For a very restrictive choice of only constant functions, i.e., H Φ = {h φ |h φ (z) = 0, ∀z} {h φ |h φ (z) = 1, ∀z}, d AΦ is always zero as the only possible choice for A φ is either the empty set or Z. On the other hand, for H Φ = {h φ |h φ (z) = 0 or h φ (z) = 1, ∀z}, which represents all the binary functions, d AΦ is identical to definition of the total variation distance [39] as the sup in (5) will effectively be over the σ-algebra of subsets of the Z space. This dependence of d AΦ on the underlying family of functions makes it possible to obtain an expression for the A Φ -distance based on the finite set of samples from each distribution. Specifically, consider two sets Z ψ, As the bound on P ψ,θ * [e φ1 ] should be obtained based on a finite number of training and synthetic samples, it is then of interest to see how far d AΦ is from d AΦ . To answer this question, one needs to rely on a measure of complexity for a given class of functions such as the VC dimension [36] and Rademacher complexity [40]. As we have chosen the mapping function M ψ (x) and the classifier h φ1 (z) to be NNs, we present the results based on the Rademacher complexity defined as follows, which can be computed for certain classes of neural networks in a closed-form fashion [40].
be a set of i.i.d. samples drawn from a distribution p(z) that is supported on Z. For H Φ , a family of real-valued functions over Z, the empirical Rademacher complexity of H Φ , given a dataset Z 1 , is defined as where the expectation is over all the σ i 's, each taking a binary value with equal probability.

Lemma 2 ( [40]). Consider a family of functions
samples from p(z) and any 0 < δ < 1, the following holds ∀h φ ∈ H Φ with probability at least 1 − δ: Now, the difference between d AΦ and d AΦ can be bounded in terms of the complexity of the underlying family of functions and the number of available samples as stated in the following lemma.
Proof. See Appendix A.
The above lemma enables us to bound the A Φ distance between two distributions in terms of the collected samples from each. Equipped with this result, we are able to bound the probability of error P ψ,θ * [e φ1 ] via the following theorem.
Theorem 1. Assume that the training and synthetic datasets are mapped into the feature space Z through the mapping function M ψ (x), with the resulting samples denoted by , respectively. Then, for any 0 < δ < 1 and a family of functions Proof. See Appendix B.
The above theorem bounds the probability of error with respect to p ψ,θ * (z) associated with a classifier h φ1 (z) in terms of the quantities that do not depend on the the unknown true parameters θ * . As our primary goal is to make P ψ,θ * [e φ1 ] as small as possible, the mapping function M ψ (x) and the classifier h φ1 (z) should be chosen in a way to minimize the above upper bound. We note that the complexity related terms in the above bound are fixed for a chosen family of the functions and the bound is primarily controlled by the first two terms. In other words, M ψ (x) and h φ1 (z) should be chosen such that the probability of classification error with respect to the mapping of the estimated distribution in the Z space, i.e., P ψ, θ [e φ1 ], and the approximated A Φ -distance between the synthetic and training datasets are minimized simultaneously. To achieve this goal, we restrict ourselves to M ψ (x) and h φ1 (z) that correspond to NNs that are trained to minimize a loss function in accordance with the first two terms of the above bound. One can efficiently solve the resulting optimization problem via the stochastic gradient descent method as described in the following.
Joint learning of the feature map and the classifier: In terms of specifics, we assume M ψ (x) and h φ1 (z) belong to the class of feed-forward (deep) NNs whose parameters, i.e., ψ and φ 1 , correspond to the weights and biases of each network. The input and output layers of the NNs corresponding to M ψ have n x and n z number of neurons, respectively, which denote the dimensions of the spaces X and Z, respectively. We note that n x is chosen according to the length of the observation vector as part of the problem formulation, while n z can be picked as a hyper-parameter to facilitate the training process.
Subsequently, the input layer of h φ1 has n z neurons while its output layer contains C neurons whose activation function is chosen to be the softmax function σ(z) for which the ith element is given by ] . In this way, the ith component of the vector y ψ,φ1,x = h φ1 M ψ (x) denotes the probability that the classifier assigns to the input x that it belongs to the ith class for i = 0, . . . , C − 1. Consequently, the averaged cross-entropy loss, minimizing of which leads to minimizing the classification error associated with h φ1 , over the synthetic dataset D s equals where l s,n = e C (y s,n ) denotes the one-hot encoded version of the label y s,n corresponding to the nth sample. Regrading the computation of d AΦ between the two sets Z ψ,θ * and Z ψ, θ , it is suggested by the authors in [23], [38] that the classification accuracy corresponding to a classifier trained to distinguish between the samples from the two sets can be used as a surrogate for the inf part in (7) that can be readily computed during the learning process. To train such classifier, we consider a NN d ζ with n z input neurons and 2 output neurons with softmax activation function, which is trained to distinguish between Z ψ,θ * and Z ψ, θ labeled as 0 and 1, respectively.
Consequently, by defining a two-dimensional vector d ψ,ζ,x = d ζ M ψ (x) , the d AΦ term can be approximated by the crossentropy loss associated with d ζ as follows: log d ψ,ζ,xs,n [2]. (14) Now, using Theorem 1 the training goal for the constituent NNs is set to simultaneously minimize the classification error corresponding to the synthetic data and the distance between the real and synthetic data, both measured in the mapped space Z. Specifically, the NNs M ψ and h φ1 should be trained to minimize the sum of the losses in (12) and (13), while the classifier d ζ is trained to minimize (14). As M ψ is trained to maximize L c (ψ, ζ) despite d ζ 's goal to minimize L c (ψ, ζ), the learning process involves adversarial training between these two NNs. Based on the approach taken in [23] for adversarial training in the context of domain adaptation, we train the above three NNs for finding the saddle points ψ, φ 1 and ζ, such that which can be achieved by utilizing the stochastic gradient descent algorithm for each minimization task. To this end, the minimization is performed over the NN's parameters, ψ, φ 1 and ζ, that are real vectors whose dimensions are determined by the architecture of each network.
B. An illustrative example: The case of two-dimensional Gaussian data Next, we show how the learning-based classifier in Section III-A performs on simple training and synthetic datasets in an illustrative manner. To this end, we consider a toy example where the true and estimated distributions are a mixture of two bivariate Gaussian distributions with full-rank covariance matrix each. In particular, we focus on the problem of binary classification where the distribution for the ith class is denoted by , and equal priors. In order to investigate the effect of mismatch between only mean parameters, the corresponding estimated distributions are assumed to have the same covariance but different means, i.e., p i (x; θ i ) = N ( µ i , Σ) for i = 0, 1 and equal priors. For two multivariate Gaussian distributions, the authors in [39] have proposed a bound for the corresponding total variation as part of the following theorem.

Theorem 2 ( [39]
). Consider two d-dimensional Gaussian distributions N (µ 1 , Σ 1 ) and N (µ 2 , Σ 2 ) where µ 1 = µ 2 and Σ 1 and Σ 2 are positive definite. Let v = µ 1 − µ 2 and Π be a d × (d − 1) matrix whose columns form a basis for the subspace orthogonal to v. Denote the eigenvalues of Then, the total variation between the two distribution can be bounded as We note that a bound on total variation would also bound the A Φ distance following the discussion after Definition 1. Using the above result, we can bound the total variation distance between p i (x; θ * i ) and p i (x; θ i ) as follows, which will provide useful insights in the remainder of this section about the learning process described in Section III-A.
Corollary 1. For two Gaussian distributions N (µ 0 , Σ) and N ( µ 0 , Σ) with the same positive definite covariance matrix Σ, the corresponding total variation is bounded from the above by 9 2 min 1, .
Regarding the specific architecture for the NNs utilized in Section III-A, let us now choose the mapping function M ψ to be M ψ (x) = W ψ,2 W ψ,1 x parameterized by ψ = {W ψ,1 ∈ R 20×2 , W ψ,2 ∈ R 2×20 }. In particular, we have set the dimension of the space Z to n z = 2 in order to be able to readily visualize it within the 2D coordinate system. For each h φ1 and d ζ , we choose a twolayer NN with softmax activation function. Specifically, for Training of these NNs involves finding the saddle points of (15) based on the available training and synthetic datasets which would lead to the learning-based classifier h φ1 . We note that the above simple choice of the mapping function maps p i (x; θ * i ), i = 0, 1 to Gaussian distributions in the Z space which allows us to utilize Corollary 1 for analyzing the total variation distance between these mappings in the following.
We now resort to numerical results for further illustration of this example. To this end, we set µ 0 = [2. where v = µ i − µ i and e v = v/||v||. Assuming λ 1 and λ 2 are eigenvalues of Σ with corresponding eigenvectors u 1 and u 2 such that λ 1 > λ 2 , it is straightforward to show that the maximal value of e T to be minimized e v ought to be in the same direction of u 1 while ||v|| become minimum. Notably, Figs. 2b and 2c highlight the fact that finding the saddle points in (15) in part corresponds to mapping the datasets to a feature space Z that satisfy both these two criteria.

IV. CASE STUDY 1: CHANNEL-BASED SPOOFING DETECTION FOR PHYSICAL LAYER SECURITY
We now present the first case study concerning channel spoofing detection, which can be posed as a binary hypothesis testing problem. We obtain the likelihood ratio test based on the time-variant channel models. Then, we discuss parameter estimation procedures for the likelihood function corresponding to channel frequency responses (CFRs), and show how our proposed solution is applicable for achieving an enhanced spoofing detection system.

A. System Model
The problem of channel spoofing detection arises in a wireless communication environment where a legitimate transmitter (Alice) is transmitting signals to a legitimate receiver (Bob) in the presence of an adversary (Eve). Eve aims at spoofing the Alice-Bob's channel by using the Alice's MAC address [5], [7]. Bob's goal, in this setting, is to distinguish between the signals coming from Alice and Eve based on the corresponding channel frequency responses (CFRs).
We envision the communication parties in a 5G propagation setting relying on MIMO-OFDM wideband communications where the number of antennas are set to N T x and N Rx at We first introduce the dominant paths model suitable for MIMO-OFDM communication, under a frequency-dependent array response [41]. For this scenario, the N Rx × N T x channel matrix associated with the nth subcarrier (n = 1, . . . , N f ) is expressed as The antenna steering response vectors are defined as where K is the total number of dominant paths, ψ T,k = 2π λn d sin(θ T x,k ), λ n = c(N T s + f c )/n denotes the signal bandwidth at the nth subcarrier, and d refers to the distance between two antenna elements. The structure of the frequencydependent antenna steering and response vectors a T,n (ψ T,K−1 ) and a R,n (ψ R,K−1 ) depends on the specific array structure. For the case of a uniform linear array (ULA) which we consider in this work, we have Similarly, a Rx,n (ψ Rx,k ) can be defined for the receiver's antennas. The path gain matrix is obtained by where h k and τ k denotes the complex channel gain and delays of the kth path while T s is the sampling interval. Then,h is where vec{·} denotes the column vector operator. We denote the associated parameters with the specular paths contribution, h, which remains constant during a coherence time T c , via a 4K × 1 vector θ sp defined as where ψ T = [ψ T,0 , . . . , ψ T, For modeling the variable part of the channel we first assume that the wide-sense stationary uncorrelated scattering (WSSUS) assumption holds, and use a multipath tapped delay line, h(t, τ ) = L−1 l=0 A l (t)δ(τ − l∆τ ), to model the impulse response at time t between two antennas. Here, A l (t) and ∆τ = 1/W denote the (complex) amplitude of the lth path and the delay between two consecutive paths, respectively. Sampling the impulse response at time t = uT , followed by taking the Fourier transform w.r.t. τ would result in a vector q u whose nth element is denoted by where ∆f and A u,l denotes the subcarrier width and the lth channel gain at time u, respectively. Following the exponential decay model which holds for the power delay profile of q u based on various experimental observations [5], A u,l is modeled with zero-mean Gaussian distribution with variance Var(A u,l ) = α 2 (1 − e −2πβ )e −2πβ . Here, α 2 and β denotes the average power and the normalized coherence bandwidth, respectively. The distribution of q u is given in the following lemma.
Proof. See Appendix C.
Next, the contribution of measurement noise n is modelled with a zero-mean complex multivariate Gaussian random variable as n ∼ CN (0, σ 2 I) where σ 2 denotes the variance of the noise. This can be incorporated in the above lemma by defining ν q,n = ν q + [σ 2 , 0, . . . , 0] and R q,n = toep(ν q,n , ν H q,n ). We then follow the Kronecker model to obtain the covariance matrix of the CFR, which holds when the diffuse spectrum contribution in the angular domains is independent from that in the frequency domain [9], [42]. Under the Kronecker model, the covariance matrix of the CFR can be decomposed as R = I N Rx ⊗ I N T x ⊗ R q,n . Therefore, the distribution of the CFR in (19) within the above model can be given as h u ∼ CN (h, R). We denote the parameters associated with the covariance matrix by θ vn = [α, β, L, σ], which relates to the variable part of the CFR and noise. As mentioned earlier, the mean h solely depends on the specular paths parameters θ sp .

B. Channel spoofing detection problem
Channel-based spoofing detection is generally studied [5], [7] in the "snapshot" scenario where Bob receives a new message claiming to be sent by Alice, and he has to check whether the claim is true. To this end, we assume that Bob is able to measure and store a noisy version of the CFR corresponding to a transmitting terminal. Based on the CFRs associated with the incoming messages, the goal in this scenario is to determine whether a message at time t = (u + 1)T belongs to Alice or Eve given a reference message from Alice 4 , h A k , at time t = uT . In this setup, we use the terms message and CFR interchangeably. One can pose the spoofing detection as a binary classification problem for which two hypotheses can be made: Under the null hypothesis, H 0 , the message at time t = (u + 1)T belongs to Alice, while under the alternative hypothesis 4 In the remaining of this section, we use A or E in the superscript of vector or scalar to indicate that it belongs to Alice or Eve, respectively. H 1 a spoofing attack has occurred, i.e., the message belongs to Eve.
For the data acquisition phase, we consider a setting where Bob spots a finite number of snapshots from a coherence time and stores the observed CFRs. Furthermore, in order to label the incoming CFRs, we use a heuristic method given by in the lieu of the channel parameters θ vn and θ sp . This method can be viewed as an imperfect labeling mechanism which decides in favor of H 0 if the Euclidean distance between an incoming CFR and the reference CFR is smaller than a predefined threshold η. From a statistical perspective, likelihood ratio test is the main approach for deciding between the two hypothesis, which relies on the knowledge of unknown channel parameters as obtained in the following. The likelihood ratio test for the snapshot scenario at time t = (u + 1)T is defined as for a predefined threshold ζ, where the conditional probability distribution of h k+1 − h A k serves as the likelihood function under each behavior. In the following, we obtain closedform expressions for these likelihood functions assuming the statistical dependence on the reference CFR for each hypothesis is specified via the conditional distributions p(q A u+1 |q A u ) and p(q E u+1 |q A u ) under H 0 and H 1 , respectively. Specifically, the dependence of q A u+1 on q A u is characterized through the corresponding channel gains in terms of an order-1 autoregressive model (AR-1) [5], i.e., where a A denotes the similarity parameter, and w u+1,l ∼ CN (0, 1) is independent of A u,l . Similarly, the lth path gain corresponding to q E u+1 and q A u are related with and AR-1 model with similarity parameter a E .
where θ A q = [α A , β A , L A ] and the κ function defined in Lemma 4.
Proof. See Appendix D. where , and the κ function is defined in Lemma 4.
Proof. See Appendix E.
The above two lemmas enables us to obtain the likelihood functions for (33). Regarding the null hypothesis H 0 , using the Kronecker model for the covariance matrix [9] and the fact that the measurement noise is independent from the other CFR's components, we obtain the covariance matrix of and R q,H1 is given in Lemma 6.

C. Parameter estimation
In order to utilize the likelihood ratio test in (33), Bob requires the knowledge of the channel parameters θ sp , θ vn corresponding to Alice-Bob and Eve-Bob channels along with the similarity parameters. In practice, these parameters should be estimated based on the data collected from the observed snapshots. Recall from Lemma 4 that a CFR associated with a terminal has a Gaussian distribution of the from CN h θsp , R θvn . The MLE estimates of the parameters are obtained viâ which amounts to jointly maximizing the arguments of some nonlinear objective function. Besides, it can be proved that (40c) is not a convex function of θ sp , and as a result there is no unique solution set for the optimization problem in (40a). In practice, solving such problem is far from trivial, especially since the number of nonlinear parameters (θ sp ) is large and a multidimensional exhaustive search is not feasible. As a workaround, the authors in [9], [42] propose a suboptimal procedure to break the problem into two sub-problems and estimate θ sp and θ vn in a separate manner. In our problem, there is also similarity parameters a A and a E which similar to θ vn appear in the covariance matrix of a Gaussian distribution, as obtained in Lemmas 5 and 6. Therefore, based on the approach taken in [9], [42], we break the parameter estimation problem into three sub-problems. Each sub-problem involves numerically maximizing the objective function of the form (40c) w.r.t. θ sp or θ vn or similarity parameters via an iterative local optimization technique, such as Gauss-Newton algorithm. In particular, the maximization processes are done sequentially and in an alternating manner between the three sets of parameters towards convergence. In the following, we elaborate on each sub-problem for the specific channel model we described earlier.
1) Estimating the specular path parameters θ sp : The main goal here is to obtain an estimate of θ sp which maximizes (40c) for a given estimate of θ vn . In the following, we use the N -exponential basis function defined as . . . . . . . . .
. Furthermore, we recall that for arbitrary matrices A ∈ C N ×P , B ∈ C M ×P , Q P ×P = diag(q) and a vector q ∈ C P ×1 , one can write vec{BQA T } = (A B)q. Utilizing this result along with the exponential basis function we can rewrite the specular path contribution introduced in (26) for the CFR model as which greatly simplifies the calculation of the first and second derivatives of h w.r.t θ sp . Specifically, the Jacobian matrix for the above model is obtained via J(θ sp ) = J ψ T J ψ R J ρ J τ where the Jacobian matrix's components are given by The first order partial derivative of the log likelihood function (40c) given an observation h with respect to the parameters θ sp is denoted by q θsp h|R θvn . For a given Jacobian matrix J(θ s p), one can compute [42] Furthermore, the negative covariance matrix of the above first order derivative, i.e., is called the Fisher information matrix (FIM), F θ sp |R θvn , which can be expressed in terms of the Jacobian matrix J(θ sp ) as [42] F Having obtained (44) and (46), a local optimization technique can be utilized to obtain an iterative rule for estimation of θ sp . To this end, we employ the Gauss-Newton algorithm aŝ for a step length ζ which should be chosen such that L(h|θ i+1 sp , R θvn > L(h|θ i sp , R θvn . 2) Estimating the variable part and noise parameters θ vn : For a given estimate of θ sp , the goal here is to estimate θ vn . To this end, we assume N number of CFRs denoted with {h i } N i=1 are available. For each CFR, the approach presented in IV-C1 is utilized to estimate the corresponding specular paths parameters as {θ sp,i } N i=1 . In order to remove the contribution of the specular paths from the CFRs, we from an M × N matrix H whose ith column amounts to h i − h(θ sp ) where θ sp denotes the average value ofθ sp,i 's. In the following, we propose an estimation procedure for θ vn using H. We first note that all the parameters in θ vn are continuous except for the number of diffuse spectrum paths L which is an integer value. As a result, the objective function is not continuous in L and the gradient of (40c) does not exist w.r.t. L. In this vein, we take a sub-optimal approach towards estimating L by separating it from the rest of the parameters in θ vn . The authors in [43] have proposed an eigenvalue ratio method for estimating the number of harmonics present in the signals from N available observations. For our specific case, we denote the MLE of the covariance of H by C H , and the corresponding eigenvalues by e i , i = 1, . . . , M . The propose heuristic approach is to choosê for a predefined value ofη. We plug the estimated value of L in the parameter vector to obtain θ vn = [σ, α, β,L]. Then, the log-likelihood function for the zero-mean CFRs can be written as The first-order gradient of L(H|θ vn ) w.r.t. to each parameter can be computed as [42] ∂L(H|θvn) for i = 1, 2, 3. Subsequently, the (i, j)th element of the FIM corresponding to L(H|θ vn ) equals [42] −E To obtain explicit expressions for (50) and (51), one needs to compute partial derivatives terms, i.e., We propose to utilize HYPHYLEARN described in Algorithm 1 for solving the spoofing detection problem which can be seen as a binary instance of the problem formulation described in Section II with two behaviors, as described in (33). Besides, statistical parametric models are available for each behavior, the high complexity of which makes one to resort to suboptimal parameter estimation procedure. As mentioned in Section IV-B the data corresponding to the Alice and Eve are collected in the snapshot setting, and subsequently (imperfectly) labeled according to (32). Then, using these collected CFRs, the underlying parameters of each likelihood function in (33) is estimated using the sub-optimal parameter estimation procedure described in Section IV-C. Next, the estimated parameters are plugged in the available parametric models CN (0, R H0 ) and , which subsequently are used to generate synthetic CFRs. Finally, the collected and synthetic CFRs are incorporated in the Step 4 of Algorithm 1 to train the learningbased classifier which can be utilized as a spoofing detector.

V. CASE STUDY 2: MULTI-USER DETECTION
An important problem in multipoint-to-point digital communication networks (e.g., radio networks, local-area networks, and uplink satellite channels) is the optimum centralized demodulation of the information sent simultaneously by several users through a Gaussian multiple-access channel. Even though the users may not employ a protocol to coordinate their transmission epochs, effective sharing of the channel is possible because each user modulates a different signature signal waveform that is known by the intended receiver (Code Division Multiple Access (CDMA)). In this section, we consider the uplink of a cellular communication system where K users are asynchronously sharing a channel to communicate with a base station (BS). The problem of multi-user detection in this setting amounts to inferring the information associated with each user from this multiple access channel.

A. Multi-user detection problem
Consider the uplink of an asynchronous direct-sequence (DS) CDMA system shared by K users, employing long spreading codes, bandlimited chip pulses and operating over a frequency-selective fading channel. The baseband equivalent of the received signal may be written as where P is the number of transmitted packets and s k,p (t) denotes the kth user signature waveform. Furthermore, T b is the bit-interval duration, A k and τ k denote the respective complex amplitude and timing offset of kth user, and b k (p) is the kth user's information bit in the pth signaling interval, whereas w(t) is the complex envelope of the additive noise term, which is assumed to be a zero-mean, wide-sense stationary complex white Gaussian process. Moreover, c k (t) is the impulse response modeling the channel effects between the BS and the kth user. Note that the channel impulse responses c k (t) are assumed to be time-invariant over each transmitted frame [44] under the assumption that the channel coherence time exceeds the packet duration BT b . Regarding the kth user signature waveform, we have where {β (n) k,p } N −1 n=0 is the pseudo-noise (PN) code employed by user k for spreading its data bit on the pth symbol interval, N is the processing gain, and T c = T b /N is the chip interval. Furthermore, h SRRC (t) denotes the square root raised-cosine waveform as the bandlimited chip pulse which following [44] is time-limited to [0, 4T c].
In the BS, chip-matched filtering and chip-rate sampling is done in order to convert the received signal to discrete time domain. To this end, r(t) is convolved with chip-matched filter h SRRC (4T c − t) followed by a sampler at a rate 2/T c (Nyquist rate). The convolution operation results in , and h RC (t) represents a raised cosine chip waveform timelimited to [0, 8T c). As h k,p (t − pT b , τ k ) has a time domain support of [pT b , (p+2)T b +7T c ], during the pth symbol interval I p = [pT b , (p + 1)T b ], the contribution from at most three bits for each user, i.e., the pth, the p − 1th and the p − 2th ones, is observed assuming that τ k + T m < T b where T m stands for the maximum delay spread. Therefore, sampling the waveform y(t) at rate M/T c , the MN-dimensional vector y(p) collecting the data samples of the interval I p can be expressed as where h k,p−i (p) and n(p) comprise the M N samples of 2} and n(t), respectively, during I p . We set M = 2 in the following discussion. A compact representation of y(p) can be obtained by relying on the notion of effective chip pulse defined as g k (t, Noting that h k,p (t, τ k ) = N −1 i=0 β n k,p g k (t − nT c , τ k ), and defining g k ∈ C M N +8M −1×1 as g k = g k (T c /M, τ k ), g k (2T c /M, τ k ), . . . , of β n k,p , and obtained in details in (9)-(11) of [44]. Then, we have The multiuser detection problem can be cast as 2 K -ary classification problem where the goal is to find the vector of information bits b = [b 0 (p), . . . , b K−1 (p)] given a observation vector y (p). Assuming all the vectors b ∈ {0, 1} K are a priori equiprobable the minimum distance rule gives the maximum a posteriori decision [45]. Mathematically, the multiuser detection is equivalent to solving the minimization problem argmin b∈{0,1} K y (p) − K−1 k=0 A k (p)g k . However, the complexity of such detector is exponential in the number of users [45] and in practice sub-optimal methods like minimum mean square error (MMSE) detector [45] is utilized for performance evaluation. We note that the multiuser detection methods relies on the channel parameters and the spreading codes corresponding to each user, and we assume true knowledge of both are not available at the BS. Specifically, we consider a case where a mismatch exists between the true spreading codes [46], utilized by the users, and the corresponding ones available at the BS. Besides, we assume that BS has access to N T number of training data from the kth user. The channel parameters are then estimated based on the available training data using the procedure described in the following section.

B. Parameter estimation
The performance of multi-user detection relies heavily on the estimation of the channel parameters. We assume the channel impulse response (CIR), c k (t), takes the form of a time-invariant multipath channel with L paths, i.e., c k (t) = L−1 l=0 α k,l δ(t − τ k,l ), which is parameterized by the complex path gains α k,l and the corresponding path delays τ k,l . The joint ML estimate of these parameters requires an exhaustive search over the continuous K-dimensional space [0, T b ) K which is computationally prohibitive. It is shown in [11] that even using the conventional grid search based scheme to find a near-ML solution NP-hard. As a workaround, alternative sub-optimal estimation methods of low-complexity are proposed for practical settings. Notably, the authors in [44] propose a two-step approach which first estimates the the channel impulse response (CIR) using the Least Squares (LS) criterion, and then extracts the underlying channel parameters. In particular, given the knowledge of the spreading codes and information bits for all the users in the training dataset, the overall CIR g may be directly estimated by invoking the LS estimation procedurê Relying onĝ the authors in [44] propose an ad-hoc algorithm to estimate the channel parameters. Specifically, the explicit parameters to be estimated include delays τ k,l = τ k,l + τ k , amplitudes a k,l = A k |α k,l | and the phases φ k,l = arg(a k,l ) for k = 0, . . . , K − 1 and l = 0, . . . , L − 1. We provide an overview of the above ad-hoc parameter estimation procedure in Appendix F for completeness.

C. HYPHYLEARN for multiuser detection
We utilize HYPHYLEARN to solve the problem of multiuser detection described in Section V-A as a 2 K -ary classification problem. In particular, due to the available statistical parametric models for each class on one hand, and lack of an estimation procedure for the underlying channel parameters which is optimal in some sense on the other hand, the multiuser detection can be framed within the problem formulation setting described in Section II. We use the available data corresponding to the users in the suboptimal estimation method described in Section V-B to obtain the estimates of the channel parameters for K usersτ = [τ 0,0 , . . . ,τ 0,L−1 , . . . ,τ K−1,L−1 ], a = [â 0,0 , . . . ,â 0,L−1 , . . . ,â K−1,K−1 ] andΦ = [φ 0,0 , . . . ,φ 0,L−1 , . . . ,φ K−1,L−1 ]. Using these estimates along with the imperfect knowledge of spreading codes for the training data, one can utilize the parametric model (62) to generate a synthetic data example corresponding to sample information bits b. This synthetic data sample is subsequently added to the synthetic dataset along with its corresponding label b. Then, the synthetic dataset is incorporated with the available training dataset according to the Step 4 of the Algorithm 1 to find the learning-based classifier. Specifically, this classifier has 2 K output neurons, each corresponding to a specific information bits vector, which enables it to to serve as a data detection method for the K-user system.

VI. NUMERICAL RESULTS
In this section, we numerically evaluate the performance of our proposed solution, HYPHYLEARN, described in Algorithm 1 for the two case studies described in Sections IV and V. This involves comparing the resulting performance against that of the existing statistical classifiers and other hybrid classification methods, and highlighting the superiority of our proposed solution for the problems under study.

A. Spoofing detection problem
In the Alice-Eve-Bob setting, we begin with a scenario where the coherence time of the Alice-Bob and the Eve-Bob channel are very large, and therefore the corresponding channel parameters are fixed between the training and testing stages. As mentioned in Section IV-B, the training data in this problem are collected by observing finite number of snapshots by Bob. The training CFRs from each snapshot are subsequently labeled using the heuristic test (32). The number of received antennas and transmit antennas at Alice and Bob is set to 2. Also, following the discussion in [7] we assume Eve also uses the same number of antennas to impersonate Alice. The number of subcarriers is set to N f = 20, which makes the total number of samples associated with each CFR equal M = 80. We assume the Alice-Bob parameters are σ 2 A = 20, α 2 A = 200, β A = 0.02 and a A = 0.85, while σ 2 E = 26, α 2 E = 250, β E = 0.08 and a E = 0.65 are used for the Eve-Bob channel. Furthermore, we set L A = 20 and L E = 16 as the number of diffuse spectrum virtual paths, while the number of specular paths are set to 4 for both channels in accordance with the experimental measurements reported in [8]. Fig. 3 illustrates the spoofing detection performance of different methods for the above scenario averaged over 10 5 CFRs from each Alice-Bob and Eve-Bob channel at the test stage, where the x-axis denotes the number of snapshots observed during the training stage. In particular, we have evaluated the performance of HYPHYLEARN for this problem, as described in Section IV-D, and compared it with other classifiers designed based on the likelihood ratio test with plug-in estimates or existing ML algorithms. By looking at the resulting spoofing detection accuracy, it can be seen that the performance of the ML algorithms based on support vector machine (SVM) and Gaussian mixture model (GMM) is limited in this case due to limited (and mislabeled) training data. We note that the GMM is used as a classifier here by assigning labels to the clusters using the available labels corresponding to the reference CFRs. Specifically, we have used the radial basis function kernel [37] for the SVM and two components for the GMM for these simulations. Furthermore, one can see that the LRT method obtained in Section IV-B can improve upon the performance of these ML algorithms by plugging the estimated parameters, as in Section IV-C, in the statistical parametric models. In these experiments, we also use the shrinkage method [47] which improves the covariance matrix estimation for each likelihood function. For this method, a performance gain can be observed for this approach in comparison to the no shrinkage case, assuming the shrinkage parameter α is clarivoyantly chosen to maximize the spoofing detection accuracy over the test dataset. This method is labeled as 'LRT (best shrinkage)' in Fig. 3. However, in practice the parameter α has to be estimated from the training data, which-as shown in the figure with label 'LRT (shrinkage)'-could deteriorate the LRT performance as the available data includes mislabeled samples.
Furthermore, we evaluate the performance of an existing hybrid classification approach known as fine tuning [2], [24] in DTL literature for this problem. In this method, we first generate 5 × 10 5 synthetic data samples using the available likelihood parametric functions with plugged-in estimates. Then, a neural network with 3 hidden layers of 400 neurons each is trained to classify the synthetic data for this example. The training data are used afterwards to refine the weights of this neural network. Notably, HYPHYLEARN is shown to outperform the aforementioned existing classification methods by relying on both available and synthetic data and jointly using them in a learning-based classifier.
For the sake of comparison, we have also considered a variation of HYPHYLEARN that relies on a generative adversarial network (GAN) for generating synthetic data, i.e., it disregards the available physic-based models. We have observed that the performance of this approach is impacted in the limited data regime as GANs rely merely on the available training data  for generating further synthetic data of similar distribution. In fact, for this example, we have verified that HYPHYLEARN based on GAN needs to be trained on 20000 data samples in order to achieve the same level of spoofing detection accuracy as HYPHYLEARN based on physics-based models with 4000 samples. Regrading the specifics of GAN, we have used a DNN of two hidden layers with 200 neurons each as the generator, and a DNN with three hidden layers with 300 neurons each as the discriminator. In our implementation of HYPHYLEARN, the number of generated synthetic data samples is set to 4 × 10 5 . We have also used NNs with 3 hidden layers of 400 neurons each for M ψ and h φ1 , while a NN with one hidden layer of 40 neurons each is used for d ζ . For all hidden layers, the ReLU activation function is used. Furthermore, Adam optimizer [37] with a learning rate of 0.0001 is used for training in this example. We also note that the optimal Bayes decision rule, which relies on the knowledge of the true parameters, results in the spoofing detection accuracy of 0.996.
Next, we consider a more realistic scenario where the channels' variations cause the training and test stage to not fall in the same coherence time. In this case, Bob uses the heuristic test (32) for some time as it does not have access to the channel parameters in this period. Afterwards, it uses the data collected in the previous coherence times to estimate the channel parameters for the current one. Fig. 4 depicts this setting where the training stage consists of n c coherence times corresponding to the Alice-Bob channel. Furthermore, in contrast to Alice, Eve's transmissions are assumed to be intermittent due to the uncertainty associated with Eve's behaviour. During each coherence time corresponding to the Alice-Bob channel, it is assumed that Bob collects 100 training data. Then, the estimation technique described in Section IV-C is utilized to estimate the channel parameters under each coherence time. Fig. 5 demonstrates the system performance as a function of number of coherence times in the training stage. Regarding the physical setup, we have used the same system  parameters as those in Fig. 3, and assumed that the coherence time of the Alice-Bob channel is 4 times that of the Eve-Bob channel for illustrative purpose. For DTL fine-tuning approach and HYPHYLEARN, the number of synthetic data generated for each behavior in a coherence time is set to 20000. For these two learning-based approaches, the training specifications for are chosen to be the same as the ones used in Fig. 3. The performance comparison again highlights the superiority of HYPHYLEARN in comparison to the existing statistical and data-driven methods.

B. Multi-user detection problem
In this section we present results of numerical simulations to investigate the effectiveness of HYPHYLEARN described in Section V-C for the MUD problem. We choose the simulation parameters based on the setting described in [44] and consider a system with processing gain of N = 32 where the number of users is either K = 3 or K = 5. Golden codes of length 32 are used by the BS as the pseudo-noise code in (58) and the users' amplitudes (A k 's) are set to 2. In addition, a chip interval of length T c = 0.001 and a sampling rate of 2/T c is employed. A near-far ratio (NFR) of 10 dB is assumed, which means the users' amplitude are randomly unbalanced around 2 with a variance of ±5 dB. The fading channel between the users and the BS consists of 3 paths, which makes the total number of unknown parameters in Section V-B to be 9K. We further consider a setting where the BS might not have access to the perfect knowledge of the pseudo-noise sequences for all the users at the time of detection, which would lead to a mismatched situation. To account for this phenomenon, we introduce a parameter ρ that in order to quantify the averaged error in the pseudo-noise sequences at the BS while decoding.
As the performance metric, we consider the bit error rate (BER) at the BS while decoding the users' information bits, which is of major interest in digital communication systems. As the MUD algorithm we employ the minimum mean square error (MMSE) decoder introduced in [48], which is shown to outperform other existing detection methods including matched filter receiver and box-constrained maximum likelihood detector [44]. As mentioned in Section V, MUD can be also solved by a classifier aiming at distinguishing between 2 K different classes each representing a unique decoded sequence of information bits. In this case, BER is directly related to the classification accuracy of the trained classifier. For the asynchronous system discussed in Section V, the interval I 2p = [pT b , (p + 2)T b ] contains most of the energy content of the information symbol b k (p). Therefore, it is sufficient for the MUD detector to process the data in the interval I 2p in order to obtain estimates of the symbols b k (p), ∀k = 0, . . . , K − 1.
We present simulation results for the performance of the MMSE detector in the above setting in Fig. 6, and compare it with our proposed approach in Section V-C. Specifically, the parameter estimation procedure for HYPHYLEARN is done under two different levels of model mismatch, i.e., ρ = 0.2 and ρ = 0.25. Furthermore, the number of training data available from each user N T is set to 40. As a general observation, Fig. 6 demonstrates that the performance of all the detectors is deteriorated as the number of users and the value of ρ is increased. The perfect MMSE is referred to the case where the true pseudo-noise sequences are assumed to be known as part of the implementation of the decoder. In particular, huge performance gap between the perfect MMSE and the MMSE decoder indicates the high sensitivity of the MMSE detector to the mismatch. On the other hand, it is also highlighted that our proposed approach can achieve a substantial gain over a wide range of SNRs by dealing with the mismatch problem. For HYPHYLEARN, the number of generated synthetic data is set to 10 6 for this example. We have also used NNs with 4 hidden layers of 300 neurons each for M ψ and h φ1 here. Also, a shallow NN with one hidden layer of 40 neurons is used for d ζ , while ReLU activation function is used for all the hidden layers. During training, Adam optimizer with a learning rate of 0.0001 is utilized as the stochastic gradient descent algorithm.
In Fig. 7, the BER performance of the multi-user detectors is investigated as a function of number of available training data. For this example, SNR at the BS is assumed to be fixed at the BS according to 8 dB. It is demonstrated that increasing the number of data samples does not lead to substantial performance improvements in the case of MMSE method. This is attributed to the aforementioned mismatch phenomenon in the pseudo-noise sequences which prevents the MMSE detector from benefiting from the larger amount of data considerably. Furthermore, it is further shown that the performance gap between HYPHYLEARN and the perfect MMSE shrinks as the number of data increases. However, the degree to which this gap decreases is higher for the case of ρ = 0.1 in comparison to that of ρ = 0.25. Indeed, HYPHYLEARN gets more benefit from the data at lower levels of mismatch where the parameter estimates enjoy higher levels of accuracy.

VII. CONCLUSIONS
We have considered the problem of hypothesis testing in the context of parametric classification where there is known model for each hypotheses but the corresponding parameters are unknown. Towards designing a classifier in this setting, we have taken into account several practical considerations including the assumptions that available training data are limited and there could be labeling errors associated with them. Furthermore, the model under each hypothesis is assumed to be complex such that the MLEs of its parameters is computationally intractable. In this vein, we have proposed to use sub-optimal parameter estimation algorithms for this purpose and generate synthetic data leveraging the model knowledge. Then, we have utilized the domain adversarial framework for learning a classifier using these synthetic data and the empirical training data. We have shown the applicability of our proposed approach in two tangible communication scenarios, i.e., spoofing detection and multiuser detection problems where the rich and complex models are available for the real data. We have finally shown through numerical results the superiority of our proposed approach in designing a classifier under the aforementioned practical limitations w.r.t to the existing statistical and machine learning methods.