False Discovery Rate (FDR) and Familywise Error Rate (FER) Rules for Model Selection in Signal Processing Applications

Model selection is an omnipresent problem in signal processing applications. The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are the most commonly used solutions to this problem. These criteria have been found to have satisfactory performance in many cases and had a dominant role in the model selection literature since their introduction several decades ago, despite numerous attempts to dethrone them. Model selection can be viewed as a multiple hypothesis testing problem. This simple observation makes it possible to use for model selection a number of powerful hypothesis testing procedures that control the false discovery rate (FDR) or the familywise error rate (FER). This is precisely what we do in this paper in which we follow the lead of the proposers of the said procedures and introduce two general rules for model selection based on FDR and FER, respectively. We show in a numerical performance study that the FDR and FER rules are serious competitors of AIC and BIC with significant performance gains in more demanding cases, essentially at the same computational effort.


I. INTRODUCTION
Model selection is an essential problem in many signal processing applications [1], [2], [3]. Examples of such applications include selecting the order of an autoregressive predictor, the number of source signals impinging on an array of sensors, the order of a polynomial trend, the number of components of a nuclear magnetic resonance signal, the dimension of a linear regression model, the length and paths of the impulse response of a multi-path communication channel, the number of components of a sinusoidal signal, and the rank of the solution of a matrix approximation problem (the last four applications form the nucleus of the numerical example section, and will be described in detail there).
A successful class of rules for model selection is based on penalizing the model complexity (basically its number of parameters, let us say k) by adding a penalty term to the negative log-likelihood [1], [3]: where y is the data vector andθ k is the maximum likelihood estimate for the model with k parameters. Different values of c are obtained from different types of statistical or information theory considerations. For example, c = 2 for the Akaike information criterion (AIC) [4], and c = ln N for the Bayesian information criterion (BIC) [5]. The AIC is minimax optimal in a predictive sense, and the BIC is consistent in a model selection sense, but as alluded by a discussion in [6] (see also [7]) neither is optimal (in the sense of maximizing the probability of selection) and therefore, in principle, they both can be outperformed by other rules.
The main goal of the present paper is to propose model selection rules that have the well-established penalized likelihood form in (1), like AIC and BIC, but which perform better than both BIC and AIC especially in demanding cases (such as for high-dimensional models). The derivation of the new rules is based on formulating the model selection as a multi-hypothesis testing problem and using FDR and FER procedures to solve it [8], [9], [10].

Respectively. M is the Total (Known) Number of Null Hypotheses andM Denotes the (Unknown) Number of True Null Hypotheses. An IR Decision is a False Discovery/Alarm, IA is a Miss, and the Other Two are Correct Detections
Multiple hypothesis testing based on FDR and FER has found numerous applications in many fields from genomics and biomedicine to financial economics. However the use of FDR or FER for model selection in signal processing applications is almost nonexistent. We begin this paper with a short introduction of the basic FDR and FER principles [8], [11]. Then we explain how FDR and FER can be used for selecting the structure (or order) of both linear and nonlinear models with both sparse and dense parameter vectors. We end-up the paper with a numerical performance study of the FDR and FER rules, as well as a comparison with the classical rules of AIC (Akaike Information Criterion) [2], [4] and BIC (Bayesian Information Criterion) [2], [5] in the four signal processing applications mentioned above: linear regression, communication channel estimation, sinusoidal parameter estimation, and low-rank matrix approximation.
The following sections provide details, including further references, on the technical aspects of the plan laid down above. The main contribution of this paper is the proposal of FDR and FER rules based on general penalized-likelihood criteria similar to AIC and BIC and therefore directly usable in the many signal processing applications that need model selection. The main pragmatical finding of the paper is that the FDR and FER rules perform much better than AIC and BIC in applications in which the models under consideration have a wide range of possible orders, and not worse in other cases in which the classical rules are known to be consistent (BIC) or possess an optimal minimax prediction property (AIC). Given the good performance of the FDR and FER criteria and the fact that they have the same well-established penalized-likelihood form as AIC and BIC, it is our opinion that the former can be considered to be serious competitors of the latter.

II. FALSE DISCOVERY RATE (FDR)
Consider a set of M null hypotheses : H 0 1 , . . . , H 0 M . Let T k > 0 be the test statistic for H 0 k , and let p k denote the false discovery probability: (p k is also called the probability of false alarm or false positive, or simply the significance level). Table 1 summarizes the four possible outcomes when testing one of the above null hypotheses (denoted H 0 in the table).
The FDR is defined as the expected proportion of IR out of the total R: (FDR = 0 if R = 0). In many practical applications, for instance in genomics, it is important to keep the ratio IR/R under a pre-specified level, to avoid wasting lab time on investigating an unnecessarily large number of "false discoveries". Intuitively, keeping this ratio small is also important for model selection to prevent selecting unduly complex models too often. We will explain the way in which FDR can be used for model selection in the next sections. In the rest of this section we discuss a procedure for controlling the FDR in the sense that [8]: where α is a pre-specified level. Following the recommendation in [9], [10], and to simplify the exposition, we will use α = 0.01 in what follows (but note that the choice of α in (4) is an interesting aspect, see e.g [10]). Assume that the distribution of T k under H 0 k is known. Also, assume that the hypotheses H 0 k (k = 1, . . . , M ) have been ordered [8], [9], [10], [12] so that Let, where η M is the so-called harmonic number (7) Also let q p k be the following quantile of the distribution of T k : Finally, letk Then reject the hypotheses H 0 1 , . . . , H 0 k and accept H 0 k+1 , . . . , H 0 M , or reject no hypothesis if there is no k that satisfies the inequality in (9). The above procedure ensures that (4) is satisfied under general conditions on the statistics {T k } (see, e.g., [13]). If the test statistics can be assumed to be independent or positively correlated then the following larger significance levels, can be shown to ensure the control of FDR as in (4) [8], [13]. A compact self-contained proof of this result for independent statistics is presented in the Appendix. Note that if we used (6) in the case of independent statistics {T k } then FDR would be controlled at level α η m .
Remark 1: Equivalentlyk can be defined as the maximum value of k for which the p-value corresponding to the observed T k is less than the significance level p k . In this paper we prefer to directly compare {T k } to the quantiles {q k } as in (9).

III. MODEL SELECTION USING FDR
We will separately consider two types of models that are frequently encountered in signal processing applications: linear-in-the parameters models and nonlinear-in-the parameters models.

A. LINEAR MODELS
Let {x 1 , . . . , x M } be a set of N × 1 known vectors, and let the observed data vector y ∈ R N be given by: where an unknown number of the coefficients {c j } are different from zero, and the noise e is normally distributed with zero mean and covariance matrix σ 2 I : e ∼ N (0, σ 2 I). The problem is to decide which of the M possible regressors {x j } in (11) have really contributed to the data vector y, or equivalently which of the parameters {c j } are zero and which are different from zero. The maximum likelihood estimates of the unknown parameters in (11) are given by (see, e.g., chap 4 in [1] where (assuming that the inverse in (12) exists). Moreover, the covariance matrix of the estimation errors inθ has the following simple expression: It is well known (see, e.g., [1], [7]) that the statistics : (16) are asymptotically chi-square distributed with one degree of freedom: To use the above {T k } as test statistics in the procedure of Section II we must order them so that (see (5)): The thresholds {q p k } in (8) can be obtained from a table/calculator of the χ 2 (1) distribution, and then we can use (9) to find the regressors that should be included in the model. Remark 2: Alternatively we can get the quantiles {q p k }, for given {p k }, from a table (or calculator) of the standard normal distribution. Indeed t ∼ χ 2 (1) if t = z 2 with z ∼ N (0, 1). Consequently, is the cumulative distribution function of N (0, 1). Hence, we can get γ p that satisfies from a table of N (0, 1) and use q p = γ 2 p in (9). We can relate the above FDR rule to the classical AIC and BIC rules. To do so we consider the following hypotheses: H k : the model (11) contains exactly k regressors {x 1 , . . . , x k } for k = 0, . . . , M (with H 0 : y = e). Note that, because the regressors are ordered according to (18), the hypotheses H k are nested (i.e., under H k the regressors are x 1 , . . . , x k , the first (k − 1) of which are the regressors under H k−1 ). Also note that the ordering of the regressors is a departure from the traditional way in which AIC and BIC are applied to (11 Under H k the negative log-likelihood function of y in (11) is given by : where The above function is minimized by (compare with (12) and (13)):θ and the minimum value is (to within a constant) : The likelihood ratio is defined as follows: Inserting (26) in (27) yields the following simple expression for T k : We have used the same notation for the right-hand sides of (16) and (28) because these two quantities are asymptotically equivalent (see Remark 3 below). Consequently, under H k−1 the T k in (28), similarly to (16), also has an asymptotic chisquare distribution with one degree of freedom. In fact it is well known that the likelihood ratio has an asymptotic χ 2 distribution, viz.
not only for the linear model in (11) but under much more general conditions (see, e.g., [14] and chapter 11 in [1]). We will make use of this result in the next sub-section. Remark 3: It is well known and easy to show that if σ 2 is given and the regressors are orthogonal then the equivalence between (16) and (28) holds in finite samples. This equivalence also holds even if the regressors are not orthogonal but the proof is a bit more complicated. First, observe that: where and we used the fact that Next note that the numerator in (30) can be rewritten as: The matrix between curly brackets in (34) is the orthogonal projector onto the null space of the (k − 1) × k matrix it follows that the said null space is spanned by the vector R −1/2 k u. Consequently (34), and hence (30), can be written as: which coincides with (16) (for known σ 2 ).
To proceed with the discussion on the connection between FDR and penalized likelihood criteria (such as AIC and BIC), observe that the inequality T k ≥ q p k (see (9) in the description of FDR) is equivalent to (see (27)): which in turn holds if and only if The conclusion is that evaluating (38) for k = 1, 2, . . . and selecting thek at which the last minimum of the criterion occurs is equivalent to the FDR procedure based on the likelihood ratio (and also asymptotically equivalent to the FDR based on the statistics in (16)). To relate (38) to AIC and BIC (see e.g. [2], [4] and [5] for details on these two rules) it only remains to observe that for q p = 2, we obtain and for q p = ln N, we get As we will show later, the above AIC and BIC rules applied to (11) after ordering the regressors (which can be viewed as modifications of the classical rules using FDR ideas, see e.g. [10] and references therein) perform reasonably well whenever their basic assumptions hold. The main advantage of this modified way of using AIC and BIC is computational. The traditional use of AIC and BIC requires testing all 2 M possible combinations of zero and non-zero coefficients in (11), which quickly would become prohibitive as M increases.

B. NONLINEAR MODELS
A common type of model in signal processing applications has the same basic form as (11) but with the essential difference that the "regressors" {x j } depend on unknown parameters.
This means that, under H k , the model is given by : where both {c j } and {b j } (as well as σ 2 ) are unknown. Therefore, in the present case the parameter vector of the "signal" term in (41) comprises both the linear parameters {c j } and the nonlinear ones {b j }: The likelihood ratio property in (29) is valid in this general case as well, and thus we can use FDR either in the hypothesis testing form, (9), or in the model selection criterion form (38), to obtain an estimatek of the integer-valued parameter of (41). An important difference from the procedure described in the previous section is that in the present case we do not need to do anything to order the test statistics {T k } as in (18). Indeed, for (41) we can expect that an analog ranking will hold in most cases without any intervention by the user. This means that in the case of (41) both AIC and BIC as well as FDR can be used without any concern about ordering the test statistics. To explain why this is so, note that for (41) we estimate not only the coefficients {c j } by minimizing the fitting criterion but also the parameters {b j } that define the regressors. Therefore, for example for k = 1, we determine both the optimal c 1 and the optimal regressor x 1 . The greedy procedure described in the previous section chooses (for k = 1) the regressor for which T 1 in (16) is larger than T 2 , T 3 , etc. However, this is nothing but a computationally convenient surrogate for the complete procedure which would fit the regressors one by one to y in order to find the best regressor (out of all possible regressors) for the one-regressor (k = 1) model. An important aspect concerning the application of FDR (or FER, see the next section) to nonlinear models described by (41) is the choice of M: how many null hypotheses {H 0 k } do we (implicitly) test when taking a decision about the possible inclusion in (41) of an additional regressor? In the real-valued parameter case, which we consider in this paper, the set of possible regressors spanned by the nonlinear parameters in (41) is a manifold, and hence M is theoretically infinite. However for any given regressor vector there are many regressors in its vicinity that are only infinitesimally different from it and thus can be neglected. Consequently, from a practical standpoint, we can "sample" the said manifold using a finite number M of regressors that cover it well. The selection of these regressors, and therefore of M, is a problem whose solution is application dependent (see Section IV for details and examples).

IV. MODEL SELECTION USING FER
Consider, once again, the hypotheses {H 0 k } M k=1 , which have been ordered according to (5). The so-called familywise error rate is defined as (see Table 1): Bonferroni rule is the oldest and simplest procedure for controlling the FER. It uses the following significance levels: and it guarantees that Because the significance levels in (44) do not depend on k, the Bonferroni rule does not require the hypotheses to be ordered. However, it turns out that in many cases these significance levels are "too small" and hence the Bonferroni rule is too conservative (i.e., it has a high probability of miss). A more powerful rule, that controls the FER as in (45), uses the following sequence of significance levels [11]: A simple proof of the fact that the inequality in (45) Then it must hold that: and It follows that the total false alarm probability of the rule based on (46) satisfies: where the first inequality follows from Boole's union bound and the second from (50) (the equality follows from the definition of the quantiles). With this observation, the proof of (45) is concluded.
Note that, similarly to the control of FDR using (6), the control of FER as in (45)  We can directly use (46) to obtain a model selection criterion analog to the FDR criterion in (28), which we will call FER for short: (like for FDR, we choose α = 0.01 in (46)).
Next we remark on a difference between FER and FDR, which is quite relevant to model selection. Let C be defined as in (47) We can write P FA as Because for a sound selection procedure P D is close to one, it follows that in such a case Consequently α can be expected to be a tight upper bound on the P FA of a sound rule that controls FER at level α.
Interestingly such a rule also controls the FDR at level α because FDR≤ FER: FDR = E(IR/R) ≤ E(I(IR ≥ 1)) = prob(IR ≥ 1) = FER (55) On the other hand, the converse is not necessarily true: a procedure that controls FDR at level α may have a substantially larger FER (and hence P FA ) than α. Intuitively this is so since, out of the total number of R rejections in a data realization, an FDR control at level α allows on the average IR = αR of them to be incorrect; and αR incorrect rejections per data realization yield a P FA significantly larger than α. We can use Markov's inequality to lend some support to the above intuitive argument: Note that we cannot use (56) to upper bound FER by FDR≤ α because FER corresponds to λ → 0 in (56), in which case the bound in (56) becomes infinite and hence useless. However, we can heuristically make use of (56) in the following manner. For a sound procedure with P D close to 1, R can be expected to be near k 0 (a bit smaller if there are false negatives, or a bit larger when there are false positives). Using the approximation R = k 0 in (56) yields the following approximate upper bound on FER in terms of FDR: Seemingly, therefore, the P FA of an FDR controlling procedure can be significantly larger than that of a procedure that controls the FER at the same level. However, while we have empirically observed higher P FA values for the FDR rule (especially in cases with many false null hypotheses, i.e. k 0 >> 1), the increase was not as drastic as suggested by (57). A possible explanation is that the control of FDR may often be loose and therefore the achieved FDR might be considerably smaller than the bound (M − k 0 )α/M in (4), which decreases as k 0 increases; also the procedure based on (6) in fact may control FDR at a level much smaller than α, for instance at level α/η m whenever the regressors are (nearly) orthogonal. Both FDR and FER rules can be used as in (9) to select the null hypotheses to be rejected. Alternatively, they can select these hypotheses using: The rules based on (9) are called "stepup rules " because they proceed forward to test T k for k = 1, 2, . . . , up to the largest k for which T k ≥ q p k . Different from this, the rules based on (58) can be implemented in a backward fashion to test T k for k = M, M − 1, . . . , up to the smallest k for which T k < q p k and so they are called "stepdown rules". With reference to the associated model selection criterion C k , the stepdown and stepup rules correspond to picking up the first (k) and, respectively, the last (k) minimum point of the criterion. Clearlyk ≥k and thus the stepup rules have a larger probability of false alarm (and, correspondingly, a smaller probability of miss) than the stepdown rules using the same sequence of significance levels. However, significant differences between the stepup and stepdown rules usually occur only for small values of N or high noise levels, a regime in which they have rather small P D and hence are not really useful anyway. As N increases (or the noise power decreases), the criterion C k tends to be unimodal and therefore the two types of rules yield same estimatek =k. To avoid choosing between stepup and stepdown rules the user may decide to pick up the global minimum of C k , which is what we will do in the numerical experiments of the next section. It is our experience that the global minimum of C k coincides with the better one ofk andk in most realizations.
To illustrate the differences between the significance levels, quantiles and the corresponding penalties used by FDR, FER, AIC and BIC we consider an example with M = 30 and N = 1000 (α = 0.01). From the results presented in Fig. 1 we can see that the FDR and FER penalties are larger than those of BIC and AIC.

V. SIGNAL PROCESSING APPLICATIONS AND NUMERICAL ILLUSTRATIONS
We will present four signal processing applications of the FDR and FER rules, and compare their performance with that of AIC and BIC.

A. LINEAR REGRESSION
We generate the data vector y using (11) with k 0 = 10 (the true number of regressors), an X matrix with elements independently drawn from N (0, 1), M = 100, {c j } 5 j=1 = 5, {c j } 8 j=6 = 3 and {c j } 10 j=9 = 1; the indexes of the 10 non-zero coefficients are independently drawn from U [1, 100] and then fixed, and the noise variance is set to 1. The detection probability (which is the probability of selecting the 10 correct regressors) is shown as a function of N in Fig. 2(a). It can be seen from the figure that FDR and FER perform similar to each other and their performance is much better than that of AIC or BIC. The false alarm probability vs N is shown in Fig. 2(b). The false alarm probabilities of FDR and FER are quite low for all values of N; on the other hand, the false alarm probabilities of AIC and BIC are rather large and they tend to increase with N (the reason behind this behavior will be explained in the discussion at the end of the second example). In Fig. 2(c), the average model order selected by each method is shown for different values of N. It can be seen that as N increases both FDR and FER choose the correct model order. AIC, on the other hand, significantly overestimates the model order, and to some extent BIC does the same as well.

B. COMMUNICATION CHANNEL
We consider a finite-memory channel with input denoted {u(k)} N k=1 . Then the output of the channel can be written as in (11) We generate {u(k)} as a white normal sequence with zero mean and unit variance. The noise variance is σ 2 = 1. Different from the linear regression example, here we consider a case in which the sparsity of the parameter vector is more pronounced (i.e. we assume a multi-path channel): k 0 = 5, M = 100, c j = 2 for j = 1, 10, 20, 40 and 60 and N is varying in [100, 300]. The detection probability, false alarm probability, and average model order are displayed versus N in Fig. 3. Similar to the linear regression case, the detection performances of FDR and FER are similar to each other and significantly better than those of AIC and BIC. Also both FDR and FER choose the correct model order as N increases. An explanation of the poor performance of AIC and BIC in the two examples above, in particular their high probability of false alarm/discovery, runs as follows. Let us assume that the k 0 true regressors have been selected by the rule. The next step is to decide if the number of regressors should be increased to k 0 + 1. Ideally, making this decision would require comparing the likelihoods of the model with order k 0 and all models with order k 0 + 1. There are M − k 0 models with k 0 + 1 regressors that we should compare with, and it is clear that if the comparison were done with each of them then the false alarm probability, which is p for one test (see Fig. 1), would significantly increase beyond p. While we make only one comparison, we compare with the model which includes the (k 0 + 1)−th regressor selected in the first step of the procedure (according to the ranking in (18)) and that model is likely to produce a larger increase of the likelihood function than any of the other (M − k 0 − 1) models. Consequently the single comparison that we make is basically equivalent to comparing the k 0 −order model with all (M − k 0 ) models of order (k 0 + 1).
The somewhat counter intuitive increase of the probability of false alarm of AIC and BIC as N increases, see the above figures, also begs an explanation. Consider AIC as an example: it will produce a false alarm if C k 0 ≥ C k 0 +1 (k 0 being the true number of regressors). Let us say that for N >> 1 this happens in 100% of cases (i.e. noise realizations), as in Fig. 2. For a much smaller value of N, however, the random fluctuations of the negative-log-likelihood in the AIC criterion are much larger and therefore it can happen that by chance C k 0 ≤ C k 0 +1 in a number of cases, which explains why the probability of false alarm can be smaller for a smaller value of N. Obviously the probability of false alarm of both AIC and BIC can be kept at bay by increasing their penalties. However there are no clear rules for how to do that in the general case: only heuristic suggestions, which lack theoretical motivation, are available.
An interesting aspect, related to the above discussion, concerns the application of selection rules to the nonlinear model in (41) which will be considered in the next two examples. As explained in Section III-B, in the case of this model, it is the parameter estimation method that implicitly selects the "best regressor" to be added to the model when we increase the order by one (k → k + 1). However, in the case of (41), the regressor set is a manifold (spanned by the parameters {b j }) and hence M is theoretically infinite. Consequently, the situation seems to be worse than in the linear model case, unless the rule has a built-in feature that increases the penalty term to prevent a blow-up of the false-alarm probability. This feature does exist: when we increase the order from k to k + 1, the number of parameters (and hence the number of degrees of freedom of the X 2 distribution of the likelihood ratio statistics) increases by 1 + dim(b) instead of just by 1 (assuming, for simplicity, that dim(b j ) does not depend on j). This leads to a proportionate increase of the penalties of AIC and BIC. For FDR and FER the penalties also increase due to the increase of the quantiles that compose the penalty terms of the latter rules. In addition to this (as mentioned above) in the case of nonlinear models the regressor set is a manifold, which can be accurately sampled (or covered) only if M is sufficiently large. A larger value of M leads to smaller significance levels for both FDR and FER, which in turn leads to an increase of the penalty terms of these rules and therefore a reduction of the false-alarm probability.

C. SINUSOIDAL SIGNALS
This is a typical signal processing application of the nonlinear model discussed in Section III-B with where f j is the normalized frequency and φ j is the initial phase of j−th sinusoid. A generic regressor vector of the above form can be re-written as: ⎡ where a = sin(φ) and b = cos(φ). The nonlinear parameter of (61) is the frequency f whose variation in the interval [0,1] generates the regressor set/manifold. It follows from basic results in spectral analysis (see e.g. [15]) that a sampling of f using a grid with step 1/N yields a satisfactory covering of the said manifold. Consequently, in the present case we can use the significance levels of FDR and FER in (6) and (46) with M = N: From the above significance levels we can calculate the corresponding quantiles (and hence the penalties) of the two rules using a χ 2 calculator for a distribution with 3 degrees of freedom (which is the number of unknown parameters in (61):  a, b and f ).
We consider the following specific values of the parameters in this example: k 0 = 3, c 1 = 5, c 2 = 3, c 3 = 2, [10,140], and σ 2 = 1. We obtain approximate maximum likelihood estimates of { f j } k j=1 using the periodogram method (see e.g. [15]) and then use {f j } to estimate {c j , φ j } k j=1 via a simple linear least squares method. The estimate of σ 2 is obtained as in (13).
The probability of correct detection, probability of false alarm and average model order selected by the four rules are shown as a function of N in Fig. 4. In this example BIC worked almost as well as FDR and FER and the performance of AIC also improved significantly compared with what we saw in the previous examples. This was expected as BIC is known to be consistent in the present application, and AIC is known to work reasonably well too.

D. LOW-RANK MATRIX APPROXIMATION
We consider a noisy matrix whose "signal part" has a low rank : where A ∈ R m×n , B ∈ R m×k 0 , D ∈ R n×k 0 , and with k o << min(m, n). The elements of the noise matrix E in (64) are assumed to be i.i.d. normal variables with zero mean and variance denoted σ 2 . The low-rank decomposition in (64) is not unique and, to be more specific, we assume that BCD T is the singular value decomposition (SVD) of the signal part of (64), therefore B and D are semi-unitary matrices and C is positive semi-definite (B T B = D T D = I and {c k ≥ 0}). Furthermore, to simplify the notation and exposition we focus on the case of square matrices: m = n (the general case of m = n can be treated in much the same manner).
We can re-write (64) as in (41) with y = vec(A), E = vec(E), and the following unknown "regressor" vectors (with a Kronecker-product structure): where b k and d k are the k−th columns of B and D. Under H k , the negative log-likelihood function associated with (64) is given by (to within a constant): It is well known that the maximum likelihood estimateŝ B k ,Ĉ k ,D k which minimize the above function with respect to B k , C k , D k can be obtained by truncating the SVD of the matrix A keeping only the k largest singular values and associated singular vectors [16] (note that the SVD provides all estimated models for k = 1, . . . , m). The estimate of σ 2 k is then given by : The likelihood ratio statistics T k have the same expression as in (28) but with N replaced by m 2 : However in the present case T k has an asymptotic chi-square distribution with 2 m degrees of freedom (see (23)): T k ∼ χ 2 (2 m). This means that the quantiles q p k should be computed for χ 2 (2 m), with 2 m being possibly much larger than the number of degrees of freedom encountered in the previous cases. The main difference from the previous applications, however, is the fact that finding a satisfactory sampling of the regressor vector set is a bit more complicated in the present case.
Consider the FDR first. A generic vector b (and similarly for d) lies on the (m − 1)-dimensional surface of the unit sphere in R m . For 1D this "surface" consists of 2 points (+1 and −1). In 2D the said surface is a circle, and we can use 360 vectors to sample it (with 1 deg between adjacent vectors). In 3D we can cover the surface using 360 deg in longitude and 180 deg in latitude, therefore we can use 180 × 360 = 2(180) 2 vectors to sample it. Generalizing to m−dimensions leads to the conclusion that a number of 2(180) m−1 vectors can be used to sample the set of a generic b regressor (and similarly for d). Because all possible combinations between the b and d vectors can occur, the total number of regressor vectors is: Consequently the significance levels of FDR are given by: (71) where we used the approximation in (7) for the harmonic number (which holds well here because (70) is a large number).
Next we consider the FER, for which the only difference from the above discussion is as follows. For b 1 there is no difference as for FER too this vector is only restricted to lie on the surface of the unit sphere in R m , but b 2 must be orthogonal to b 1 (and similarly for d 1 and d 2 ); hence b 2 belongs to a sphere in R m−1 , b 3 to a sphere in R m−2 (because b 3 must be orthogonal to both b 1 and b 2 ), and so forth. Making use of this observation and of (70) we can write the significance levels of FER as: Remark 4: The above sampling of the spherical surface, based on patches of 1 deg in longitude and latitude, is denser close to the poles than around the equator (easy to visualize in 3D). If desired, a uniform patching can be obtained using the expression for the solid angle subtended by the spherical surface in R m : where (·) is the gamma function. The conversion of (73) from steradians to "degrees" (for example, square-degrees for 3D) can be done multiplying (73) by 180 π m−1 . For m = 1, 2 and 3 this gives m = 2, 360, and 4(180) 2 /π . The first two of these numbers are the same as the values given by our simpler formula, 2(180) m−1 , but the third one is smaller (which was expected as our sampling is denser close to the poles). We have tried using (73) in FDR and FER rules in a limited number of cases but failed to notice any significant difference from the performance of the rules based on our simpler (although less uniform) sampling formula.
Regarding the implementation of (71) and (72), note that for large values of m (m > 10), the chi-square calculator in MATLAB can fail to return meaningful quantiles. To circumvent this numerical problem we suggest the following alternative. We first calculate the quantile corresponding to p k from the standard normal distribution and then use the approximation χ 2 (2 m) ≈ N (2 m, 4 m), which holds for sufficiently large m, to obtain the corresponding quantile for the chi-square distribution. To compute the quantiles of the normal distribution (as the normal distribution calculator in MATLAB does not work well either for small p values) we used the fact that ln(p) is well approximated (for sufficiently small p) by the following function of the quantile (q) of the normal distribution [17]: In sum for given p, we first solve the above nonlinear equation using the Fsolve function in MATLAB and obtain q and next use q in the normal approximation of the chi-square distribution to obtain the corresponding chi-square quantile.
In the simulation example of (64) we consider the following specifications: k 0 = 3, m(= n) = 100, c 1 = 5, c 2 = 3, and c 3 = 3. Furthermore B and D are m × 3 semi-unitary matrices obtained from the QR decomposition of random matrices with i.i.d. normal elements. Fig. 5 shows the probability of detection, probability of false alarm, and the average model order selected by the four rules for a range of values of 1/σ 2 .
The detection probabilities of FDR and FER almost coincide with each other and they converge to one much faster than the detection probability of BIC. On the other hand, AIC has a constant detection probability over the range of noise variances considered, which is almost equal to probability of false alarm of this rule.
Remark 5: Asymptotic performance studies of model selection rules typically require that [dim(θ k )] 2 /N → 0 as N → ∞. This condition was satisfied in all three previous examples. However, in the present case the above ratio is larger than 1 even for k = 1: (2 m) 2 /m 2 = 4. Consequently the performance of the rules is not expected to increase with m. That is the reason why in this example we have plotted the performance metrics versus 1/σ 2 .

VI. CONCLUSION
Most models used in signal processing applications contain both real-valued parameters and integer-valued parameters (which determine the structure, in particular the orders, of the model). The estimation of the former is well studied and understood and there are optimal methods, such as the method of the maximum likelihood, which attain the ultimate accuracy. On the other hand, the estimation of the latter parameters (i.e., the operation of model selection), while also extensively studied, is less understood and well-established methods that always perform better than any competitor do not exist. AIC and BIC rules, without a doubt, are the workhorses of model selection in the signal processing literature and elsewhere (see, e.g. [2], [3], [4]). Under certain conditions, the former is known to overestimate the true orders with a (false alarm) probability of about 0.16, whereas the latter is consistent (as N → ∞). AIC's tendency of overestimation should not be viewed as a drawback if the model is to be used for prediction. However, if the main purpose of the modeling exercise is acquiring information about the actual data generating mechanism and model interpretability is important, then consistent (or nearly so) model selection becomes a key factor and consequently BIC is to be preferred to AIC.
The problem is that, as illustrated in the previous section, in applications in which the range of possible model orders is considerable (i.e., M >> 1), even BIC can perform rather poorly. The primary aim of this paper was to show that model selection methods, which perform much better than BIC in such applications and not worse in regular cases in which BIC is consistent, do exist. We have shown how the FDR and FER procedures of multiple hypothesis testing can be used to derive such methods (or rules) that can perform better in terms of model selection accuracy than both AIC and BIC. A secondary goal of the paper was to briefly introduce the FDR and FER principles to those readers who were less familiar with them.
The FDR and FER rules presented in the previous sections can be directly used in many signal processing applications. Furthermore, the basic ideas of these rules can be employed to develop new rules for model selection based on even more advanced results and methods from the multiple hypothesis testing literature. Indeed there exist several enhanced versions of (6) and (46) as well as more sophisticated methods for controlling the FDR or FER (see, e.g., [18], [19]). Can we use the methodology described in this paper along with those more advanced hypothesis testing methods to obtain new model selection rules with enhanced performance? This is a question that is worthy of further research in our opinion.

APPENDIX PROOF OF (4) FOR INDEPENDENT STATISTICS AND THE SIGNIFICANCE LEVELS IN (10)
Using the definitions ofk in (9) and C in (47) we can write FDR as: where I(·) denotes the indicator function (I(A) = 1 if A = true and I(A) = 0 if A = false). The expression for FDR in (76) holds because any H 0 j with j >k (and hence T j < q pk ) is accepted, whereas any H 0 j with j ≤k is rejected and its test statistics satisfies T j ≥ Tk ≥ q pk . Interestingly there can be T j s that are under their quantiles, T j < q p j , but they are rejected too if j ≤k (see (9)).
The main difficulty associated with evaluating the expectation in (76) is that T j andk are not independent of each other. To overcome this problem we will try to find ak j that is independent of T j and which is such that To do so, we replace T j by a valueT 1 >> 1 (theoretically infinite, but a valueT 1 > T 1 suffices). Then we re-order the statistics with T j replaced byT 1 : r If j >k then I(T j ≥ q pk ) = 0 and thus (77) must hold.
Note that the inequality in (77) can be strict in such a case becauseT k ≥ T k (for k = 1, . . . , j), consequently we might haveT k ≥ q p k even when T k < q p k . This means that in the present casek j ≥k with strict inequality being possible and we have q pk j < q pk ifk j >k and hence the RHS in (77) can be > 0.
r If j <k thenT k = T k for k =k, . . . , M and thusk j =k so (77) holds.
r Finally, If j =k thenT k = T k for k =k + 1, . . . , M and Tk = Tk −1 ≥ Tk ≥ q pk , so we still havek j =k and (77) holds true. Using (77) along with the fact thatk j and T j are independent of one another (ask j depends on {T k } k = j ) we can write: The proof of (4) is thus concluded.

ACKNOWLEDGMENT
The authors would like to thank Associate Editor (Prof. Usman Khan) and the four reviewers for their constructive and helpful comments.