Minimum Description Length Principle in Supervised Learning With Application to Lasso

The minimum description length (MDL) principle is extended to supervised learning. The MDL principle is a philosophy that the shortest description of given data leads to the best hypothesis about the data source. One of the key theories for the MDL principle is Barron and Cover’s theory (BC theory), which mathematically justifies the MDL principle based on two-stage codes in density estimation (unsupervised learning). Though the codelength of two-stage codes looks similar to the target function of penalized likelihood methods, parameter optimization of penalized likelihood methods is done without quantization of parameter space. Recently, Chatterjee and Barron have provided theoretical tools to extend BC theory to penalized likelihood methods by overcoming this difference. Indeed, applying their tools, they showed that the famous penalized likelihood method ‘lasso’ can be interpreted as an MDL estimator and enjoys performance guarantee by BC theory. An important fact is that their results assume a fixed design setting, which is essentially the same as unsupervised learning. The fixed design is natural if we use lasso for compressed sensing. If we use lasso for supervised learning, however, the fixed design is considerably unsatisfactory. Only random design is acceptable. However, it is inherently difficult to extend BC theory to the random design regardless of whether the parameter space is quantized or not. In this paper, a novel theoretical tool for extending BC theory to supervised learning (the random design setting and no quantization of parameter space) is provided. Applying this tool, when the covariates are subject to a Gaussian distribution, it is proved that lasso in the random design setting can also be interpreted as an MDL estimator, and that lasso enjoys the risk bound of BC theory. The risk/regret bounds obtained have several advantages inherited from BC theory. First, the bounds require remarkably few assumptions. Second, the bounds hold for any finite sample size $n$ and any finite feature number $p$ even if $n\ll p$ . Behavior of the regret bound is investigated by numerical simulations. We believe that this is the first extensions of BC theory to supervised learning (random design).


I. INTRODUCTION
There have been various techniques to evaluate performance of machine learning methods theoretically.Taking lasso [1] as an example, lasso has been analyzed by nonparametric statistics [2], [3], [4], [5], empirical process [6], statistical physics [7], [8], [9] and so on.In general, most of these techniques require either asymptotic assumption (sample number n and/or feature number p go to infinity) or various technical assumptions like boundedness of features or moment conditions.Some of them are much restrictive for practical use.In this paper, we try to develop another way for performance evaluation of machine learning methods with as few assumptions as possible.An important candidate for this purpose is Barron and Cover's theory (BC theory), which is one of the most famous results for the minimum description length (MDL) principle.The MDL principle [10], [11], [12], [13], [14] claims that the shortest description of a given set of data leads to the best hypotheses about the data source.A famous model selection criterion based on the MDL principle was proposed by Rissanen [10].This criterion corresponds to a codelength of a two-stage code in which one encodes a statistical model to encode data and then the data are encoded with the model.In this case, an MDL estimator is defined as the minimizer of the total codelength of this two-stage code.BC theory [15] guarantees that a risk of the MDL estimator in terms of the Rényi divergence [16] is tightly bounded from above by redundancy of the corresponding two-stage code.Because this result means that the shortest description of the data by the two-stage code yields the smallest risk upper bound, this result gives a mathematical justification of the MDL principle.Furthermore, BC theory holds for finite n without any complicated technical conditions.However, BC theory has been applied to supervised learning only approximately or limitedly.The original BC theory seems to be widely recognized that it can be applicable to both unsupervised and supervised learning.Though it is not false, BC theory actually cannot be applied to supervised learning without a certain condition (Condition 1 defined in Section III).This condition is critical in a sense that lack of this condition breaks a key technique of BC theory.The literature [17] is the only example of application of BC theory to supervised learning to our knowledge.His work assumed a specific setting, where Condition 1 can be satisfied.However, the risk bound may not be sufficiently tight due to imposing Condition 1 forcedly, which will be explained in Section III.Another well-recognized disadvantage is the necessity of quantization of parameter space.Barron et al. proposed a way to avoid the quantization and derived a risk bound of lasso [18], [19] as an example.However, their idea cannot be applied to supervised learning in general.The main difficulty stems from Condition 1 as explained later.It is thus essentially difficult to solve.Actually, their risk bound of lasso was derived with fixed design only (i.e., essentially unsupervised setting).The fixed design, however, is not satisfactory to evaluate generalization error of supervised learning.In this paper, we propose an extension of BC theory to supervised learning without quantization in random design cases.The derived risk bound inherits most of advantages of the original BC theory.The main term of the risk bound has again a form of redundancy of two-stage code.Thus, our extension also gives a mathematical justification of the MDL principle in supervised learning.It should be remarked that, however, an additional condition is required for an exact redundancy interpretation.We also derive new risk and regret bounds of lasso with random design as its application under normality of features.This application is not trivial at all and requires much more effort than both the above extension itself and the derivation in fixed design cases.We will try to derive those bounds in a manner not specific to our setting but rather applicable to several other settings.Interestingly, the redundancy and regret interpretation for the above bounds are exactly justified without any additional condition in the case of lasso.The most advantage of our theory is that it requires almost no assumptions: neither asymptotic assumption (n < p is also allowed), bounded assumptions, moment conditions nor other technical conditions.Especially, it is remarkable that our risk evaluation holds for finite n without necessity of boundedness of features though the employed loss function (the Rényi divergence) is not bounded.Behavior of the regret bound will be investigated by numerical simulations.It may be worth noting that, despite we tried several other approaches in order to extend BC theory to supervised learning, we can hardly derive a risk bound of lasso as tight as meaningful by using them.We believe that our proposal is currently the unique choice that could give a meaningful risk bound.
This paper is organized as follows.Section II introduces an MDL estimator in supervised learning.We briefly review BC theory and its recent progress in Section III.The extension of BC theory to supervised learning will appear in Section IV-A.We derive new risk and regret bounds of lasso in Section IV-B.All proofs of our results are given in Section V. Section VI contains numerical simulations.A conclusion will appear in Section VII.

II. MDL ESTIMATOR IN SUPERVISED LEARNING
Suppose that we have n training data (x n , y n ) := {(x i , y i ) ∈ X × Y |i = 1, 2, • • • , n} generated from p * (x n , y n ) = q * (x n )p * (y n |x n ), where X is a domain of feature vector x and Y could be ℜ (regression) or a finite set (classification) according to target problems.Here, the sequence (x 1 , y 1 ), (x 2 , y 2 ), • • • is not necessarily independently and identically distributed (i.i.d.) but can be a stochastic process in general.We write the jth component of the ith sample as x ij .To define an MDL estimator according to the notion of two-stage code [10], we need to describe data itself and a statistical model used to describe the data too.Letting L(x n , y n ) be the codelength of the two-stage code to describe (x n , y n ), L(x n , y n ) can be decomposed as by the chain rule.Since a goal of supervised learning is to estimate p * (y n |x n ), we need not estimate q * (x n ).In view of the MDL principle, this implies that L(x n ) (the description length of x n ) can be ignored.Therefore, we only consider the encoding of y n given x n hereafter.This corresponds to a description scheme in which an encoder and a decoder share the data x n .To describe y n given x n , we use a parametric model p θ (y n |x n ) with parameter θ ∈ Θ.The parameter space Θ is a certain continuous space or a union of continuous spaces.Note that, however, the continuous parameter cannot be encoded.Thus, we need to quantize the parameter space Θ as Θ(x n ).According to the notion of the two-stage code, we need to describe not only y n but also the model used to describe y n (or equivalently the parameter θ ∈ Θ(x n )) given x n .Again by the chain rule, such a codelength can be decomposed as Here, L(y n |x n , θ) expresses a codelength to describe y n using p θ(y n |x n ), which is, needless to say, − log p θ(y n |x n ).On the other hand, L( θ|x n ) expresses a codelength to describe the model p θ (y n |x n ) itself.Note that L( θ|x n ) must satisfy Kraft's inequality The MDL estimator is defined by the minimizer of the above codelength: Let us write the minimum description length attained by the two-stage code as Because L2 also satisfies Kraft's inequality with respect to y n for each x n , it is interpreted as a codelength of a prefix two-stage code.Therefore, is a conditional sub-probability distribution corresponding to the two-stage code.

III. BARRON AND COVER'S THEORY
We briefly review Barron and Cover's theory (BC theory) and its recent progress in view of supervised learning though they discussed basically unsupervised learning (or supervised learning with fixed design).In BC theory, the Rényi divergence [16] between p(y|x) and r(y|x) with order λ ∈ (0, 1) is used as a loss function.The Rényi divergence converges to Kullback-Leibler (KL) divergence for any p, r .We also note that the Rényi divergence at λ = 0.5 is equal to Bhattacharyya divergence [20] ) We drop n of each divergence like d λ (p, r) if it is defined with a single random variable, i.e., BC theory requires the model description length to satisfy a little bit stronger Kraft's inequality defined as follows.
Definition 1.Let β be a real number in (0, 1).We say that a function h( θ) satisfies β-stronger Kraft's inequality if where the summation is taken over a range of θ in its context.
The following condition is indispensable for application of BC theory to supervised learning.

Condition 1 (indispensable condition).
Both the quantized space and the model description length are independent of x n , i.e., Under Condition 1, BC theory [15] gives the following two theorems for supervised learning.Though these theorems were shown only for the case of Hellinger distance in the original literature [15], we state these theorems with the Rényi divergence.
Theorem 2. Let β be a real number in (0, 1).Assume that L satisfies β-stronger Kraft's inequality.Under Condition 1, for any λ ∈ (0, 1 − β]. Theorem 3. Let β be a real number in (0, 1).Assume that L satisfies β-stronger Kraft's inequality.Under Condition 1, Since the right side of ( 7) is just the redundancy of the prefix two-stage code, Theorem 2 implies that we obtain the smallest upper bound of the risk by compressing the data most with the two-stage code.That is, Theorem 2 is a mathematical justification of the MDL principle.We remark that, by interchanging the infimum and the expectation of ( 6), the right side of (6) becomes a quantity called "index of resolvability" [15], which is an upper bound of redundancy.It is remarkable that BC theory requires no assumption except Condition 1 and β-stronger Kraft's inequality.However, Condition 1 is a somewhat severe restriction.Both the quantization and the model description length can depend on x n in the definitions.In view of the MDL principle, this is favorable because the total description length can be minimized according to x n flexibly.If we use the model description length that is uniform over X n in contrast, the total codelength must be longer in general.Hence, data-dependent model description length is more desirable.Actually, this observation suggests that the bound derived in [17] may not be sufficiently tight.In addition, the restriction by Condition 1 excludes a practically important case 'lasso with column normalization' (explained below) from the scope of application.However, it is essentially difficult to remove this restriction as noted in Section I. Another concern is quantization.The quantization for the encoding is natural in view of the MDL principle.Our target, however, is an application to usual estimators or machine learning algorithms themselves including lasso.A trivial example of such an application is a penalized maximum likelihood estimator (PMLE) where L : Θ × X n → [0, ∞) is a certain penalty.Similarly to the quantized case, let us define Note that, however, p 2 (y n |x n ) is not necessarily a subprobability distribution in contrast to the quantized case, which will be discussed in detail in Section IV-A.PMLE is a wide class of estimators including many useful methods like Ridge regression [21], lasso, Dantzig Selector [22] and any Maximum-A-Posteriori estimators of Bayes estimation.If we can accept θ as an approximation of θ (by taking L = L), we have a risk bound by direct application of BC theory.However, the quantization is unnatural in view of machine learning application.Besides, we cannot use any data-dependent L. Barron et al. proposed an important notion 'risk validity' to remove the quantization [23], [19], [24].Definition 4 (risk validity).Let β be a real number in (0, 1) and λ be a real number in (0, 1−β].For fixed x n , we say that a penalty function L(θ|x n ) is risk valid if there exist a quantized space Θ(x n ) ⊂ Θ and a model description length L( θ|x n ) satisfying β-stronger Kraft's inequality such that Θ(x n ) and L( θ|x n ) satisfy where Note that their original definition in [19] was presented only for the case where λ = 1 − β.Here, d(p, r|x n ) is the Rényi divergence for fixed design (x n is fixed).Hence, d n λ (p, r|x n ) does not depend on q * (x n ) in contrast to the Rényi divergence for random design d n λ (p, r) defined by (1).Barron et al. proved that θ has bounds similar to Theorems 2 and 3 for any risk valid penalty in the fixed design case.Their way is excellent because it does not require any additional condition other than the risk validity.However, the risk evaluation only for a particular is unsatisfactory for supervised learning.In order to evaluate the so-called 'generalization error' of supervised learning, we need to evaluate the risk with random design, i.e., E p * (x n ,y n ) [d n λ (p * , p θ)].However, it is essentially difficult to apply their idea to random design cases as it is.Let us explain this by using lasso as an example.The readers unfamiliar to lasso can refer to the head of Section IV-B for its definition.By extending the definition of risk validity to random design straightforwardly, we obtain the following definition.
Definition 5 (risk validity in random design).Let β be a real number in (0, 1) and λ be a real number in (0, 1 − β].We say that a penalty function L(θ|x n ) is risk valid if there exist a quantized space Θ ⊂ Θ and a model description length L( θ) satisfying β-stronger Kraft's inequality such that In contrast to the fixed design case, (8) must hold not only for a fixed x n ∈ X n but also for all x n ∈ X n .In addition, Θ and L( θ) must be independent of x n due to Condition 1.The form of Rényi divergence d n λ (p * , p θ ) also differs from d n λ (p * , p θ |x n ) of the fixed design case in general.Let us rewrite (9) equivalently as For short, we write the inside part of the minimum of the left side of (10) as H(θ, θ, x n , y n ).We need to evaluate min θ {H(θ, θ, x n , y n )} in order to derive risk valid penalties.However, it seems to be considerably difficult.To our knowledge, the technique used by Chatterjee and Barron [19] is the best way to evaluate it, so that we also employ it in this paper.A key premise of their idea is that taking θ close to θ is not a bad choice to evaluate min θ H(θ, θ, x n , y n ).
Regardless of whether it is true or not, this premise seems to be natural and meaningful in the following sense.If we quantize the parameter space finely enough, the quantized estimator θ is expected to behave almost similarly to θ with the same penalty and is expected to have a similar risk bound.
If we take θ = θ, then H(θ, θ, x n , y n ) is equal to L(θ), which implies that L(θ) is a risk valid penalty and has a risk bound similar to the quantized case.Note that, however, we cannot match θ to θ exactly because θ must be on the fixed quantized space Θ.So, Chatterjee and Barron randomized θ on the grid points on Θ around θ and evaluate the expectation with respect to it.This is clearly justified because . By using a carefully tuned randomization, they succeeded in removing the dependency of E θ [H(θ, θ, x n , y n )] on y n .Let us write the resultant expectation as ) is a risk valid penalty.By this fact, risk valid penalties should basically depend on x n in general.If not (L(θ|x n ) = L(θ)), L(θ) must bound max x n H ′ (θ, x n ), which makes L(θ) much larger.This is again unfavorable in view of the MDL principle.
In particular, H ′ (θ, x n ) includes an unbounded term in linear regression cases with regard to x n , which originates from the third term of the left side of (10).This can be seen by checking Section III of [19].Though their setting is fixed design, this fact is also true for the random design.Hence, as long as we use their technique, derived risk valid penalties must depend on x n in linear regression cases.However, the ℓ 1 norm used in the usual lasso does not depend on x n .Hence, the risk validity seems to be useless for lasso.However, the following weighted ℓ 1 norm where x 2 ij plays an important role here.The lasso with this weighted ℓ 1 norm is equivalent to an ordinary lasso with column normalization such that each column of the design matrix has the same norm.The column normalization is theoretically and practically important.Hence, we try to find a risk valid penalty of the form where µ 1 and µ 2 are real coefficients.Indeed, there seems to be no other useful penalty dependent on x n for the usual lasso.
In contrast to fixed design cases, however, there are severe difficulties to derive a meaningful risk bound with this penalty.
We explain this intuitively.The main difficulty is caused by Condition 1.As described above, our strategy is to take θ close to θ. Suppose now that it is ideally almost realizable for any choice of x n , y n , θ.This implies that H(θ, θ, x n , y n ) is almost equal to L(θ).On the other hand, for each fixed θ, the weighted ℓ 1 norm of θ can be arbitrarily small by making x n small accordingly.Therefore, the penalty µ 1 θ w,1 + µ 2 is almost equal to µ 2 in this case.This implies that µ 2 must bound max θ L(θ), which is infinity in general.If L depended on x n , we could resolve this problem.However, L must be independent of x n .This issue does not seem to be specific to lasso.Another major issue is the Rényi divergence d n λ (p * , p θ ).In the fixed design case, the Rényi divergence ) is a simple convex function in terms of θ, which makes its analysis easy.In contrast, the Rényi divergence d n λ (p * , p θ ) in case of random design is not convex and more complicated than that of fixed design cases, which makes it difficult to analyze.We will describe why the non-convexity of loss function makes the analysis difficult in Section V-G.The difficulties that we face when we use the techniques of [19] in the random design case are not limited to them.We do not explain them here because it requires the readers to understand their techniques in detail.However, we only remark that these difficulties seem to make their techniques useless for supervised learning with random design.We propose a remedy to solve these issues in a lump in the next section.

IV. MAIN RESULTS
In this section, we propose a way to extend BC theory to supervised learning and derive a new risk bound of lasso.

A. Extension of BC Theory to Supervised Learning
There are several possible approaches to extend BC theory to supervised learning.A major concern is how tight a resultant risk bound is.Below, we propose a way that gives a tight risk upper bound for at least lasso.A key idea is to modify the risk validity condition by introducing a so-called typical set of x n .We postulate that a probability distribution of stochastic process x 1 , x 2 , • • • , is a member of a certain class P x .Furthermore, we define P n x by the set of marginal distribution of x 1 , x 2 , • • • , x n of all elements of P x .We assume that we can define a typical set A n ǫ for each q * ∈ P n x , i.e., Pr(x n ∈ A n ǫ ) → 1 as n → ∞.This is possible if q * is stationary and ergodic for example.See [25] for detail.For short, Pr(x n ∈ A n ǫ ) is written as P n ǫ hereafter.We modify the risk validity by using the typical set.
Definition 6 (ǫ-risk validity).Let β, ǫ be real numbers in (0, 1) and λ be a real number in (0, 1−β].We say that L(θ|x n ) is ǫ-risk valid for (λ, β, P n x , A n ǫ ) if for any q * ∈ P n x , there exist a quantized subset Θ(q * ) ⊂ Θ and a model description length L( θ|q * ) satisfying β-stronger Kraft's inequality such that Note that both Θ and L can depend on the unknown distribution q * (x n ).This is not problematic because the final penalty L does not depend on the unknown q * (x n ).A difference from (10) is the restriction of the range of x n onto the typical set.From here to the next section, we will see how this small change solves the problems described in the previous section.First, we show what can be proved for ǫ-risk valid penalties.

Theorem 7 (risk bound). Define E n
ǫ as a conditional expectation with regard to p * (x n , y n ) given that x n ∈ A n ǫ .Let β, ǫ be arbitrary real numbers in (0, 1).For any λ ∈ (0, Theorem 8 (regret bound).Let β, ǫ be arbitrary real numbers in (0, 1).For any λ ∈ (0, A proof of Theorem 7 is described in Section V-A, while a proof of Theorem 8 is described in Section V-B.Note that both bounds become tightest when λ = 1 − β because the Rényi divergence d n λ (p, r) is monotonically increasing in terms of λ (see [12] for example).We call the quantity − log(1/p 2 (y n |x n )) − (− log(1/p * (y n |x n ))) in Theorem 8 'regret' of the two-stage code p 2 on the given data (x n , y n ) in this paper, though the ordinary regret is defined as the codelength difference from log(1/p θmle (y n |x n )), where θmle denotes the maximum likelihood estimator.Compared to the usual BC theory, there is an additional term (1/β) log(1/P n ǫ ) in the risk bound (11).Due to the property of the typical set, this term decreases to zero as n → ∞.Therefore, the first term is the main term, which has a form of redundancy of two-stage code like the quantized case.Hence, this theorem gives a justification of the MDL principle in supervised learning.Note that, however, − log p 2 (y n |x n ) needs to satisfy Kraft's inequality in order to interpret the main term as a conditional redundancy exactly.A sufficient conditions for this was introduced by [24] and is called 'codelength validity'.Definition 9 (codelength validity).We say that L(θ|x n ) is codelength valid if there exist a quantized subset Θ(x n ) ⊂ Θ and a model description length L( θ|x n ) satisfying Kraft's inequality such that for each x n .
We note that both the quantization and the model description length on it depend on x n in contrast to the ǫ-risk validity.This is because the fixed design setting suffices to justify the redundancy interpretation.Let us see that − log p 2 (y|x) can be exactly interpreted as a codelength if L(θ|x n ) is codelength valid.First, we assume that Y , the range of y, is discrete.For each x n , we have Hence, − log p 2 (y n |x n ) can be exactly interpreted as a codelength of a prefix code.Next, we consider the case where Y is a continuous space.The above inequality trivially holds by replacing the sum with respect to y n with an integral.Thus, p 2 (y n |x n ) is guaranteed to be a sub-probability density function.Needless to say, − log p 2 (y n |x n ) cannot be interpreted as a codelength as itself in continuous cases.As is well known, however, a difference can be exactly interpreted as a codelength difference by way of quantization.See Section III of [15] for details.This indicates that both the redundancy interpretation of the fist term of (11) and the regret interpretation of the (negative) second term in the left side of the inequality in the first line of ( 12) are justified by the codelength validity.Note that, however, the ǫ-risk validity does not imply the codelength validity and vice versa in general.
We discuss about the conditional expectation in the risk bound (11).This conditional expectation seems to be hard to be replaced with the usual (unconditional) expectation.The main difficulty arises from the unboundedness of the loss function.Indeed, we can immediately show a similar risk bound with unconditional expectation for bounded loss functions.As an example, let us consider a class of divergence, called α-divergence [26] The α-divergence approaches KL divergence as α → ±1 [27].More exactly, We also note that the α-divergence with α = 0 is four times the squared Hellinger distance (16) which has been studied and used in statistics for a long time.We focus here on the following two properties of αdivergence: (i) The α-divergence is always bounded: for any p, r and α ∈ (−1, 1).(ii) The α-divergence is bounded by the Rényi divergence as for any p, r and α ∈ (−1, 1).See [14] for its proof.
As a corollary of Theorem 7, we obtain the following risk bound.

Define a function λ(t)
In particular, taking β = (α + 1)/2 yields the tightest bound Its proof will be described in Section V-C.Though it is not so obvious when the condition "p 2 (y n |x n ) is a sub-probability distribution" is satisfied, we remark that the codelength validity of L(θ|x n ) is its simple sufficient condition.The second and the third terms of the right side vanish as n → ∞ due to the property of the typical set.The boundedness of loss function is indispensable for the proof.On the other hand, it seems to be impossible to bound the risk for unbounded loss functions.Our remedy for this issue is the risk evaluation based on the conditional expectation on the typical set.Because x n lies out of A n ǫ with small probability, the conditional expectation is likely to capture the expectation of almost all cases.In spite of this fact, if one wants to remove the unnatural conditional expectation, Theorem 8 offers a more satisfactory bound.Note that the right side of ( 12) also approaches to zero as n → ∞.
We remark the relationship of our result with KL divergence D n (p, r).Because of (3) or (15), it seems to be possible to obtain a risk bound with KL divergence.However, it is impossible because taking λ → 1 in (11) or α → ±1 in (19) makes the bounds diverge to the infinity.That is, we cannot derive a risk bound for the risk with KL divergence by BC theory, though we can do it for the Rényi divergence and the α-divergence.It sounds somewhat strange because KL divergence seems to be related the most to the notion of the MDL principle because it has a clear information theoretical interpretation.This issue originates from the original BC theory and has been casted as an open problem for a long time.
Finally, we remark that the effectiveness of our proposal in real situations depends on whether we can show the risk validity of the target penalty and derive a sufficiently small bound for log(1/P n ǫ ) and 1 − P n ǫ .Actually, much effort is required to realize them for lasso.

B. Risk Bound of Lasso in Random Design
In this section, we apply the approach in the previous section to lasso and derive new risk and regret bounds.In a setting of lasso, training data , where θ * is a true parameter and ǫ i is a Gaussian noise having zero mean and a known variance σ 2 .By introducing we have a vector/matrix expression of the regression model Y = Xθ * + E .The parameter space Θ is ℜ p .The dimension p of parameter θ can be greater than n.The lasso estimator is defined by where µ 1 is a positive real number (regularization coefficient).Note that the weighted ℓ 1 norm is used in (20), though the original lasso was defined with the usual ℓ 1 norm in [1].As explained in Section III, θ corresponds to the usual lasso with 'column normalization'.When x n is Gaussian with zero mean, we can derive a risk valid weighted ℓ 1 penalty by choosing an appropriate typical set.
Lemma 1.For any ǫ ∈ (0, 1), define where N (x|µ, Σ) is a Gaussian distribution with a mean vector µ and a covariance matrix Σ.Here, Σ jj denotes the jth diagonal element of Σ and x ij denotes the jth element of x i .Assume a linear regression setting: ).Let β be a real number in (0, 1) and λ be a real number in We describe its proof in Section V-F.The derivation is much more complicated and requires more techniques, compared to the fixed design case in [19].This is because the Rényi divergence is a usual mean square error (MSE) in the fixed design case, while it is not in the random design case in general.In addition, it is important for the risk bound derivation to choose an appropriate typical set in a sense that we can show that P n ǫ approaches to one sufficiently fast and we can also show the ǫ-risk validity of the target penalty with the chosen typical set.In case of lasso with normal design, the typical set A n ǫ defined in (21) satisfies such properties.
Let us compare the coefficient of the risk valid weighted ℓ 1 penalty with the fixed design case in [19].They showed that the weighted ℓ 1 norm satisfying is risk valid in the fixed design case.The condition for µ 2 is the same, while the condition for µ 1 in ( 22) is more strict than that of the fixed design case.We compare them by taking β = 1 − λ (the tightest choice) and ǫ = 0 in ( 22) because ǫ can be negligibly small for sufficiently large n.The minimum µ 1 for the risk validity in the random design case is times that for the fixed design case.Hence, the smallest value of regularization coefficient µ 1 for which the risk bound holds in the random design is always larger than that of the fixed design case for any λ ∈ (0, 1) but its extent is not so large unless λ is extremely close to 1 (See Fig. 1).Next, we show that P n ǫ exponentially approaches to one as n increases.
Lemma 2 (Exponential Bound of Typical Set).Suppose that x i ∼ N (x i |0, Σ) independently.For any ǫ ∈ (0, 1), See Section V-H for its proof.In the lasso case, it is often postulated that p is much greater than n.Due to Lemma 2, )), which also implies that the second term in (11) can be negligibly small even if n ≪ p.In this sense, the exponential bound is important for lasso.Combining Lemmas 1 and 2 with Theorems 7 and 8, we obtain the following theorem.
Since x n and y n are i.i.d.now, d n λ (p, r) = nd λ (p, r).Hence, we presented the risk bound as a single-sample version in (25) by dividing the both sides by n.Finally, we remark that the following interesting fact holds for the lasso case.

Lemma 3. Assume a linear regression setting:
If µ 1 and µ 2 satisfy (22), then the weighted That is, the weighted ℓ 1 penalties derived in Lemma 1 are not only ǫ-risk valid but also codelength valid.Its proof will be described in Section V-I.By this fact, the redundancy and regret interpretation of the main terms in (25) and ( 26) are justified.It also indicates that we can obtain the unconditional risk bound with respect to α-divergence for those weighted ℓ 1 penalties by Corollary 1 without any additional condition.

V. PROOFS OF THEOREMS, LEMMAS AND COROLLARY
We give all proofs to the theorems, the lemmas and the corollary in the previous section.

A. Proof of Theorem 7
Here, we prove our main theorem.The proof proceeds along with the same line as [19] though some modifications are necessary.

Proof. Define
By the ǫ-risk validity, we obtain The following fact is an extension of the key technique of BC theory: The first inequality holds because for any non-negative random variable A. The second inequality holds because of the monotonically increasing property of d n λ (p * , p θ ) in terms of λ.Thus, the right side of ( 28) is bounded as Hence, we have an important inequality Applying Jensen's inequality to (29), we have Thus, we have Rearranging the terms of this inequality, we have the statement.

B. Proof of Theorem 8
It is not necessary to start from scratch.We reuse the proof of Theorem 7.
Proof.We can start from (29).For convenience, we define By Markov's inequality and (29), Hence, we obtain

C. Proof of Corollary 1
The proof is obtained immediately from Theorem 7.

Proof. Let again E n
ǫ denote a conditional expectation with regard to p * (x n , y n ) given that x n ∈ A n ǫ .Let further I A (x n ) be an indicator function of a set A ⊂ X n .The unconditional risk is bounded as The first and second inequalities follow from the two properties of α-divergence in ( 17) and ( 18) respectively.The third inequality follows from Theorem 7 because λ(α) ∈ (0, 1 − β) by the assumption.The last inequality holds because of the following reason.By the decomposition of expectation, we have Since p 2 (y n |x n ) is a sub-probability distribution by the assumption, the conditional expectation part is non-negative.Therefore, removing the indicator function I A n ǫ (x n ) cannot decrease this quantity.The final part of the statement follows from the fact that taking λ = 1 − β makes the bound in (11) tightest because of the monotonically increasing property of Rényi divergence with regard to λ.
Again, we remark that the sub-probability condition of p 2 (y n |x n ) can be replaced with a sufficient condition "L(θ|x n ) is codelength valid."In addition, the sub-probability condition can be relaxed to sup

D. Rényi Divergence and Its Derivatives
In this section and the next section, we prove a series of lemmas, which will be used to derive risk valid penalties for lasso.First, we show that the Rényi divergence can be understood by defining pλ θ (x, y) in Lemma 4.Then, their explicit forms in the lasso setting are calculated in Lemma 5.

Lemma 4. Define a probability distribution pλ
θ (x, y) by where Z λ θ is a normalization constant.Then, the Rényi divergence and its first and second derivatives are written as where Var p (A) denotes a covariance matrix of A with respect to p and Proof.The normalizing constant is rewritten as Thus, the Rényi divergence is written as .
If we assume that p * (y|x) = N (y|x T θ * , σ 2 ) (i.e., linear regression setting), Var q λ θ xx T θ .(32) If we additionally assume that q * (x) = N (x|0, Σ) with a nonsingular covariance matrix Σ, where Proof.By completing squares, we can rewrite pλ θ (x, y) as pλ θ (x, y) Hence, p λ θ (y|x) is N (y|x T θ(λ), σ 2 ).Integrating y out, we also have Since Σ is strictly positive definite by the assumption, Σ −1 + (1/c) θθT is non-singular.Hence, by the inverse formula (Lemma 8 in Appendix), Therefore, q λ θ (x) = N (x|0, Σ λ θ ).The score function and Hessian of log p θ (y|x) are Using (30), the first derivative is obtained as By (31) combined with (37), the Hessian of Rényi divergence is calculated as where x j denotes the jth element of x only here.Thus, we need all the fourth-moments of q λ θ (x).We rewrite Σ λ θ as S to reduce notation complexity hereafter.By the formula of moments of Gaussian distribution, we have Therefore, the above quantity is calculated as Summarizing these as a matrix form, we have As a result, Var q λ θ (xx T θ) is obtained as Using (38), the first and second terms of (39) are calculated as Combining these,

E. Upper Bound of Negative Hessian
Using Lemma 5 in Section V-D, we show that the negative Hessian of the Rényi divergence is bounded from above.Lemma 6. Assume that q * (x) = N (x|0, Σ) and p * (y|x) = N (y|x T θ * , σ 2 ), where Σ is non-singular.For any θ, θ * , where A B implies that B − A is positive semi-definite.
Proof.By Lemma 5, we have

For any nonzero vector
by Cauchy-Schwartz inequality.Hence, we have Thus, for t ≥ 0. Checking the properties of f (t), we have

F. Proof of Lemma 1
We are now ready to derive ǫ-risk valid weighted ℓ 1 penalties.
Proof.Similarly to the rewriting from ( 8) to (10), we can rewrite the condition for ǫ-risk validity as We again write the inside part of the minimum in (41) as H(θ, θ, x n , y n ).As described in Section III, the direct minimization of H(θ, θ, x n , y n ) seems to be difficult.Instead of evaluating the minimum explicitly, we borrow a nice randomization technique introduced in [19] with some modifications.Their key idea is to evaluate not min θ H(θ, θ, x n , y n ) directly but its expectation E θ [H(θ, θ, x n , y n )] with respect to a dexterously randomized θ because the expectation is larger than the minimum.Let us define w where δ > 0 is a quantization width and Z is the set of all integers.Though Θ depends on x n in fixed design cases [19], we must remove the dependency to satisfy the ǫ-risk validity as above.For each θ, θ is randomized as where m j := w * j θ j /δ and each component of θ is statistically independent of each other.Its important properties are where | θ| denotes a vector whose jth component is the absolute value of θj and similarly for |θ|.Using these, we can bound E θ [H(θ, θ, x n , y n )] as follows.The loss variation part in (41) is the main concern because it is more complicated than squared error of fixed design cases.Let us consider the following Taylor expansion where W := diag(w 1 , w 2 , • • • , w p ).We define a codelength function C(z) := z 1 log 4p+log 2 over Z p .Note that C(z) satisfies Kraft's inequality.Let us define a codelength function on Θ(q * ) as By this definition, L satisfies β-stronger Kraft's inequality and does not depend on x n but depends on q * (x) through W * .By taking expectation with respect to θ, we have Since x n ∈ A n ǫ , we have Thus, we can bound E θ [H(θ, θ, x n , y n )] by the data-dependent weighted ℓ 1 norm θ w,1 as Because this holds for any δ > 0, we can minimize the upper bound with respect to δ, which completes the proof.

G. Some Remarks on the Proof of Lemma 1
The main difference of the proof from the fixed design case is in the loss variation part.In the fixed design case, the Rényi divergence d λ (p * , p θ |x n ) is convex in terms of θ.When the Rényi divergence is convex, the negative Hessian is negative semi-definite for all θ.Hence, the loss variation part is trivially bounded above by zero.On the other hand, d λ (p * , p θ ) is not convex in terms of θ.This can be intuitively seen by deriving the explicit form of d λ (p * , p θ ) instead of checking the positive semi-definiteness of its Hessian.From (35), we have where I p is the identity matrix of dimension p. Prof. A. R. Barron suggested in a private discussion that Z λ θ can be simplified more as follows.Let Q := [q 1 , q 2 , • • • , q p ] be an orthogonal matrix such that q 1 := θ′ / θ′ 2 .Using this, we have Hence, the resultant Z λ θ is obtained as . Thus, we have a simple expression of the Rényi divergence as From this form, we can easily know that the Rényi divergence is not convex.When the Rényi divergence is non-convex, it is unclear in general whether and how the loss variation part is bounded above.This is one of the main reasons why the derivation becomes more difficult than that of the fixed design case.
We also mention an alternative proof of Lemma 1 based on (49).We provided Lemma 4 to calculate Hessian of the Rényi divergence.However, the above simple expression of the Rényi divergence is somewhat easier to differentiate, while the expression based on (48) is somewhat hard to do it.Therefore, we can twice differentiate the above Rényi divergence directly in order to obtain Hessian instead of Lemma 5 in our Gaussian setting.However, there is no guarantee that such a simplification is always possible in general setting.In our proof, we tried to give a somewhat systematic way which is easily applicable to other settings to some extent.Suppose now, for example, we are aim at deriving ǫ-risk valid ℓ 1 penalties for lasso when q * (x) is subject to non-Gaussian distribution.By (32) in Lemma 5, it suffices only to bound Var q λ θ (xx T θ) in the sense of positive semi-definiteness because −E q λ θ [xx T ] is negative semi-definite.In general, it seemingly depends on a situation which is better, the direct differential or using (32).In our Gaussian setting, we imagine that the easiest way to calculate Hessian for most readers is to calculate the first derivative by the formula (30) and then to differentiate it directly, though this depends on readers' background knowledge.For other settings, we believe that providing Lemmas 4 and 5 would be useful in some cases.

H. Proof of Lemma 2
Here, we show that x n distributes out of A n ǫ with exponentially small probability with respect to n.
Proof.The typical set A n ǫ can be decomposed covariate-wise as where x j := (x 1j , x 2j , • • • , x nj ) T and the above Π denotes a direct product of sets.From its definition, w 2 j is subject to a Gamma distribution Ga((n/2), (2s)/n) when x j ∼ Π n i=1 N (x j |0, (w * j ) 2 ).We write w 2 j as z and (w * j ) 2 as s (the index j is dropped for legibility).We rewrite the Gamma distribution g(z; s) in the form of exponential family: That is, ν is a natural parameter and z is a sufficient statistic, so that the expectation parameter η(s) is E g(z;s) [z].The relationship between the variance parameter s and natural/expectation parameters are summarized as For exponential families, there is a useful Sanov-type inequality (Lemma 7 in Appendix).Using this Lemma, we can bound Pr(x j / ∈ A n ǫ (j)) as follows.For this purpose, it suffices to bound the probability of the event |w where D is the single data version of the KL-divergence defined by (2).It is easy to see that ǫ−log(1+ǫ) ≤ −ǫ−log(1−ǫ) for any 0 < ǫ < 1.By Lemma 7, we obtain Hence P n ǫ can be bounded below as The last inequality follows from (1 − t) p ≥ 1 − pt for any t ∈ [0, 1] and p ≥ 1.To simplify the bound, we can do more.The maximum positive real number a such that, for any ǫ ∈ Then, the maximum integer a 1 such that (1 − log 2)/2 ≥ 1/a 1 is 7, which gives the last inequality in the statement.

I. Proof of Lemma 3
We can prove this lemma by checking the proof of Lemma 1.

Proof
Similarly to the rewriting from ( 9) to (10), we can restate the codelength validity condition for L 1 (θ|x n ) as "there exist a quantize subset Θ(x n ) and a model description length L( θ|x n ) satisfying the usual Kraft's inequality, such that Recall that ( 22) is a sufficient condition for the ǫ-risk validity of L 1 , in fact, it was derived as a sufficient condition for the proposition that L 1 (θ|x n ) bounds from above for any q * ∈ P n x , v n ∈ A n ǫ , y n ∈ Y n , θ ∈ Θ, where θ was randomized on Θ(q * ) and ( Θ(q * ), L( θ|q * )) were defined by (42) and (47), in particular, L( θ|q * ) satisfies β-stronger Kraft's inequality.Recall that H(θ, θ, x n , y n ) is the inside part of the minimum in (41).Here, we used v n instead of x n so as to discriminate from the above fixed x n .To derive the sufficient condition, we obtained upper bounds on the terms (i) and (ii) of (51) respectively, and shown that L 1 (θ|v n ) with v n ∈ A n ǫ is not less than the sum of both upper bounds if ( 22) is satisfied.A point is that the upper bound on the term (i) we derived is a non-negative function of θ (see (46)).Hence, if v n ∈ A n ǫ and ( 22) hold, L 1 (θ|v n ) is an upper bound on the term (ii), which is not less than Now, assume (22) and let us take q * ∈ P n x given x n , such that Σ jj is equal to (1/n) n i=1 x 2 ij for all j.Then we have Since q * is determined by x n and L( θ|q * ) satisfies Kraft's inequality, the codelength validity condition holds for L 1 .

VI. NUMERICAL SIMULATIONS
We investigate behavior of the regret bound (26).In the regret bound, we take β = 1 − λ with which the regret bound becomes tightest.Furthermore, µ 1 and µ 2 are taken as their smallest values in (22).As described before, we cannot obtain the exact bound for KL divergence which gives the most famous loss function, the mean square error (MSE), in this setting.This is because the regret bound diverges to the infinity as λ → 1 unless n is accordingly large enough.That is, we can obtain only the approximate evaluation of the MSE.The precision of that approximation depends on the sample size n.We do not employ the MSE here but another famous loss function, squared Hellinger distance d 2 H (for a single data).The Hellinger distance was defined in (16) as n sample version (i.e., d 2 H = d 2,1 H ). We can obtain a regret bound for d 2 H (p * , p θ ) by (26) because two times the squared Hellinger distance 2d 2 H is bounded by Bhattacharyya divergence (d 0.5 ) in (4) through the relationship (18).We set n = 200, p = 1000 and Σ = I p to mimic a typical situation of sparse learning.The lasso estimator is calculated by a proximal gradient method [28].To make the regret bound tight, we take τ = 0.03 that is close to zero compared to the main term (regret).For this τ , Fig. 2 shows the plot of ( 27) against ǫ.We should choose the smallest ǫ as long as the regret bound holds with large probability.Our choice is ǫ = 0.5 at which the value of ( 27) is 0.81.We show the results of two cases in Figs.3-5.These plots express the value of d 0.5 , 2d 2 H and the regret bound that were obtained in a hundred of repetitions with different signal-to-noise ratios (SNR) E q * [(x T θ * ) 2 ]/σ 2 (that is, different σ 2 ).From these figures and other experiments, we observed that 2d 2 H almost always equaled d 0.5 (they were almost overlapped).As the SN ratio got larger, then the regret bound became looser, for example, about six times larger than 2d 2 H when SNR is 10.One of the reasons is that the ǫ-risk validity condition is too strict to bound the loss function when SNR is high.Hence, a possible way to improve the risk bound is to restrict the parameter space Θ used in ǫ-risk validity to a range of θ, which is expected to be considerably narrower than Θ due to high SNR.In contrast, the regret bound is tight when SNR is 0.5 in Fig. 5. Finally, we remark that the regret bound dominated the Rényi divergence over all trials, though the regret bound is probabilistic.One of the reason is the looseness of the lower bound (27) of the probability for the regret bound to hold.This suggests that ǫ can be reduced more if we can derive its tighter bound.

VII. CONCLUSION
We proposed a way to extend the original BC theory to supervised learning by using a typical set.Similarly to the original BC theory, our extension also gives a mathematical justification of the MDL principle for supervised learning.As an application, we derived a new risk and regret bounds of lasso.The derived bounds still retains various advantages of the original BC theory.In particular, it requires considerably few assumptions.Our next challenges are applying our pro-  posal to non-normal cases for lasso and other machine learning methods.

APPENDIX SANOV-TYPE INEQUALITY
The following lemma is a special case of the result in [29].Below, we give a simpler proof.In the lemma, we denote a random variable of one dimension by X and denote its corresponding one dimensional variable by x.

Lemma 7. Let
x ∼ p θ (x) := exp(θx − ψ(θ)), where x and θ are of one dimension.Then, where η is the expectation parameter corresponding to the natural parameter θ and similarly for η ′ .The symbol D denotes the single sample version of the KL-divergence defined by (2).
Assume η ′ − η ≥ 0. Because of the monotonicity of natural parameter and expectation parameter of exponential family, By Markov's inequality, we have The other inequality can also be proved in the same way.See, for example, Corollary 1.7.2 in [30] for its proof.

INVERSE MATRIX FORMULA Lemma 8 .
Let A be a non-singular m × m matrix.If c and d are both m × 1 vectors and A + cd is non-singular, then(A + cd T ) −1 = A −1 − A −1 cd T A −1 1 + d T A −1 c .