Relaxed Lasso

https://doi.org/10.1016/j.csda.2006.12.019Get rights and content

Abstract

The Lasso is an attractive regularisation method for high-dimensional regression. It combines variable selection with an efficient computational procedure. However, the rate of convergence of the Lasso is slow for some sparse high-dimensional data, where the number of predictor variables is growing fast with the number of observations. Moreover, many noise variables are selected if the estimator is chosen by cross-validation. It is shown that the contradicting demands of an efficient computational procedure and fast convergence rates of the 2-loss can be overcome by a two-stage procedure, termed the relaxed Lasso. For orthogonal designs, the relaxed Lasso provides a continuum of solutions that include both soft- and hard-thresholding of estimators. The relaxed Lasso solutions include all regular Lasso solutions and computation of all relaxed Lasso solutions is often identically expensive as computing all regular Lasso solutions. Theoretical and numerical results demonstrate that the relaxed Lasso produces sparser models with equal or lower prediction loss than the regular Lasso estimator for high-dimensional data.

Keywords

High dimensionality
Bridge estimation
Lasso
q-norm penalisation
Dimensionality reduction

1. Introduction

The current work is motivated by linear prediction for high-dimensional data, where the number of predictor variables p is very large, possibly very much larger than the number of observations n (e.g. van de Geer and van Houwelingen, 2004). Regularisation is clearly of central importance for these high-dimensional problems.

There are many criteria to consider when choosing an appropriate regularisation method. First, not all regularisation procedures are adequate for the high-dimensional case. The non-negative Garotte (Breiman, 1995) is for example a promising regularisation method. However, it is not suited for the case p>n as it requires computation of the OLS-estimator, which is unavailable in this case. An important criterion in the presence of many predictor variables is the computational complexity of the procedure. Many regularisation procedures with otherwise attractive features involve, unfortunately, minimisation of a non-convex function (e.g. Fan and Li, 2001, Tsybakov and van de Geer, 2005). For high-dimensional problems, it is in general very costly to find an (approximate) solution in this case, due to the presence of local minima in the objective function.

For Bridge estimators, which were proposed in Frank and Friedman (1993), we study in the following the tradeoff between computational complexity on the one hand and (asymptotic) properties of the estimators on the other hand. Let X=(X1,,Xp) be a p-dimensional predictor variable and Y a response variable of interest. For n independent observations (Yi,Xi), i=1,,n, of (Y,X), Bridge estimators are defined for λ,γ[0,) as(1)β^λ,γ=argminβn-1i=1n(Yi-XiTβ)2+λβγ,where βγ=k{1,,p}|βk|γ is the γ-norm of the vector of coefficients, and γ is typically in the range [0,2].

For γ=0, Bridge estimation corresponds to ordinary model selection. Ridge regression is obtained for γ=2, while γ=1 is equivalent to the Lasso proposed in Tibshirani (1996). Computation of the estimator (1) involves minimisation of a non-convex function if γ<1, while the function is convex for γ1. Since optimisation of a non-convex function in a high-dimensional setting is very difficult, Bridge estimation with γ1 is an attractive choice. However, for values of γ>1, the shrinkage of estimators towards zero increases with the magnitude of the parameter being estimated (Knight and Fu, 2000). For the Lasso (γ=1), the shrinkage is constant irrespective of the magnitude of the parameter being estimated (at least for orthogonal designs, where regularisation with the Lasso is equivalent to soft-thresholding of the estimates). It was recognised in Fan and Li (2001) that this leads to undesirable properties (in terms of prediction) of the resulting estimator. It was first suggested by Huber (1973) to examine the asymptotic properties for a growing number p=pn of predictor variables as a function of the number of observations n, see as well Fan and Peng (2004). It will be shown below that the shrinkage of the Lasso leads to a low convergence rate of the 2-loss for high-dimensional problems where the number of parameters p=pn is growing almost exponentially fast with n, so that pnn.

For γ<1, the shrinkage of estimates decreases with increasing magnitude of the parameter being estimated and faster convergence rates can thus in general be achieved (see e.g. Knight and Fu, 2000 and, for classification, Tsybakov and van de Geer, 2005). However, the fact remains that for γ<1 a non-convex optimisation problem has to be solved.

There is no value of γ for which an entirely satisfactory compromise is achieved between low computational complexity on the one hand and fast convergence rates on the other hand. In this paper, it is shown that a two-stage procedure, termed relaxed Lasso, can work around this problem. The method has low computational complexity (the computational burden is often identical to that of an ordinary Lasso solution) and, unlike the Lasso, convergence rates are fast, irrespective of the growth rate of the number of predictor variables. Moreover, relaxed Lasso leads to consistent variable selection under a prediction-optimal choice of the penalty parameters, which does not hold true for ordinary Lasso solutions in a high-dimensional setting.

2. Relaxed Lasso

We define relaxed Lasso estimation and illustrate the properties of the relaxed Lasso estimators for an orthogonal design. A two-stage algorithm for computing the relaxed Lasso estimator is then proposed, followed by a few remarks about extending the procedure to generalised linear models (McCullagh and Nelder, 1989).

Recall that the Lasso estimator under a squared error loss is defined in Tibshirani (1996) for λ[0,) as(2)β^λ=argminβn-1i=1n(Yi-XiTβ)2+λβ1.The Lasso estimator is a special case of the Bridge estimator (1), obtained by setting γ=1. The set of predictor variables selected by the Lasso estimator β^λ is denoted by Mλ,(3)Mλ={1kp|β^kλ0}.For sufficiently large penalties λ (e.g. for λ>2maxkn-1i=1nYiXik), the selected model is the empty set, Mλ=, as all components of the estimator (2) are identical to zero. In the absence of a 1-penalty and if the number of variables p is smaller than the number of observations n, all predictor variables are in general selected, so that M0={1,,p} in this case.

The 1-penalty for the ordinary Lasso estimator (2) has two effects, model selection and shrinkage estimation. On the one hand, a certain set of coefficients is set to zero and hence excluded from the selected model. On the other hand, for all variables in the selected model Mλ, coefficients are shrunken towards zero compared to the least-squares solution. These two effects are clearly related and can be best understood in the context of orthogonal design as soft-thresholding of the coefficients. Nevertheless, it is not immediately obvious whether it is indeed optimal to control these two effects, model selection on the one hand and shrinkage estimation on the other hand, by a single parameter only. As an example, it might be desirable in some situations to estimate the coefficients of all selected variables without shrinkage, corresponding to a hard-thresholding of the coefficients.

As a generalisation of both soft- and hard-thresholding, we control model selection and shrinkage estimation by two separate parameters λ and φ with the relaxed Lasso estimator.

Definition 1

The relaxed Lasso estimator is defined for λ[0,) and φ(0,1] as(4)β^λ,φ=argminβn-1i=1n(Yi-XiT{β·1Mλ})2+φλβ1,where 1Mλ is the indicator function on the set of variables Mλ{1,,p} so that for all k{1,,p}, {β·1Mλ}k=0,kMλ,βk,kMλ.

Note that only predictor variables in the set Mλ are considered for the relaxed Lasso estimator. The parameter λ controls thus the variable selection part, as in ordinary Lasso estimation. The relaxation parameter φ controls on the other hand the shrinkage of coefficients. If φ=1, the Lasso and relaxed Lasso estimators are identical. For φ<1, the shrinkage of coefficients in the selected model is reduced compared to ordinary Lasso estimation. The case of φ=0 needs special consideration, as the definition above would produce a degenerate solution. In the following, we define the relaxed Lasso estimator for φ=0 as the limit of the above definition for φ0. In this case, all coefficients in the model Mλ are estimated by the OLS-solution. This estimator (for φ=0) was already proposed in Efron et al. (2004) as Lars–OLS hybrid, “using Lars to find the model but not to estimate the coefficients” (Efron et al., 2004). The reduction of the sum of squared residuals of this hybrid method over the ordinary Lasso estimator was found to be small for the studied data set, which contained 10 predictor variables only.

We will show further below that the gains with relaxed Lasso estimation (adaptive φ) compared to ordinary Lasso estimation (φ=1) can be very large. Moreover, relaxed Lasso is producing in most cases better results than the Lars–OLS hybrid (φ=0), as relaxed Lasso can adapt the amount of shrinkage to the structure of the underlying data.

An algorithm is developed to compute the exact solutions of the relaxed Lasso estimator. The parameters λ and φ can then be chosen e.g. by cross-validation. The algorithm is based on the Lars-algorithm by Efron et al. (2004). As the relaxed Lasso estimator is parameterised by two parameters, a two-dimensional manifold has to be covered to find all solutions. The computational burden of computing all relaxed Lasso estimators is in the worst case identical to that of the Lars–OLS hybrid and in the best case identical to that of the Lars-algorithm. The method is thus very well suited for high-dimensional problems.

2.1. Orthogonal design

To illustrate the properties of the relaxed Lasso estimator, it is instructive to consider an orthogonal design. The shrinkage of various regularisation methods are shown in Fig. 1 for this case. The set of solutions of the relaxed Lasso estimator is given for all k=1,,p by β^kλ,φ=β^k0-φλ,β^k0>λ,0,|β^k0|λ,β^k0+φλ,β^k0<-λ,where β^0 is the OLS-solution. For φ=0, hard-thresholding is achieved, while φ=1 results—as mentioned above—in soft-thresholding, which corresponds to the Lasso solution. The relaxed Lasso provides hence a continuum of solutions that includes soft- and hard-thresholding, much like the set of solutions provided by the Bridge estimators (1) when varying γ in the range [0,1]. It can be seen in Fig. 1 that the solutions to the Bridge estimators and the relaxed Lasso solutions are indeed very similar.

  1. Download : Download full-size image

Fig. 1. Comparison of shrinkage estimators as a function of the OLS-estimator β^0. Shown are estimators for soft-thresholding (top left), hard-thresholding (top right), the estimator β^λ,γ, Eq. (1), for γ=0,0.5,0.9, and 1 (bottom left) and the relaxed Lasso estimators β^λ,φ for φ=0,13,23, and 1 (bottom right).

2.2. Algorithm

The main advantage of the relaxed Lasso estimator over Bridge estimation is the low computational complexity. We propose in the following a naive, easy to implement, algorithm for computing relaxed Lasso estimators as in (4). Based on some more insight, a modification is proposed further below so that—for many data sets—the computational effort of computing all relaxed Lasso solutions is identical to that of solving the ordinary Lasso solutions.

Simple Algorithm.

Step 1: Compute all ordinary Lasso solutions e.g. with the Lars-algorithm in Efron et al. (2004) under the Lasso modification. Let M1,,Mm be the resulting set of m models. Let λ1>>λm=0 be a sequence of penalty values so that Mλ=Mk if and only if λ(λk,λk-1], where λ0. (The models are not necessarily distinct, so it is always possible to obtain such a sequence of penalty parameters.)

Step 2: For each k=1,,m, compute all Lasso solutions on the set Mk of variables, varying the penalty parameter between 0 and λk. The obtained set of solutions is identical to the set of relaxed Lasso solutions β^λ,φ for λΛk. The relaxed Lasso solutions for all penalty parameters are given by the union of these sets.

It is obvious that this algorithm produces all relaxed Lasso solutions, for all values of the penalty parameters φ[0,1] and λ>0. The computational complexity of this algorithm is identical to that of Lars–OLS hybrid, as the Lars iterations in Step 2 are about as computationally intensive as ordinary least-squares estimation (Efron et al., 2004).

However, this naive algorithm is not optimal in general. The computation of the ordinary Lasso solutions contains information that can be exploited in the second stage, when finding Lasso solutions for all subsets Mk, k=1,,m of variables. Fig. 2 serves as an illustration. The “direction” in which relaxed Lasso solutions are found is identical to the directions of ordinary Lasso solutions. These directions do not have to be computed again. Indeed, by extrapolating the path of the ordinary Lasso solutions, all relaxed Lasso solutions can often be found. There is an important caveat. Extrapolated Lasso solutions are only valid relaxed Lasso solutions if and only if the extrapolations do not cross the value zero. This is e.g. fulfilled if the ordinary Lasso estimators are monotonously increasing for a decreasing penalty parameter λ. If, however, the extrapolations do cross zero for a set Mk, then the Lasso has to be computed again explicitly for this set, using e.g. again the Lars-algorithm of Efron et al. (2004).

  1. Download : Download full-size image

Fig. 2. Left: path of the estimators β^λ for a data set with three variables. For large values of λ all components are equal to zero. In the range λ(0.45,0.75], only the first components is non-zero. Middle: the relaxed Lasso solutions, if λ is in the range λ(0.45,0.75]. The direction in which the relaxed Lasso solutions are found is the same as those computed for the ordinary Lasso solutions. The relaxed Lasso solution for φ=0 corresponds to the OLS-solution. Right: likewise, relaxed Lasso solutions for the range λ(0.2,0.45] are found by extrapolating the Lasso solutions. Again, the solutions for φ=0 correspond to the OLS-solution for the two variables selected by the Lasso estimator.

Refined Algorithm.

Step 1: Identical to Step 1 of the simple algorithm. Compute all ordinary Lasso solutions.

Step 2: For each k=1,,m, let δ(k)=(β^λk-β^λk-1)/(λk-1-λk). This is the direction in which solutions are found for ordinary Lasso solutions and is hence known from Step 1. Let β˜=β^λk+λkδ(k). If there is at least one component l so that sign(β˜l)sign(β^lλk), then relaxed Lasso solutions for λΛk have to be computed as in Step 2 of the simple algorithm. Otherwise all relaxed Lasso solutions for λΛk and φ[0,1] are given by linear interpolation between β^λk-1 (which corresponds to φ=1) and β˜ (which corresponds to φ=0).

In the worst case, the refined algorithm is no improvement over the simple algorithm. In the best case, all relaxed Lasso solutions are found at no extra cost, once the ordinary Lasso solutions are computed. If Lasso solutions are e.g. monotonously increasing (for a decreasing value of λ), then the condition about sign-equality in Step 2 of the refined algorithm is fulfilled, and the relaxed Lasso solutions are found at no extra cost.

The computational complexity of the ordinary Lasso is O(npmin{n,p}), as there are m=O(min{n,p}) steps, each of complexity O(np). In the worst case, the computational complexity of the relaxed Lasso is O(m2np), which is, for high-dimensional problems with p>n, identical to O(n3p), and hence slightly more expensive than the O(n2p) of the ordinary Lasso (but equally expensive as the Lars–OLS hybrid if the least-squares estimator is computed explicitly). However, the linear scaling with the number p of variables is identical. Moreover, as mentioned above, the scaling O(n3p) is really a worst case scenario. Often all relaxed Lasso solutions can be found at little or no extra cost compared to the ordinary Lasso solutions, using the refined algorithm above.

2.3. Extensions

The method can be easily generalised to more general loss functions and generalised linear models (McCullagh and Nelder, 1989). Let (β) be the negative log-likelihood under parameter β. The relaxed Lasso estimator is then defined in analogy to (4) as(5)β^λ,φ=argminβMλ(β)+φλβ1,where βMλ is understood to be equivalent to requiring that βk=0, for all kMλ. The algorithm for computing the solutions for all parameter values λ,φ has the same two-stage characteristic as for the quadratic loss function. The computational effort is again identical to that of ordinary Lasso estimation. For this case, no exact solutions for ordinary Lasso estimators are in general available, and the same is true for the relaxed Lasso estimators. However, only optimisation of convex functions are required as long as the log-likelihood is a concave function. For the Lasso, a solution has been proposed e.g. in Zhao and Yu (2004) and could be generalised to compute all relaxed Lasso solutions.

3. Asymptotic results

For the asymptotic results, we consider a random design. Let X=(X1,,Xp)be a p=pn-dimensional random variable with a gaussian distribution with covariance matrix Σ, so that XN(0,Σ). The response variable Y is a linear combination of the predictor variables,(6)Y=XTβ+ɛ,where ɛN(0,σ2). We compare the risk of the Lasso estimator and the relaxed Lasso estimator. The minimal achievable squared error loss is given by the variance σ2 of the noise term. The random loss L(λ) of the Lasso estimator is defined by(7)L(λ)=E(Y-XTβ^λ)2-σ2,where the expectation is with respect to a sample that is independent of the sample which is used to determine the estimator. The loss L(λ,φ) of the relaxed Lasso estimator under parameters λ,φ is defined analogously as(8)L(λ,φ)=E(Y-XTβ^λ,φ)2-σ2.It is shown in the following that convergence rates for the relaxed Lasso estimator are largely unaffected by the number of predictor variables for sparse high-dimensional data. This is in contrast to the ordinary Lasso estimator, where the convergence rate drops dramatically for large growth rates of the number pn of predictor variables.

3.1. Setting and assumptions

We make a few assumptions about sparse high-dimensional data. The number of predictor variables p=pn is assumed to be growing very fast with the number of observations.

Assumption 2

For some c>0 and 0<ξ<1,(9)logpncnξ.

In the following, a matrix is said to be diagonally dominant at value ν if the row-wise sum of the absolute values of its non-diagonal entries are bounded by ν times the corresponding absolute value of the diagonal element.

Assumption 3

There exists some ν<1 so that both Σ and Σ-1 are diagonally dominant at value ν for all nN.

Note that a diagonal dominant matrix (for any value ν>0) is positive definite. The existence of Σ-1 is hence already implied by the assumption about Σ. The assumption is not of critical importance for the results, but shortens the proofs considerably.

The coefficient vector β is assumed to be sparse. For simplicity of exposition, we assume sparseness in the 0-norm: there are a finite number q of non-zero components of β and these are fix for all nN. W.l.o.g., the non-zero components are first in order.

Assumption 4

The vector βRpn of coefficients is given for all nN by β=(β1,,βq,0,0,).

The true model is hence M={1,,q}. The pn-q noise variables with zero coefficients are nevertheless possibly correlated with the response variable. This setting is similar to some numerical examples in Fan and Peng (2004).

As the number of non-zero coefficients is given by a finite and fixed number q, we restrict the penalty parameter λ in the following to the range Λ, for which the number of selected variables is less than or equal to dlogn with an arbitrary large d>0,(10)Λ{λ0:#Mλdlogn}.This range includes all sequences λn for which the Lasso or relaxed Lasso estimates are consistent for variable selection, as the number of true non-zero coefficients is finite and fixed.

3.2. Slow rates with the ordinary Lasso

It is shown that the rate of convergence of ordinary Lasso estimators is slow if the number of noise variables is growing fast.

Theorem 5

Under Assumptions 24 and independent predictor variables, that is Σ=1, it holds for the risk under the ordinary Lasso estimator that for any c>0 and n PinfλΛL(λ)>cn-r1r>1-ξ.

A proof is given in the appendix.

It is hence shown that the rate of convergence of the risk is critically determined by the rate nξ with which the logarithm logpn of the number of predictor variables is growing to infinity. It follows that it is impossible to have both consistent variable selection and optimal rates for independent predictor variables with the ordinary Lasso estimator.

Adding many noise predictor variables slows down the rate of convergence for the Lasso estimator, no matter how the penalty parameter λ is chosen. The reason for this slow convergence in the high-dimensional setting is that a large value of the penalty parameter λ is necessary to keep the estimates of coefficients of noise predictor variables at low values. The shrinkage of the non-zero components is then very large, leading to less than optimal prediction; for a further discussion of this phenomenon see as well Fan and Li (2001).

3.3. Fast rates with the relaxed Lasso

A faster rate of convergence is achieved with the relaxed Lasso estimator than with ordinary Lasso in this sparse high-dimensional setting. Noise variables can be prevented from entering the estimator with a high value of the penalty parameter λ, while the coefficients of selected variables can be estimated at the usual n-rate, using a relaxed penalty. It is shown in other words that the rate of convergence of the relaxed Lasso estimator is not influenced by the presence of many noise variables.

Theorem 6

Under Assumptions 24, for n, it holds for the loss under the relaxed Lasso estimator that infλΛ,φ[0,1]L(λ,φ)=Op(n-1).

A proof is given in the appendix.

The rate of convergence of the relaxed Lasso estimator (under oracle choices of the penalty parameters) is thus shown to be uninfluenced by a fast growing number of noise variables. The results are illustrated in Fig. 3.

  1. Download : Download full-size image

Fig. 3. Convergence rates for the Lasso and the relaxed Lasso. The parameter ξ determines the rate at which the number pn of variables grows for n, between constant (ξ=0) and exponential (ξ=1). The loss under the relaxed Lasso is Op(n-1), irrespective of ξ. The loss under the ordinary Lasso estimator can be of oder Op(n-r) only if r<1-ξ (depicted by the grey area in the figure), no matter how the penalty parameter λ is chosen.

3.4. Choice of the penalty parameters by cross-validation

It was shown above that the rate of convergence of the relaxed Lasso estimate is not influenced by the presence of many noise variables under an oracle choice of the penalty parameters λ and φ (which are unknown). We show that the parameters λ,φ can be chosen by cross-validation while still retaining the fast rate.

For K-fold cross-validation, each observation belongs to one of K partitions, each consisting of n˜ observations, where n˜/n1/K for n. Let LS,n˜(λ,φ) be for S=1,,K the empirical loss on the observations in partition S when constructing the estimator on the set of observations different from S. Let Lcv(λ,φ) be the empirical loss under K-fold cross-validation, Lcv(λ,φ)=K-1S=1KLS,n˜(λ,φ).The penalty parameters λ^ and φ^ are chosen as minimisers of Lcv(λ,φ), (λ^,φ^)=argmin(λ,φ)Λ×[0,1]Lcv(λ,φ).In practice, a value of K between 5 and 10 is recommended, even though the following result is valid for a broader range.

Theorem 7

Let L(λ^,φ^) be the loss of the relaxed Lasso estimate and (λ^,φ^) chosen by K-fold cross-validation with 2K<. Under Assumptions 24, it holds that L(λ^,φ^)=Op(n-1log2n).

The optimal rates under oracle choices of the penalty parameters are thus almost obtained if the penalty parameters are chosen by cross-validation. We conjecture that the cross-validated penalty parameters lead for the relaxed Lasso estimator to consistent variable selection; this is not the case for the Lasso, see Meinshausen and Bühlmann (2006). This conjecture is supported by the following numerical examples.

4. Numerical examples

We illustrate the asymptotic results of Section 3 with a few numerical examples. The response variable follows the linear model (6). The predictor variable X follows a normal distribution with covariance matrix Σ, where Σij=ρ|i-j| for some value of 0ρ<1. For ρ=0, this corresponds to independent predictor variables. The variance of ɛ in (6) is chosen so that the signal-to-noise ratio is 0<η<1 (e.g. the variance of the response variable Y due to ɛ is 1/η of the variance of Y due to XTβ).

We consider the case where there are q variables (with qp) that “carry signal” in the sense that βk0 for all kq and βk=0 for all k with k>q. All components βk with kq are double-exponentially distributed.

For various values of n between 50 and 200 and p between 50 and 800, the ordinary Lasso estimator (φ=1), the Lars–OLS hybrid estimator (φ=0), and the relaxed Lasso estimator (adaptive φ) are computed. The penalty parameters are chosen by 5-fold cross-validation. The signal-to-noise ratio is chosen from the set η{0.2,0.8}. The correlation between predictor variables is chosen once as ρ=0 (independent predictor variables) and once as ρ=0.3, while the number of relevant predictor variables is chosen from the set q{5,15,25,50}. For each of these settings, the three mentioned estimators are computed 100 times each. Let Lrel be the average loss of relaxed Lasso over these 100 simulations, in the sense of (7), and likewise Lols and Llasso for Lars–OLS hybrid and Lasso, see (8).

For small q, the setting is resembling that of the theoretical considerations above and of the numerical examples in Fan and Peng (2004). For small q, the theorems above suggest that the relaxed Lasso and Lars–OLS hybrid outperform Lasso estimation in terms of predictive power. On the other hand, for q=p, the Lasso is the ML estimator and one expects it to do very well compared with the Lars–OLS hybrid for large values of q. (In a Bayesian setting, if the prior for β is chosen to be the actual double-exponential distribution of the components of β, the Lasso solution is the MAP estimator if q=p.)

If we knew beforehand the value of q (the number of relevant predictor variables), then we would for optimal prediction either choose Lasso (if q is large), that is φ=1, or Lars–OLS hybrid (if q is small), that is φ=0. However, we do not know the value of q. The numerical examples illustrate how well relaxed Lasso adapts to the unknown sparsity of the underlying data.

4.1. Number of selected variables

The number of selected variables is shown in Table 1 for the Lasso estimator and in Table 2 for the relaxed Lasso. As expected, the relaxed Lasso selects roughly the correct number of variables (or less, if the noise is high or the number of observations n is low, with the Lars–OLS hybrid selecting even fewer variables in these cases, as can be seen from Table 3). In contrast, ordinary Lasso often selects too many noise variables (with the cross-validated choice of λ). For q=5, it selects e.g. up to 34 variables. For q=50, up to 121. Using the considerations in the proof of Theorem 5, these numbers can be expected to grow even higher if a larger number n of observations would be considered.

Table 1. Average number of selected variables with the Lasso for ρ=0

p5010020040080050100200400800
nq=5,η=0.8q=5,η=0.2
50171820222599898
10013232427311012151417
20011132731341113192224
nq=15,η=0.8q=15,η=0.2
502730302724108867
10026394452531315141617
20027355159691821262930
nq=50,η=0.8q=50,η=0.2
50363023161297876
10047656661541519161411
2004871961121212730343127

Table 2. Average number of selected variables with the relaxed Lasso for ρ=0

p5010020040080050100200400800
nq=5,η=0.8q=5,η=0.2
507675565664
1005665568769
20054655561288
nq=15,η=0.8q=15,η=0.2
50191818151286666
1001719161716101210912
20015151615131214181815
nq=50,η=0.8q=50,η=0.2
5034251912986665
1004457554534131713119
20046576471662326292419

Table 3. Average number of selected variables with the Lars–OLS hybrid for ρ=0

p5010020040080050100200400800
nq=5,η=0.8q=5,η=0.2
505565533242
1004554544335
2004454433744
nq=15,η=0.8q=15,η=0.2
50171617131133333
100141616161556447
200131314141376864
nq=50,η=0.8q=50,η=0.2
5031231612733223
100424948373055251
20046475258571610954

4.2. Comparison with Lasso

Lasso and relaxed Lasso estimators produce nearly identical results (in terms of predictive power) if the number q of relevant predictor variables is large, as can be seen from Table 4, which shows the relative improvement of relaxed Lasso over ordinary Lasso,(11)100·(Llasso/Lrel-1).There is no harm when using the relaxed Lasso on such data instead of the Lasso, but there is not much to be gained either. However, for data where there is a very large number of noise variables (e.g. small q), the relaxed Lasso estimator produces a much smaller MSE, as expected from the previous theoretical results. The extent to which the relaxed Lasso outperforms Lasso in this setting depends strongly on the signal-to-noise ratio η. The improvements are larger for large η, where shrinkage of the selected components is not necessary. For small η, shrinkage of the selected components is useful and an optimal procedure chooses thus φ close to 1 for noisy problems. Indeed, the average chosen value of φ for the relaxed Lasso is large if η is low, as can be seen from Table 6.

Table 4. The relative improvement of relaxed Lasso over ordinary Lasso for ρ=0 (upper half) and ρ=0.3 (lower half)

p5010020040080050100200400800
nq=5,η=0.8q=5,η=0.2
504155494952−10−2−21
100958889110146565119
20010688841691712414112122
nq=15,η=0.8q=15,η=0.2
50−5−3−2−3−2−3−3−4−4−4
100717181812−30−13−1
2002832435860−32234
nq=50,η=0.8q=50,η=0.2
50−3−4−3−2−3−4−3−3−4−3
100−3−2−4−4−3−3−1−1−1−1
2000−131−2−4−1−1−10
nq=5,η=0.8q=5,η=0.2
50409810485103−220−31
10083951141801869861015
200119128891662022432101941
nq=15,η=0.8q=15,η=0.2
50−335−17−3−5−3−2−2
10014313633490−1−310
20050487277114−4−1573
nq=50,η=0.8q=50,η=0.2
50−7−2−3−2−3−4−4−2−2−4
100−2−10−2−3−3−1−1−3−1
2002781211−2−3−1−2−1

In the worst case, relaxed Lasso is performing only marginally worse than ordinary Lasso and is slightly more expensive to compute. For many sparse high-dimensional problems, however, the computation of the relaxed Lasso solutions comes at no extra computational cost and leads to sparser estimators and more accurate predictions.

4.3. Comparison with Lars–OLS hybrid

The theoretical conclusions suggest that Lars–OLS hybrid should do equally well for sparse high-dimensional data as relaxed Lasso. However, there are two caveats. First, the argument holds only for data with sparse structure. If the data do not have sparse structure, Lars–OLS hybrid is in general performing worse than Lasso. Relaxed Lasso can adapt to the amount of sparseness (as seen from Table 6) by varying φ between 1 (for not so sparse data) to 0 (for sparse data). Table 5 shows the relative improvement of relaxed Lasso over Lars–OLS hybrid, analogously to (11). For large values of q, relaxed Lasso is indeed performing better than Lars–OLS in general.

Table 5. The relative improvement of relaxed Lasso over Lars–OLS hybrid for ρ=0 (upper half) and ρ=0.3 (lower half)

p5010020040080050100200400800
nq=5,η=0.8q=5,η=0.2
50032−3−23829212820
100−461151846341923
20081−4502−45559
nq=15,η=0.8q=15,η=0.2
508522328212099
100630003242312629
2003−22−202220482322
nq=50,η=0.8q=50,η=0.2
501887762014934
100611666353316208
200363224146523130
nq=5,η=0.8q=5,η=0.2
50−32−13−14233243026
100−6−4−7−131139202122
200−3−1−2221−4814−6
nq=15,η=0.8q=15,η=0.2
5000−111372923189
10002−2112939393332
200−221−1−1141334916
nq=50,η=0.8q=50,η=0.2
5018154529171053
100598434643281713
200412103936464131

What is more striking than the dependence on the sparseness, however, is the dependence on the signal-to-noise ratio. Consider the case where only five variables carry signal (q=5). For a high signal-to-noise ratio (η=0.8), relaxed Lasso and Lars–OLS hybrid perform approximately equally well (and both much better than ordinary Lasso). For a low signal-to-noise ratio (η=0.2), however, relaxed Lasso is considerably better than Lars–OLS. The reason for this is intuitively easy to understand. For noisy problems, it pays off to shrink the coefficients of selected variables, while this is less important for less noisy data. Relaxed Lasso adapts the amount of shrinkage to the noise level.

In general, it is not optimal to do no shrinkage at all for the selected variables (φ=0) or do full shrinkage (φ=1). This is the reason why relaxed Lasso is performing better than both ordinary Lasso and Lars–OLS hybrid for noisy problems, especially when just a few variables carry signal. Given that the computational cost of relaxed Lasso is not higher than that for Lars–OLS hybrid (and sometimes equal to that of Lasso), relaxed Lasso seems to be well suited for high-dimensional problems as the sparseness and signal-to-noise ratio is in general unknown and relaxed Lasso is adaptive to both (Table 6).

Table 6. The average value of φ^ for the relaxed Lasso, for ρ=0

p5010020040080050100200400800
nq=5,η=0.8q=5,η=0.2
500.140.090.080.040.030.660.500.490.510.46
1000.090.110.080.030.040.520.680.530.450.51
2000.080.080.070.060.060.290.390.710.470.41
nq=15,η=0.8q=15,η=0.2
500.240.150.130.190.200.670.500.450.470.45
1000.210.170.050.050.040.660.720.610.610.64
2000.170.120.100.060.020.540.630.750.700.61
nq=50,η=0.8q=50,η=0.2
500.550.380.390.450.430.650.470.450.410.33
1000.440.540.450.410.400.720.770.710.640.58
2000.400.440.300.290.190.770.840.890.790.75

5. Conclusions

We have proposed the relaxed Lasso as a generalisation of Lasso estimation. The main motivation are very high-dimensional regression problems, where the Lasso has two shortcomings:

  • Selection of noise variables: If the penalty parameter is chosen by cross-validation, the number of selected variables is often very large. Many noise variables are potentially selected.

  • Low accuracy of predictions: The accuracy of prediction (in terms of squared error loss) was shown to be negatively affected by the presence of many noise variables, particularly for high signal-to-noise ratios.

The advantages of relaxed Lasso over ordinary Lasso in this high-dimensional setting are twofold.
  • Sparser estimates: The number of selected coefficients is in general very much smaller for relaxed Lasso, without compromising on the accuracy of predictions. The models produced by relaxed Lasso are thus more amenable to interpretation.

  • More accurate predictions: If the signal-to-noise ratio is very low, the predictive accuracy of both Lasso and relaxed Lasso is comparable. For a high signal-to-noise ratio, relaxed Lasso achieves often much more accurate predictions.

For high signal-to-noise ratios, both advantages of relaxed Lasso—sparser estimates and more accurate predictions—can be achieved alternatively by using the Lars–OLS hybrid. However, Lars–OLS hybrid is not adaptive to the signal-to-noise ratio, as seen in the numerical examples and is performing very much worse than ordinary Lasso for low signal-to-noise ratios. Relaxed Lasso is adaptive to the signal-to-noise ratio and achieves near-optimal performance on a wide variety of data sets.

6. Proofs

6.1. Proof of Theorem 6

It was assumed that the set of non-zero coefficients of β is given by M={1,,q}. Denote by E the event(12)λ:Mλ=M.Let c>0 be any positive constant. Then Pinfλ,φL(λ,φ)>cn-1Pinfλ,φL(λ,φ)>cn-1|EP(E)+P(Ec).Let λ be the smallest value of the penalty parameter λ such that no noise variable enters the selected variables, that is β^kλ=0 for all k>q,(13)λminλ0{λ|β^kλ=0,k>q}.The loss infλ,φL(λ,φ) is smaller than L(λ,0). Note that, conditional on E, the loss L(λ,0) is the loss of the regular OLS-estimator β^0 on the set M={1,,q} of the q predictor variables with non-vanishing coefficients. Let L be the loss of this OLS-estimator. It follows that Pinfλ,φL(λ,φ)>cn-1P(L>cn-1|E)P(E)+P(Ec)P(L>cn-1)+P(Ec).It follows from the proofs in Meinshausen and Bühlmann (2006) that there is a value of λ such that the true model M is selected with the Lasso estimator, so that P(Ec)0 for n. By the known properties of the OLS-estimator, there exists some c>0 for every ɛ>0, so that limsupnP(L>cn-1)<ɛ, which completes the proof. □

6.2. Some useful lemmas

6.2.1. Eigenvalues

Let Σ(M) be the covariance matrix, restricted to the subset M{1,,p} of variables. Let Σn(M) be the corresponding empirical covariance matrix for n independent observations.

Lemma 8

Under Assumptions 24, there exist 0<bmin<bmax<, so that the maximal and minimal eigenvalues λmax(M) and λmin(M) of the empirical covariance matrices Σn(M) are all bounded simultaneously for any d>0 and all M with |M|=mndlogn by bmin from below and bmax from above, with probability converging to 1 for n, P(bmin<λmin(M),λmax(M)<bmax,M:|M|mn)1,n.

Proof

By Gershgorins theorem, all eigenvalues of the empirical covariance matrix Σn(M) are in the set Γ(M)aMx:|x-(Σn(M))aa|bMa|(Σn(M))ab|.Let Γ{1,,p} be the set of all predictor variables. Taking the union over all sets with |M|mn, MΓ(M)a{1,,p}x:|x-(Σn)aa|maxΞ{1,,p},|Ξ|mn-1bΞ|(Σn)ab|.Denoting the maximal difference between the covariance matrix and its empirical version by(14)Δ=maxa,b|(Σn-Σ)ab|,it follows that MΓ(M)a{1,,p}x:|x-Σaa|mnΔ+ba|Σab|.Using the assumption that Σ is diagonally dominant at value ν<1 and Σaa=1, for all a{1,,p}, it follows that MΓ(M)a{1,,p}{x:1-ν-mnΔ<x1+ν+mnΔ}.As logpncnξ with ξ<1 and mndlogn for some d>0, it is sufficient to show that there exist g>0 for every δ>0 so that for n,(15)P(Δδ/mn)=O(pn2exp(-gn/mn)).Using Bernstein's inequality, there exists g>0 so that for any 1a,bpn and for n, Pn-1i=1n(XiaXib)-E(XaXb)>δ/mn=O(exp(-gn/mn)).With Bonferronis inequality, Eq. (15) follows, which completes the proof.  □

6.2.2. Change in gradient

Let Vh be the set of all diagonal h×h matrices V, where the diagonal elements are in {-1,1}.

Lemma 9

It holds under Assumptions 24 that, for every g>0, with probability converging to 1 for n, simultaneously for all M with |M|mn=dlogn and VV|M|, |Σ(M)Σn(M)-1V1M-V1M|<g,where the inequality is understood to be fulfilled if it is fulfilled componentwise.

Proof

First, Σ(M)Σn(M)-1V1M=V1M+(Σ(M)-Σn(M))Σn(M)-1V1M.Thus, simultaneously for all M with |M|mn, it holds componentwise that |Σ(M)Σn(M)-1V1M-V1M|mnΔmaxM,aM|(Σn(M)-1V1M)a|,where Δ is defined as in (14). The last term maxM,aM|(Σn(M)-1V1M)a| is bounded by mn/λmin, where λmin is the minimal eigenvalue of Σn(M) over all subsets M with |M|mn. This minimal eigenvalue is bounded from below by bmin>0 with probability converging to 1 for n, according to Lemma 8. It remains to be shown that for any δ>0, P(Δ>δ/mn2)1 for n. This follows analogously to (15), which completes the proof. □

6.2.3. Restricted positive cone condition

The positive cone condition of Efron et al. (2004) is fulfilled if, for all subsets M{1,,pn} and all VV|M|, (VΣn(M)V)-11M>0,where the inequality holds componentwise. The restricted positive cone condition is fulfilled if the inequality holds for all subsets M so that |M|mn.

Lemma 10

Under Assumptions 24, the restricted positive cone condition is fulfilled for mndlogn with any d>0, with probability converging to 1 for n. Moreover, for any 0<ε<1-ν, PminM:|M|mn,VV|M|(VΣn(M)V)-11M>ε1,n.

Proof

First, for any M and VV|M|, (VΣn(M)V)-11M=(VΣ(M)V)-1(VΣ(M)Σn(M)-1V1M).By Lemma 9, the components of VΣ(M)Σn(M)-1V1M are, for every δ>0, simultaneously bounded for all M with |M|mn and VV|M| by 1-δ from below and by 1+δ from above, with probability converging to 1 for n. Thus it holds for every aM and VV|M|, with probability converging to 1 for n,((VΣn(M)V)-11M)aΣ(M)aa-1(1-δ)-bMa|Σ(M)ab-1|(1+δ)=(1-δ)Σ(M)aa-1-1+δ1-δbMa|Σ(M)ab-1|ga(δ).The inverse covariance matrix Σ-1 is by assumption diagonally dominant at value ν<1, which is equivalent to bMa|Σab-1|νΣaa-1.It is straightforward to show that in this case, for all M{1,,p}, the inverse covariance matrices Σ(M)-1 are diagonally dominant at value ν<1 as well. For δ=0, the continuous function ga(δ) is hence, for all components aM, larger than or equal to (1-ν)(Σaa(M)-1). Note that Σaa(M)-1 is the inverse of the conditional variance Var(Xa|{Xb,bMa}), which is smaller than the unconditional variance Var(Xa). Hence, as Σaa=1, it holds that Σaa(M)-1>1 for all aM and thus for all aM, limδ0ga(δ)1-ν.Choosing δ sufficiently small, the continuous function ga(δ) is for all components aM larger than ε, as ε<1-ν, which completes the proof. □

Lemma 11

Under Assumptions 24, for some d>0, it holds for any ε>1+ν that PmaxM:|M|mn,VV|M|(VΣn(M)V)-11M<ε1,n.

Proof

The proof of Lemma 11 follows analogously to the proof of Lemma 10. □

6.2.4. Monotonicity of Lasso-solutions

Lemma 12

Under the restricted positive cone condition, the absolute value of the Lasso estimator β^kλ is for all components k=1,,p monotonously increasing for a decreasing value of λ.

Proof

For any value of λ, let δλ>0 be a small change of the penalty parameter λ. Let δβ^λ be the corresponding change of the Lasso estimator, δβ^λβ^λ-δλ-β^λ.It has to be shown that for any λ>0,(16)β^λ·δβ^λ0.For all components of β^λ equal to zero, the claim is automatically fulfilled. Let the set of non-zero components of β^λ be again denoted by Mλ{1,,p}. Denote the restriction of β^λ and δβ^λ to the set M by β^λ(M) and δβ^λ(M), respectively. It follows e.g. from Efron et al. (2004) that the infinitesimal change δβ(M) of the vector β^λ(M) is proportional to(17)(Σn(M)V)-11M,where V is a diagonal |M|×|M|-matrix with diagonal elements Vkk, kM, identical to the signs of the correlations of Xik, i=1,,n with the residuals Yi-a{1,,p}β^kλXia, i=1,,n. As β^λ is a Lasso solution, Vkk is identical to the sign of β^kλ for all kM. Therefore, componentwise, for all λ>0 sign(δβ^λ(M)·β^λ(M))=sign((VΣn(M)V)-11M).If the restricted positive cone condition is fulfilled, all components on the r.h.s. are positive and so the same is true for the l.h.s., and (16) follows. The restricted positive cone condition is fulfilled with probability converging to 1 for n according to Lemma 10, which completes the proof.  □

6.2.5. When do noise variables enter?

By assumption, the correct model is given by the first q predictor variables, M={1,,q}. A noise variable is hence a variable with index larger than q. If any noise variable is part of the Lasso estimator, then, equivalently, there exists some k>q so that kMλ.

Lemma 13

Let λn, nN, be a sequence with λn=o(n(-1+ξ)/2) for n. Then, under Assumptions 24 and independent predictor variables, P(k>q:kMλn)1,n.

Proof

Let β^λ be the Lasso estimator, which is constrained to be zero outside the set M={1,,q},(18)β^λ=argminβn-1i=1nYi-kMβkXik2+λβ1.If β^λ is a valid Lasso solution to the unconstrained problem, as in Eq. (2), then there does not exist, by uniqueness of the solution, any k>q so that kMλ. It suffices hence to show that β^λn cannot be the solution to (2), with probability converging to 1 for n. Using results in Osborne et al. (2000), the Lasso estimator β^λ is only a valid Lasso solution for the whole set of pn predictor, if the gradient n-1i=1nRiXik is smaller or equal to λ for all k>q, where, for i=1,,n, Ri=Yi-aMβ^kλXia,are the residuals under the estimator β^λ. Thus(19)P(k>q:kMλn)Pmaxk>qn-1i=1nRiXik>λn.Conditional on (Y,X1,,Xq), it holds for every k>q, that n-1i=1nRiXikN0,n-2i=1nRi2.The expected value of n-1i=1nRi2, the averaged squared residuals, is larger than σ2(n-q)/n for all values of λ and Pn-1i=1nRi2>σ2/21,n.If n-1i=1nRi2=σ2/2, then n-1i=1nRiXikN(0,σ2/(2n)). Thus, for some c,d>0, Pn-1i=1nRiXik>λndλn-1exp(-cnλn2),which holds for every k>q, of which there are pn-q variables. The probability that the gradient n-1i=1nRiXik is smaller than λn for all pn-q noise variables is hence bounded byPmaxk>qn-1i=1nRiXikλn(1-dλn-1exp(-cnλn2))pn-qexp(-(pn-q)dλn-1exp(-cnλn2)).Let λn be a sequence with λn=o(n(-1+ξ)/2) for n. Then nλn2=o(nξ), and as logpngnξ, for some g>0, it follows that Pmaxk>qn-1i=1nRiXikλn0,n,which, using (19), completes the proof. □

6.2.6. Error of estimators

The following lemma bounds from below the difference between the estimator under λλ and the true parameter value.

Lemma 14

Assume Σ=1 and Assumptions 2 and 3. For any δ>0, with probability converging to 1 for n, it holds for all kq that for λλ, |β^kλ-βk|(1-δ)λ.

Proof

First, |β^kλ-βk||β^kλ-β^k0|-|β^k0-βk|,where β^0 is defined as in (18) as the Lasso estimator where all components of noise variables, for k>q, are restricted to be zero. This estimator is the regular OLS estimator on the set M={1,,q} of variables and it follows by standard results that for any series cn with cn-1=op(n-1/2), it holds that P(|β^k0-βk|>cn)0 for n. By Lemma 13, λ-1=op(n-1/2). It suffices hence to show that, for any δ>0, for all kq and λλ,(20)P(|β^kλ-β^k0|>(1-δ)λ)1,n.Note that for λλ, β^λ=β^λ by definition of β^λ in (18). Using (17), it holds for every λ>0 that |β^kλ-β^k0|=0λ(VΣn(Mλ)V)-11Mλdλ,where Mλ={kq:β^kλ0}M. By Lemma 10 and Σ=1, it holds for every δ>0 with probability converging to 1 for n that(21)minM:|M|mn,V(VΣn(M)V)-11M>(1-δ).As q=|M|mn, it follows that with probability converging to 1 for n, |β^kλ-β^k0|(1-δ)λ,which shows, using β^λ=β^λ for λλ, that (20) holds true and thus completes the proof. □

6.2.7. Errors due to finite validation set

Let Ln˜(λ,φ) be the empirical version of L(λ,φ) for n˜ observations of (Y,X), which are independent of the observations used to construct the relaxed Lasso estimator.

Lemma 15

Let liminfnn˜/n1/K with K2. Then, under Assumptions 24, supλΛ,φ[0,1]|L(λ,φ)-Ln˜(λ,φ)|=Op(n-1log2n),n.

Proof

The restricted positive cone condition is satisfied with probability converging to 1 for n, according to Lemma 10. It hence suffices to show the claim under assumption of the restricted positive cone condition. Let, as before, M1,,Mm be the set of all models attained with Lasso estimates and let λk, k=1,,m, (with λ1<<λm) be the largest value of the penalty parameter λ so that Mk=Mλ. Using Lemma 12 and the definition of the relaxed Lasso estimates, Eq. (4), any relaxed Lasso solution is in one of the sets B1,,Bm, where for all k{1,,m},(22)Bk={β=φβ^λk,0+(1-φ)β^λk,1|φ[0,1]}.The estimates β^λk,1 are the Lasso estimates for penalty parameter λk, and β^λk,0 the corresponding OLS-estimates. The loss under a choice of λ,φ as penalty parameters is given by L(λ,φ)=EY-k{1,,p}β^kλ,φXk2.For any λ, set δβ^λ=(β^λ,1-β^λ,0). The loss L(λ,φ) can then be written as(23)L(λ,φ)=E(Uλ2)+2φE(UλVλ)+φ2E(Vλ2),where Uλ=Y-k{1,,p}β^kλ,0Xk, and Vλ=k{1,,p}δβ^kλXk. The loss L(λ,φ) is hence, for a given λ, a quadratic function in φ. Both Uλ and Vλ are normal distributed random variables conditional on the sample on which β^λ,φ is estimated. There exists some h>0 so that, for all λ and φ, P(maxkβ^kλ,φ>h)0 for n. As the number of non-zero coefficients is bounded by mndlogn, it thus follows by Bernstein's inequality that there exists some g>0 for every ɛ>0 so that, limsupnP(|E(Uλ2)-En˜(Uλ2)|>gn˜-1logn)<ɛ,where En˜(Uλ2) is the empirical mean of Uλ in the sample of n˜ observations in the validation set. For the second and third term in the loss (23) it follows analogously that there exists g>0 for every ɛ>0 so that limsupnP(|E(UλVλ)-En˜(UλVλ)|>gn˜-1logn)<ɛ,limsupnP(|E(Vλ2)-En˜(Vλ2)|>gn˜-1logn)<ɛ.Hence, using (23), there exists some g>0 for every ɛ>0 so that limsupnPsupφ[0,1]|L(λ,φ)-Ln˜(λ,φ)|>gn˜-1logn<ɛ.When extending the supremum over φΛ to a supremum over λ>0,φ[0,1], note that it is sufficient, due to (22), to consider values of λ in the finite set {λ1,,λm}. Using Bonferroni's inequality and mdlogn, it follows that there exists some g>0 for every ɛ>0 so that limsupnPsupλ,φ|L(λ,φ)-Ln˜(λ,φ)|>gn˜-1log2n<ɛ,which completes the proof as n˜/n1/K>0 for n.  □

6.3. Proof of Theorem 5

For independent predictor variables, the loss L(λ) of the Lasso estimator under penalty parameter λ is given by(24)L(λ)=k{1,,p}(β^kλ-βk)2=kq(β^kλ-βk)2+k>q(β^kλ)2,using that the variance of all components of X is identical to 1 and βk=0 for all k>q. Let λ be defined as in (13). Using Lemma 14, it follows that for all ε>0, with probability converging to 1 for n, for all kq and λλ, (β^kλ-βk)2(1-ε)2(λ-λ)2.Summing only over components with kq in (24), it follows that the loss is bounded from below for λλ by(25)infλλL(λ)q(1-ε)2λ2.Now the case λ<λ is examined. The range of λ is furthermore restricted to lie in the area Λ, defined in (10). Denote in the following the difference between the Lasso estimators β^λ and β^λ by δλ=β^λ-β^λ. Denote the difference between β^λ and the true parameter β by θ=β^λ-β. Then (β^kλ-βk)2=θk2-2θkδkλ+(δkλ)2.It follows by Lemma 14 that, with probability converging to 1 for n, for any ε>0, |θk|>(1-ε)λ. It holds by an analogous argument that |θk|<(1+ε)λ. Hence, for all kq, (β^kλ-βk)2(1-ε)2λ2-2(1+ε)λδkλ+(δkλ)2.By Lemma 11 and analogously to Lemma 14, it holds furthermore with probability converging to 1, that (1-ε)(λ0-λ)|δkλ|(1+ε)(λ0-λ) and hence, for all kq, (β^kλ-βk)2(1-ε)2λ2-2(1+ε)2λ(λ-λ)+(1-ε)2(λ-λ)2.As λ is the largest value of λ such that Mλ=M, a noise variable (with index k>q) enters the model Mλ if λ<λ. Using again Lemma 10, with probability converging to 1 for n, it holds for this component that for any ε>0,(26)(β^kλ)2(1-ε)2(λ-λ)2.It follows that with probability converging to 1 for n, L(λ)q(1-ε)2λ2-2q(1+ε)2λ(λ-λ)+(q+(1-ε)2)(λ-λ)2.Denote the infimum over λ0λλ of the r.h.s. by f(ε), f(ε)infλ0λλ(q(1-ε)2λ2-2q(1+ε)2λ(λ-λ)+(q+(1-ε)2)(λ-λ)2).Note that f(ε) is a continuous function of ε and limε1f(ε)=q/(q+1)λ2.Hence, as ε can be chosen arbitrarily close to 1, it holds that, with probability converging to 1 for n, infλ0λλL(λ)infε>0f(ε)λ2/2.By Lemma 13, λ-2=Op(n1-ξ) and thus, using (25), for any r>1-ξ, PinfλΛL(λ)>cn-r1,n,which completes the proof. □

6.4. Proof of Theorem 7

It holds for the loss under λ^ and φ^ for every g>0 thatP(L(λ^,φ^)>gn-1log2n)PinfλΛ,φ[0,1]L(λ,φ)>12gn-1log2n+2PsupλΛ,φ[0,1]|L(λ,φ)-Lcv(λ,φ)|>12gn-1log2n.It follows by Theorem 6 that the first term on the r.h.s. vanishes for n. The second term is by Bonferroni's inequality bounded from above by Kmax1SKPsupλΛ,φ[0,1]|L(λ,φ)-LS,n˜(λ,φ)|>12gn-1log2n.Using Lemma 15, there exists thus for every ɛ>0 some g>0 so that limsupnP(L(λ^,φ^)>gn-1log2n)<ɛ,which completes the proof. □

References

Cited by (462)

View all citing articles on Scopus
View Abstract