Relaxed Lasso
Keywords
1. Introduction
The current work is motivated by linear prediction for high-dimensional data, where the number of predictor variables p is very large, possibly very much larger than the number of observations n (e.g. van de Geer and van Houwelingen, 2004). Regularisation is clearly of central importance for these high-dimensional problems.
There are many criteria to consider when choosing an appropriate regularisation method. First, not all regularisation procedures are adequate for the high-dimensional case. The non-negative Garotte (Breiman, 1995) is for example a promising regularisation method. However, it is not suited for the case as it requires computation of the OLS-estimator, which is unavailable in this case. An important criterion in the presence of many predictor variables is the computational complexity of the procedure. Many regularisation procedures with otherwise attractive features involve, unfortunately, minimisation of a non-convex function (e.g. Fan and Li, 2001, Tsybakov and van de Geer, 2005). For high-dimensional problems, it is in general very costly to find an (approximate) solution in this case, due to the presence of local minima in the objective function.
For Bridge estimators, which were proposed in Frank and Friedman (1993), we study in the following the tradeoff between computational complexity on the one hand and (asymptotic) properties of the estimators on the other hand. Let be a p-dimensional predictor variable and Y a response variable of interest. For n independent observations , , of , Bridge estimators are defined for as(1)where is the -norm of the vector of coefficients, and is typically in the range .
For , Bridge estimation corresponds to ordinary model selection. Ridge regression is obtained for , while is equivalent to the Lasso proposed in Tibshirani (1996). Computation of the estimator (1) involves minimisation of a non-convex function if , while the function is convex for . Since optimisation of a non-convex function in a high-dimensional setting is very difficult, Bridge estimation with is an attractive choice. However, for values of , the shrinkage of estimators towards zero increases with the magnitude of the parameter being estimated (Knight and Fu, 2000). For the Lasso (), the shrinkage is constant irrespective of the magnitude of the parameter being estimated (at least for orthogonal designs, where regularisation with the Lasso is equivalent to soft-thresholding of the estimates). It was recognised in Fan and Li (2001) that this leads to undesirable properties (in terms of prediction) of the resulting estimator. It was first suggested by Huber (1973) to examine the asymptotic properties for a growing number of predictor variables as a function of the number of observations n, see as well Fan and Peng (2004). It will be shown below that the shrinkage of the Lasso leads to a low convergence rate of the -loss for high-dimensional problems where the number of parameters is growing almost exponentially fast with n, so that .
For , the shrinkage of estimates decreases with increasing magnitude of the parameter being estimated and faster convergence rates can thus in general be achieved (see e.g. Knight and Fu, 2000 and, for classification, Tsybakov and van de Geer, 2005). However, the fact remains that for a non-convex optimisation problem has to be solved.
There is no value of for which an entirely satisfactory compromise is achieved between low computational complexity on the one hand and fast convergence rates on the other hand. In this paper, it is shown that a two-stage procedure, termed relaxed Lasso, can work around this problem. The method has low computational complexity (the computational burden is often identical to that of an ordinary Lasso solution) and, unlike the Lasso, convergence rates are fast, irrespective of the growth rate of the number of predictor variables. Moreover, relaxed Lasso leads to consistent variable selection under a prediction-optimal choice of the penalty parameters, which does not hold true for ordinary Lasso solutions in a high-dimensional setting.
2. Relaxed Lasso
We define relaxed Lasso estimation and illustrate the properties of the relaxed Lasso estimators for an orthogonal design. A two-stage algorithm for computing the relaxed Lasso estimator is then proposed, followed by a few remarks about extending the procedure to generalised linear models (McCullagh and Nelder, 1989).
Recall that the Lasso estimator under a squared error loss is defined in Tibshirani (1996) for as(2)The Lasso estimator is a special case of the Bridge estimator (1), obtained by setting . The set of predictor variables selected by the Lasso estimator is denoted by ,(3)For sufficiently large penalties (e.g. for ), the selected model is the empty set, , as all components of the estimator (2) are identical to zero. In the absence of a -penalty and if the number of variables p is smaller than the number of observations n, all predictor variables are in general selected, so that in this case.
The -penalty for the ordinary Lasso estimator (2) has two effects, model selection and shrinkage estimation. On the one hand, a certain set of coefficients is set to zero and hence excluded from the selected model. On the other hand, for all variables in the selected model , coefficients are shrunken towards zero compared to the least-squares solution. These two effects are clearly related and can be best understood in the context of orthogonal design as soft-thresholding of the coefficients. Nevertheless, it is not immediately obvious whether it is indeed optimal to control these two effects, model selection on the one hand and shrinkage estimation on the other hand, by a single parameter only. As an example, it might be desirable in some situations to estimate the coefficients of all selected variables without shrinkage, corresponding to a hard-thresholding of the coefficients.
As a generalisation of both soft- and hard-thresholding, we control model selection and shrinkage estimation by two separate parameters and with the relaxed Lasso estimator. Definition 1 The relaxed Lasso estimator is defined for and as(4)where is the indicator function on the set of variables so that for all ,
We will show further below that the gains with relaxed Lasso estimation (adaptive ) compared to ordinary Lasso estimation () can be very large. Moreover, relaxed Lasso is producing in most cases better results than the Lars–OLS hybrid (), as relaxed Lasso can adapt the amount of shrinkage to the structure of the underlying data.
An algorithm is developed to compute the exact solutions of the relaxed Lasso estimator. The parameters and can then be chosen e.g. by cross-validation. The algorithm is based on the Lars-algorithm by Efron et al. (2004). As the relaxed Lasso estimator is parameterised by two parameters, a two-dimensional manifold has to be covered to find all solutions. The computational burden of computing all relaxed Lasso estimators is in the worst case identical to that of the Lars–OLS hybrid and in the best case identical to that of the Lars-algorithm. The method is thus very well suited for high-dimensional problems. To illustrate the properties of the relaxed Lasso estimator, it is instructive to consider an orthogonal design. The shrinkage of various regularisation methods are shown in Fig. 1 for this case. The set of solutions of the relaxed Lasso estimator is given for all by where is the OLS-solution. For , hard-thresholding is achieved, while results—as mentioned above—in soft-thresholding, which corresponds to the Lasso solution. The relaxed Lasso provides hence a continuum of solutions that includes soft- and hard-thresholding, much like the set of solutions provided by the Bridge estimators (1) when varying in the range . It can be seen in Fig. 1 that the solutions to the Bridge estimators and the relaxed Lasso solutions are indeed very similar. Fig. 1. Comparison of shrinkage estimators as a function of the OLS-estimator . Shown are estimators for soft-thresholding (top left), hard-thresholding (top right), the estimator , Eq. (1), for , and 1 (bottom left) and the relaxed Lasso estimators for , and 1 (bottom right). The main advantage of the relaxed Lasso estimator over Bridge estimation is the low computational complexity. We propose in the following a naive, easy to implement, algorithm for computing relaxed Lasso estimators as in (4). Based on some more insight, a modification is proposed further below so that—for many data sets—the computational effort of computing all relaxed Lasso solutions is identical to that of solving the ordinary Lasso solutions. Simple Algorithm. Step 1: Compute all ordinary Lasso solutions e.g. with the Lars-algorithm in Efron et al. (2004) under the Lasso modification. Let be the resulting set of m models. Let be a sequence of penalty values so that if and only if , where . (The models are not necessarily distinct, so it is always possible to obtain such a sequence of penalty parameters.) Step 2: For each , compute all Lasso solutions on the set of variables, varying the penalty parameter between 0 and . The obtained set of solutions is identical to the set of relaxed Lasso solutions for . The relaxed Lasso solutions for all penalty parameters are given by the union of these sets. It is obvious that this algorithm produces all relaxed Lasso solutions, for all values of the penalty parameters and . The computational complexity of this algorithm is identical to that of Lars–OLS hybrid, as the Lars iterations in Step 2 are about as computationally intensive as ordinary least-squares estimation (Efron et al., 2004). However, this naive algorithm is not optimal in general. The computation of the ordinary Lasso solutions contains information that can be exploited in the second stage, when finding Lasso solutions for all subsets , of variables. Fig. 2 serves as an illustration. The “direction” in which relaxed Lasso solutions are found is identical to the directions of ordinary Lasso solutions. These directions do not have to be computed again. Indeed, by extrapolating the path of the ordinary Lasso solutions, all relaxed Lasso solutions can often be found. There is an important caveat. Extrapolated Lasso solutions are only valid relaxed Lasso solutions if and only if the extrapolations do not cross the value zero. This is e.g. fulfilled if the ordinary Lasso estimators are monotonously increasing for a decreasing penalty parameter . If, however, the extrapolations do cross zero for a set , then the Lasso has to be computed again explicitly for this set, using e.g. again the Lars-algorithm of Efron et al. (2004). Fig. 2. Left: path of the estimators for a data set with three variables. For large values of all components are equal to zero. In the range , only the first components is non-zero. Middle: the relaxed Lasso solutions, if is in the range . The direction in which the relaxed Lasso solutions are found is the same as those computed for the ordinary Lasso solutions. The relaxed Lasso solution for corresponds to the OLS-solution. Right: likewise, relaxed Lasso solutions for the range are found by extrapolating the Lasso solutions. Again, the solutions for correspond to the OLS-solution for the two variables selected by the Lasso estimator. Refined Algorithm. Step 1: Identical to Step 1 of the simple algorithm. Compute all ordinary Lasso solutions. Step 2: For each , let . This is the direction in which solutions are found for ordinary Lasso solutions and is hence known from Step 1. Let If there is at least one component l so that , then relaxed Lasso solutions for have to be computed as in Step 2 of the simple algorithm. Otherwise all relaxed Lasso solutions for and are given by linear interpolation between (which corresponds to ) and (which corresponds to ). In the worst case, the refined algorithm is no improvement over the simple algorithm. In the best case, all relaxed Lasso solutions are found at no extra cost, once the ordinary Lasso solutions are computed. If Lasso solutions are e.g. monotonously increasing (for a decreasing value of ), then the condition about sign-equality in Step 2 of the refined algorithm is fulfilled, and the relaxed Lasso solutions are found at no extra cost. The computational complexity of the ordinary Lasso is , as there are steps, each of complexity . In the worst case, the computational complexity of the relaxed Lasso is , which is, for high-dimensional problems with , identical to , and hence slightly more expensive than the of the ordinary Lasso (but equally expensive as the Lars–OLS hybrid if the least-squares estimator is computed explicitly). However, the linear scaling with the number p of variables is identical. Moreover, as mentioned above, the scaling is really a worst case scenario. Often all relaxed Lasso solutions can be found at little or no extra cost compared to the ordinary Lasso solutions, using the refined algorithm above. The method can be easily generalised to more general loss functions and generalised linear models (McCullagh and Nelder, 1989). Let be the negative log-likelihood under parameter . The relaxed Lasso estimator is then defined in analogy to (4) as(5)where is understood to be equivalent to requiring that , for all . The algorithm for computing the solutions for all parameter values has the same two-stage characteristic as for the quadratic loss function. The computational effort is again identical to that of ordinary Lasso estimation. For this case, no exact solutions for ordinary Lasso estimators are in general available, and the same is true for the relaxed Lasso estimators. However, only optimisation of convex functions are required as long as the log-likelihood is a concave function. For the Lasso, a solution has been proposed e.g. in Zhao and Yu (2004) and could be generalised to compute all relaxed Lasso solutions.2.1. Orthogonal design
2.2. Algorithm
2.3. Extensions
3. Asymptotic results
For the asymptotic results, we consider a random design. Let be a -dimensional random variable with a gaussian distribution with covariance matrix , so that . The response variable Y is a linear combination of the predictor variables,(6)where . We compare the risk of the Lasso estimator and the relaxed Lasso estimator. The minimal achievable squared error loss is given by the variance of the noise term. The random loss of the Lasso estimator is defined by(7)where the expectation is with respect to a sample that is independent of the sample which is used to determine the estimator. The loss of the relaxed Lasso estimator under parameters is defined analogously as(8)It is shown in the following that convergence rates for the relaxed Lasso estimator are largely unaffected by the number of predictor variables for sparse high-dimensional data. This is in contrast to the ordinary Lasso estimator, where the convergence rate drops dramatically for large growth rates of the number of predictor variables. We make a few assumptions about sparse high-dimensional data. The number of predictor variables is assumed to be growing very fast with the number of observations. Assumption 2 For some and ,(9) In the following, a matrix is said to be diagonally dominant at value if the row-wise sum of the absolute values of its non-diagonal entries are bounded by times the corresponding absolute value of the diagonal element. Assumption 3 There exists some so that both and are diagonally dominant at value for all . Note that a diagonal dominant matrix (for any value ) is positive definite. The existence of is hence already implied by the assumption about . The assumption is not of critical importance for the results, but shortens the proofs considerably. The coefficient vector is assumed to be sparse. For simplicity of exposition, we assume sparseness in the -norm: there are a finite number q of non-zero components of and these are fix for all . W.l.o.g., the non-zero components are first in order. Assumption 4 The vector of coefficients is given for all by . The true model is hence . The noise variables with zero coefficients are nevertheless possibly correlated with the response variable. This setting is similar to some numerical examples in Fan and Peng (2004). As the number of non-zero coefficients is given by a finite and fixed number q, we restrict the penalty parameter in the following to the range , for which the number of selected variables is less than or equal to with an arbitrary large ,(10)This range includes all sequences for which the Lasso or relaxed Lasso estimates are consistent for variable selection, as the number of true non-zero coefficients is finite and fixed. It is shown that the rate of convergence of ordinary Lasso estimators is slow if the number of noise variables is growing fast. Theorem 5 Under Assumptions 2–4 and independent predictor variables, that is , it holds for the risk under the ordinary Lasso estimator that for any and A proof is given in the appendix. It is hence shown that the rate of convergence of the risk is critically determined by the rate with which the logarithm of the number of predictor variables is growing to infinity. It follows that it is impossible to have both consistent variable selection and optimal rates for independent predictor variables with the ordinary Lasso estimator. Adding many noise predictor variables slows down the rate of convergence for the Lasso estimator, no matter how the penalty parameter is chosen. The reason for this slow convergence in the high-dimensional setting is that a large value of the penalty parameter is necessary to keep the estimates of coefficients of noise predictor variables at low values. The shrinkage of the non-zero components is then very large, leading to less than optimal prediction; for a further discussion of this phenomenon see as well Fan and Li (2001). A faster rate of convergence is achieved with the relaxed Lasso estimator than with ordinary Lasso in this sparse high-dimensional setting. Noise variables can be prevented from entering the estimator with a high value of the penalty parameter , while the coefficients of selected variables can be estimated at the usual -rate, using a relaxed penalty. It is shown in other words that the rate of convergence of the relaxed Lasso estimator is not influenced by the presence of many noise variables. Theorem 6 Under Assumptions 2–4, for , it holds for the loss under the relaxed Lasso estimator that A proof is given in the appendix. The rate of convergence of the relaxed Lasso estimator (under oracle choices of the penalty parameters) is thus shown to be uninfluenced by a fast growing number of noise variables. The results are illustrated in Fig. 3. Fig. 3. Convergence rates for the Lasso and the relaxed Lasso. The parameter determines the rate at which the number of variables grows for n, between constant () and exponential (). The loss under the relaxed Lasso is , irrespective of . The loss under the ordinary Lasso estimator can be of oder only if (depicted by the grey area in the figure), no matter how the penalty parameter is chosen. It was shown above that the rate of convergence of the relaxed Lasso estimate is not influenced by the presence of many noise variables under an oracle choice of the penalty parameters and (which are unknown). We show that the parameters can be chosen by cross-validation while still retaining the fast rate. For K-fold cross-validation, each observation belongs to one of K partitions, each consisting of observations, where for . Let be for the empirical loss on the observations in partition S when constructing the estimator on the set of observations different from S. Let be the empirical loss under K-fold cross-validation, The penalty parameters and are chosen as minimisers of , In practice, a value of K between 5 and 10 is recommended, even though the following result is valid for a broader range. Theorem 7 Let be the loss of the relaxed Lasso estimate and chosen by K-fold cross-validation with . Under Assumptions 2–4, it holds that The optimal rates under oracle choices of the penalty parameters are thus almost obtained if the penalty parameters are chosen by cross-validation. We conjecture that the cross-validated penalty parameters lead for the relaxed Lasso estimator to consistent variable selection; this is not the case for the Lasso, see Meinshausen and Bühlmann (2006). This conjecture is supported by the following numerical examples.3.1. Setting and assumptions
3.2. Slow rates with the ordinary Lasso
3.3. Fast rates with the relaxed Lasso
3.4. Choice of the penalty parameters by cross-validation
4. Numerical examples
We illustrate the asymptotic results of Section 3 with a few numerical examples. The response variable follows the linear model (6). The predictor variable X follows a normal distribution with covariance matrix , where for some value of . For , this corresponds to independent predictor variables. The variance of in (6) is chosen so that the signal-to-noise ratio is (e.g. the variance of the response variable Y due to is of the variance of Y due to ).
We consider the case where there are q variables (with ) that “carry signal” in the sense that for all and for all k with . All components with are double-exponentially distributed.
For various values of n between 50 and 200 and p between 50 and 800, the ordinary Lasso estimator (), the Lars–OLS hybrid estimator (), and the relaxed Lasso estimator (adaptive ) are computed. The penalty parameters are chosen by 5-fold cross-validation. The signal-to-noise ratio is chosen from the set . The correlation between predictor variables is chosen once as (independent predictor variables) and once as , while the number of relevant predictor variables is chosen from the set . For each of these settings, the three mentioned estimators are computed 100 times each. Let be the average loss of relaxed Lasso over these 100 simulations, in the sense of (7), and likewise and for Lars–OLS hybrid and Lasso, see (8).
For small q, the setting is resembling that of the theoretical considerations above and of the numerical examples in Fan and Peng (2004). For small q, the theorems above suggest that the relaxed Lasso and Lars–OLS hybrid outperform Lasso estimation in terms of predictive power. On the other hand, for , the Lasso is the ML estimator and one expects it to do very well compared with the Lars–OLS hybrid for large values of q. (In a Bayesian setting, if the prior for is chosen to be the actual double-exponential distribution of the components of , the Lasso solution is the MAP estimator if .)
If we knew beforehand the value of q (the number of relevant predictor variables), then we would for optimal prediction either choose Lasso (if q is large), that is , or Lars–OLS hybrid (if q is small), that is . However, we do not know the value of q. The numerical examples illustrate how well relaxed Lasso adapts to the unknown sparsity of the underlying data. The number of selected variables is shown in Table 1 for the Lasso estimator and in Table 2 for the relaxed Lasso. As expected, the relaxed Lasso selects roughly the correct number of variables (or less, if the noise is high or the number of observations n is low, with the Lars–OLS hybrid selecting even fewer variables in these cases, as can be seen from Table 3). In contrast, ordinary Lasso often selects too many noise variables (with the cross-validated choice of ). For , it selects e.g. up to 34 variables. For , up to 121. Using the considerations in the proof of Theorem 5, these numbers can be expected to grow even higher if a larger number n of observations would be considered. Table 1. Average number of selected variables with the Lasso for Table 2. Average number of selected variables with the relaxed Lasso for Table 3. Average number of selected variables with the Lars–OLS hybrid for Lasso and relaxed Lasso estimators produce nearly identical results (in terms of predictive power) if the number q of relevant predictor variables is large, as can be seen from Table 4, which shows the relative improvement of relaxed Lasso over ordinary Lasso,(11)There is no harm when using the relaxed Lasso on such data instead of the Lasso, but there is not much to be gained either. However, for data where there is a very large number of noise variables (e.g. small q), the relaxed Lasso estimator produces a much smaller MSE, as expected from the previous theoretical results. The extent to which the relaxed Lasso outperforms Lasso in this setting depends strongly on the signal-to-noise ratio . The improvements are larger for large , where shrinkage of the selected components is not necessary. For small , shrinkage of the selected components is useful and an optimal procedure chooses thus close to 1 for noisy problems. Indeed, the average chosen value of for the relaxed Lasso is large if is low, as can be seen from Table 6. Table 4. The relative improvement of relaxed Lasso over ordinary Lasso for (upper half) and (lower half) In the worst case, relaxed Lasso is performing only marginally worse than ordinary Lasso and is slightly more expensive to compute. For many sparse high-dimensional problems, however, the computation of the relaxed Lasso solutions comes at no extra computational cost and leads to sparser estimators and more accurate predictions. The theoretical conclusions suggest that Lars–OLS hybrid should do equally well for sparse high-dimensional data as relaxed Lasso. However, there are two caveats. First, the argument holds only for data with sparse structure. If the data do not have sparse structure, Lars–OLS hybrid is in general performing worse than Lasso. Relaxed Lasso can adapt to the amount of sparseness (as seen from Table 6) by varying between 1 (for not so sparse data) to 0 (for sparse data). Table 5 shows the relative improvement of relaxed Lasso over Lars–OLS hybrid, analogously to (11). For large values of q, relaxed Lasso is indeed performing better than Lars–OLS in general. Table 5. The relative improvement of relaxed Lasso over Lars–OLS hybrid for (upper half) and (lower half) What is more striking than the dependence on the sparseness, however, is the dependence on the signal-to-noise ratio. Consider the case where only five variables carry signal (). For a high signal-to-noise ratio (), relaxed Lasso and Lars–OLS hybrid perform approximately equally well (and both much better than ordinary Lasso). For a low signal-to-noise ratio (), however, relaxed Lasso is considerably better than Lars–OLS. The reason for this is intuitively easy to understand. For noisy problems, it pays off to shrink the coefficients of selected variables, while this is less important for less noisy data. Relaxed Lasso adapts the amount of shrinkage to the noise level. In general, it is not optimal to do no shrinkage at all for the selected variables () or do full shrinkage (). This is the reason why relaxed Lasso is performing better than both ordinary Lasso and Lars–OLS hybrid for noisy problems, especially when just a few variables carry signal. Given that the computational cost of relaxed Lasso is not higher than that for Lars–OLS hybrid (and sometimes equal to that of Lasso), relaxed Lasso seems to be well suited for high-dimensional problems as the sparseness and signal-to-noise ratio is in general unknown and relaxed Lasso is adaptive to both (Table 6). Table 6. The average value of for the relaxed Lasso, for 4.1. Number of selected variables
50 100 200 400 800 50 100 200 400 800 50 17 18 20 22 25 9 9 8 9 8 100 13 23 24 27 31 10 12 15 14 17 200 11 13 27 31 34 11 13 19 22 24 50 27 30 30 27 24 10 8 8 6 7 100 26 39 44 52 53 13 15 14 16 17 200 27 35 51 59 69 18 21 26 29 30 50 36 30 23 16 12 9 7 8 7 6 100 47 65 66 61 54 15 19 16 14 11 200 48 71 96 112 121 27 30 34 31 27 50 100 200 400 800 50 100 200 400 800 50 7 6 7 5 5 6 5 6 6 4 100 5 6 6 5 5 6 8 7 6 9 200 5 4 6 5 5 5 6 12 8 8 50 19 18 18 15 12 8 6 6 6 6 100 17 19 16 17 16 10 12 10 9 12 200 15 15 16 15 13 12 14 18 18 15 50 34 25 19 12 9 8 6 6 6 5 100 44 57 55 45 34 13 17 13 11 9 200 46 57 64 71 66 23 26 29 24 19 50 100 200 400 800 50 100 200 400 800 50 5 5 6 5 5 3 3 2 4 2 100 4 5 5 4 5 4 4 3 3 5 200 4 4 5 4 4 3 3 7 4 4 50 17 16 17 13 11 3 3 3 3 3 100 14 16 16 16 15 5 6 4 4 7 200 13 13 14 14 13 7 6 8 6 4 50 31 23 16 12 7 3 3 2 2 3 100 42 49 48 37 30 5 5 2 5 1 200 46 47 52 58 57 16 10 9 5 4 4.2. Comparison with Lasso
50 100 200 400 800 50 100 200 400 800 50 41 55 49 49 52 −1 0 −2 −2 1 100 95 88 89 110 146 5 6 5 11 9 200 106 88 84 169 171 24 14 11 21 22 50 −5 −3 −2 −3 −2 −3 −3 −4 −4 −4 100 7 17 18 18 12 −3 0 −1 3 −1 200 28 32 43 58 60 −3 2 2 3 4 50 −3 −4 −3 −2 −3 −4 −3 −3 −4 −3 100 −3 −2 −4 −4 −3 −3 −1 −1 −1 −1 200 0 −1 3 1 −2 −4 −1 −1 −1 0 50 40 98 104 85 103 −2 2 0 −3 1 100 83 95 114 180 186 9 8 6 10 15 200 119 128 89 166 202 24 32 10 19 41 50 −3 3 5 −1 7 −3 −5 −3 −2 −2 100 14 31 36 33 49 0 −1 −3 1 0 200 50 48 72 77 114 −4 −1 5 7 3 50 −7 −2 −3 −2 −3 −4 −4 −2 −2 −4 100 −2 −1 0 −2 −3 −3 −1 −1 −3 −1 200 2 7 8 12 11 −2 −3 −1 −2 −1 4.3. Comparison with Lars–OLS hybrid
50 100 200 400 800 50 100 200 400 800 50 0 3 2 −3 −2 38 29 21 28 20 100 −4 6 1 1 5 18 46 34 19 23 200 8 1 −4 5 0 2 −4 55 5 9 50 8 5 2 2 3 28 21 20 9 9 100 6 3 0 0 0 32 42 31 26 29 200 3 −2 2 −2 0 22 20 48 23 22 50 18 8 7 7 6 20 14 9 3 4 100 6 11 6 6 6 35 33 16 20 8 200 3 6 3 2 2 41 46 52 31 30 50 −3 2 −1 3 −1 42 33 24 30 26 100 −6 −4 −7 −1 3 11 39 20 21 22 200 −3 −1 −2 2 2 1 −4 81 4 −6 50 0 0 −1 1 1 37 29 23 18 9 100 0 2 −2 1 1 29 39 39 33 32 200 −2 2 1 −1 −1 14 13 34 9 16 50 18 1 5 4 5 29 17 10 5 3 100 5 9 8 4 3 46 43 28 17 13 200 4 1 2 1 0 39 36 46 41 31 50 100 200 400 800 50 100 200 400 800 50 0.14 0.09 0.08 0.04 0.03 0.66 0.50 0.49 0.51 0.46 100 0.09 0.11 0.08 0.03 0.04 0.52 0.68 0.53 0.45 0.51 200 0.08 0.08 0.07 0.06 0.06 0.29 0.39 0.71 0.47 0.41 50 0.24 0.15 0.13 0.19 0.20 0.67 0.50 0.45 0.47 0.45 100 0.21 0.17 0.05 0.05 0.04 0.66 0.72 0.61 0.61 0.64 200 0.17 0.12 0.10 0.06 0.02 0.54 0.63 0.75 0.70 0.61 50 0.55 0.38 0.39 0.45 0.43 0.65 0.47 0.45 0.41 0.33 100 0.44 0.54 0.45 0.41 0.40 0.72 0.77 0.71 0.64 0.58 200 0.40 0.44 0.30 0.29 0.19 0.77 0.84 0.89 0.79 0.75
5. Conclusions
We have proposed the relaxed Lasso as a generalisation of Lasso estimation. The main motivation are very high-dimensional regression problems, where the Lasso has two shortcomings:
- •
Selection of noise variables: If the penalty parameter is chosen by cross-validation, the number of selected variables is often very large. Many noise variables are potentially selected.
- •
Low accuracy of predictions: The accuracy of prediction (in terms of squared error loss) was shown to be negatively affected by the presence of many noise variables, particularly for high signal-to-noise ratios.
- •
Sparser estimates: The number of selected coefficients is in general very much smaller for relaxed Lasso, without compromising on the accuracy of predictions. The models produced by relaxed Lasso are thus more amenable to interpretation.
- •
More accurate predictions: If the signal-to-noise ratio is very low, the predictive accuracy of both Lasso and relaxed Lasso is comparable. For a high signal-to-noise ratio, relaxed Lasso achieves often much more accurate predictions.
6. Proofs
6.1. Proof of Theorem 6
It was assumed that the set of non-zero coefficients of is given by . Denote by the event(12)Let be any positive constant. Then Let be the smallest value of the penalty parameter such that no noise variable enters the selected variables, that is for all ,(13)The loss is smaller than . Note that, conditional on , the loss is the loss of the regular OLS-estimator on the set of the q predictor variables with non-vanishing coefficients. Let be the loss of this OLS-estimator. It follows that It follows from the proofs in Meinshausen and Bühlmann (2006) that there is a value of such that the true model is selected with the Lasso estimator, so that for . By the known properties of the OLS-estimator, there exists some for every , so that , which completes the proof. □
6.2. Some useful lemmas
6.2.1. Eigenvalues
Let be the covariance matrix, restricted to the subset of variables. Let be the corresponding empirical covariance matrix for n independent observations. Lemma 8 Under Assumptions 2–4, there exist , so that the maximal and minimal eigenvalues and of the empirical covariance matrices are all bounded simultaneously for any and all with by from below and from above, with probability converging to 1 for , Proof By Gershgorins theorem, all eigenvalues of the empirical covariance matrix are in the set Let be the set of all predictor variables. Taking the union over all sets with , Denoting the maximal difference between the covariance matrix and its empirical version by(14)it follows that Using the assumption that is diagonally dominant at value and , for all , it follows that As with and for some , it is sufficient to show that there exist for every so that for ,(15)Using Bernstein's inequality, there exists so that for any and for , With Bonferronis inequality, Eq. (15) follows, which completes the proof. □
6.2.2. Change in gradient
Let be the set of all diagonal matrices V, where the diagonal elements are in . Lemma 9 It holds under Assumptions 2–4 that, for every , with probability converging to 1 for , simultaneously for all with and , where the inequality is understood to be fulfilled if it is fulfilled componentwise. Proof First, Thus, simultaneously for all with , it holds componentwise that where is defined as in (14). The last term is bounded by , where is the minimal eigenvalue of over all subsets with . This minimal eigenvalue is bounded from below by with probability converging to 1 for , according to Lemma 8. It remains to be shown that for any , for . This follows analogously to (15), which completes the proof. □
6.2.3. Restricted positive cone condition
The positive cone condition of Efron et al. (2004) is fulfilled if, for all subsets and all , where the inequality holds componentwise. The restricted positive cone condition is fulfilled if the inequality holds for all subsets so that . Lemma 10 Under Assumptions 2–4, the restricted positive cone condition is fulfilled for with any , with probability converging to 1 for . Moreover, for any , Proof First, for any and , By Lemma 9, the components of are, for every , simultaneously bounded for all with and by from below and by from above, with probability converging to 1 for . Thus it holds for every and , with probability converging to 1 for ,The inverse covariance matrix is by assumption diagonally dominant at value , which is equivalent to It is straightforward to show that in this case, for all , the inverse covariance matrices are diagonally dominant at value as well. For , the continuous function is hence, for all components , larger than or equal to . Note that is the inverse of the conditional variance , which is smaller than the unconditional variance . Hence, as , it holds that for all and thus for all , Choosing sufficiently small, the continuous function is for all components larger than , as , which completes the proof. □ Lemma 11 Proof The proof of Lemma 11 follows analogously to the proof of Lemma 10. □
6.2.4. Monotonicity of Lasso-solutions
Lemma 12 Under the restricted positive cone condition, the absolute value of the Lasso estimator is for all components monotonously increasing for a decreasing value of . Proof For any value of , let be a small change of the penalty parameter . Let be the corresponding change of the Lasso estimator, It has to be shown that for any ,(16)For all components of equal to zero, the claim is automatically fulfilled. Let the set of non-zero components of be again denoted by . Denote the restriction of and to the set by and , respectively. It follows e.g. from Efron et al. (2004) that the infinitesimal change of the vector is proportional to(17)where V is a diagonal -matrix with diagonal elements , , identical to the signs of the correlations of , with the residuals , . As is a Lasso solution, is identical to the sign of for all . Therefore, componentwise, for all If the restricted positive cone condition is fulfilled, all components on the r.h.s. are positive and so the same is true for the l.h.s., and (16) follows. The restricted positive cone condition is fulfilled with probability converging to 1 for according to Lemma 10, which completes the proof. □
6.2.5. When do noise variables enter?
By assumption, the correct model is given by the first q predictor variables, . A noise variable is hence a variable with index larger than q. If any noise variable is part of the Lasso estimator, then, equivalently, there exists some so that . Lemma 13 Let , , be a sequence with for . Then, under Assumptions 2–4 and independent predictor variables, Proof Let be the Lasso estimator, which is constrained to be zero outside the set ,(18)If is a valid Lasso solution to the unconstrained problem, as in Eq. (2), then there does not exist, by uniqueness of the solution, any so that . It suffices hence to show that cannot be the solution to (2), with probability converging to 1 for . Using results in Osborne et al. (2000), the Lasso estimator is only a valid Lasso solution for the whole set of predictor, if the gradient is smaller or equal to for all , where, for , are the residuals under the estimator . Thus(19)Conditional on , it holds for every , that The expected value of , the averaged squared residuals, is larger than for all values of and If , then . Thus, for some , which holds for every , of which there are variables. The probability that the gradient is smaller than for all noise variables is hence bounded byLet be a sequence with for Then , and as , for some , it follows that which, using (19), completes the proof. □
6.2.6. Error of estimators
The following lemma bounds from below the difference between the estimator under and the true parameter value. Lemma 14 Assume and Assumptions 2 and 3. For any , with probability converging to 1 for , it holds for all that for , Proof First, where is defined as in (18) as the Lasso estimator where all components of noise variables, for , are restricted to be zero. This estimator is the regular OLS estimator on the set of variables and it follows by standard results that for any series with , it holds that for . By Lemma 13, . It suffices hence to show that, for any , for all and ,(20)Note that for , by definition of in (18). Using (17), it holds for every that where . By Lemma 10 and , it holds for every with probability converging to 1 for that(21)As , it follows that with probability converging to 1 for , which shows, using for , that (20) holds true and thus completes the proof. □
6.2.7. Errors due to finite validation set
Let be the empirical version of for observations of , which are independent of the observations used to construct the relaxed Lasso estimator. Lemma 15 Proof The restricted positive cone condition is satisfied with probability converging to 1 for , according to Lemma 10. It hence suffices to show the claim under assumption of the restricted positive cone condition. Let, as before, be the set of all models attained with Lasso estimates and let , , (with ) be the largest value of the penalty parameter so that . Using Lemma 12 and the definition of the relaxed Lasso estimates, Eq. (4), any relaxed Lasso solution is in one of the sets , where for all ,(22)The estimates are the Lasso estimates for penalty parameter , and the corresponding OLS-estimates. The loss under a choice of as penalty parameters is given by For any , set . The loss can then be written as(23)where , and . The loss is hence, for a given , a quadratic function in . Both and are normal distributed random variables conditional on the sample on which is estimated. There exists some so that, for all and , for . As the number of non-zero coefficients is bounded by , it thus follows by Bernstein's inequality that there exists some for every so that, where is the empirical mean of in the sample of observations in the validation set. For the second and third term in the loss (23) it follows analogously that there exists for every so that Hence, using (23), there exists some for every so that When extending the supremum over to a supremum over , note that it is sufficient, due to (22), to consider values of in the finite set . Using Bonferroni's inequality and , it follows that there exists some for every so that which completes the proof as for . □
6.3. Proof of Theorem 5
For independent predictor variables, the loss of the Lasso estimator under penalty parameter is given by(24)using that the variance of all components of X is identical to 1 and for all . Let be defined as in (13). Using Lemma 14, it follows that for all , with probability converging to 1 for , for all and , Summing only over components with in (24), it follows that the loss is bounded from below for by(25)Now the case is examined. The range of is furthermore restricted to lie in the area , defined in (10). Denote in the following the difference between the Lasso estimators and by . Denote the difference between and the true parameter by . Then It follows by Lemma 14 that, with probability converging to 1 for , for any , . It holds by an analogous argument that . Hence, for all , By Lemma 11 and analogously to Lemma 14, it holds furthermore with probability converging to 1, that and hence, for all , As is the largest value of such that , a noise variable (with index ) enters the model if . Using again Lemma 10, with probability converging to 1 for , it holds for this component that for any ,(26)It follows that with probability converging to 1 for , Denote the infimum over of the r.h.s. by , Note that is a continuous function of and Hence, as can be chosen arbitrarily close to 1, it holds that, with probability converging to 1 for , By Lemma 13, and thus, using (25), for any , which completes the proof. □
6.4. Proof of Theorem 7
It holds for the loss under and for every thatIt follows by Theorem 6 that the first term on the r.h.s. vanishes for . The second term is by Bonferroni's inequality bounded from above by Using Lemma 15, there exists thus for every some so that which completes the proof. □
References
- Breiman, 1995Better subset regression using the nonnegative garroteTechnometrics, 37 (1995), pp. 373-384
- Efron et al., 2004Least angle regressionAnn. Statist., 32 (2004), pp. 407-451
- Fan and Li, 2001Variable selection via penalized likelihoodJ. Amer. Statist. Assoc., 96 (2001), pp. 1348-1360
- Fan and Peng, 2004Nonconcave penalized likelihood with a diverging number of parametersAnn. Statist., 32 (2004), pp. 928-961
- Frank and Friedman, 1993A statistical view of some chemometrics regression tools (with discussion)Technometrics, 35 (1993), pp. 109-148
- Huber, 1973Robust regression: asymptotics, conjectures, and Monte CarloAnn. Statist., 1 (1973), pp. 799-821
- Knight and Fu, 2000Asymptotics for lasso-type estimatorsAnn. Statist., 28 (2000), pp. 1356-1378
- McCullagh and Nelder, 1989Generalized Linear ModelsChapman & Hall, London (1989)
- Meinshausen and Bühlmann, 2006High dimensional graphs and variable selection with the lassoAnn. Statist., 34 (2006), pp. 1436-1462
- Osborne et al., 2000On the lasso and its dualJ. Comput. Graph. Statist., 9 (2000), pp. 319-337
- Tibshirani, 1996Regression shrinkage and selection via the lassoJ. Roy. Statist. Soc. Ser. B, 58 (1996), pp. 267-288
- Tsybakov and van de Geer, 2005Square root penalty: adaptation to the margin in classification and in edge estimationAnn. Statist., 33 (2005), pp. 1203-1224
- van de Geer and van Houwelingen, 2004High dimensional data: in mathematical statistics and bio-medical applicationsBernoulli, 10 (2004), pp. 939-943
- Zhao and Yu, 2004Zhao, P., Yu, B., 2004. Boosted lasso. Technical Report 678, University of California, Berkeley.
Cited by (462)
Mathematical programming for simultaneous feature selection and outlier detection under l1 norm
2024, European Journal of Operational ResearchScenario selection with LASSO regression for the valuation of variable annuity portfolios
2024, Insurance: Mathematics and EconomicsPersonalized choice model for forecasting demand under pricing scenarios with observational data—The case of attended home delivery
2024, International Journal of ForecastingAir pollution prediction and backcasting through a combination of mobile monitoring and historical on-road traffic emission inventories
2024, Science of the Total EnvironmentVariables selection using L<inf>0</inf> penalty
2024, Computational Statistics and Data Analysis