The Spatial LASSO with Applications to Unmixing Hyperspectral Biomedical Images - Supplemental Material Daniel V. Samarov, Maritoni Litorja, Jeeseong Hwang ∗ October 1, 2014 S1 Supplemental Material This document provides additional details on the Spatial LASSO (SPLASSO) model discussed in Samarov et al. (2013). Section S2 details several properties of the SPLASSO, Section S3 provides a blockwise-coordinate descent approach to solving the SPLASSO, Section S4 introduces the adaptive splasso, Section ?? highlights commonalities with microarray processing and Section S5 provides proofs. ∗ Daniel V. Samarov is a Mathematical Statistician, Statistical Engineering Division, Information Technol- ogy Laboratory, National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, (Email: [email protected]), Maritoni Litorja is a Research Chemist, Sensor Science Division, Physical Measurement Laboratory (PML), NIST, Gaithersburg, MD 20899, (Email: [email protected]), Jeeseong Hwang is a Research Biophysicist, Radiation and Biomolecular Physics Division, PML, NIST, Gaithersburg, MD 20899, (Email: [email protected]). 1 S2 SPLASSO properties S2.1 Derivation of the orthonormal case In Section 3.2 of Samarov et al. (2013) the SPLASSO is studied under the assumption of orthonormality of the design matrix. In solving for β i in the orthonormal setting we start by fixing the remaining j 6= i coefficients. This is the same first step taken in the block-wise coordinate descent based procedure we use to solve the SPLASSO problem (discussed in Section 4) where all but the current parameter being estimated fixed. Starting with equation (3.4) this can be expanded to d X β 2 0 − yiT x0l βil0 + λ1 |βil0 | + λ2 il l0 =1 X (βil2 0 − 2βil0 βjl0 ) + yiT yi + λ2 j∈Nk (yi ) X β Tj β j . j∈Nk (yi ) (S2.1) Noting that the objective function in (3.4) is convex, the optimal solution in (S2.1) is characterized by its subgradient equations (as the l1 penalty term is non-differentiable). Recall P from our discussion in Section 3.2 that for illustrative purposes we assume j∈Nk (yi ) wij = 1 P and that αil = j∈Nk (yi ) βjl wij and βˆil (OLS) = yiT xl . With these notations and assumptions the subgradient equations for a particular βˆil are written as λ1 βˆil (OLS) − (1 + λ2 )βˆil + λ2 αil = vl 2 (S2.2) where the subderivative term sgn(βˆil ) if βˆil 6= 0, vl = ∈ {vl : |vl | ≤ 1} if βˆil = 0. Dividing (S2.2) by 1 + λ2 , letting γ = 1/(1 + λ2 ) and setting ˆbil = γ βˆil (OLS) + (1 − γ)αil the 2 expression in (S2.2) can be written as ˆbil − βˆil = λ1 γ. 2 (S2.3) When βˆil = 0, the subgradient equations are satisfied when λ1 |ˆbil | ≤ γ 2 and when βˆil = 6 0 we have λ1 βˆil = sgn(ˆbil )(|ˆbil | − γ). 2 Note, due to the convexity and continuity of the objective function in (3.4) when βˆil = 6 0, sgn(ˆbil ) = sgn(βˆil ). Putting this all together, for fixed j 6= i the SPLASSO estimate in the orthonormal setting is λ1 βˆil (SPLASSO) = sgn(ˆbil )(|ˆbil | − γ)+ . 2 S2.2 (S2.4) The Orthonormal Case - Additional Discussion Following the discussion from Section 3.2 in Samarov et al. (2013), Figure 1 compares the coefficient estimates of the OLS, LASSO and SPLASSO estimators under the assumption of orthonormality and varying parameter values. Here we set the weighted average of the surrounding coefficients, αi,l to be {0.1, 1, 2} (increasing from left to right along the columns) with the regularization parameters λ1 = 1 and λ2 taking values in {0.1, 1, 2} (increasing from top to bottom along the rows). The 45◦ line corresponds to the OLS estimate, the bold solid line is the LASSO estimate, the dashed bold line the SPLASSO estimate and the solid black point is the value of αi,l . We start off by looking at the effect that the αi,l ’s have on the SPLASSO estimate. Going 3 from left to right along the columns of plots in Figure 1 one can see that as the weighted average of the surrounding coefficients increases, the region over which thresholding occurs begins to shift over to the left (this is particularly prominent in the second and third row of plots). The implication is that unless the current coefficient’s estimated value is very different from its neighbors, it is less likely to be set to 0 and more likely to be positive. Next, as the value of the regularization parameter λ2 increases (going from top to bottom) it has the effect of placing greater weight on the αi,l s. This also results in a shift left in the region where the SPLASSO estimates are set to be 0. S2.3 Feasible Set Looking at the set of feasible solutions for the SPLASSO in the general case is important as it gives us an understanding of how the regularization parameters, value of neighboring coefficients and the spatial weights effect the coefficient estimates. Because the value of the spatial penalty term depends on the neighboring coefficients, we consider a few test cases to get a general idea of the models behavior. Consider the following example; looking at Figure 2 suppose we are estimating the coefficient vector β ∈ R2 corresponding to the white square at the center. The k = 1 neighborhood of this point is indicated by the light gray and dark gray squares whose respective coefficient values are identical and are denoted by β 1 and β 2 . Lastly let w1 and w2 be the spatial weights for β 1 and β 2 , respectively. Table 1 details the test cases we consider here. The particular parameterizations chosen are meant to reflect what might be expected in practice. For the regularization parameters we consider the cases where λ1 = λ2 , λ1 < λ2 or λ1 > λ2 with either all the surrounding coefficients close in value (β 1 ≈ β 2 ), or with one set greater than the other (β 1 > β 2 ), and having either similar spatial weights (w1 ≈ w2 ), or with the weights greater for one set of coefficients than the other (w1 > w2 or w1 < w2 ). The feasible sets corresponding to each of these cases are shown in Figure 3. For example, in the Case I in Table 1 we have β 1 = β 2 = (1/2, 1/2), w1 = w2 = 1/2 and λ1 = λ2 = 1. This case captures the SPLASSO estimates behavior when all the surrounding 4 αi,l = 0.1 αi,l = 1 1 2 1 −2 −2 −1 0 αi,l = 0.1 −2 2 1 −2 −2 −1 0 β αi,l = 0.1 2 OLS LASSO SPLASSO 2 2 1 1 2 αi,l = 2 OLS LASSO SPLASSO ● −2 −1 ^ β 0 1 ● 1 0 0 λ2 = 2 −1 −1 −1 β −2 −2 −1 ^ β ● −2 −2 λ2 = 2 1 2 2 αi,l = 1 λ2 = 2 0 1 β OLS LASSO SPLASSO ● −1 ^ β 0 2 OLS LASSO SPLASSO 0 2 1 2 αi,l = 2 ● 1 0 1 λ2 = 1 OLS LASSO SPLASSO −1 −1 0 β −2 −2 −1 ^ β ● −2 −1 λ2 = 1 1 2 2 αi,l = 1 λ2 = 1 OLS LASSO SPLASSO 0 1 β 0 0 ● −1 ^ β 0 1 ● −2 −2 −1 OLS LASSO SPLASSO 2 2 OLS LASSO SPLASSO β ^ β λ2 = .1 −1 ^ β 0 ● −1 ^ β 1 2 OLS LASSO SPLASSO −2 ^ β αi,l = 2 λ2 = .1 0 λ2 = .1 −2 −1 0 β β 1 2 −2 −1 0 1 2 β Figure 1: An illustration of the behavior of the OLS, LASSO and SPLASSO coefficient estimates in the orthonormal setting. Looking across the plots the values of the surrounding neighbors, αi,l and regularization parameter λ2 steadily increase as we move left to right and top to bottom. 5 β1 β β2 Figure 2: An illustration of the test cases we consider for exploring the feasible set of the SPLASSO model. The white “pixel” at the center corresponds to the location whose coefficients (or abundances in the context of HSI), β we are estimating and the dark and light gray pixels are its neighbors. The light and dark gray regions each share common coefficients vectors β 1 and β 2 respectively. coefficients and weights on the sparse and spatial penalties are equal. In cases V and IX once again β 1 = β 2 but now we consider the cases where either the sparse term is less than the spatial term, λ1 (= 1) < λ2 (= 2) or visa-versa, λ1 (= 2) > λ2 (= 1). The other cases shown in Table 1 consider other possible combinations of neighboring coefficients, regularization and spatial weight values. Note, we do not claim that the cases considered here are exhaustive, as stated earlier they are meant to provide insight into general model behavior. Figure 3 shows the feasible sets, highlighted in light blue, and Table 1 provides details on the parameters used. Looking at Figure 3 we can see that depending on whether λ1 is equal, greater or less than λ2 (going from left to right) the corners of the feasible set become more or less pronounced and likely to have thresholding occur. Similarly, depending on whether β 1 is equal to or greater than β 2 and whether the spatial weight w1 is equal, greater than or less than w2 , we see that the feasible set is pulled more or less in the direction of the parameter with the greatest weight. 6 Case β1 β2 w1 w2 λ1 λ2 Case β1 β2 w1 w2 λ1 λ2 I ( 12 , 12 ) ( 12 , 12 ) II (1,1) ( 12 , 12 ) 1 1 X (1,1) ( 12 , 12 ) 1 1 XI (1,1) ( 12 , 12 ) 0.75 0.25 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 III (1,1) ( 12 , 12 ) 0.75 0.25 1 1 XII (1,1) ( 12 , 12 ) 0.25 0.75 2 1 IV (1,1) ( 12 , 12 ) 0.25 0.75 1 1 V ( 12 , 12 ) ( 12 , 12 ) VI (1,1) ( 12 , 21 ) 1 2 1 2 1 2 1 2 1 2 1 2 VII (1,1) ( 12 , 12 ) 0.75 0.25 1 2 VIII (1,1) ( 12 , 12 ) 0.25 0.75 1 2 IX ( 12 , 21 ) ( 12 , 21 ) 1 2 1 2 2 1 Table 1: The values used to generate the different feasible regions shown in Figure 3. The columns, labelled I, II, . . . , XII correspond the coefficients used in each of the panels shown in Figure 3. The values chosen were meant to reflect the cases one might expect to see in practice. S2.4 Decorrelation, convexity and relationship to the elastic net As previously mentioned, the SPLASSO penalty shares some similarities with the elastic net, in particular it combines the “decorrelation” property of the latter with spatial smoothing across the coefficients. Before exploring this connection in more detail we introduce some Pn notation; let di = j=1 wij , with wij = wji , D = diag(d1 , . . . , dn ) and D = D ⊗ Im×m , where ⊗ denotes the Kronecker product. Next define W = {wij }ni,j=1 , W = W ⊗ Im×m and L = D − W. Finally, letting B = (β T1 , . . . , β Tn )T , X = In×n ⊗ X and Y = (y1T , . . . , ynT )T we can express the SPLASSO objective function as ˆ ıve SPLASSO) = arg min ||Y − X B||2 + λ1 |B|1 + λ2 B T LB. B(na¨ B (S2.5) This representation of the SPLASSO loss function allows for some interesting insights. The key observation is that the matrix L in (S2.5) is the well known “Graph Laplacian” used in spectral clustering (see von Luxburg (2007) for an overview on this topic). Spectral clustering is an unsupervised learning method which is used as an approximation to the graph cut problem (see Hagen & Kahng (1992) and Shi & Malik (2000)). Its objective 7 Figure 3: This figure shows the set of feasible solutions for the SPLASSO model and its relationship to the regularizations parameters λ1 and λ2 (columns) and the neighboring coefficients and their associated weights (rows). Specific values for these parameters are shown in Table 1. is to separate (i.e. cluster) a collection of observations into two or more groups based on the assumption that there exist high and low density regions in the data. More specifically, spectral clustering looks at each observations as a node on a graph with the graph Laplacian matrix representing the edge weights. If the edge weights are properly specified the points which are similar to one another will have larger edge weights (i.e. will be strongly connected to one another), and those that are different will have smaller edge weights (i.e. 8 are less connected). The final step is to compute the eigenvalues and vectors of the graph Laplacian and apply a standard clustering algorithm (such as k-means) to the eigenvectors corresponding to the k − 1 smallest eigenvalues (excluding the smallest) to determine the k clusters. For a discussion on the intuition behind the latter we refer the reader to von Luxburg (2007). Expressing the spatial penalty in the SPLASSO objective function using the graph Laplacian provides us with some interesting insights. First, the positive semi-definiteness of the graph Laplacian ensures convexity of the SPLASSO loss. Second, in the same way that L connects and separates high and low density regions in the clustering setting, within the framework of our model it connects and separates (dis)similar coefficient vectors, producing smoother, less noisy estimates. The following theorem gives a more detailed look at how our coefficient estimates are affected by the graph Laplacian. Theorem S2.1. Given the data (Y, X ) and regularization parameters (λ1 , λ2 ), the SPLASSO estimates Bˆ are given by ˆ B(SPLASSO) = arg min B B T X T X + λ2 L 1 + λ2 B − 2Y T X B + λ1 |B|1 . (S2.6) If λ2 = 0 in (S2.6) then the SPLASSO simply becomes a series of LASSO models. On the other hand, by setting λ2 > 0 we connect each β i to its neighbors. This can be seen more clearly by writing out the first term in (S2.6). Letting S = XT X we have X T X + λ2 L = 1 + λ2 S+λ2 (d1 −w11 )I 1+λ2 2 w12 I − λ1+λ 2 2 w21 I − λ1+λ 2 S+λ2 (d2 −w22 )I 1+λ2 .. . .. . 2 wn1 I − λ1+λ 2 2 wn2 I − λ1+λ 2 ··· 2 w1n I − λ1+λ 2 ··· .. . 2 w2n I − λ1+λ 2 ··· S+λ2 (dn −wnn )I 1+λ2 .. . . (S2.7) The ij th block in (S2.7) tells us the degree of connectivity between β i and β j . Comparing 9 this to the elastic net, the matrices, (S + λ2 (di − wii )I)/(1 + λ2 ), i = 1, . . . , n, along the diagonal are the same, less the scalar term di − wii . The relationship between the SPLASSO and elastic net can be made even clearer if we consider a slight variation on the SPLASSO penalty, specifically if we change the penalty term to be 2 X β β i j √ − p wij = B T Lsym B, di dj ij (S2.8) where Lsym = I − D−1/2 WD−1/2 . The matrix Lsym is referred to as the “normalized” graph Laplacian in the spectral clustering literature. Letting Wsym = D−1/2 WD−1/2 we have the following result, Corollary S2.1. Given the modified penalty term in (S2.8), data (Y, X ) and regularization parameters (λ1 , λ2 ), the SPLASSO estimates Bsym (SPLASSO) are given by Bˆsym = arg min B B T X T X + λ2 I λ2 Wsym − 1 + λ2 1 + λ2 B − 2Y T X B + λ1 |B|1 . (S2.9) The matrix (X T X + λ2 I)/(1 + λ2 ) in (S2.9) now has the exact same form as the elastic net. This representation of the SPLASSO illustrates the balance between the decorrelation of variables and the constraint of spatial smoothness. In HSI applications the decorrelation property can be particularly important as endmembers are often strongly correlated. It was shown that when two (or more) covariates are exactly correlated the LASSO fails to find a unique solution (Zou & Hastie (2005), Lemma 2) and that it generally encounters difficulties when covariates are highly correlated. One of the key properties of the elastic net is that it is able to overcome this through the introduction of the l2 penalty term. In the following theorem we show that the SPLASSO shares a similar property. Theorem S2.2. Given data (yi , X), i = 1, . . . , n and parameters (λ1 , λ2 ), let the response 10 ˆ (λ1 , λ2 ) being yi be centered and the predictors X standardized to have unit norm. With β i defined as the na¨ıve SPLASSO estimate, suppose βˆi,l (λ1 , λ2 )βˆi,s (λ1 , λ2 ) > 0. Then with Dλ1 ,λ2 (l, s) = 1 ˆ |βi,l (λ1 , λ2 ) − βˆi,s (λ1 , λ2 )|, ||yi || we have Dλ1 ,λ2 (l, s) ≤ 1p 1 2(1 − ρ) + λ2 ||yi || X |βj,l − βj,s |wij , (S2.10) j∈Nk (yi ) where ρ = xTl xs is the sample correlation. What this theorem tells us is that under the SPLASSO model, when two variables xl and xs are highly correlated, the differences between their coefficient estimates becomes progressively smaller. Unlike the elastic net however, there is an additional term incorporating the differences associated with neighboring coefficients. S2.5 Connections with univariate soft thresholding In this section we take a closer look at the behavior of the SPLASSO model as the weight placed on the spatial regularization term increases to infinity. Using (S2.6) and (S2.9) in Theorem S2.1 and Corollary S2.1 respectively, straightforward calculations show us that the SPLASSO estimate becomes, arg min B T LB − 2Y T X B + λ1 |B|1 , as λ2 → ∞. B Taking a closer look at the objective function above, this can be re-expressed as n X i=1 −2yiT Xβ i + X ||β i − β j ||2 wij + λ1 |β i |1 , j∈Nk (yi ) 11 with straightforward modifications for Lsym . Written this way, the solution for a particular βi,l , given its neighbors is simply X X 1 βj,l wij yiT xl + βj,l wij − λ1 , βˆi,l = sgn yiT xl + 2 j∈Nk (yi ) j∈Nk (yi ) + as λ2 → ∞. Thus as λ2 → ∞ the correlation between coefficients is shrunk to zero, leaving P yiT xl + j∈Nk (yi ) βj,l wij , the univariate regression estimator plus the weighted average of the neighboring coefficients. S2.6 Rescaling and the SPLASSO In Zou & Hastie (2005) the authors showed that the introduction of the l2 penalty term resulted in the elastic net’s coefficient estimates receiving shrinkage from both the l1 and the l2 penalties. In order to address this, they proposed rescaling the “na¨ıve” solution as they called it by a factor of 1 + λ2 . Their justification for this stemmed primarily from two points; first, that the LASSO is minimax optimal (Donoho et al. (1995)) and second, in the orthonormal setting the na¨ıve elastic net is simply the LASSO solution scaled by 1/(1 + λ2 ). Intuitively then, multiplying the na¨ıve elasatic net solution by (1 + λ2 ) would in yield a minimax optimal solution. This logic was then extended to the general non-orthonormal case. In the context of the SPLASSO we recommend against this rescaling. To see why, we begin by re-expressing the quantity in (S2.4) as 1 λ1 ˆ ˆ sgn(βi,l + λ2 αi,l ) |βi,l + λ2 αi,l | − . 1 + λ2 2 + (S2.11) From (S2.11) we see that there is the same scaling factor of 1/(1 + λ2 ) as in the elastic net. However, this scaling factor plays an important role in balancing the influence of the surrounding coefficient estimates and the OLS estimate. If we remove it this will cause the overall estimate to be inflated by the term λ2 αi,l . 12 S3 A computationally efficient solution to the SPLASSO Even with leveraging parallelization as described in Section4.2, for problems with sufficiently large n the process of solving the LARS-SPLASSO algorithm at each point can still become computationally expensive. In the context of some HSI applications this can be a major issue as images may often have a spatial resolution of more than 1000 × 1000 and a spectral resolution of 1000 or more. In the following section we propose a more efficient solution to the SPLASSO problem, also based on the ideas of coordinate descent. In recent work by Friedman et al. (2007) a coordinate descent algorithm was proposed for the LASSO and for several related methods. In that work it was shown that coordinate descent provided considerable improvements in computational speed over competitors (including LARS and other state-of-the-art optimization techniques). Additionally, in Liu et al. (2009) a similar blockwise coordinate descent algorithm was proposed for the multivariate response case. Building on these ideas we show how a similar blockwise-coordinate descent based approach as in Liu et al. (2009) can be implemented for the SPLASSO. Our algorithm consists of simultaneously updating the coefficients for a given β i while holding all the others fixed, then cycling through this process until convergence. Suppose the current estimates are βˆi , i = 1, . . . , n. Then β q is updated as ˆ = arg min ||rq,k − xk βˆq,k ||2 β q +λ1 |βq,k | βˆq,k X + λ2 (S3.1) (βˆq,k − βj,k )2 wqj , for k = 1, . . . , m, j∈N (yq ) where rq,k = yq − P xl βq,l denotes the partial residual vector. It can be shown that (S3.1) P can be solved in closed form. Letting bq,k = rTq,k xk + λ2 j∈N (yq ) βj,k wqj we have l6=k sgn(bq,k )(|bq,k | − λ1 /2)+ P . βˆq,k = T xk xk + λ2 j∈N (yq ) wqj 13 (S3.2) What is appealing about (S3.2) is that most quantities can be pre-computed. Specifically P for rTq,k xk = yqT xk − l6=k xTl xk βq,l the inner products uq,k = yqT xk , 1 ≤ q ≤ n, 1 ≤ k ≤ m P and vl,k = xTl xk , 1 ≤ l, k ≤ m as well as j∈N (yq ) wqj can all be calculated in advance. This provides considerable savings in computation once we start iterating through the coordinate descent algorithm. Note, this updating procedure we just described is the same covariance update discussed in Friedman et al. (2010). ˆ = 0, i = 1, . . . , n. The coordinate descent SPLASSO algorithm begins by initializing β i To generate the solution path, a decreasing sequence of regularization parameters, λ1 ∈ {C∆t , t = 0, . . . , t0 }, 0 < ∆ < 1, t0 ∈ Z+ , is selected where C ∈ R+ . Here C is chosen so that for t = 0, effectively all the coefficient estimates will be thresholded to 0, and for t = t0 they will be close to the OLS estimates, i.e. no thresholding. Once the coefficients have been estimated by the coordinate descent algorithm for t = 0, the coefficients estimates are recalculated using the previous coefficient estimates as the starting values. This process is repeated for each t. This is the warm start concept discussed in Liu et al. (2009) and Friedman et al. (2010). A summary of the algorithm is provided in Figure 4. Note, additional speedups are possible by indexing those βˆq,k , which have converged or have been set to 0 for a given λ1 , and skipping them as we iterate through the algorithm. S4 The Adaptive SPLASSO One of the shortcomings of the LASSO model is that while the inclusion of the l1 penalty term comes with the benefit of encouraging sparsity it also introduces biased coefficient estimates (as illustrated in Figure 3 of Samarov et al. (2013)). The adaptive LASSO, or ALASSO (Zou (2006)) was developed to help reduce this bias through the addition of a set of weights for each coefficient βi,l on the penalty term λ (or λ1 in the case of the SPLASSO). The weights for those coefficients whose value is significantly different from 0 would be smaller (thus reducing the amount of shrinkage), and greater for those that were closer to 0 (increasing 14 Coordinate Descent SPLASSO: • Compute inner products uq,k , 1 ≤ q ≤ n, 1 ≤ k ≤ m and vl,k , 1 ≤ l, k ≤ m and spatial weight wij . • Select value 0 < ∆ < 1, t0 and C. • For each λ1 ∈ {C∆t , t = 1, . . . , t0 }, iterate the through the following until convergence: ˆ i if t > 0, where β ˆ i is the 1. If t = 0 set as the starting value β i = 0, and β i = β previous iterations estimates of the coefficients. 2. For each q ∈ {1, . . . , n} and every k ∈ {1, . . . , m} calculate sgn(bq,k )(|bq,k | − λ1 /2)+ 0 P βˆq,k = T xk xk + λ2 j∈N (yq ) wqj 0 3. If |βˆq,k − βˆq,k | < , for all q and k, small stop, else repeat. ˆ i. • Output SPLASSO solution path β Figure 4: The Coordinate Descent SPLASSO algorithm the amount of shrinkage). Using a similar framework as the ALASSO, we propose a variant on the SPLASSO we call the adaptive SPLASSO (or ASPLASSO). As with the ALASSO, we introduce a set of weights for each βi,l in our model, say φi,l , i = 1, . . . , n, l = 1, . . . , m, which are smaller or larger according to the coefficients relative importance. The ASPLASSO model can be written as n X i=1 ||yi − Xβ i ||2 + λ1 m X φi,l |βi,l | + λ2 l=1 X j∈N (yi ) wij m X (βi,l − βj,l )2 . (S4.1) l=1 The weights φi,l can take on a number of forms, a reasonable choice is the reciprocal of the weighted average of the neighboring coefficient estimates, 1 φi,l = P j∈Nk (y) 15 wij βj,l . (S4.2) This model has a similar interpretation to the standard SPLASSO model discussed in Section 3 with one slight modification resulting from the introduction of the weights φi,l . The effect of these weights is most easily seen in the case where we take, X as orthonormal. With ˆbi,l , αi,l and γ, as defined in Section 3.2, the ASPLASSO estimate are λ1 βˆi,l (na¨ıve ASPLASSO) = sgn(ˆbi,l ) |ˆbi,l | − γφi,l 2 . (S4.3) + Setting φi,l = 1/|ˆbi,l | in (S4.3) we see that as |ˆbi,l | increases there will be less weight placed on the penalty term, resulting in a reduction in the amount of shrinkage. On the other hand with |ˆbi,l | close to 0 the amount of shrinkage and the likelihood that the estimate will be set to 0 increases. S4.1 Solving the ASPLASSO Both the LARS and the coordinate descent approaches used to solve the SPLASSO described in Section 4 of Samarov et al. (2013) can also be used to solve the ASPLASSO. We begin with † † † the LARS formulation. Let x†il = xl /φi,l , βi,l = φi,l βi,l , wijl = λ2 wij /φ2i,l and βj,l = φi,l βj,l , for j ∈ N (yi ). Putting these together, the problem in (S4.1) can then be expressed as n X 2 ||yi − Xβ i || + λ1 i=1 m X X φi,l |βi,l | + λ2 l=1 j∈N (yi ) wij m X (βi,l − βj,l )2 l=1 2 m n m X X X xl = φi,l βi,l + λ1 φi,l |βi,l | yi − φi,l i=1 l=1 l=1 m X X wij + λ2 (βi,l − βj,l )2 φ2i,l 2 φ i,l j∈N (yi ) l=1 2 n m m m X X X X X † † † 2 † † † = xil βi,l + λ1 |βi,l | + wijl (βi,l − βj,l ) yi − i=1 = n X i=1 l=1 l=1 j∈N (yi ) l=1 X ||yi − X†i β †i ||2 + λ1 |β †i |1 + j∈N (yi ) 16 † (β †i − β †j )T Wij (β †i − β †j ). (S4.4) where yi W†1/2 β † i1 i1 † yi = .. . †1/2 Win β †in k k β †i 1/2 = (1 + λ2 ) β †j = † βj,1 .. . † βj,m X W†1/2 i1 † , , Xi = (1 + λ2 )−1/2 .. . †1/2 Wi n k † βi,1 .. † † † . Wij = diag(wij1 , . . . , wijm ), † βi,m , for j ∈ N (yi ). (S4.5) Through similar calculations to (4.1), and using the quantities in (S4.5), (S4.4) can be written as n X ||yi† − X† β †i ||2 + i=1 λ1 |β †i |1 . 1/2 (1 + λ2 ) (S4.6) The same LARS-SPLASSO algorithm can then be used to solve (S4.6). The coordinate descent approach to solving the ASPLASSO requires changing the update 0 of βˆq,k in Step 2 of the coordinate descent SPLASSO algorithm shown in Figure 6 of Samarov et al. (2013) to sgn(bq,k )(|bq,k | − φi,l /2)+ 0 P βˆq,k = T . xk xk + λ2 j∈N (yq ) wqj Currently work is under way to implement both the LARS and coordinate descent solutions to the ASPLASSO. 17 S5 Proofs In this section we provide proofs from some of the results discussed in the previous sections. We begin with Theorem S2.1 Proof. Let Bˆ be the solution of (S2.5). Calculations similar to (4.1) show (S2.5) can be written as ||Y ∗ − X ∗ B ∗ ||2 + λ1 |B ∗ |1 , (1 + λ2 )1/2 (S5.1) where 1 X Y ∗ 1/2 ∗ Y∗ = , B = (1 + λ2 ) B. , X = 1/2 (1 + λ ) 1/2 2 L 0 (S5.2) Using (S5.1) we get 2 ∗ λ1 B B ∗ ˆ B = arg min Y − X + B (1 + λ2 )1/2 (1 + λ2 )1/2 (1 + λ2 )1/2 1 ∗T ∗ λ1 |B|1 Y ∗T X ∗ X X ∗T ∗ T + Y Y + = arg min B B−2 . B 1 + λ2 (1 + λ2 )1/2 1 + λ2 (S5.3) Plugging in the identities in (S5.2) into (S5.3), we have T 1 X X + λ2 L T T B B − 2Y X B + λ1 |B|1 + Y T Y B 1 + λ2 1 + λ2 T X X + λ2 L T = arg min B B − 2Y T X B + λ1 |B|1 . B 1 + λ2 Bˆ = arg min Next we provide a proof of Theorem S2.2. Note the structure of this proof is quite similar to that of Zou & Hastie (2005). Proof. If βˆi,l (λ1 , λ2 )βˆi,s (λ1 , λ2 ) > 0, then both βˆi,l (λ1 , λ2 ) and βˆi,l (λ1 , λ2 ) are non-zero and 18 have the same sign. Taking the derivative of the SPLASSO loss (3.4) with respect to βi,l and βi,s , and setting equal to zero, one gets − 2xTl (yi − Xβ i (λ1 , λ2 )) + λ1 sign(βˆi,l (λ1 , λ2 )) X + 2λ2 (βˆi,l − βj,l )wij = 0, (S5.4) j∈Nk (yi ) − 2xTs (yi − Xβ i (λ1 , λ2 )) + λ1 sign(βˆi,s (λ1 , λ2 )) X + 2λ2 (βˆi,s − βj,l )wij = 0. (S5.5) j∈Nk (yi ) Next we subtract (S5.4) from (S5.5) which gives ˆ i (λ1 , λ2 )) + λ2 (βˆi,l (λ1 , λ2 ) − βˆi,s (λ1 , λ2 ))− (xs − xl )T (yi − Xβ X λ2 (βj,l − βj,s )wij , j∈Nk (yi ) which is equivalent to X 1 βˆi,l (λ1 , λ2 ) − βˆi,s (λ1 , λ2 ) = (xl − xs )T ˆri (λ1 , λ2 ) + (βj,l − βj,s )wij , λ2 (S5.6) j∈Nk (yi ) where ˆri (λ1 , λ2 ) = yi − Xβ i (λ1 , λ2 ) is the residual vector. Since the columns of X are standardized to have unit norm, ||xi − xj ||2 = 2(1 − ρ) where ρ = xTi xj . Since our loss function is convex, ˆ (λ1 , λ2 )||2 + λ1 |β ˆ (λ1 , λ2 )|1 ≤ ||yi ||2 , ||ˆri (λ1 , λ2 )||2 + λ2 ||β i i so ||ˆri (λ1 , λ2 )|| ≤ ||yi ||2 . Then (S5.6) implies that T X 1 (xl − xs ) ˆri 1 + (βj,l − βj,s )wij Dλ1 ,λ2 (l, s) ≤ ||yi || λ2 ||yi || j∈Nk (yi ) 19 ≤ 1p 1 2(1 − ρ) + λ2 ||yi || X |βj,l − βj,s |wij j∈Nk (yi ) References Donoho, D., Johnstone, I., Kerkyacharian, G. & Picard, D. (1995). Wavelet shrinkage: asymptotia (with discussion)? J.R. Statist. Soc. B 57 301–369. Friedman, J., Hastie, T., Hofling, H. & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1 302–332. Friedman, J., Hastie, T. & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 1–22. Hagen, L. & Kahng, A. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design 11 1074–1085. Liu, H., Palatucci, M. & Zhang, J. (2009). Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. Samarov, D., Litorja, M. & Hwang, J. (2013). The spatial lasso with applications to umixing hyperspectral biomedical images. Technometrics (submitted) . Shi, J. & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 888–905. von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing 17 395–416. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 1418–1429. 20 Zou, H. & Hastie, T. (2005). Regularization and variable selection via the elastic net. J Roy Statist Soc Ser B 67 301–320. 21

© Copyright 2017 ExploreDoc