Software & Data

Adelie: software for fitting group lasso and elastic net models

Adelie is an efficient algorithm for solving the group lasso/elastic net penalized regression problem. Adelie fits the entire regularization path by block coordinate descent. Our package adelie (implemented in Python and also in R with a package on CRAN) appears to be faster by factors of 3 to 10 than the next fastest package on CRAN for wide data. We deliver solutions for all the GLM and other families that are available in glmnet, and when all the groups are of size 1, the package matches the performance of glmnet. The core code beneath adelie was written by James Yang in C++, and he developed the Python package. The R package was developed by James Yang, Trevor Hastie and Balasubramanian Narasimhan. Adelie software is well documented in both Python and R, and vignettes on both platforms show examples of usage. The software is available via "pip install adelie" in Python, and "install.packages('adelie')" in R.

Gamsel: fit regularization path for generalized additive models.

Gamsel fits a regularization path for generalized additive models with many variables. It uses an overlapped group lasso penalty to create sticking points at constant, linear and non-linear terms. Written by Alexandra Chouldechova and Trevor Hastie, and maintained by Trevor Hastie. gamsel package, on CRAN. Here is a gamsel vignette written by Trevor Hastie and Matt Wand.

Glinternet: fit a linear model with hierarchical interaction via group-lasso regularization.

Glinternet fits a regularization path to include main effects and interactions in linear and logistic regression models. Can deal with quantitative and factor predictors. Glinternet uses the overlap group-lasso to enforce strict hierarchy, which also encourages interactions among variables with strong main effects. The code is efficient, and can handle problems with many thousands of variables. Written Michael Lim and Trevor Hastie, and maintained by Michael Lim. glinternet package, on CRAN

Glmnet: fit the elastic-net regularization path for some generalized linear models.

Glmnet fits the entire regularition path for an elastic-net regularized glm. See website at glmnet.stanford.edu. The models included are Gaussian, binomial, multinomial, Poisson, and the Cox model. Glmnet solves the following problem:

$\displaystyle \min_{\beta_0, \beta} \frac{1}{N} \sum_{i=1}^N w_i l(y_i, \beta_0 + \beta^T x_i) + \lambda \left[(1-\alpha) \lVert\beta\rVert_2^2/2 + \alpha \lVert\beta\rVert_1\right],$

over a grid of values of (lambda) covering the entire range. Here $ l(y,\eta)$ is a log-likelihood contribution for observation $i$; e.g. for the Gaussian case it is $\mbox{$\frac12$}(y-\eta)^2$. Here $\alpha$ bridges the gap between lasso ($\alpha=1$, the default), and ridge ($\alpha=0$). The package includes methods for prediction and plotting, and functions for performing K-fold cross-validation. The code can handle sparse input-matrix formats, as well as range constraints on coefficients. Glmnet also makes use of the strong rules for efficient restriction of the active set. Glmnet has many bells and whistles, which are illustrated in the vignette below. The core of Glmnet is a set of fortran subroutines, which make for very fast execution. The algorithms use coordinate descent with warm starts and active set iterations. Original version written by Jerome Friedman, Trevor Hastie, Rob Tibshirani and Noah Simon. Glmnet in R: Glmnet 4.1 released 2024, with many additional features, including the relaxed lasso, all GLM families, many additional options for Cox models. This package is actively maintained by Trevor Hastie on CRAN. The R code interfaces to C++ code written by James Yang and Fortran code written by Jerome Friedman.

Youtube webinar on glmnet (the sound got slightly lagged wrt the video).
glmnet.stanford.edu is the website for glmnet, and has links to all the vignettes.
At the top of this page we point to the new Adelie Python package that implements the models and GLM families in Glmnet for both lasso/elastic net as well as group penalized versions of these.
Older implementations of Glmnet are Glmnet in Python: ported and maintained by B.J. Balakumar. Glmnet in Matlab: ported and maintained by Junyang Qian. The original port was by Hui Jiang (2009), and was updated and expanded by Junyang Qian in September 2013.

softImpute: impute missing values for a matrix via nuclear-norm regularization

SoftImpute fits a low-rank matrix approximation to a matrix with missing values via nuclear-norm regularization. The algorithm works like EM, filling in the missing values with the current guess, and then solving the optimization problem on the complete matrix using a soft-thresholded SVD. Special sparse-matrix classes available for very large matrices.
Written by Trevor Hastie and Rahul Mazumder, and maintained by Trevor Hastie. softImpute package, on CRAN softImpute vignette (html) published (9/10/2014).

Sparsenet: fit a linear model regularized by the nonconvex MC+ sparsity penalty

Sparsenet uses coordinate descent on the MC+ nonconvex penalty family, and fits a surface of solutions over the two-dimensional parameter space. This penalty family is indexed by an overall strength paramter λ (like lasso), and a convexity parameter γ, with γ=∞ corresponding to the lasso, and γ=1 best subset selection. Written by Rahul Mazumder, Jerome Friedman and Trevor Hastie, and maintained by Trevor Hastie. Sparsenet package, on CRAN

SvmPath: fit the entire regularization path for the SVM

The software, written in the S language for R, computes the entire solution path for the two-class SVM model. The solution is calculated for every value of the cost parameter C, essentially with the same computing cost of a single SVM solution. Written by Trevor Hastie. Find R package here

glmpath: fit the entire L1 regularization path for generalized linear models.

This algorithm uses predictor-corrector method to compute the entire regularization path for generalized linear models with L1 penalty. Somewhat superceded by the package glmnet above, but not entirely. Glmpath is able to estimate the knots or entry points for each variable as it enters the path.. Written by Mee-Young Park and Trevor Hastie, and maintained by Mee-Young Park. glmpath package, on CRAN

LARS: Least Angle Regression software

The software, written in the S language, computes the entire LAR, Lasso, or (epsilon) forward stagewise coefficient path in the same order of computations as a single least-squares fit. Written by Brad Efron and Trevor Hastie. R package can be found here. Visit the LARS website.

gam

R routines for fitting generalized additive models. This package corresponds to the gam models described in Chapter 7 of the "white" book Statistical Models in S Wadsworth (1992) Chambers and Hastie (eds).
Formulas s() and lo() allow for smoothing splines and local regression smoothers. Any family is accommodated, using the same family functions as glm(). Generic functions for plotting, anova, summary, predict etc. Recent (2013) improvements to the function step.gam()
Written and maintained by Trevor Hastie.
gam package, on CRAN.

mda

R routines for Flexible Discriminant Analysis, Penalized Discriminant Analysis and Nonparametric Mixture Discriminant Analysis models. These tools are enhancements on the lda function in R, and allow linear, polynomial, and nonparametric versions of discriminant analysis and mixture models. There are easy to use predict methods. These methods are described in Elements of Statistical Learning (chapter 12), as well as the original references.
Written by Trevor Hastie and Rob Tibshirani, and maintained by Trevor Hastie.
mda package, on CRAN.

impute

Imputation of missing data, intended for microarray and expression arrays. Impute uses knn to impute the missing values for a gene, by using the average values from the k-nearest neighbors in the space of the non-missing elements. The algorithm is fortran-based, and uses an adaptive combination of recursive 2-means clustering and nearest neighbors.
Trevor Hastie (fortran code and algorithm), Robert Tibshirani, Balasubramanian Narasimhan (maintainer) and Gilbert Chu.
impute (on Bioconductor)

Older Software & Data

Gene Shaving

A method for finding small clusters of highly correlated genes with large variance across the samples. See online version of gene shaving paper at Genome Biology by Trevor Hastie, Rob Tibshirani and coauthors
Code part of the GeneClust software written by Kim Anh Do and colleagues, and based on the original code by Trevor Hastie and Rob Tibshirani
Go to Geneclust homepage

Smart Prediction

Routines in Splus and R for making predictions "smarter" in the context of the formula language for statistical models such as lm() and glm(). Written by Thomas Yee and Trevor Hastie. Go to Webpage

gamfit

FORTRAN program for fitting generalized additive models. Written by Trevor Hastie and Rob Tibshirani. Shell archive

principal.curve

S functions for fitting principal curves. Written by Trevor Hastie. Shell archive princurve package in R. Contains original principal curves code, ported and maintained by Andreas Weingessel.

safe.predict

Modified versions of bs() and ns() that allow safe predictions, especially in the context of the S modelling functions. New predict() methods as well.
Written by Trevor Hastie.
Shell archive

s.to.latex

Tools for converting S help files and S code to latex, and a corresponding latex .sty file.
Written by John Chambers and Trevor Hastie
Shell archive