The gradLasso package implements an efficient gradient
descent solver for LASSO-penalized regression models. It supports
several families including Gaussian, Binomial, Negative Binomial, and
Zero-Inflated Negative Binomial (ZINB). It also features built-in
stability selection and cross-validation.
This vignette demonstrates the basic usage of the package.
We start by simulating simple Gaussian data with correlated predictors.
set.seed(42)
# Simulate 200 obs, 20 predictors, 5 active
sim <- simulate_data(n = 200, p = 20, family = "gaussian", k = 5, snr = 3.0)
df <- data.frame(y = sim$y, sim$X)
# Check the first few rows
head(df[, 1:6])
#> y Var1 Var2 Var3 Var4 Var5
#> 1 2.4088118 1.3709584 -1.6863106 0.9706798 -0.04932661 0.66502569
#> 2 -1.1866537 -0.5646982 0.2140939 -0.8088901 0.25200976 0.76083532
#> 3 -0.5444072 0.3631284 1.2202852 0.2984229 1.02738323 0.41846489
#> 4 -0.4884482 0.6328626 2.1445006 0.4769757 0.91408140 -0.01476616
#> 5 -1.2214254 0.4042683 -1.2681897 -0.8203085 -0.81123831 -1.50034498
#> 6 -1.2940614 -0.1061245 -1.1488285 -1.2083257 1.29080373 -0.04453506We can fit the model using the standard formula interface. By
default, gradLasso performs 50 bootstraps for stability
selection.
fit <- gradLasso(y ~ ., data = df, lambda_cv = TRUE, boot = TRUE, n_boot = 50)
print(fit)
#>
#> gradLasso Fitted Object
#> Family: gaussian
#> Lambda: 0.009446
#> Deviance: 61
#> Use plot() to view diagnostics or summary() for coefficients.We can inspect the selected coefficients using
summary(). The “Selection_Prob” column shows how often each
variable was selected across bootstrap iterations.
summary(fit)
#>
#> ------------------------------------------------
#> gradLasso Model Summary
#> ------------------------------------------------
#> Family: gaussian
#> Deviance: 61.20
#> AIC: 101.20
#> BIC: 167.17
#> DF: 20
#> ------------------------------------------------
#> Lambda: 0.009446 (Selected via CV)
#> Method: Stability Selection (50 bootstraps)
#> Interval: 2.5% - 97.5%
#>
#> --- Selected Coefficients ---
#> Predictor Estimate Selection_Prob Boot_Mean CI_2.5 CI_97.5
#> Var1 0.4690 1.00 0.4763 0.4054 0.5365
#> Var2 -0.4402 1.00 -0.4364 -0.5347 -0.3436
#> Var3 0.6597 1.00 0.6626 0.5642 0.7326
#> Var4 -0.3445 1.00 -0.3404 -0.4040 -0.2621
#> Var5 0.4756 1.00 0.4894 0.4069 0.5586
#> Var8 0.1467 1.00 0.1574 0.0963 0.2167
#> Var13 -0.0868 1.00 -0.0832 -0.1502 -0.0190
#> Var17 0.0540 0.98 0.0545 0.0011 0.1313
#> (Intercept) -0.0739 0.96 -0.0701 -0.1734 0.0000
#> Var14 0.0470 0.96 0.0386 -0.0083 0.0976
#> Var15 0.0367 0.96 0.0432 -0.0147 0.1094
#> Var19 0.0403 0.96 0.0407 -0.0448 0.1102
#> Var10 -0.0316 0.90 -0.0208 -0.1069 0.0160
#> Var11 -0.0142 0.90 -0.0248 -0.0871 0.0617
#> Var7 -0.0277 0.88 -0.0321 -0.1171 0.0277
#> Var18 -0.0176 0.88 -0.0172 -0.0748 0.0562
#> Var20 0.0082 0.86 0.0127 -0.0377 0.0845
#> Var6 0.0033 0.84 0.0140 -0.0582 0.0662
#> Var16 0.0236 0.84 0.0244 -0.0483 0.0945
#> Var9 -0.0349 0.82 -0.0379 -0.1221 0.0150
#> Var12 0.0000 0.78 -0.0011 -0.0491 0.0466
#> ------------------------------------------------gradLasso specializes in complex GLMs like ZINB. We
support a pipe syntax (|) to specify different predictors
for the Count model and the Zero-Inflation model. Simulation
We simulate data where the count model depends on different variables than the zero-inflation model.
set.seed(456)
sim_zinb <- simulate_data(n = 500, p = 20, family = "zinb",
k_mu = 5, k_pi = 5, theta = 2.0)
df_zinb <- data.frame(y = sim_zinb$y, sim_zinb$X)We use the pipe syntax:
y ~ predictors_for_count | predictors_for_zero. Here we use
all variables (.) for both models.
# We use a smaller number of bootstraps for speed in this vignette
fit_zinb <- gradLasso(y ~ . | ., data = df_zinb,
family = grad_zinb(),
n_boot = 10,
lambda = 0.05) # Fixed lambda for demonstration
print(fit_zinb)
#>
#> gradLasso Fitted Object
#> Family: zinb
#> Lambda: 0.0295
#> Deviance: 1202
#> Use plot() to view diagnostics or summary() for coefficients.The summary automatically splits coefficients into “Count”, “Zero-Infl”, and “Dispersion” components.
summary(fit_zinb)
#>
#> ------------------------------------------------
#> gradLasso Model Summary
#> ------------------------------------------------
#> Family: zinb
#> Deviance: 1202.12
#> AIC: 1240.12
#> BIC: 1320.20
#> DF: 19
#> ------------------------------------------------
#> Lambda: 0.0295 (Selected via CV)
#> Method: Stability Selection (10 bootstraps)
#> Interval: 2.5% - 97.5%
#>
#> --- Count Model Coefficients ---
#> Predictor Estimate Selection_Prob Boot_Mean CI_2.5 CI_97.5
#> (Intercept) 0.2891 1.0 0.3411 0.1989 0.4096
#> Var1 0.3159 1.0 0.2416 0.1681 0.3159
#> Var2 -0.2687 1.0 -0.3654 -0.4571 -0.2199
#> Var3 0.4674 1.0 0.5103 0.4660 0.6454
#> Var4 -0.0798 1.0 -0.0887 -0.2103 0.0115
#> Var5 0.2063 1.0 0.2291 0.1913 0.3321
#> Var6 0.0913 1.0 0.0742 0.0478 0.1382
#> Var7 -0.1512 1.0 -0.2090 -0.2888 -0.0937
#> Var16 -0.0156 0.9 -0.0798 -0.2015 0.0000
#> Var19 0.0211 0.9 0.0247 0.0047 0.0313
#> Var12 0.0000 0.8 0.0273 0.0000 0.0547
#> Var17 -0.0548 0.8 -0.0973 -0.1618 0.0000
#> Var9 0.0168 0.7 0.0367 0.0000 0.0667
#> Var13 0.0000 0.6 0.0084 0.0000 0.0651
#> Var14 0.0000 0.6 -0.0183 -0.0681 0.0000
#> Var20 0.0000 0.5 0.0170 0.0000 0.0848
#> Var8 0.0000 0.4 -0.0034 -0.0171 0.0000
#> Var10 0.0000 0.3 0.0000 0.0000 0.0000
#> Var11 0.0000 0.3 0.0229 0.0000 0.1014
#> Var18 0.0000 0.3 0.0050 0.0000 0.0258
#> Var15 0.0000 0.1 0.0000 0.0000 0.0000
#>
#> --- Zero-Inflation Model Coefficients ---
#> Predictor Estimate Selection_Prob Boot_Mean CI_2.5 CI_97.5
#> Var6 -0.0114 0.9 -0.1323 -0.2006 -0.0026
#> Var5 0.0457 0.8 0.0409 0.0000 0.1585
#> (Intercept) 0.0104 0.7 -0.0059 -0.0260 0.0000
#> Var1 -0.0413 0.7 -0.1890 -0.3615 0.0000
#> Var12 -0.0045 0.7 -0.0009 -0.0045 0.0000
#> Var4 -0.0653 0.6 -0.0499 -0.1840 0.0000
#> Var13 0.0000 0.6 0.0062 0.0000 0.0124
#> Var7 0.0000 0.2 0.0245 0.0000 0.1897
#> Var14 0.0000 0.2 0.0000 0.0000 0.0000
#> Var3 0.0000 0.1 0.0035 0.0000 0.0268
#> Var8 0.0000 0.1 0.0000 0.0000 0.0000
#> Var10 0.0000 0.1 -0.0048 -0.0371 0.0000
#> Var11 0.0000 0.1 0.0000 0.0000 0.0000
#> Var20 0.0000 0.1 0.0016 0.0000 0.0122
#>
#> --- Dispersion ---
#> Theta: 1.3594
#> ------------------------------------------------For large datasets, gradLasso supports parallel
execution for both Cross-Validation and Bootstrapping.
Conclusion
gradLasso provides a unified, tidy interface for sparse
regression across multiple GLM families. Its integrated stability
selection offers robust variable selection for high-dimensional
data.