Introduction to gradLasso

The gradLasso package implements an efficient gradient descent solver for LASSO-penalized regression models. It supports several families including Gaussian, Binomial, Negative Binomial, and Zero-Inflated Negative Binomial (ZINB). It also features built-in stability selection and cross-validation.

1. Gaussian Regression (Standard LASSO)

We start by simulating simple Gaussian data with correlated predictors.

set.seed(42)
# Simulate 200 obs, 20 predictors, 5 active
sim <- simulate_data(n = 200, p = 20, family = "gaussian", k = 5, snr = 3.0)
df <- data.frame(y = sim$y, sim$X)

# Check the first few rows
head(df[, 1:6])
#>            y       Var1       Var2       Var3        Var4        Var5
#> 1  2.4088118  1.3709584 -1.6863106  0.9706798 -0.04932661  0.66502569
#> 2 -1.1866537 -0.5646982  0.2140939 -0.8088901  0.25200976  0.76083532
#> 3 -0.5444072  0.3631284  1.2202852  0.2984229  1.02738323  0.41846489
#> 4 -0.4884482  0.6328626  2.1445006  0.4769757  0.91408140 -0.01476616
#> 5 -1.2214254  0.4042683 -1.2681897 -0.8203085 -0.81123831 -1.50034498
#> 6 -1.2940614 -0.1061245 -1.1488285 -1.2083257  1.29080373 -0.04453506

We can fit the model using the standard formula interface. By default, gradLasso performs 50 bootstraps for stability selection.

fit <- gradLasso(y ~ ., data = df, lambda_cv = TRUE, boot = TRUE, n_boot = 50)

print(fit)
#> 
#> gradLasso Fitted Object
#> Family: gaussian 
#> Lambda: 0.009446 
#> Deviance: 61 
#> Use plot() to view diagnostics or summary() for coefficients.

We can inspect the selected coefficients using summary(). The “Selection_Prob” column shows how often each variable was selected across bootstrap iterations.

summary(fit)
#> 
#> ------------------------------------------------
#> gradLasso Model Summary
#> ------------------------------------------------
#> Family:   gaussian
#> Deviance: 61.20
#> AIC:      101.20
#> BIC:      167.17
#> DF:       20
#> ------------------------------------------------
#> Lambda:   0.009446 (Selected via CV)
#> Method:   Stability Selection (50 bootstraps)
#> Interval: 2.5% - 97.5%
#> 
#> --- Selected Coefficients ---
#>    Predictor Estimate Selection_Prob Boot_Mean  CI_2.5 CI_97.5
#>         Var1   0.4690           1.00    0.4763  0.4054  0.5365
#>         Var2  -0.4402           1.00   -0.4364 -0.5347 -0.3436
#>         Var3   0.6597           1.00    0.6626  0.5642  0.7326
#>         Var4  -0.3445           1.00   -0.3404 -0.4040 -0.2621
#>         Var5   0.4756           1.00    0.4894  0.4069  0.5586
#>         Var8   0.1467           1.00    0.1574  0.0963  0.2167
#>        Var13  -0.0868           1.00   -0.0832 -0.1502 -0.0190
#>        Var17   0.0540           0.98    0.0545  0.0011  0.1313
#>  (Intercept)  -0.0739           0.96   -0.0701 -0.1734  0.0000
#>        Var14   0.0470           0.96    0.0386 -0.0083  0.0976
#>        Var15   0.0367           0.96    0.0432 -0.0147  0.1094
#>        Var19   0.0403           0.96    0.0407 -0.0448  0.1102
#>        Var10  -0.0316           0.90   -0.0208 -0.1069  0.0160
#>        Var11  -0.0142           0.90   -0.0248 -0.0871  0.0617
#>         Var7  -0.0277           0.88   -0.0321 -0.1171  0.0277
#>        Var18  -0.0176           0.88   -0.0172 -0.0748  0.0562
#>        Var20   0.0082           0.86    0.0127 -0.0377  0.0845
#>         Var6   0.0033           0.84    0.0140 -0.0582  0.0662
#>        Var16   0.0236           0.84    0.0244 -0.0483  0.0945
#>         Var9  -0.0349           0.82   -0.0379 -0.1221  0.0150
#>        Var12   0.0000           0.78   -0.0011 -0.0491  0.0466
#> ------------------------------------------------

Diagnostics

We can visualize the stability path and residual plots.

# Plot Stability Selection (Plot 1) and CV Deviance (Plot 2)
plot(fit, which = c(1, 2))

2. Zero-Inflated Negative Binomial (ZINB)

gradLasso specializes in complex GLMs like ZINB. We support a pipe syntax (|) to specify different predictors for the Count model and the Zero-Inflation model. Simulation

We simulate data where the count model depends on different variables than the zero-inflation model.

set.seed(456)
sim_zinb <- simulate_data(n = 500, p = 20, family = "zinb",
                          k_mu = 5, k_pi = 5, theta = 2.0)
df_zinb <- data.frame(y = sim_zinb$y, sim_zinb$X)

Model Fitting

We use the pipe syntax: y ~ predictors_for_count | predictors_for_zero. Here we use all variables (.) for both models.

# We use a smaller number of bootstraps for speed in this vignette
fit_zinb <- gradLasso(y ~ . | ., data = df_zinb,
                      family = grad_zinb(),
                      n_boot = 10,
                      lambda = 0.05) # Fixed lambda for demonstration

print(fit_zinb)
#> 
#> gradLasso Fitted Object
#> Family: zinb 
#> Lambda: 0.0295 
#> Deviance: 1202 
#> Use plot() to view diagnostics or summary() for coefficients.

Inspecting ZINB Coefficients

The summary automatically splits coefficients into “Count”, “Zero-Infl”, and “Dispersion” components.

summary(fit_zinb)
#> 
#> ------------------------------------------------
#> gradLasso Model Summary
#> ------------------------------------------------
#> Family:   zinb
#> Deviance: 1202.12
#> AIC:      1240.12
#> BIC:      1320.20
#> DF:       19
#> ------------------------------------------------
#> Lambda:   0.0295 (Selected via CV)
#> Method:   Stability Selection (10 bootstraps)
#> Interval: 2.5% - 97.5%
#> 
#> --- Count Model Coefficients ---
#>    Predictor Estimate Selection_Prob Boot_Mean  CI_2.5 CI_97.5
#>  (Intercept)   0.2891            1.0    0.3411  0.1989  0.4096
#>         Var1   0.3159            1.0    0.2416  0.1681  0.3159
#>         Var2  -0.2687            1.0   -0.3654 -0.4571 -0.2199
#>         Var3   0.4674            1.0    0.5103  0.4660  0.6454
#>         Var4  -0.0798            1.0   -0.0887 -0.2103  0.0115
#>         Var5   0.2063            1.0    0.2291  0.1913  0.3321
#>         Var6   0.0913            1.0    0.0742  0.0478  0.1382
#>         Var7  -0.1512            1.0   -0.2090 -0.2888 -0.0937
#>        Var16  -0.0156            0.9   -0.0798 -0.2015  0.0000
#>        Var19   0.0211            0.9    0.0247  0.0047  0.0313
#>        Var12   0.0000            0.8    0.0273  0.0000  0.0547
#>        Var17  -0.0548            0.8   -0.0973 -0.1618  0.0000
#>         Var9   0.0168            0.7    0.0367  0.0000  0.0667
#>        Var13   0.0000            0.6    0.0084  0.0000  0.0651
#>        Var14   0.0000            0.6   -0.0183 -0.0681  0.0000
#>        Var20   0.0000            0.5    0.0170  0.0000  0.0848
#>         Var8   0.0000            0.4   -0.0034 -0.0171  0.0000
#>        Var10   0.0000            0.3    0.0000  0.0000  0.0000
#>        Var11   0.0000            0.3    0.0229  0.0000  0.1014
#>        Var18   0.0000            0.3    0.0050  0.0000  0.0258
#>        Var15   0.0000            0.1    0.0000  0.0000  0.0000
#> 
#> --- Zero-Inflation Model Coefficients ---
#>    Predictor Estimate Selection_Prob Boot_Mean  CI_2.5 CI_97.5
#>         Var6  -0.0114            0.9   -0.1323 -0.2006 -0.0026
#>         Var5   0.0457            0.8    0.0409  0.0000  0.1585
#>  (Intercept)   0.0104            0.7   -0.0059 -0.0260  0.0000
#>         Var1  -0.0413            0.7   -0.1890 -0.3615  0.0000
#>        Var12  -0.0045            0.7   -0.0009 -0.0045  0.0000
#>         Var4  -0.0653            0.6   -0.0499 -0.1840  0.0000
#>        Var13   0.0000            0.6    0.0062  0.0000  0.0124
#>         Var7   0.0000            0.2    0.0245  0.0000  0.1897
#>        Var14   0.0000            0.2    0.0000  0.0000  0.0000
#>         Var3   0.0000            0.1    0.0035  0.0000  0.0268
#>         Var8   0.0000            0.1    0.0000  0.0000  0.0000
#>        Var10   0.0000            0.1   -0.0048 -0.0371  0.0000
#>        Var11   0.0000            0.1    0.0000  0.0000  0.0000
#>        Var20   0.0000            0.1    0.0016  0.0000  0.0122
#> 
#> --- Dispersion ---
#> Theta: 1.3594
#> ------------------------------------------------

3. Parallel Processing

For large datasets, gradLasso supports parallel execution for both Cross-Validation and Bootstrapping.

# Example (not run in vignette):
# fit <- gradLasso(y ~ ., data = df, parallel = TRUE, n_cores = 4)

Conclusion

gradLasso provides a unified, tidy interface for sparse regression across multiple GLM families. Its integrated stability selection offers robust variable selection for high-dimensional data.