The gkrreg package implements the Gaussian Kernel Robust Regression (GKRReg, or GKRR) method proposed by De Carvalho, Lima Neto, and Ferreira (2017). The method fits a linear regression model by iteratively re-weighting observations using the Gaussian kernel, so that outliers and leverage points are automatically down-weighted. Convergence is theoretically guaranteed (Propositions 4.1 and 4.2 of the paper).
The package provides:
gkrr() — model fitting with four width hyper-parameter
estimators ("s1", "s2", "s3",
"s4") plus automatic data-driven selection
("auto");gkrr_boot() — bootstrap inference (standard errors,
confidence intervals and p-values);summary.gkrr() — a coefficient table modelled after
summary.lm(), with sandwich inference by default and
bootstrap inference when available;plot.gkrr() — six diagnostic panels emphasising kernel
weights;GKRReg minimises
\[S(\boldsymbol{\beta}) = 2\sum_{i=1}^n \bigl[1 - G(y_i,\, \hat{\mu}_i)\bigr], \qquad G(a,b) = \exp\!\bigl(-{(a-b)^2}/{\gamma^2}\bigr),\]
where \(\hat{\mu}_i = \mathbf{x}_i^\top\boldsymbol{\beta}\) and \(\gamma^2 > 0\) is the kernel width hyper-parameter. An observation with a large residual contributes close to 2 to \(S\); a perfectly fitted observation contributes 0.
Differentiating \(S\) yields an IRWLS problem with kernel weights \(k_{ii}^{(t)} = G(y_i, \hat{\mu}_i^{(t-1)}) \in (0,1]\). The algorithm starts from OLS and alternates between WLS and weight updates until convergence (Propositions 4.1–4.2 of De Carvalho, Lima Neto, and Ferreira (2017)).
Five options are available via the sigma_method
argument:
| Method | Description | Recommended scenario |
|---|---|---|
"s1" |
Mean of the 0.1 and 0.9 quantiles of OLS squared residuals on a sub-sample (Caputo estimator) | Clean data |
"s2" |
Pairwise median of \((y_i - \hat{\mu}_j^{\text{OLS}})^2\), \(i \neq j\) | Y-space outliers \(\leq 10\%\); X-space outliers \(\leq 15\%\) |
"s3" |
Residual variance \(\sum(y_i - \hat{\mu}_i)^2 / (n - p - 1)\) | Y-space outliers \(\geq 15\%\); leverage points |
"s4" |
AICc-selected bandwidth \(h^2\) via
sm::h.select() |
Large samples (\(n \geq 200\)) |
"auto" |
Selects among S1–S4 by out-of-bag bootstrap MSE | When the outlier scenario is unknown |
When "auto" is used, a message is emitted showing which
method was selected and the comparative OOB MSEs. The selection uses
\(B = 99\) replicates by default,
configurable via auto_args = list(B = ...).
library(gkrreg)
data(belgium_calls)
fit <- gkrr(calls ~ year, data = belgium_calls, sigma_method = "s3")
print(fit)
#>
#> GKRReg -- Gaussian Kernel Robust Regression
#> --------------------------------------------
#> gamma^2 : 31.61 (method: s3)
#> Iterations: 9 [converged]
#>
#> Coefficients:
#> (Intercept) year
#> -6.5764 0.1351
#>
#> (Sandwich inference available -- use summary() for the full table)summary() uses the sandwich variance
estimator by default, producing standard errors, confidence
intervals and Wald z-test p-values with no additional computation:
summary(fit)
#>
#> Call:
#> gkrr(formula = calls ~ year, data = belgium_calls, sigma_method = "s3")
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.6165 -0.1917 -0.0271 3.5213 18.4537
#>
#> Coefficients:
#> Estimate Std. Error CI 95% lower CI 95% upper p-value
#> (Intercept) -6.5764 1.1145 -8.7609 -4.3920 3.621e-09 ***
#> year 0.1351 0.0203 0.0954 0.1748 2.529e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Sandwich SE (asymptotic Wald z-test))
#>
#> gamma^2: 31.61 | method: s3 | iterations: 9 | converged: TRUE
#> R-squared: -0.1219 | Weighted R-squared: 0.5499
#> Kernel weights -- min: 0.0000 mean: 0.7501 max: 1.0000
#> Note: sandwich inference may be less reliable here (n = 24 (small sample); 12% of observations received near-zero kernel weight (< 0.01)).
#> Consider bootstrap inference via boot = TRUE in gkrr() or
#> summary(fit, boot = gkrr_boot(fit)).The sandwich covariance matrix is accessible via
vcov():
round(vcov(fit), 6)
#> (Intercept) year
#> (Intercept) 1.242183 -0.022546
#> year -0.022546 0.000410The original paper does not provide a sampling distribution for \(\hat{\boldsymbol{\beta}}\). The WLS covariance matrix \((X^\top \hat{K} X)^{-1}\) from the final IRWLS step treats \(\hat{K}\) as fixed and therefore underestimates the true variance.
gkrreg implements an analytic sandwich variance estimator based on the theory of generalised \(M\)-estimators. At convergence the GKRReg estimating equations are:
\[\sum_{i=1}^n k_{ii}(\boldsymbol{\beta})\,\mathbf{x}_i (y_i - \mathbf{x}_i^\top\boldsymbol{\beta}) = \mathbf{0},\]
and the asymptotic sandwich covariance is \(n^{-1}\hat{A}^{-1}\hat{B}\hat{A}^{-1}\), where
\[\hat{A} = \frac{1}{n}\,\mathbf{X}^\top \operatorname{diag}\!\left[k_{ii}\!\left(1 - \frac{2\hat{e}_i^2}{\hat{\gamma}^2}\right)\right] \mathbf{X}, \qquad \hat{B} = \frac{1}{n}\,\mathbf{X}^\top \operatorname{diag}\!\left(k_{ii}^2\,\hat{e}_i^2\right)\mathbf{X}.\]
This corresponds to the HC0 class of heteroskedasticity-robust covariance estimators (white1982?). The correction term \((1 - 2\hat{e}_i^2/\hat{\gamma}^2)\) in \(\hat{A}\) arises directly from differentiating the Gaussian kernel with respect to \(\boldsymbol{\beta}\).
When is the sandwich reliable? The sandwich is
consistent and efficient asymptotically, but treats \(\hat{\gamma}^2\) as fixed. For small
samples (\(n < 50\)) or heavy
contamination, bootstrap inference is recommended.
summary() emits a note when these conditions are detected
automatically.
gkrr_boot() runs a pairs bootstrap, re-fitting
the complete GKRReg algorithm — including re-estimating \(\gamma^2\) — on each replicate. This
captures all sources of variability and is more reliable than the
sandwich for small samples or heavy contamination.
Three CI types are available: "percentile",
"normal", and "bca" (bias-corrected and
accelerated, Efron (1987); the default and
recommended). Bootstrap p-values use the centred-t approach:
\[p_j = \frac{2}{B}\,\#\!\left\{|\hat\beta_j^* - \hat\beta_j| \geq |\hat\beta_j|\right\}.\]
Option A — boot = TRUE inside
gkrr()
fit_b <- gkrr(calls ~ year, data = belgium_calls,
sigma_method = "s3",
boot = TRUE,
boot_args = list(B = 999, type = "bca", seed = 1))
summary(fit_b)Option B — separate gkrr_boot()
call
| Scenario | Recommended |
|---|---|
| \(n \geq 50\), mild contamination | Sandwich (fast, deterministic) |
| \(n < 50\) or heavy contamination | Bootstrap |
| Both available and SEs diverge by > 25% | Bootstrap |
When both are available, summary() automatically warns
if the relative discrepancy in standard errors exceeds 25% for any
coefficient.
plot.gkrr() provides six panels. Point size is
inversely proportional to \(k_{ii}\): outliers appear large
and red; well-fitted observations appear small and blue.
oldpar <- par(no.readonly = TRUE)
on.exit(par(oldpar))
par(mfrow = c(1, 3))
plot(fit, which = 1, ask = FALSE) # residuals vs. fitted
plot(fit, which = 3, ask = FALSE) # weight vs. residual + kernel curve
plot(fit, which = 4, ask = FALSE) # weight vs. indexPanel 3 overlays the theoretical curve \(G(e) = \exp(-e^2/\hat{\gamma}^2)\), making
the down-weighting mechanism directly visible. Observations 15–20 in
belgium_calls (a recording error) receive near-zero weights
and are effectively excluded from the fit.
which |
Description |
|---|---|
| 1 | Residuals vs. fitted values |
| 2 | Observed vs. fitted (\(y\) vs. \(\hat{\mu}\)) |
| 3 | Kernel weight vs. residual with theoretical curve |
| 4 | Kernel weight vs. observation index |
| 5 | Normal QQ-plot of residuals, coloured by weight |
| 6 | Convergence of \(S(\boldsymbol{\beta})\) |
data(kootenay)
fit_ols <- lm(newgate ~ libby, data = kootenay)
fit_gkrr <- gkrr(newgate ~ libby, data = kootenay, sigma_method = "s1")
if (requireNamespace("MASS", quietly = TRUE)) {
fit_m <- MASS::rlm(newgate ~ libby, data = kootenay, method = "M")
fit_mm <- MASS::rlm(newgate ~ libby, data = kootenay, method = "MM")
} else { fit_m <- fit_mm <- NULL }
tab <- rbind(OLS = coef(fit_ols),
GKRR = coef(fit_gkrr))
if (!is.null(fit_m))
tab <- rbind(tab, M = coef(fit_m), MM = coef(fit_mm))
print(round(tab, 4))
#> (Intercept) libby
#> OLS 23.1649 -0.0138
#> GKRR 5.4561 0.6216
#> M 23.1942 -0.0186
#> MM 5.4667 0.6209
plot(kootenay$libby, kootenay$newgate,
xlab = "Libby flow", ylab = "Newgate flow",
main = "Kootenay River -- X-space outlier (1934)",
pch = 19, col = "grey60")
points(kootenay["1934","libby"], kootenay["1934","newgate"],
col = "red", pch = 17, cex = 1.6)
abline(fit_ols, col = "black", lwd = 2, lty = 2)
abline(fit_gkrr, col = "firebrick", lwd = 2)
if (!is.null(fit_m)) {
abline(fit_m, col = "darkorange", lwd = 2, lty = 3)
abline(fit_mm, col = "purple", lwd = 2, lty = 4)
legend("topleft", c("OLS","GKRR","M","MM","1934"),
col = c("black","firebrick","darkorange","purple","red"),
lty = c(2,1,3,4,NA), pch = c(NA,NA,NA,NA,17), lwd = 2, bty = "n")
} else {
legend("topleft", c("OLS","GKRR","1934"),
col = c("black","firebrick","red"),
lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n")
}fit_ols <- lm(calls ~ year, data = belgium_calls)
fit_gkrr <- gkrr(calls ~ year, data = belgium_calls, sigma_method = "s3")
plot(belgium_calls$year + 1900, belgium_calls$calls,
xlab = "Year", ylab = "Calls (tens of millions)",
main = "Belgium International Calls",
pch = 19, col = "grey60")
points(belgium_calls$year[15:20] + 1900, belgium_calls$calls[15:20],
col = "red", pch = 17, cex = 1.4)
abline(fit_ols, col = "black", lwd = 2, lty = 2)
abline(fit_gkrr, col = "firebrick", lwd = 2)
legend("topleft", c("OLS","GKRR (S3)","Outliers (1964-69)"),
col = c("black","firebrick","red"),
lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n")oldpar <- par(no.readonly = TRUE)
on.exit(par(oldpar))
par(mfrow = c(1, 3))
plot(fit_gkrr, which = 1, ask = FALSE)
plot(fit_gkrr, which = 3, ask = FALSE)
plot(fit_gkrr, which = 4, ask = FALSE)data(delivery)
fit_ols <- lm(delivery_time ~ n_products + distance, data = delivery)
fit_gkrr <- gkrr(delivery_time ~ n_products + distance, data = delivery,
sigma_method = "s3")
round(rbind(OLS = coef(fit_ols),
GKRR = coef(fit_gkrr)), 4)
#> (Intercept) n_products distance
#> OLS 2.3412 1.6159 0.0144
#> GKRR 4.0607 1.3891 0.0145oldpar <- par(no.readonly = TRUE)
on.exit(par(oldpar))
par(mfrow = c(1, 3))
plot(fit_gkrr, which = 2, ask = FALSE)
plot(fit_gkrr, which = 4, ask = FALSE)
plot(fit_gkrr, which = 5, ask = FALSE)data(mammals)
fit_ols <- lm(log_brain ~ log_body, data = mammals)
fit_gkrr <- gkrr(log_brain ~ log_body, data = mammals, sigma_method = "s3")
plot(mammals$log_body, mammals$log_brain,
xlab = "log(body mass, kg)", ylab = "log(brain mass, g)",
main = "Brain vs. Body Mass (62 Mammal Species)",
pch = 19, col = "grey60", cex = 0.8)
elephants <- mammals$species %in% c("African elephant","Asian elephant")
points(mammals$log_body[elephants], mammals$log_brain[elephants],
col = "red", pch = 17, cex = 1.5)
abline(fit_ols, col = "black", lwd = 2, lty = 2)
abline(fit_gkrr, col = "firebrick", lwd = 2)
legend("topleft", c("OLS","GKRR (S3)","Elephants"),
col = c("black","firebrick","red"),
lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n")sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.6
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Fortaleza
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gkrreg_0.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.56
#> [5] cachem_1.1.0 knitr_1.51 htmltools_0.5.9 rmarkdown_2.30
#> [9] lifecycle_1.0.5 cli_3.6.6 sass_0.4.10 jquerylib_0.1.4
#> [13] sm_2.2-6.0 compiler_4.5.2 rstudioapi_0.18.0 tools_4.5.2
#> [17] evaluate_1.0.5 bslib_0.11.0 yaml_2.3.12 otel_0.2.0
#> [21] jsonlite_2.0.0 rlang_1.2.0 MASS_7.3-65