Getting Started with Risk Differences

Your First Risk Difference Analysis

Let’s calculate the risk difference for cancer associated with betel nut chewing:

# Simple unadjusted analysis
rd_simple <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut"
)
#> Waiting for profiling to be done...
#> Warning: Unknown or uninitialised column: `on_boundary`.

print(rd_simple)
#> Risk Difference Analysis Results (v0.1.0) 
#> ========================================= 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 1 
#> 
#>   Exposure Risk Difference           95% CI P-value    Model    N
#>  areca_nut          18.25% (15.89%, 20.57%)  <0.001 identity 2500

Understanding the Output

The output shows:

rd: The risk difference (e.g., 0.082 = 8.2 percentage points)
ci_lower, ci_upper: 95% confidence interval bounds
p_value: Test of whether the risk difference equals zero
model_type: Which GLM link function successfully converged
n_obs: Number of observations used in the analysis

Interpreting Risk Differences

# Let's visualize what this risk difference means
exposure_summary <- cachar_sample %>%
  group_by(areca_nut) %>%
  summarise(
    n = n(),
    cases = sum(abnormal_screen),
    risk = mean(abnormal_screen),
    se = sqrt(risk * (1 - risk) / n)
  ) %>%
  mutate(
    risk_percent = risk * 100,
    se_percent = se * 100
  )

print(exposure_summary)
#> # A tibble: 2 × 7
#>   areca_nut     n cases   risk      se risk_percent se_percent
#>   <fct>     <int> <int>  <dbl>   <dbl>        <dbl>      <dbl>
#> 1 No          778    27 0.0347 0.00656         3.47      0.656
#> 2 Yes        1722   374 0.217  0.00994        21.7       0.994

# The risk difference is simply:
rd_value <- diff(exposure_summary$risk)
cat("Risk difference:", round(rd_value * 100, 1), "percentage points\n")
#> Risk difference: 18.2 percentage points

Adjusted Analyses

Real-world associations are often confounded. Let’s adjust for age and sex:

# Age and sex adjusted analysis
rd_adjusted <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut",
  adjust_vars = c("age", "sex")
)
#> Warning: Unknown or uninitialised column: `on_boundary`.
print(rd_adjusted)
#> Risk Difference Analysis Results (v0.1.0) 
#> ========================================= 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 1 
#> 
#>   Exposure Risk Difference          95% CI P-value Model    N
#>  areca_nut          17.02% (2.66%, 31.38%)       —   log 2500

# Show comparison in a readable format
cat("\n=== COMPARISON: UNADJUSTED vs ADJUSTED ===\n")
#> 
#> === COMPARISON: UNADJUSTED vs ADJUSTED ===
cat("Unadjusted Risk Difference:", sprintf("%.2f%%", rd_simple$rd * 100), 
    sprintf("(%.2f%%, %.2f%%)", rd_simple$ci_lower * 100, rd_simple$ci_upper * 100), "\n")
#> Unadjusted Risk Difference: 18.25% (15.89%, 20.57%)
cat("Adjusted Risk Difference:  ", sprintf("%.2f%%", rd_adjusted$rd * 100), 
    sprintf("(%.2f%%, %.2f%%)", rd_adjusted$ci_lower * 100, rd_adjusted$ci_upper * 100), "\n")
#> Adjusted Risk Difference:   17.02% (2.66%, 31.38%)
cat("Difference in estimates:   ", sprintf("%.2f%%", (rd_adjusted$rd - rd_simple$rd) * 100), "percentage points\n")
#> Difference in estimates:    -1.23% percentage points

Note: Adjustment often changes the risk difference estimate. This indicates that age and/or sex were confounders of the chewing-cancer association.

Stratified Analysis

Sometimes we want to know if effects differ across subgroups:

# Stratified by residence (urban vs rural)
rd_stratified <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut",
  strata = "residence"
)
#> Waiting for profiling to be done...
#> Warning in .transform_to_rd_robust(model, data, exposure, alpha, n_boot = 500,
#> : Confidence interval capped at ±50 percentage points due to extreme width.
#> Waiting for profiling to be done...
#> Warning: Unknown or uninitialised column: `on_boundary`.
#> Warning in calc_risk_diff(data = cachar_sample, outcome = "abnormal_screen", :
#> 1 of 3 estimates have very wide CIs (>50 pp). Consider larger sample sizes.

print(rd_stratified)
#> Risk Difference Analysis Results (v0.1.0) 
#> ========================================= 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 3 
#> 
#>   Exposure Risk Difference            95% CI P-value    Model    N
#>  areca_nut          18.89%  (16.28%, 21.46%)  <0.001 identity 2158
#>  areca_nut          12.50% (-37.50%, 62.50%)       —      log  251
#>  areca_nut          16.96%   (3.07%, 30.85%)   0.017 identity   91

This shows separate risk differences for urban and rural areas, which might reflect:

Different exposure patterns
Access to healthcare
Environmental co-factors

Visualizing Your Results

A picture is worth a thousand p-values:


# Stratified analysis by residence
rd_stratified <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut",
  strata = "residence"
)
#> Waiting for profiling to be done...
#> Warning in .transform_to_rd_robust(model, data, exposure, alpha, n_boot = 500,
#> : Confidence interval capped at ±50 percentage points due to extreme width.
#> Waiting for profiling to be done...
#> Warning: Unknown or uninitialised column: `on_boundary`.
#> Warning in calc_risk_diff(data = cachar_sample, outcome = "abnormal_screen", :
#> 1 of 3 estimates have very wide CIs (>50 pp). Consider larger sample sizes.

print(rd_stratified)
#> Risk Difference Analysis Results (v0.1.0) 
#> ========================================= 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 3 
#> 
#>   Exposure Risk Difference            95% CI P-value    Model    N
#>  areca_nut          18.89%  (16.28%, 21.46%)  <0.001 identity 2158
#>  areca_nut          12.50% (-37.50%, 62.50%)       —      log  251
#>  areca_nut          16.96%   (3.07%, 30.85%)   0.017 identity   91

# Check if estimates are stable enough for visualization
rd_summary <- rd_stratified %>%
  summarise(
    n_valid = sum(!is.na(rd)),
    max_ci_width = max((ci_upper - ci_lower) * 100, na.rm = TRUE),
    .groups = "drop"
  )

# Only create plot if we have reasonable estimates
if (rd_summary$n_valid > 0 && rd_summary$max_ci_width < 200) {
  # Use the fixed forest plot function
  plot_obj <- create_forest_plot(
    rd_stratified, 
    title = "Risk Difference for Cancer by Areca Nut Chewing",
    max_ci_width = 30
  )
  
  if (!is.null(plot_obj)) {
    print(plot_obj)
  } else {
    cat(paste0(riskdiff:::.safe_warning()), "Could not create plot due to unstable estimates.\n")
  }
} else {
  cat(paste0(riskdiff:::.safe_warning()), "Estimates too unstable for meaningful visualization.\n")
  cat("Consider:\n")
  cat(paste0(riskdiff:::.safe_bullet()), "Larger sample sizes\n") 
  cat(paste0(riskdiff:::.safe_bullet()), "Different stratification variables\n")
  cat(paste0(riskdiff:::.safe_bullet()), "Pooled analysis instead of stratification\n")
  
  # Show tabular results instead using the robust summary function
  formatted_results <- create_summary_table(
    rd_stratified, 
    caption = "Risk Differences by Residence"
  )
  
  knitr::kable(formatted_results, caption = "Risk Differences by Residence")
}

Common Pitfalls and Solutions

1. Model Convergence Issues

The identity link GLM (which directly estimates risk differences) often fails to converge. The riskdiff package handles this automatically:

# Force different link functions
rd_identity <- calc_risk_diff(
  cachar_sample, "abnormal_screen", "areca_nut", 
  link = "identity"
)
#> Waiting for profiling to be done...
#> Warning: Unknown or uninitialised column: `on_boundary`.

rd_log <- calc_risk_diff(
  cachar_sample, "abnormal_screen", "areca_nut", 
  link = "log"
)
#> Warning: Unknown or uninitialised column: `on_boundary`.

# Compare model types used
cat("Identity link model type:", rd_identity$model_type, "\n")
#> Identity link model type: identity
cat("Log link model type:", rd_log$model_type, "\n")
#> Log link model type: log

2. Very Rare Outcomes

With rare outcomes, risk differences become very small:

# Create a rare outcome (1% prevalence)
cachar_sample$rare_outcome <- rbinom(nrow(cachar_sample), 1, 0.01)

rd_rare <- calc_risk_diff(
  cachar_sample, 
  "rare_outcome", 
  "areca_nut"
)
#> Waiting for profiling to be done...
#> Warning: Unknown or uninitialised column: `on_boundary`.

print(rd_rare)
#> Risk Difference Analysis Results (v0.1.0) 
#> ========================================= 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 1 
#> 
#>   Exposure Risk Difference          95% CI P-value    Model    N
#>  areca_nut           0.55% (-0.42%, 1.39%)   0.214 identity 2500

Tip: For very rare outcomes (<1%), consider whether risk ratios might be more appropriate for your research question.

3. Missing Data

The package automatically handles missing data by complete case analysis:

# Create a copy with some missing data for demonstration
set.seed(123)  # For reproducibility
cachar_with_missing <- cachar_sample %>%
  mutate(
    # Introduce more modest missing data (~3% in age, ~2% in alcohol)
    age_with_missing = ifelse(runif(n()) < 0.03, NA, age),
    alcohol_with_missing = ifelse(runif(n()) < 0.02, NA, alcohol)
  )

# Check the missing data patterns
missing_summary <- cachar_with_missing %>%
  summarise(
    total_observations = n(),
    age_missing = sum(is.na(age_with_missing)),
    alcohol_missing = sum(is.na(alcohol_with_missing)),
    total_missing_any = sum(!complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut))),
    complete_cases = sum(complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut)))
  )

print(missing_summary)
#>   total_observations age_missing alcohol_missing total_missing_any
#> 1               2500          83              57               140
#>   complete_cases
#> 1           2360

# Analysis with variables that have missing data
rd_missing <- calc_risk_diff(
  cachar_with_missing,
  "abnormal_screen",
  "areca_nut",
  adjust_vars = c("age_with_missing", "alcohol_with_missing")
)
#> Warning: Unknown or uninitialised column: `on_boundary`.

# Compare with complete case analysis
rd_complete <- calc_risk_diff(
  cachar_sample,
  "abnormal_screen", 
  "areca_nut",
  adjust_vars = c("age", "alcohol")
)
#> Warning: Unknown or uninitialised column: `on_boundary`.

cat("\n=== IMPACT OF MISSING DATA ===\n")
#> 
#> === IMPACT OF MISSING DATA ===
cat("Complete data analysis (n=", rd_complete$n_obs, "): ", sprintf("%.2f%%", rd_complete$rd * 100), 
    sprintf(" (%.2f%%, %.2f%%)", rd_complete$ci_lower * 100, rd_complete$ci_upper * 100), "\n")
#> Complete data analysis (n= 2500 ):  17.16%  (2.75%, 31.57%)

# Check if missing data analysis succeeded
if (!is.na(rd_missing$rd)) {
  cat("Missing data analysis (n=", rd_missing$n_obs, "):  ", sprintf("%.2f%%", rd_missing$rd * 100), 
      sprintf(" (%.2f%%, %.2f%%)", rd_missing$ci_lower * 100, rd_missing$ci_upper * 100), "\n")
  cat("Cases lost to missing data: ", rd_complete$n_obs - rd_missing$n_obs, "\n")
} else {
  cat("Missing data analysis: FAILED (insufficient data or convergence issues)\n")
  cat("Attempted to use n =", rd_missing$n_obs, "complete cases\n")
  cat("Cases lost to missing data: ", nrow(cachar_with_missing) - rd_missing$n_obs, "\n\n")
  
  cat("LESSON: This demonstrates why missing data can be problematic:\n")
  cat(paste0(riskdiff:::.safe_bullet()), "Listwise deletion can dramatically reduce sample size\n")
  cat(paste0(riskdiff:::.safe_bullet()), "Small samples may cause model convergence failures\n") 
  cat(paste0(riskdiff:::.safe_bullet()), "Consider multiple imputation for better missing data handling\n")
  cat(paste0(riskdiff:::.safe_bullet()), "The riskdiff package gracefully handles these failures\n")
}
#> Missing data analysis: FAILED (insufficient data or convergence issues)
#> Attempted to use n = 2500 complete cases
#> Cases lost to missing data:  0 
#> 
#> LESSON: This demonstrates why missing data can be problematic:
#> • Listwise deletion can dramatically reduce sample size
#> • Small samples may cause model convergence failures
#> • Consider multiple imputation for better missing data handling
#> • The riskdiff package gracefully handles these failures

Quick Reference

Basic Syntax

# Example usage:
result <- calc_risk_diff(
  data = cachar_sample,           # Your dataset
  outcome = "abnormal_screen",    # Binary outcome variable (0/1)
  exposure = "areca_nut",         # Exposure of interest
  adjust_vars = c("age", "sex"),  # Variables to adjust for
  strata = "residence",           # Stratification variables
  link = "auto",                  # Link function: "auto", "identity", "log", "logit"
  alpha = 0.05,                   # Significance level (0.05 = 95% CI)
  verbose = FALSE                 # Print diagnostic messages if TRUE
)

Interpretation Guide

Risk Difference	Interpretation	Public Health Meaning
0.05 (5%)	5 percentage point increase	50 extra cases per 1,000 screened
0.01 (1%)	1 percentage point increase	10 extra cases per 1,000 screened
-0.03 (-3%)	3 percentage point decrease	30 fewer cases per 1,000 screened

When to Use Risk Differences

✅ Use risk differences when:

Communicating to non-statistical audiences
Making policy decisions about interventions
The outcome is relatively common (>5%)
You need to calculate number needed to treat/screen
Comparing absolute impact across different populations

❌ Consider alternatives when:

Outcome is very rare (<1%)
You need to compare across populations with very different baseline risks
Your research question is about biological/etiological mechanisms