Getting Started with Risk Differences

library(riskdiff)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

Why Risk Differences Matter in Public Health

When communicating health risks to policymakers, patients, or the public, absolute measures like risk differences are often more meaningful than relative measures like odds ratios or risk ratios.

Consider these two statements about betel nut (areca nut) chewing and cancer risk:

  1. “Betel nut chewing increases the odds of cancer by 5.2 times”

  2. “Betel nut chewing increases cancer risk by 12 percentage points”

The second statement is immediately actionable: in a screening group of 1,000 people, you would expect to find approximately 120 additional cancer cases among betel nut chewers compared to non-chewers. This directly informs:

Key Concept: Risk differences represent the additional burden of disease attributable to an exposure. They are on the same scale as the outcome, making them intuitive for non-statistical audiences.

Understanding the Example Data

The riskdiff package includes example data from a cancer screening program in Northeast India:

data(cachar_sample)

# Basic structure
glimpse(cachar_sample)
#> Rows: 2,500
#> Columns: 12
#> $ id                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
#> $ age                <int> 53, 25, 18, 28, 51, 25, 56, 20, 58, 18, 30, 64, 54,…
#> $ sex                <fct> female, male, female, female, male, female, male, m…
#> $ residence          <fct> urban slum, rural, rural, rural, rural, rural, rura…
#> $ smoking            <fct> No, No, No, No, No, No, Yes, No, No, No, No, Yes, N…
#> $ tobacco_chewing    <fct> Yes, No, No, Yes, Yes, No, Yes, No, Yes, Yes, Yes, …
#> $ areca_nut          <fct> Yes, Yes, Yes, Yes, No, No, Yes, No, Yes, Yes, No, …
#> $ alcohol            <fct> No, No, No, No, No, No, No, Yes, No, Yes, No, No, N…
#> $ abnormal_screen    <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ head_neck_abnormal <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ age_group          <fct> 40-60, Under 40, Under 40, Under 40, 40-60, Under 4…
#> $ tobacco_areca_both <fct> Yes, No, No, Yes, No, No, Yes, No, Yes, Yes, No, Ye…

# Key variables for our analysis
cachar_sample %>%
  select(abnormal_screen, areca_nut, smoking, alcohol, age, sex, residence) %>%
  summary()
#>  abnormal_screen  areca_nut  smoking    alcohol         age       
#>  Min.   :0.0000   No : 778   No :2168   No :1962   Min.   :18.00  
#>  1st Qu.:0.0000   Yes:1722   Yes: 332   Yes: 538   1st Qu.:23.00  
#>  Median :0.0000                                    Median :32.00  
#>  Mean   :0.1604                                    Mean   :36.06  
#>  3rd Qu.:0.0000                                    3rd Qu.:49.00  
#>  Max.   :1.0000                                    Max.   :84.00  
#>      sex            residence   
#>  male  :1880   rural     :2158  
#>  female: 620   urban     : 251  
#>                urban slum:  91  
#>                                 
#>                                 
#> 

This dataset represents a cross-sectional screening study where:

Your First Risk Difference Analysis

Let’s calculate the risk difference for cancer associated with betel nut chewing:

# Simple unadjusted analysis
rd_simple <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut"
)
#> Waiting for profiling to be done...

print(rd_simple)
#> Risk Difference Analysis Results (v0.2.0+)
#> ========================================== 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 1 
#> 
#>   Exposure Risk Difference           95% CI P-value    Model Boundary CI Method
#>  areca_nut          18.25% (15.89%, 20.57%)  <0.001 identity               wald

Understanding the Output

The output shows:

Interpreting Risk Differences

# Let's visualize what this risk difference means
exposure_summary <- cachar_sample %>%
  group_by(areca_nut) %>%
  summarise(
    n = n(),
    cases = sum(abnormal_screen),
    risk = mean(abnormal_screen),
    se = sqrt(risk * (1 - risk) / n)
  ) %>%
  mutate(
    risk_percent = risk * 100,
    se_percent = se * 100
  )

print(exposure_summary)
#> # A tibble: 2 × 7
#>   areca_nut     n cases   risk      se risk_percent se_percent
#>   <fct>     <int> <int>  <dbl>   <dbl>        <dbl>      <dbl>
#> 1 No          778    27 0.0347 0.00656         3.47      0.656
#> 2 Yes        1722   374 0.217  0.00994        21.7       0.994

# The risk difference is simply:
rd_value <- diff(exposure_summary$risk)
cat("Risk difference:", round(rd_value * 100, 1), "percentage points\n")
#> Risk difference: 18.2 percentage points

Adjusted Analyses

Real-world associations are often confounded. Let’s adjust for age and sex:

# Age and sex adjusted analysis
rd_adjusted <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut",
  adjust_vars = c("age", "sex")
)
#> Waiting for profiling to be done...
print(rd_adjusted)
#> Risk Difference Analysis Results (v0.2.0+)
#> ========================================== 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 1 
#> 
#>   Exposure Risk Difference          95% CI P-value Model Boundary CI Method
#>  areca_nut          17.02% (9.84%, 24.20%)  <0.001   log               wald

# Show comparison in a readable format
cat("\n=== COMPARISON: UNADJUSTED vs ADJUSTED ===\n")
#> 
#> === COMPARISON: UNADJUSTED vs ADJUSTED ===
cat("Unadjusted Risk Difference:", sprintf("%.2f%%", rd_simple$rd * 100), 
    sprintf("(%.2f%%, %.2f%%)", rd_simple$ci_lower * 100, rd_simple$ci_upper * 100), "\n")
#> Unadjusted Risk Difference: 18.25% (15.89%, 20.57%)
cat("Adjusted Risk Difference:  ", sprintf("%.2f%%", rd_adjusted$rd * 100), 
    sprintf("(%.2f%%, %.2f%%)", rd_adjusted$ci_lower * 100, rd_adjusted$ci_upper * 100), "\n")
#> Adjusted Risk Difference:   17.02% (9.84%, 24.20%)
cat("Difference in estimates:   ", sprintf("%.2f%%", (rd_adjusted$rd - rd_simple$rd) * 100), "percentage points\n")
#> Difference in estimates:    -1.23% percentage points

Note: Adjustment often changes the risk difference estimate. This indicates that age and/or sex were confounders of the chewing-cancer association.

Stratified Analysis

Sometimes we want to know if effects differ across subgroups:

# Stratified by residence (urban vs rural)
rd_stratified <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut",
  strata = "residence"
)
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Waiting for profiling to be done...
#> Note: 1 of 3 analyses had MLE on parameter space boundary. Robust confidence intervals were used.

print(rd_stratified)
#> Risk Difference Analysis Results (v0.2.0+)
#> ========================================== 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 3 
#> Boundary cases detected: 1 of 3 
#> Boundary CI method: auto 
#> 
#>   Exposure Risk Difference                  95% CI P-value    Model    Boundary
#>  areca_nut          18.89%        (16.28%, 21.46%)  <0.001 identity            
#>  areca_nut          12.50% (-26421.33%, 26446.33%)   0.987      log [Uh oh]none
#>  areca_nut          16.96%         (3.07%, 30.85%)   0.017 identity            
#>          CI Method
#>               wald
#>  wald_conservative
#>               wald
#> 
#> Boundary Case Details:
#> =====================
#> Row 2 ( areca_nut ): Boundary type: none 
#> 
#> Boundary Type Guide:
#> - upper_bound: Fitted probabilities near 1
#> - lower_bound: Fitted probabilities near 0
#> - separation: Complete/quasi-separation detected
#> - both_bounds: Probabilities near both 0 and 1
#> - [Uh oh] indicates robust confidence intervals were used
#> 
#> Note: Standard asymptotic theory may not apply for boundary cases.
#> Confidence intervals use robust methods when boundary detected.

This shows separate risk differences for urban and rural areas, which might reflect:

Visualizing Your Results

A picture is worth a thousand p-values:

# Create a forest plot of risk differences
plot_data <- rd_stratified %>%
  mutate(
    label = paste0(residence, "\n(n=", n_obs, ")"),
    rd_percent = rd * 100,
    ci_lower_percent = ci_lower * 100,
    ci_upper_percent = ci_upper * 100
  )

ggplot(plot_data, aes(x = rd_percent, y = label)) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = ci_lower_percent, xmax = ci_upper_percent), 
                 height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
  labs(
    title = "Risk Difference for Cancer by Betel Nut Chewing",
    subtitle = "Stratified by Urban/Rural Residence",
    x = "Risk Difference (percentage points)",
    y = ""
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.y = element_text(size = 11)
  )

Common Pitfalls and Solutions

1. Model Convergence Issues

The identity link GLM (which directly estimates risk differences) often fails to converge. The riskdiff package handles this automatically:

# Force different link functions
rd_identity <- calc_risk_diff(
  cachar_sample, "abnormal_screen", "areca_nut", 
  link = "identity"
)
#> Waiting for profiling to be done...

rd_log <- calc_risk_diff(
  cachar_sample, "abnormal_screen", "areca_nut", 
  link = "log"
)
#> Waiting for profiling to be done...

# Compare model types used
cat("Identity link model type:", rd_identity$model_type, "\n")
#> Identity link model type: identity
cat("Log link model type:", rd_log$model_type, "\n")
#> Log link model type: log

2. Very Rare Outcomes

With rare outcomes, risk differences become very small:

# Create a rare outcome (1% prevalence)
cachar_sample$rare_outcome <- rbinom(nrow(cachar_sample), 1, 0.01)

rd_rare <- calc_risk_diff(
  cachar_sample, 
  "rare_outcome", 
  "areca_nut"
)
#> Waiting for profiling to be done...

print(rd_rare)
#> Risk Difference Analysis Results (v0.2.0+)
#> ========================================== 
#> 
#> Confidence level: 95% 
#> Number of comparisons: 1 
#> 
#>   Exposure Risk Difference          95% CI P-value    Model Boundary CI Method
#>  areca_nut          -0.29% (-1.28%, 0.52%)   0.520 identity               wald

Tip: For very rare outcomes (<1%), consider whether risk ratios might be more appropriate for your research question.

3. Missing Data

The package automatically handles missing data by complete case analysis:

# Create a copy with some missing data for demonstration
set.seed(123)  # For reproducibility
cachar_with_missing <- cachar_sample %>%
  mutate(
    # Introduce more modest missing data (~3% in age, ~2% in alcohol)
    age_with_missing = ifelse(runif(n()) < 0.03, NA, age),
    alcohol_with_missing = ifelse(runif(n()) < 0.02, NA, alcohol)
  )

# Check the missing data patterns
missing_summary <- cachar_with_missing %>%
  summarise(
    total_observations = n(),
    age_missing = sum(is.na(age_with_missing)),
    alcohol_missing = sum(is.na(alcohol_with_missing)),
    total_missing_any = sum(!complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut))),
    complete_cases = sum(complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut)))
  )

print(missing_summary)
#>   total_observations age_missing alcohol_missing total_missing_any
#> 1               2500          83              57               140
#>   complete_cases
#> 1           2360

# Analysis with variables that have missing data
rd_missing <- calc_risk_diff(
  cachar_with_missing,
  "abnormal_screen",
  "areca_nut",
  adjust_vars = c("age_with_missing", "alcohol_with_missing")
)
#> Waiting for profiling to be done...

# Compare with complete case analysis
rd_complete <- calc_risk_diff(
  cachar_sample,
  "abnormal_screen", 
  "areca_nut",
  adjust_vars = c("age", "alcohol")
)
#> Waiting for profiling to be done...

cat("\n=== IMPACT OF MISSING DATA ===\n")
#> 
#> === IMPACT OF MISSING DATA ===
cat("Complete data analysis (n=", rd_complete$n_obs, "): ", sprintf("%.2f%%", rd_complete$rd * 100), 
    sprintf(" (%.2f%%, %.2f%%)", rd_complete$ci_lower * 100, rd_complete$ci_upper * 100), "\n")
#> Complete data analysis (n= 2500 ):  17.16%  (9.95%, 24.36%)

# Check if missing data analysis succeeded
if (!is.na(rd_missing$rd)) {
  cat("Missing data analysis (n=", rd_missing$n_obs, "):  ", sprintf("%.2f%%", rd_missing$rd * 100), 
      sprintf(" (%.2f%%, %.2f%%)", rd_missing$ci_lower * 100, rd_missing$ci_upper * 100), "\n")
  cat("Cases lost to missing data: ", rd_complete$n_obs - rd_missing$n_obs, "\n")
} else {
  cat("Missing data analysis: FAILED (insufficient data or convergence issues)\n")
  cat("Attempted to use n =", rd_missing$n_obs, "complete cases\n")
  cat("Cases lost to missing data: ", nrow(cachar_with_missing) - rd_missing$n_obs, "\n\n")
  
  cat("📚 LESSON: This demonstrates why missing data can be problematic:\n")
  cat("   • Listwise deletion can dramatically reduce sample size\n")
  cat("   • Small samples may cause model convergence failures\n") 
  cat("   • Consider multiple imputation for better missing data handling\n")
  cat("   • The riskdiff package gracefully handles these failures\n")
}
#> Missing data analysis: FAILED (insufficient data or convergence issues)
#> Attempted to use n = 2500 complete cases
#> Cases lost to missing data:  0 
#> 
#> 📚 LESSON: This demonstrates why missing data can be problematic:
#>    • Listwise deletion can dramatically reduce sample size
#>    • Small samples may cause model convergence failures
#>    • Consider multiple imputation for better missing data handling
#>    • The riskdiff package gracefully handles these failures

Quick Reference

Basic Syntax

# Example usage:
result <- calc_risk_diff(
  data = cachar_sample,           # Your dataset
  outcome = "abnormal_screen",    # Binary outcome variable (0/1)
  exposure = "areca_nut",         # Exposure of interest
  adjust_vars = c("age", "sex"),  # Variables to adjust for
  strata = "residence",           # Stratification variables
  link = "auto",                  # Link function: "auto", "identity", "log", "logit"
  alpha = 0.05,                   # Significance level (0.05 = 95% CI)
  verbose = FALSE                 # Print diagnostic messages if TRUE
)

Interpretation Guide

Risk Difference Interpretation Public Health Meaning
0.05 (5%) 5 percentage point increase 50 extra cases per 1,000 screened
0.01 (1%) 1 percentage point increase 10 extra cases per 1,000 screened
-0.03 (-3%) 3 percentage point decrease 30 fewer cases per 1,000 screened

When to Use Risk Differences

Use risk differences when:

Consider alternatives when:

Next Steps

Now that you understand the basics:

  1. See the “Complete Example” vignette for a full analysis workflow

  2. Check the “Technical Details” vignette for statistical methodology

  3. Use ?calc_risk_diff for detailed function documentation

Getting Help


This vignette is part of the riskdiff package (v0.2.0), developed to make risk difference calculations accessible to public health researchers.