Modeling site, platform, relatedness, and spatial structure

Selçuk Korkmaz

2026-07-03

Beyond the classic subject / batch / study / time relations, splitGraph models several further leakage axes, in two families:

This vignette builds and groups by each, and shows how the threshold drives the pairwise grouping.

Cluster-style relations: site, region, platform, assay

graph_from_metadata() auto-detects site_id, region_id, platform_id, and assay_id columns and builds the corresponding typed nodes and edges. Each then has its own constraint mode. The example below uses site, platform, and assay; region behaves identically (a region_id column and mode = "region") and is omitted only to keep the output short.

meta <- data.frame(
  sample_id   = paste0("S", 1:6),
  subject_id  = c("P1", "P1", "P2", "P2", "P3", "P3"),
  site_id     = c("NYC", "NYC", "BOS", "BOS", "NYC", "BOS"),
  platform_id = c("illumina", "illumina", "nanopore", "nanopore", "illumina", "nanopore"),
  assay_id    = c("rnaseq", "rnaseq", "rnaseq", "wgs", "wgs", "wgs"),
  stringsAsFactors = FALSE
)

g <- graph_from_metadata(meta, graph_name = "structure-demo")

grouping_vector(derive_split_constraints(g, mode = "site"))
#>         S1         S2         S3         S4         S5         S6 
#> "site:NYC" "site:NYC" "site:BOS" "site:BOS" "site:NYC" "site:BOS"
grouping_vector(derive_split_constraints(g, mode = "platform"))
#>                  S1                  S2                  S3                  S4 
#> "platform:illumina" "platform:illumina" "platform:nanopore" "platform:nanopore" 
#>                  S5                  S6 
#> "platform:illumina" "platform:nanopore"
grouping_vector(derive_split_constraints(g, mode = "assay"))
#>             S1             S2             S3             S4             S5 
#> "assay:rnaseq" "assay:rnaseq" "assay:rnaseq"    "assay:wgs"    "assay:wgs" 
#>             S6 
#>    "assay:wgs"

Whatever mode is primary, every detected cluster relation is also carried into the split_spec as a blocking annotation, so a downstream consumer can block on site, platform, or assay even when the split unit is something else — here, subject:

spec <- as_split_spec(derive_split_constraints(g, mode = "subject"), graph = g)
spec$block_vars
#> [1] "site_group"     "platform_group" "assay_group"
head(spec$sample_data[, c("sample_id", "group_id",
                          "site_group", "platform_group", "assay_group")])
#>   sample_id   group_id site_group platform_group assay_group
#> 1        S1 subject:P1        NYC       illumina      rnaseq
#> 2        S2 subject:P1        NYC       illumina      rnaseq
#> 3        S3 subject:P2        BOS       nanopore      rnaseq
#> 4        S4 subject:P2        BOS       nanopore         wgs
#> 5        S5 subject:P3        NYC       illumina         wgs
#> 6        S6 subject:P3        BOS       nanopore         wgs

Any of these relations can also participate in a composite derivation, where several dependency sources are combined and each connected component becomes one group:

constraint <- derive_split_constraints(
  g, mode = "composite", strategy = "strict",
  via = c("Subject", "Site", "Platform")
)
grouping_vector(constraint)
#>            S1            S2            S3            S4            S5 
#> "component_1" "component_1" "component_1" "component_1" "component_1" 
#>            S6 
#> "component_1"

Pairwise relation: genetic relatedness

Some leakage is pairwise and continuous rather than a clean grouping. Genetic relatedness is the canonical example: a kinship coefficient — typically from a tool such as KING or PLINK — links pairs of subjects. relatedness_edges_from_kinship() takes such a pair table, keeps pairs at or above a threshold, and emits subject_related_to edges; mode = "relatedness" then groups by transitive closure over those edges (so a chain of related individuals lands in one group).

# A kinship table over subject pairs (one sample per subject here for clarity).
# P1-P2 and P2-P3 clear the threshold and chain together; P5-P6 form a second
# related pair; P1-P4 is too weak to count.
kin <- data.frame(
  id1     = c("P1", "P2", "P1", "P5"),
  id2     = c("P2", "P3", "P4", "P6"),
  kinship = c(0.25, 0.20, 0.02, 0.30),
  stringsAsFactors = FALSE
)
rel_edges <- relatedness_edges_from_kinship(kin, threshold = 0.1)

meta_r <- data.frame(
  sample_id  = paste0("S", 1:6),
  subject_id = paste0("P", 1:6),
  stringsAsFactors = FALSE
)
samples  <- create_nodes(meta_r, "Sample", "sample_id")
subjects <- create_nodes(meta_r, "Subject", "subject_id")
belongs  <- create_edges(meta_r, "sample_id", "subject_id",
                         "Sample", "Subject", "sample_belongs_to_subject")

g_rel <- build_dependency_graph(list(samples, subjects), list(belongs, rel_edges))

rel_groups <- grouping_vector(derive_split_constraints(g_rel, mode = "relatedness"))
rel_groups
#>                        S1                        S2                        S3 
#> "relatedness:component_1" "relatedness:component_1" "relatedness:component_1" 
#>                        S4                        S5                        S6 
#> "relatedness:component_2" "relatedness:component_3" "relatedness:component_3"

The grouping is a transitive closure over the subject_related_to edges. The network below draws those edges between subjects, coloured by the relatedness group each subject (and therefore its samples) lands in: the P1–P2–P3 chain becomes one group even though P1 and P3 were never linked directly, P5–P6 form a second, and the unrelated P4 stands alone.

subject_group <- setNames(rel_groups[meta_r$sample_id], meta_r$subject_id)
kept_pairs <- kin[kin$kinship >= 0.1, c("id1", "id2")]
rel_net <- igraph::graph_from_data_frame(
  kept_pairs, directed = FALSE,
  vertices = data.frame(name = meta_r$subject_id)
)

palette_rel <- c("#4C78A8", "#F58518", "#54A24B", "#B279A2")
set.seed(1)
plot(rel_net,
     vertex.color       = palette_rel[as.integer(factor(subject_group[igraph::V(rel_net)$name]))],
     vertex.size        = 34,
     vertex.label.color = "white",
     vertex.label.font  = 2,
     edge.color         = "grey60",
     edge.width         = 2,
     main               = "Relatedness clusters (kinship >= 0.1)")

The threshold is the key knob, and it belongs to the edge-building step, not the grouping. Raising it drops weaker links: at 0.22 the P2–P3 pair (kinship 0.20) no longer qualifies, so that chain breaks and P3 splits into its own group, while the stronger P5–P6 pair is untouched:

rel_strict <- relatedness_edges_from_kinship(kin, threshold = 0.22)
g_rel_strict <- build_dependency_graph(list(samples, subjects), list(belongs, rel_strict))

grouping_vector(derive_split_constraints(g_rel_strict, mode = "relatedness"))
#>                        S1                        S2                        S3 
#> "relatedness:component_1" "relatedness:component_1" "relatedness:component_2" 
#>                        S4                        S5                        S6 
#> "relatedness:component_3" "relatedness:component_4" "relatedness:component_4"

Pairwise relation: spatial proximity

Spatial proximity works the same way over sample coordinates — for example spot locations from spatial transcriptomics, positions on a tissue slide, or geographic site coordinates. spatial_edges_from_coords() connects samples within a radius (Euclidean distance over the coordinate columns), and mode = "spatial" groups the resulting connected components.

# Two spatial clusters. Cluster 1 (S1-S3) is a chain: neighbouring pairs are
# within the radius, but the endpoints are not.
coords <- data.frame(
  sample_id = paste0("S", 1:6),
  x = c(0, 1, 2,  6.0, 6.9, 6.2),
  y = c(0, 1, 0,  6.0, 6.6, 5.3),
  stringsAsFactors = FALSE
)
adj_edges <- spatial_edges_from_coords(coords, radius = 1.5)

meta_s <- data.frame(
  sample_id  = paste0("S", 1:6),
  subject_id = paste0("P", 1:6),
  stringsAsFactors = FALSE
)
samples_s  <- create_nodes(meta_s, "Sample", "sample_id")
subjects_s <- create_nodes(meta_s, "Subject", "subject_id")
belongs_s  <- create_edges(meta_s, "sample_id", "subject_id",
                           "Sample", "Subject", "sample_belongs_to_subject")

g_sp <- build_dependency_graph(list(samples_s, subjects_s), list(belongs_s, adj_edges))

sp_groups <- grouping_vector(derive_split_constraints(g_sp, mode = "spatial"))
sp_groups
#>                    S1                    S2                    S3 
#> "spatial:component_1" "spatial:component_1" "spatial:component_1" 
#>                    S4                    S5                    S6 
#> "spatial:component_2" "spatial:component_2" "spatial:component_2"

Plotting the coordinates, drawing the within-radius adjacency edges in grey, and colouring points by the derived group makes the transitive closure concrete: S1–S2 and S2–S3 are each within the 1.5 radius, so all three share a group even though S1 and S3 are 2 units apart and were never linked directly. Every sample in the second cluster is likewise reachable from the others, while the two clusters are far enough apart to stay separate:

sp_grp <- factor(sp_groups[coords$sample_id])
row_of <- setNames(seq_len(nrow(coords)), coords$sample_id)
from_i <- row_of[sub("^sample:", "", adj_edges$data$from)]
to_i   <- row_of[sub("^sample:", "", adj_edges$data$to)]
palette_sp <- c("#4C78A8", "#F58518")

plot(coords$x, coords$y, type = "n", asp = 1, xlab = "x", ylab = "y",
     main = "Spatial groups (radius = 1.5)")
segments(coords$x[from_i], coords$y[from_i],
         coords$x[to_i],   coords$y[to_i], col = "grey60", lwd = 2)
points(coords$x, coords$y, pch = 19, cex = 3.5, col = palette_sp[as.integer(sp_grp)])
text(coords$x, coords$y, labels = coords$sample_id, col = "white", cex = 0.8, font = 2)
legend("topleft", legend = levels(sp_grp), pch = 19,
       col = palette_sp[seq_along(levels(sp_grp))], title = "Spatial group", bty = "n")

Deriving on a subset is leakage-safe

Real splits are derived on a subset of samples — the training rows, say. For pairwise (and composite) modes this raises a subtle question: if a sample that bridges two others is left out of the subset, could those two still inherit a shared group from the full graph? They do not. When you pass samples =, grouping is recomputed within that subset, so structure that exists only through an excluded sample never leaks across the split.

The spatial chain makes this visible. S1 and S3 shared a group only because S2 bridged them; ask for S1 and S3 alone, and they correctly fall into separate groups:

grouping_vector(
  derive_split_constraints(g_sp, mode = "spatial", samples = c("S1", "S3"))
)
#>                    S1                    S3 
#> "spatial:component_1" "spatial:component_2"

Thresholds are inputs, not modeling

Because the threshold (kinship cutoff, spatial radius) is applied up front in the edge-building helpers, it is a derivation input, not a modeling choice: splitGraph forms groups over whatever edges survive and never computes folds itself. The resulting split_spec is handed to a downstream consumer for execution, exactly as with every other mode — see the adapter-cookbook and cross-language-handoff vignettes for that step.