splitGraph ends at a split_spec object. It
deliberately knows nothing about rsample,
tidymodels, or any other resampling engine. The handoff
contract is the sample_data table inside the spec plus a
few scalar fields (group_var, block_vars,
time_var, ordering_required,
recommended_resampling), together with provenance the
adapter can inspect to choose a strategy (constraint_mode,
constraint_strategy).
You do not always have to write this glue yourself. The reference
downstream consumer, bioLeak,
takes a split_spec directly —
bioLeak::as_leaksplits(spec, data, outcome) builds an
executable, leakage-audited split plan from it. This cookbook is for the
other case: when you want to feed a split_spec into a
different engine, or understand exactly what a consumer has to honor. It
shows three small, self-contained adapters that turn a
split_spec into something a downstream workflow can
use:
(train, test) row-index pairs — runnable here, no extra
dependencies.rsample::group_vfold_cv() adapter
for grouped cross-validation keyed to group_id.rsample::rolling_origin() adapter
for ordered evaluation keyed to order_rank.Adapters 2 and 3 show idiomatic glue but are not evaluated in this
vignette so that splitGraph does not pick up
rsample as a build-time dependency.
The same pattern works for any other resampling library you happen to use.
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4", "S5", "S6"),
subject_id = c("P1", "P1", "P2", "P2", "P3", "P3"),
batch_id = c("B1", "B2", "B1", "B2", "B1", "B2"),
timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
time_index = c(0, 1, 0, 1, 0, 1),
outcome_id = c("ctrl", "case", "ctrl", "case", "case", "ctrl"),
stringsAsFactors = FALSE
)
g <- graph_from_metadata(meta, graph_name = "cookbook")
subject_constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(subject_constraint, graph = g)
spec
#> <split_spec> subject
#> Samples: 6
#> Groups: 3
#> Recommended resampling: grouped_cvThe sample_data table is the contract:
This is the simplest meaningful adapter. It groups by whatever
split_spec$group_var says is the split unit, and returns
one held-out group per fold.
logo_folds <- function(spec, observation_data, sample_id_col = "sample_id") {
stopifnot(inherits(spec, "split_spec"))
if (!sample_id_col %in% names(observation_data)) {
stop("`observation_data` must contain a `", sample_id_col, "` column.")
}
joined <- merge(
observation_data,
spec$sample_data[, c("sample_id", spec$group_var)],
by.x = sample_id_col, by.y = "sample_id", sort = FALSE
)
joined$.row <- seq_len(nrow(joined))
groups <- split(joined$.row, joined[[spec$group_var]])
lapply(names(groups), function(g) {
list(
group = g,
train = unlist(groups[setdiff(names(groups), g)], use.names = FALSE),
assess = groups[[g]]
)
})
}
# Pretend we have an observation frame keyed by sample_id.
obs <- data.frame(
sample_id = meta$sample_id,
x = rnorm(nrow(meta)),
y = rbinom(nrow(meta), 1, 0.5)
)
folds <- logo_folds(spec, obs)
length(folds)
#> [1] 3
folds[[1]]
#> $group
#> [1] "subject:P1"
#>
#> $train
#> [1] 3 4 5 6
#>
#> $assess
#> [1] 1 2That is the entire downstream contract: take spec, take
an observation frame, return train/assess index lists. Anything more
complicated is specific to a resampling library.
group_var is the primary split unit, but
split_spec also advertises coarser block_vars
— dependency axes that should ideally not straddle a fold even when they
are not the grouping unit. They are per-sample columns aligned to
sample_id, so an adapter reads them exactly like
group_var:
spec$block_vars
#> [1] "batch_group"
head(spec$sample_data[, c("sample_id", spec$group_var, spec$block_vars)])
#> sample_id group_id batch_group
#> 1 S1 subject:P1 B1
#> 2 S2 subject:P1 B2
#> 3 S3 subject:P2 B1
#> 4 S4 subject:P2 B2
#> 5 S5 subject:P3 B1
#> 6 S6 subject:P3 B2A block-aware adapter can pass these to a resampler’s blocking/strata
argument, or simply audit its folds. Here we check whether any batch
straddles the train/assess boundary — a leak a subject-only split does
not prevent, and exactly what carrying batch_group on the
spec lets a consumer catch:
block <- spec$block_vars[[1]]
block_of <- setNames(spec$sample_data[[block]], spec$sample_data$sample_id)
do.call(rbind, lapply(folds, function(f) {
data.frame(
held_out_group = f$group,
straddling_batches = paste(
intersect(block_of[obs$sample_id[f$train]],
block_of[obs$sample_id[f$assess]]),
collapse = ", "
)
)
}))
#> held_out_group straddling_batches
#> 1 subject:P1 B1, B2
#> 2 subject:P2 B1, B2
#> 3 subject:P3 B1, B2Every batch appears on both sides, because grouping by subject does not also block by batch. Whether that matters is a scientific decision — the point is that the spec carries enough information for the adapter to make it.
rsample::group_vfold_cv()Grouped CV keyed to group_id. The downstream package
would typically ship something like this; the adapter is short enough
that you can paste it into your own analysis script.
spec_to_group_vfold <- function(spec, observation_data,
v = NULL,
sample_id_col = "sample_id") {
stopifnot(inherits(spec, "split_spec"))
if (!requireNamespace("rsample", quietly = TRUE)) {
stop("Install rsample to use this adapter.")
}
joined <- merge(
observation_data,
spec$sample_data[, c("sample_id", spec$group_var)],
by.x = sample_id_col, by.y = "sample_id", sort = FALSE
)
n_groups <- length(unique(joined[[spec$group_var]]))
if (is.null(v)) v <- n_groups
rsample::group_vfold_cv(
data = joined,
group = !!spec$group_var,
v = v
)
}v = NULL (the default above) gives leave-one-group-out,
which is the right default when splitGraph has already
grouped samples by their deepest leakage-relevant unit (e.g. subject).
Pick a smaller v for k-fold-style grouped CV.
rsample::rolling_origin()When spec$ordering_required is TRUE (or
spec$time_var is set), the right downstream object is an
ordered split rather than a grouped one.
spec_to_rolling_origin <- function(spec, observation_data,
sample_id_col = "sample_id",
initial = NULL,
assess = 1L) {
stopifnot(inherits(spec, "split_spec"))
if (is.null(spec$time_var)) {
stop("This split_spec has no `time_var`; ordered evaluation is not available.")
}
if (!requireNamespace("rsample", quietly = TRUE)) {
stop("Install rsample to use this adapter.")
}
joined <- merge(
observation_data,
spec$sample_data[, c("sample_id", spec$time_var)],
by.x = sample_id_col, by.y = "sample_id", sort = FALSE
)
ordered <- joined[order(joined[[spec$time_var]]), , drop = FALSE]
if (is.null(initial)) initial <- max(1L, floor(nrow(ordered) * 0.6))
rsample::rolling_origin(ordered, initial = initial, assess = assess)
}The key idea: splitGraph puts ordering information on
the spec; the adapter is just a thin shim that consumes it.
If the downstream consumer is not in R, write the spec to JSON and
let the consumer interpret it. The on-disk format is a formal, versioned
contract: it has a JSON Schema (Draft 2020-12) shipped in
inst/schema/, each file names it via a $schema
key, and validate_split_spec_json() checks a file against
it before you consume it.
tmp <- tempfile(fileext = ".json")
write_split_spec(spec, tmp)
# The file opens with its $schema reference and schema_version.
cat(readLines(tmp, n = 5), sep = "\n")
#> {
#> "$schema": "https://raw.githubusercontent.com/selcukorkmaz/splitGraph/main/inst/schema/split_spec.schema.json",
#> "splitGraph_object": "split_spec",
#> "schema_version": "0.2.0",
#> "group_var": "group_id",
# Validate the file against the shipped JSON Schema, then read it back exactly.
validate_split_spec_json(tmp)$valid
#> [1] TRUE
spec2 <- read_split_spec(tmp)
identical(spec$sample_data$group_id, spec2$sample_data$group_id)
#> [1] TRUE
unlink(tmp)You do not have to write a JSON parser to consume this from Python:
the package ships a pure-Python reference reader
(inst/python/splitspec) that recovers the same grouping and
ordering and drives scikit-learn GroupKFold /
TimeSeriesSplit. The cross-language-handoff
vignette walks the full R → JSON → Python → scikit-learn path:
The same read/write pair exists for dependency_graph
(write_dependency_graph() /
read_dependency_graph(), validated with
validate_graph_json()). Both formats are documented under
?write_split_spec and ?write_dependency_graph.
Because schema_version follows a documented
major-compatibility policy, a file written by an older splitGraph still
loads; migrate_split_spec_json() upgrades it to the current
version in place.
The three adapters above cover different shapes of split. You do not
have to choose between them by hand: split_spec carries
recommended_resampling, so a single dispatcher can route
each spec to the right one. This makes a pipeline that handles subject,
batch, time, and composite specs uniformly.
recommend_adapter <- function(spec) {
switch(
spec$recommended_resampling,
grouped_cv = "group_vfold_cv (group = group_id)",
blocked_cv = "group_vfold_cv (group = group_id)",
custom_grouped_cv = "group_vfold_cv (group = group_id)",
leave_one_group_out = "leave-one-group-out over group_id",
ordered_split = "rolling_origin (order by order_rank)",
"group_vfold_cv (default)"
)
}
# The subject spec recommends grouped CV; a time-mode spec recommends ordering.
recommend_adapter(spec)
#> [1] "group_vfold_cv (group = group_id)"
time_spec <- as_split_spec(derive_split_constraints(g, mode = "time"), graph = g)
recommend_adapter(time_spec)
#> [1] "rolling_origin (order by order_rank)"recommended_resampling is only a hint — your adapter is
free to override it — but it lets one entry point serve every constraint
mode without inspecting the graph.
The only assumptions an adapter has to honor:
split_spec$sample_data is keyed by
sample_id (character).split_spec$group_var is the column that holds the
splitting unit.split_spec$block_vars are present-but-coarser blocking
columns. Depending on the graph these include batch_group,
study_group, site_group,
region_group, platform_group, and
assay_group — an adapter can block or stratify on any that
are present.split_spec$time_var, when non-NULL,
defines the ordering.split_spec$recommended_resampling is a hint, not a
contract — your adapter is free to ignore it.
constraint_mode / constraint_strategy are
available if you want to branch (e.g. treat "time" specs as
ordered).That is the whole interface, and it is stable:
bioLeak::as_leaksplits() consumes exactly these fields, and
a contract test in splitGraph pins the seam so it cannot drift. As long
as those fields are honored, anything is a valid downstream
consumer.