The goal of this vignette is to explain the older resamplers:
ResamplingVariableSizeTrainCV
and ResamplingSameOtherCV
, which
output some data which are useful for visualizing the train/test
splits. If you do not want to visualize the train/test splits, then it
is recommended to instead use the newer resampler,
ResamplingSameOtherSizesCV
(see other vignette).
The goal of thie section is to explain how to quantify the extent to which it is possible to train on one data subset, and predict on another data subset. This kind of problem occurs frequently in many different problem domains:
The ideas are similar to my previous blog posts about how to do this
in
python
and R. Below
we explain how to use mlr3resampling
for this purpose, in simulated
regression and classification problems. To use this method in
real data, the important sections to read below are named “Benchmark:
computing test error,” which show how to create these cross-validation
experiments using mlr3 code.
We begin by generating some data which can be used with regression algorithms. Assume there is a data set with some rows from one person, some rows from another,
N <- 300
library(data.table)
set.seed(1)
abs.x <- 2
reg.dt <- data.table(
x=runif(N, -abs.x, abs.x),
person=rep(1:2, each=0.5*N))
reg.pattern.list <- list(
easy=function(x, person)x^2,
impossible=function(x, person)(x^2+person*3)*(-1)^person)
reg.task.list <- list()
for(task_id in names(reg.pattern.list)){
f <- reg.pattern.list[[task_id]]
yname <- paste0("y_",task_id)
reg.dt[, (yname) := f(x,person)+rnorm(N)][]
task.dt <- reg.dt[, c("x","person",yname), with=FALSE]
reg.task <- mlr3::TaskRegr$new(
task_id, task.dt, target=yname)
reg.task$col_roles$subset <- "person"
reg.task$col_roles$stratum <- "person"
reg.task$col_roles$feature <- "x"
reg.task.list[[task_id]] <- reg.task
}
reg.dt
#> x person y_easy y_impossible
#> <num> <int> <num> <num>
#> 1: -0.9379653 1 1.32996609 -2.918082
#> 2: -0.5115044 1 0.24307692 -3.866062
#> 3: 0.2914135 1 -0.23314657 -3.837799
#> 4: 1.6328312 1 1.73677545 -7.221749
#> 5: -1.1932723 1 -0.06356159 -5.877792
#> ---
#> 296: 0.7257701 2 -2.48130642 5.180948
#> 297: -1.6033236 2 1.20453459 9.604312
#> 298: -1.5243898 2 1.89966190 7.511988
#> 299: -1.7982414 2 3.47047566 11.035397
#> 300: 1.7170157 2 0.60541972 10.719685
The table above shows some simulated data for two regression problems:
mlr3::TaskRegr
line which tells mlr3 what data set to use, what is
the target column, and what is the subset/stratum column.First we reshape the data using the code below,
(reg.tall <- nc::capture_melt_single(
reg.dt,
task_id="easy|impossible",
value.name="y"))
#> x person task_id y
#> <num> <int> <char> <num>
#> 1: -0.9379653 1 easy 1.32996609
#> 2: -0.5115044 1 easy 0.24307692
#> 3: 0.2914135 1 easy -0.23314657
#> 4: 1.6328312 1 easy 1.73677545
#> 5: -1.1932723 1 easy -0.06356159
#> ---
#> 596: 0.7257701 2 impossible 5.18094849
#> 597: -1.6033236 2 impossible 9.60431191
#> 598: -1.5243898 2 impossible 7.51198770
#> 599: -1.7982414 2 impossible 11.03539747
#> 600: 1.7170157 2 impossible 10.71968480
The table above is a more convenient form for the visualization which we create using the code below,
if(require(animint2)){
my_theme <- theme_bw(20)
theme_set(my_theme)
ggplot()+
geom_point(aes(
x, y),
data=reg.tall)+
facet_grid(
task_id ~ person,
labeller=label_both,
space="free",
scales="free")+
scale_y_continuous(
breaks=seq(-100, 100, by=2))
}
#> Le chargement a nécessité le package : animint2
#> Registered S3 methods overwritten by 'animint2':
#> method from
#> [.uneval ggplot2
#> drawDetails.zeroGrob ggplot2
#> grid.draw.absoluteGrob ggplot2
#> grobHeight.absoluteGrob ggplot2
#> grobHeight.zeroGrob ggplot2
#> grobWidth.absoluteGrob ggplot2
#> grobWidth.zeroGrob ggplot2
#> grobX.absoluteGrob ggplot2
#> grobY.absoluteGrob ggplot2
#> heightDetails.titleGrob ggplot2
#> heightDetails.zeroGrob ggplot2
#> makeContext.dotstackGrob ggplot2
#> print.element ggplot2
#> print.ggplot2_bins ggplot2
#> print.rel ggplot2
#> print.theme ggplot2
#> print.uneval ggplot2
#> widthDetails.titleGrob ggplot2
#> widthDetails.zeroGrob ggplot2
#>
#> Attachement du package : 'animint2'
#> Les objets suivants sont masqués depuis 'package:ggplot2':
#>
#> %+%, %+replace%, Coord, CoordCartesian, CoordFixed, CoordFlip,
#> CoordMap, CoordPolar, CoordQuickmap, CoordTrans, Geom, GeomAbline,
#> GeomAnnotationMap, GeomArea, GeomBar, GeomBlank, GeomContour,
#> GeomCrossbar, GeomCurve, GeomCustomAnn, GeomDensity, GeomDensity2d,
#> GeomDotplot, GeomErrorbar, GeomErrorbarh, GeomHex, GeomHline,
#> GeomLabel, GeomLine, GeomLinerange, GeomLogticks, GeomMap,
#> GeomPath, GeomPoint, GeomPointrange, GeomPolygon, GeomRaster,
#> GeomRasterAnn, GeomRect, GeomRibbon, GeomRug, GeomSegment,
#> GeomSmooth, GeomSpoke, GeomStep, GeomText, GeomTile, GeomViolin,
#> GeomVline, Position, PositionDodge, PositionFill, PositionIdentity,
#> PositionJitter, PositionJitterdodge, PositionNudge, PositionStack,
#> Scale, ScaleContinuous, ScaleContinuousDate,
#> ScaleContinuousDatetime, ScaleContinuousIdentity,
#> ScaleContinuousPosition, ScaleDiscrete, ScaleDiscreteIdentity,
#> ScaleDiscretePosition, Stat, StatBin, StatBin2d, StatBindot,
#> StatBinhex, StatContour, StatCount, StatDensity, StatDensity2d,
#> StatEcdf, StatEllipse, StatFunction, StatIdentity, StatQq,
#> StatSmooth, StatSum, StatSummary, StatSummary2d, StatSummaryBin,
#> StatSummaryHex, StatUnique, StatYdensity, aes, aes_, aes_all,
#> aes_auto, aes_q, aes_string, annotate, annotation_custom,
#> annotation_logticks, annotation_map, annotation_raster,
#> as_labeller, autoplot, benchplot, borders, calc_element,
#> continuous_scale, coord_cartesian, coord_equal, coord_fixed,
#> coord_flip, coord_map, coord_munch, coord_polar, coord_quickmap,
#> coord_trans, cut_interval, cut_number, cut_width, discrete_scale,
#> draw_key_abline, draw_key_blank, draw_key_crossbar,
#> draw_key_dotplot, draw_key_label, draw_key_path, draw_key_point,
#> draw_key_pointrange, draw_key_polygon, draw_key_rect,
#> draw_key_smooth, draw_key_text, draw_key_vline, draw_key_vpath,
#> economics, economics_long, element_blank, element_grob,
#> element_line, element_rect, element_text, expand_limits,
#> facet_grid, facet_null, facet_wrap, fortify, geom_abline,
#> geom_area, geom_bar, geom_bin2d, geom_blank, geom_contour,
#> geom_count, geom_crossbar, geom_curve, geom_density,
#> geom_density2d, geom_density_2d, geom_dotplot, geom_errorbar,
#> geom_errorbarh, geom_freqpoly, geom_hex, geom_histogram,
#> geom_hline, geom_jitter, geom_label, geom_line, geom_linerange,
#> geom_map, geom_path, geom_point, geom_pointrange, geom_polygon,
#> geom_qq, geom_raster, geom_rect, geom_ribbon, geom_rug,
#> geom_segment, geom_smooth, geom_spoke, geom_step, geom_text,
#> geom_tile, geom_violin, geom_vline, gg_dep, ggplot, ggplotGrob,
#> ggplot_build, ggplot_gtable, ggsave, ggtitle, guide_colorbar,
#> guide_colourbar, guide_legend, guides, is.Coord, is.facet,
#> is.ggplot, is.theme, label_both, label_bquote, label_context,
#> label_parsed, label_value, label_wrap_gen, labeller, labs,
#> last_plot, layer, layer_data, layer_grob, layer_scales, lims,
#> map_data, margin, mean_cl_boot, mean_cl_normal, mean_sdl, mean_se,
#> median_hilow, position_dodge, position_fill, position_identity,
#> position_jitter, position_jitterdodge, position_nudge,
#> position_stack, presidential, qplot, quickplot, rel,
#> remove_missing, resolution, scale_alpha, scale_alpha_continuous,
#> scale_alpha_discrete, scale_alpha_identity, scale_alpha_manual,
#> scale_color_brewer, scale_color_continuous, scale_color_discrete,
#> scale_color_distiller, scale_color_gradient, scale_color_gradient2,
#> scale_color_gradientn, scale_color_grey, scale_color_hue,
#> scale_color_identity, scale_color_manual, scale_colour_brewer,
#> scale_colour_continuous, scale_colour_date, scale_colour_datetime,
#> scale_colour_discrete, scale_colour_distiller,
#> scale_colour_gradient, scale_colour_gradient2,
#> scale_colour_gradientn, scale_colour_grey, scale_colour_hue,
#> scale_colour_identity, scale_colour_manual, scale_fill_brewer,
#> scale_fill_continuous, scale_fill_date, scale_fill_datetime,
#> scale_fill_discrete, scale_fill_distiller, scale_fill_gradient,
#> scale_fill_gradient2, scale_fill_gradientn, scale_fill_grey,
#> scale_fill_hue, scale_fill_identity, scale_fill_manual,
#> scale_linetype, scale_linetype_continuous, scale_linetype_discrete,
#> scale_linetype_identity, scale_linetype_manual, scale_radius,
#> scale_shape, scale_shape_continuous, scale_shape_discrete,
#> scale_shape_identity, scale_shape_manual, scale_size,
#> scale_size_area, scale_size_continuous, scale_size_date,
#> scale_size_datetime, scale_size_discrete, scale_size_identity,
#> scale_size_manual, scale_x_continuous, scale_x_date,
#> scale_x_datetime, scale_x_discrete, scale_x_log10, scale_x_reverse,
#> scale_x_sqrt, scale_y_continuous, scale_y_date, scale_y_datetime,
#> scale_y_discrete, scale_y_log10, scale_y_reverse, scale_y_sqrt,
#> should_stop, stat_bin, stat_bin2d, stat_bin_2d, stat_bin_hex,
#> stat_binhex, stat_contour, stat_count, stat_density,
#> stat_density2d, stat_density_2d, stat_ecdf, stat_ellipse,
#> stat_function, stat_identity, stat_qq, stat_smooth, stat_spoke,
#> stat_sum, stat_summary, stat_summary2d, stat_summary_2d,
#> stat_summary_bin, stat_summary_hex, stat_unique, stat_ydensity,
#> theme, theme_bw, theme_classic, theme_dark, theme_get, theme_gray,
#> theme_grey, theme_light, theme_linedraw, theme_minimal,
#> theme_replace, theme_set, theme_update, theme_void,
#> transform_position, update_geom_defaults, update_labels,
#> update_stat_defaults, waiver, xlab, xlim, ylab, ylim, zeroGrob
In the simulated data above, we can see that
In the code below, we define a K-fold cross-validation experiment.
(reg_same_other <- mlr3resampling::ResamplingSameOtherCV$new())
#>
#> ── <ResamplingSameOtherCV> : Same versus Other Cross-Validation ────────────────
#> • Iterations:
#> • Instantiated: FALSE
#> • Parameters: folds=3
In the code below, we define two learners to compare,
(reg.learner.list <- list(
if(requireNamespace("rpart"))mlr3::LearnerRegrRpart$new(),
mlr3::LearnerRegrFeatureless$new()))
#> [[1]]
#>
#> ── <LearnerRegrRpart> (regr.rpart): Regression Tree ────────────────────────────
#> • Model: -
#> • Parameters: xval=0
#> • Packages: mlr3 and rpart
#> • Predict Types: [response]
#> • Feature Types: logical, integer, numeric, factor, and ordered
#> • Encapsulation: none (fallback: -)
#> • Properties: importance, missings, selected_features, and weights
#> • Other settings: use_weights = 'use'
#>
#> [[2]]
#>
#> ── <LearnerRegrFeatureless> (regr.featureless): Featureless Regression Learner ─
#> • Model: -
#> • Parameters: robust=FALSE
#> • Packages: mlr3 and stats
#> • Predict Types: [response], se, and quantiles
#> • Feature Types: logical, integer, numeric, character, factor, ordered,
#> POSIXct, and Date
#> • Encapsulation: none (fallback: -)
#> • Properties: featureless, importance, missings, selected_features, and weights
#> • Other settings: use_weights = 'use'
In the code below, we define the benchmark grid, which is all combinations of tasks (easy and impossible), learners (rpart and featureless), and the one resampling method.
(reg.bench.grid <- mlr3::benchmark_grid(
reg.task.list,
reg.learner.list,
reg_same_other))
#> task learner resampling
#> <char> <char> <char>
#> 1: easy regr.rpart same_other_cv
#> 2: easy regr.featureless same_other_cv
#> 3: impossible regr.rpart same_other_cv
#> 4: impossible regr.featureless same_other_cv
In the code below, we execute the benchmark experiment (in parallel using the multisession future plan).
if(FALSE){#for CRAN.
if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
#> Le chargement a nécessité le package : lgr
#>
#> Attachement du package : 'lgr'
#> L'objet suivant est masqué depuis 'package:ggplot2':
#>
#> Layout
(reg.bench.result <- mlr3::benchmark(
reg.bench.grid, store_models = TRUE))
#>
#> ── <BenchmarkResult> of 72 rows with 4 resampling run ──────────────────────────
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 easy regr.rpart same_other_cv 18 0 0
#> 2 easy regr.featureless same_other_cv 18 0 0
#> 3 impossible regr.rpart same_other_cv 18 0 0
#> 4 impossible regr.featureless same_other_cv 18 0 0
The code below computes the test error for each split,
reg.bench.score <- mlr3resampling::score(reg.bench.result)
reg.bench.score[1]
#> train.subsets test.fold test.subset person iteration test
#> <char> <int> <int> <int> <int> <list>
#> 1: all 1 1 1 1 1, 3, 5, 6,12,13,...
#> train uhash nr
#> <list> <char> <int>
#> 1: 4, 7, 9,10,18,20,... 7c11294c-bd1a-46ae-98ff-c20f08a426ce 1
#> task task_id learner learner_id
#> <list> <char> <list> <char>
#> 1: <TaskRegr:easy> easy <LearnerRegrRpart:regr.rpart> regr.rpart
#> resampling resampling_id prediction_test regr.mse algorithm
#> <list> <char> <list> <num> <char>
#> 1: <ResamplingSameOtherCV> same_other_cv <PredictionRegr> 1.638015 rpart
The code below visualizes the resulting test accuracy numbers.
if(require(animint2)){
ggplot()+
scale_x_log10()+
geom_point(aes(
regr.mse, train.subsets, color=algorithm),
shape=1,
data=reg.bench.score)+
facet_grid(
task_id ~ person,
labeller=label_both,
scales="free")
}
It is clear from the plot above that
The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.
inst <- reg.bench.score$resampling[[1]]$instance
rect.expand <- 0.3
grid.dt <- data.table(x=seq(-abs.x, abs.x, l=101), y=0)
grid.task <- mlr3::TaskRegr$new("grid", grid.dt, target="y")
pred.dt.list <- list()
point.dt.list <- list()
for(score.i in 1:nrow(reg.bench.score)){
reg.bench.row <- reg.bench.score[score.i]
task.dt <- data.table(
reg.bench.row$task[[1]]$data(),
reg.bench.row$resampling[[1]]$instance$id.dt)
names(task.dt)[1] <- "y"
set.ids <- data.table(
set.name=c("test","train")
)[
, data.table(row_id=reg.bench.row[[set.name]][[1]])
, by=set.name]
i.points <- set.ids[
task.dt, on="row_id"
][
is.na(set.name), set.name := "unused"
]
point.dt.list[[score.i]] <- data.table(
reg.bench.row[, .(task_id, iteration)],
i.points)
i.learner <- reg.bench.row$learner[[1]]
pred.dt.list[[score.i]] <- data.table(
reg.bench.row[, .(
task_id, iteration, algorithm
)],
as.data.table(
i.learner$predict(grid.task)
)[, .(x=grid.dt$x, y=response)]
)
}
(pred.dt <- rbindlist(pred.dt.list))
#> task_id iteration algorithm x y
#> <char> <int> <char> <num> <num>
#> 1: easy 1 rpart -2.00 3.557968
#> 2: easy 1 rpart -1.96 3.557968
#> 3: easy 1 rpart -1.92 3.557968
#> 4: easy 1 rpart -1.88 3.557968
#> 5: easy 1 rpart -1.84 3.557968
#> ---
#> 7268: impossible 18 featureless 1.84 7.204232
#> 7269: impossible 18 featureless 1.88 7.204232
#> 7270: impossible 18 featureless 1.92 7.204232
#> 7271: impossible 18 featureless 1.96 7.204232
#> 7272: impossible 18 featureless 2.00 7.204232
(point.dt <- rbindlist(point.dt.list))
#> task_id iteration set.name row_id y x fold person
#> <char> <int> <char> <int> <num> <num> <int> <int>
#> 1: easy 1 test 1 1.32996609 -0.9379653 1 1
#> 2: easy 1 train 2 0.24307692 -0.5115044 3 1
#> 3: easy 1 test 3 -0.23314657 0.2914135 1 1
#> 4: easy 1 train 4 1.73677545 1.6328312 2 1
#> 5: easy 1 test 5 -0.06356159 -1.1932723 1 1
#> ---
#> 21596: impossible 18 train 296 5.18094849 0.7257701 1 2
#> 21597: impossible 18 train 297 9.60431191 -1.6033236 1 2
#> 21598: impossible 18 test 298 7.51198770 -1.5243898 3 2
#> 21599: impossible 18 train 299 11.03539747 -1.7982414 1 2
#> 21600: impossible 18 test 300 10.71968480 1.7170157 3 2
#> subset display_row
#> <int> <int>
#> 1: 1 1
#> 2: 1 101
#> 3: 1 2
#> 4: 1 51
#> 5: 1 3
#> ---
#> 21596: 2 198
#> 21597: 2 199
#> 21598: 2 299
#> 21599: 2 200
#> 21600: 2 300
set.colors <- c(
train="#1B9E77",
test="#D95F02",
unused="white")
algo.colors <- c(
featureless="blue",
rpart="red")
make_person_subset <- function(DT){
DT[, "person/subset" := person]
}
make_person_subset(point.dt)
make_person_subset(reg.bench.score)
#> Warning in `[.data.table`(DT, , `:=`("person/subset", person)): A shallow copy
#> of this data.table was taken so that := can add or remove 1 columns by
#> reference. At an earlier point, this data.table was copied by R (or was created
#> manually using structure() or similar). Avoid names<- and attr<- which in R
#> currently (and oddly) may copy the whole data.table. Use set* syntax instead to
#> avoid copying: ?set, ?setnames and ?setattr. It's also not unusual for
#> data.table-agnostic packages to produce tables affected by this issue. If this
#> message doesn't help, please report your use case to the data.table issue
#> tracker so the root cause can be fixed or this message improved.
if(require(animint2)){
viz <- animint(
title="SOAK algorithm: train/predict on subsets, regression",
video="https://vimeo.com/1053413000",
pred=ggplot()+
ggtitle("Predictions for selected train/test split")+
theme_animint(height=400)+
scale_fill_manual(values=set.colors)+
geom_point(aes(
x, y, fill=set.name),
showSelected="iteration",
size=3,
help="One dot for each train/test/unused data point.",
shape=21,
data=point.dt)+
scale_color_manual(values=algo.colors)+
geom_line(aes(
x, y, color=algorithm,
group=paste(algorithm, iteration)),
help="One line for each learned prediction function.",
showSelected="iteration",
data=pred.dt)+
facet_grid(
task_id ~ `person/subset`,
labeller=label_both,
space="free",
scales="free")+
scale_x_continuous(
"x = input/feature in regression")+
scale_y_continuous(
"y = output to predict in regression",
breaks=seq(-100, 100, by=2)),
err=ggplot()+
ggtitle("Test error for each split")+
theme_animint(height=400, width=350)+
guides(fill="none")+
scale_y_log10(
"Mean squared error on test set")+
scale_fill_manual(values=algo.colors)+
scale_x_discrete(
"People/subsets in train set")+
geom_point(aes(
train.subsets, regr.mse, fill=algorithm),
help="One dot per test set and learning algorithm.",
shape=1,
size=5,
stroke=2,
color="black",
color_off=NA,
showSelected="algorithm",
clickSelects="iteration",
data=reg.bench.score)+
facet_grid(
task_id ~ `person/subset`,
labeller=label_both,
scales="free"),
diagram=ggplot()+
ggtitle("Select train/test split")+
theme_animint(height=400, width=300)+
facet_grid(
. ~ train.subsets,
scales="free",
space="free")+
scale_size_manual(values=c(subset=3, fold=1))+
scale_color_manual(values=c(subset="orange", fold="grey50"))+
geom_rect(aes(
xmin=-Inf, xmax=Inf,
color=rows,
size=rows,
ymin=display_row, ymax=display_end),
help="One rect per chunk of data with common fold (grey) and subset (gold).",
fill=NA,
data=inst$viz.rect.dt)+
scale_fill_manual(values=set.colors)+
geom_text(aes(
x=ifelse(rows=="subset", Inf, -Inf),
y=(display_row+display_end)/2,
hjust=ifelse(rows=="subset", 1, 0),
label=paste0(rows, "=", ifelse(rows=="subset", subset, fold))),
help="Text labels indicate chunks of data with common fold (grey) and subset (gold).",
showSelected="rows",
data=data.table(train.name="same", inst$viz.rect.dt))+
geom_rect(aes(
xmin=iteration-rect.expand, ymin=display_row,
xmax=iteration+rect.expand, ymax=display_end,
fill=set.name),
help="One rect per chunk of data assigned to train/test set in cross-validation.",
alpha=0.5,
alpha_off=0.5,
color="black",
color_off=NA,
clickSelects="iteration",
data=inst$viz.set.dt)+
scale_x_continuous(
"Split number",
breaks=c(1,6, 7,12, 13,18))+
scale_y_continuous(
"Row number"),
source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
animint2pages(viz, "2023-12-13-train-predict-subsets-regression")
}
If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-13-train-predict-subsets-regression/
The previous section investigated a simulated regression problem, whereas in this section we simulate a binary classification problem. Assume there is a data set with some rows from one person, some rows from another,
N <- 200
library(data.table)
(full.dt <- data.table(
label=factor(rep(c("spam","not spam"), l=N)),
person=rep(1:2, each=0.5*N)
)[, signal := ifelse(label=="not spam", 0, 3)][])
#> label person signal
#> <fctr> <int> <num>
#> 1: spam 1 3
#> 2: not spam 1 0
#> 3: spam 1 3
#> 4: not spam 1 0
#> 5: spam 1 3
#> ---
#> 196: not spam 2 0
#> 197: spam 2 3
#> 198: not spam 2 0
#> 199: spam 2 3
#> 200: not spam 2 0
Above each row has an person ID between 1 and 2. We can imagine a spam filtering system, that has training data for multiple people (here just two). Each row in the table above represents a message which has been labeled as spam or not, by one of the two people. Can we train on one person, and accurately predict on the other person? To do that we will need some features, which we generate/simulate below:
set.seed(1)
n.people <- length(unique(full.dt$person))
for(person.i in 1:n.people){
use.signal.vec <- list(
easy=rep(if(person.i==1)TRUE else FALSE, N),
impossible=full.dt$person==person.i)
for(task_id in names(use.signal.vec)){
use.signal <- use.signal.vec[[task_id]]
full.dt[
, paste0("x",person.i,"_",task_id) := ifelse(
use.signal, signal, 0
)+rnorm(N)][]
}
}
full.dt
#> label person signal x1_easy x1_impossible x2_easy x2_impossible
#> <fctr> <int> <num> <num> <num> <num> <num>
#> 1: spam 1 3 2.3735462 3.4094018 1.0744410 -0.3410670
#> 2: not spam 1 0 0.1836433 1.6888733 1.8956548 1.5024245
#> 3: spam 1 3 2.1643714 4.5865884 -0.6029973 0.5283077
#> 4: not spam 1 0 1.5952808 -0.3309078 -0.3908678 0.5421914
#> 5: spam 1 3 3.3295078 0.7147645 -0.4162220 -0.1366734
#> ---
#> 196: not spam 2 0 -1.0479844 -0.9243128 0.7682782 -1.0293917
#> 197: spam 2 3 4.4411577 1.5929138 -0.8161606 2.9890743
#> 198: not spam 2 0 -1.0158475 0.0450106 -0.4361069 -1.2249912
#> 199: spam 2 3 3.4119747 -0.7151284 0.9047050 0.4038886
#> 200: not spam 2 0 -0.3810761 0.8652231 -0.7630863 1.1691226
In the table above, there are two sets of two features:
x1_easy
), and
one is random noise (x2_easy
), so the algorithm just needs to
learn to ignore the noise feature, and concentrate on the signal
feature. That should be possible given data from either person (same
signal in each person).x2_impossible
. But if
the algorithm does not have access to that person, then the best it
can do is same as featureless (predict most frequent class label in
train data).Below we reshape the data to a table which is more suitable for visualization:
(scatter.dt <- nc::capture_melt_multiple(
full.dt,
column="x[12]",
"_",
task_id="easy|impossible"))
#> label person signal task_id x1 x2
#> <fctr> <int> <num> <char> <num> <num>
#> 1: spam 1 3 easy 2.3735462 1.0744410
#> 2: not spam 1 0 easy 0.1836433 1.8956548
#> 3: spam 1 3 easy 2.1643714 -0.6029973
#> 4: not spam 1 0 easy 1.5952808 -0.3908678
#> 5: spam 1 3 easy 3.3295078 -0.4162220
#> ---
#> 396: not spam 2 0 impossible -0.9243128 -1.0293917
#> 397: spam 2 3 impossible 1.5929138 2.9890743
#> 398: not spam 2 0 impossible 0.0450106 -1.2249912
#> 399: spam 2 3 impossible -0.7151284 0.4038886
#> 400: not spam 2 0 impossible 0.8652231 1.1691226
Below we visualize the pattern for each person and feature type:
if(require(animint2)){
ggplot()+
geom_point(aes(
x1, x2, color=label),
shape=1,
data=scatter.dt)+
facet_grid(
task_id ~ person,
labeller=label_both)
}
In the plot above, it is apparent that
We use the code below to create a list of classification tasks, for use in the mlr3 framework.
class.task.list <- list()
for(task_id in c("easy","impossible")){
feature.names <- grep(task_id, names(full.dt), value=TRUE)
task.col.names <- c(feature.names, "label", "person")
task.dt <- full.dt[, task.col.names, with=FALSE]
this.task <- mlr3::TaskClassif$new(
task_id, task.dt, target="label")
this.task$col_roles$subset <- "person"
this.task$col_roles$stratum <- c("person","label")
this.task$col_roles$feature <- setdiff(names(task.dt), this.task$col_roles$stratum)
class.task.list[[task_id]] <- this.task
}
class.task.list
#> $easy
#>
#> ── <TaskClassif> (200x3) ───────────────────────────────────────────────────────
#> • Target: label
#> • Target classes: not spam (positive class, 50%), spam (50%)
#> • Properties: twoclass, strata
#> • Features (2):
#> • dbl (2): x1_easy, x2_easy
#> • Strata: person and label
#>
#> $impossible
#>
#> ── <TaskClassif> (200x3) ───────────────────────────────────────────────────────
#> • Target: label
#> • Target classes: not spam (positive class, 50%), spam (50%)
#> • Properties: twoclass, strata
#> • Features (2):
#> • dbl (2): x1_impossible, x2_impossible
#> • Strata: person and label
Note in the code above that person is assigned roles subset and
stratum, whereas label is assigned roles target and stratum. When
adapting the code above to real data, the important part is the
mlr3::TaskClassif
line which tells mlr3 what data set to use, and
what columns should be used for target/subset/stratum.
The code below is used to define a K-fold cross-validation experiment,
(class_same_other <- mlr3resampling::ResamplingSameOtherCV$new())
#>
#> ── <ResamplingSameOtherCV> : Same versus Other Cross-Validation ────────────────
#> • Iterations:
#> • Instantiated: FALSE
#> • Parameters: folds=3
The code below is used to define the learning algorithms to test,
(class.learner.list <- list(
if(requireNamespace("rpart"))mlr3::LearnerClassifRpart$new(),
mlr3::LearnerClassifFeatureless$new()))
#> [[1]]
#>
#> ── <LearnerClassifRpart> (classif.rpart): Classification Tree ──────────────────
#> • Model: -
#> • Parameters: xval=0
#> • Packages: mlr3 and rpart
#> • Predict Types: [response] and prob
#> • Feature Types: logical, integer, numeric, factor, and ordered
#> • Encapsulation: none (fallback: -)
#> • Properties: importance, missings, multiclass, selected_features, twoclass,
#> and weights
#> • Other settings: use_weights = 'use'
#>
#> [[2]]
#>
#> ── <LearnerClassifFeatureless> (classif.featureless): Featureless Classification
#> • Model: -
#> • Parameters: method=mode
#> • Packages: mlr3
#> • Predict Types: [response] and prob
#> • Feature Types: logical, integer, numeric, character, factor, ordered,
#> POSIXct, and Date
#> • Encapsulation: none (fallback: -)
#> • Properties: featureless, importance, missings, multiclass, selected_features,
#> twoclass, and weights
#> • Other settings: use_weights = 'use'
The code below defines the grid of tasks, learners, and resamplings.
(class.bench.grid <- mlr3::benchmark_grid(
class.task.list,
class.learner.list,
class_same_other))
#> task learner resampling
#> <char> <char> <char>
#> 1: easy classif.rpart same_other_cv
#> 2: easy classif.featureless same_other_cv
#> 3: impossible classif.rpart same_other_cv
#> 4: impossible classif.featureless same_other_cv
The code below runs the benchmark experiment grid. Note that each iteration can be parallelized by declaring a future plan.
if(FALSE){
if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
(class.bench.result <- mlr3::benchmark(
class.bench.grid, store_models = TRUE))
#>
#> ── <BenchmarkResult> of 72 rows with 4 resampling run ──────────────────────────
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 easy classif.rpart same_other_cv 18 0 0
#> 2 easy classif.featureless same_other_cv 18 0 0
#> 3 impossible classif.rpart same_other_cv 18 0 0
#> 4 impossible classif.featureless same_other_cv 18 0 0
Below we compute scores (test error) for each resampling iteration, and show the first row of the result.
class.bench.score <- mlr3resampling::score(class.bench.result)
class.bench.score[1]
#> train.subsets test.fold test.subset person iteration test
#> <char> <int> <int> <int> <int> <list>
#> 1: all 1 1 1 1 1, 2, 8,11,12,18,...
#> train uhash nr
#> <list> <char> <int>
#> 1: 3, 4, 5, 6, 9,10,... 618ff518-03cd-4c15-9141-a900ec906b1e 1
#> task task_id learner learner_id
#> <list> <char> <list> <char>
#> 1: <TaskClassif:easy> easy <LearnerClassifRpart:classif.rpart> classif.rpart
#> resampling resampling_id prediction_test classif.ce
#> <list> <char> <list> <num>
#> 1: <ResamplingSameOtherCV> same_other_cv <PredictionClassif> 0.08823529
#> algorithm
#> <char>
#> 1: rpart
Finally we plot the test error values below.
if(require(animint2)){
ggplot()+
geom_point(aes(
classif.ce, train.subsets, color=algorithm),
shape=1,
data=class.bench.score)+
facet_grid(
person ~ task_id,
labeller=label_both,
scales="free")
}
It is clear from the plot above that
The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.
inst <- class.bench.score$resampling[[1]]$instance
rect.expand <- 0.3
grid.value.dt <- scatter.dt[
, lapply(.SD, function(x)do.call(seq, c(as.list(range(x)), l=21)))
, .SDcols=c("x1","x2")]
grid.class.dt <- data.table(
label=full.dt$label[1],
do.call(
CJ, grid.value.dt
)
)
class.pred.dt.list <- list()
class.point.dt.list <- list()
for(score.i in 1:nrow(class.bench.score)){
class.bench.row <- class.bench.score[score.i]
task.dt <- data.table(
class.bench.row$task[[1]]$data(),
class.bench.row$resampling[[1]]$instance$id.dt)
names(task.dt)[2:3] <- c("x1","x2")
set.ids <- data.table(
set.name=c("test","train")
)[
, data.table(row_id=class.bench.row[[set.name]][[1]])
, by=set.name]
i.points <- set.ids[
task.dt, on="row_id"
][
is.na(set.name), set.name := "unused"
][]
class.point.dt.list[[score.i]] <- data.table(
class.bench.row[, .(task_id, iteration)],
i.points)
if(class.bench.row$algorithm!="featureless"){
i.learner <- class.bench.row$learner[[1]]
i.learner$predict_type <- "prob"
i.task <- class.bench.row$task[[1]]
setnames(grid.class.dt, names(i.task$data()))
grid.class.task <- mlr3::TaskClassif$new(
"grid", grid.class.dt, target="label")
pred.grid <- as.data.table(
i.learner$predict(grid.class.task)
)[, data.table(grid.class.dt, prob.spam)]
names(pred.grid)[2:3] <- c("x1","x2")
pred.wide <- dcast(pred.grid, x1 ~ x2, value.var="prob.spam")
prob.mat <- as.matrix(pred.wide[,-1])
contour.list <- contourLines(
grid.value.dt$x1, grid.value.dt$x2, prob.mat, levels=0.5)
class.pred.dt.list[[score.i]] <- data.table(
class.bench.row[, .(
task_id, iteration, algorithm
)],
data.table(contour.i=seq_along(contour.list))[, {
do.call(data.table, contour.list[[contour.i]])[, .(level, x1=x, x2=y)]
}, by=contour.i]
)
}
}
(class.pred.dt <- rbindlist(class.pred.dt.list))
#> task_id iteration algorithm contour.i level x1 x2
#> <char> <int> <char> <int> <num> <num> <num>
#> 1: easy 1 rpart 1 0.5 1.856156 -3.008049
#> 2: easy 1 rpart 1 0.5 1.856156 -2.606579
#> 3: easy 1 rpart 1 0.5 1.856156 -2.205109
#> 4: easy 1 rpart 1 0.5 1.856156 -1.803639
#> 5: easy 1 rpart 1 0.5 1.856156 -1.402169
#> ---
#> 766: impossible 18 rpart 1 0.5 3.743510 1.225096
#> 767: impossible 18 rpart 1 0.5 4.158037 1.225096
#> 768: impossible 18 rpart 1 0.5 4.572564 1.225096
#> 769: impossible 18 rpart 1 0.5 4.987091 1.225096
#> 770: impossible 18 rpart 1 0.5 5.401618 1.225096
(class.point.dt <- rbindlist(class.point.dt.list))
#> task_id iteration set.name row_id label x1 x2
#> <char> <int> <char> <int> <fctr> <num> <num>
#> 1: easy 1 test 1 spam 2.3735462 1.0744410
#> 2: easy 1 test 2 not spam 0.1836433 1.8956548
#> 3: easy 1 train 3 spam 2.1643714 -0.6029973
#> 4: easy 1 train 4 not spam 1.5952808 -0.3908678
#> 5: easy 1 train 5 spam 3.3295078 -0.4162220
#> ---
#> 14396: impossible 18 train 196 not spam -0.9243128 -1.0293917
#> 14397: impossible 18 train 197 spam 1.5929138 2.9890743
#> 14398: impossible 18 train 198 not spam 0.0450106 -1.2249912
#> 14399: impossible 18 train 199 spam -0.7151284 0.4038886
#> 14400: impossible 18 train 200 not spam 0.8652231 1.1691226
#> fold person subset display_row
#> <int> <int> <int> <int>
#> 1: 1 1 1 1
#> 2: 1 1 1 2
#> 3: 2 1 1 35
#> 4: 2 1 1 36
#> 5: 2 1 1 37
#> ---
#> 14396: 2 2 2 166
#> 14397: 2 2 2 167
#> 14398: 1 2 2 133
#> 14399: 1 2 2 134
#> 14400: 2 2 2 168
set.colors <- c(
train="#1B9E77",
test="#D95F02",
unused="white")
algo.colors <- c(
featureless="blue",
rpart="red")
make_person_subset <- function(DT){
DT[, "person/subset" := person]
}
make_person_subset(class.point.dt)
make_person_subset(class.bench.score)
#> Warning in `[.data.table`(DT, , `:=`("person/subset", person)): A shallow copy
#> of this data.table was taken so that := can add or remove 1 columns by
#> reference. At an earlier point, this data.table was copied by R (or was created
#> manually using structure() or similar). Avoid names<- and attr<- which in R
#> currently (and oddly) may copy the whole data.table. Use set* syntax instead to
#> avoid copying: ?set, ?setnames and ?setattr. It's also not unusual for
#> data.table-agnostic packages to produce tables affected by this issue. If this
#> message doesn't help, please report your use case to the data.table issue
#> tracker so the root cause can be fixed or this message improved.
if(require(animint2)){
viz <- animint(
title="SOAK algorithm: train/predict on subsets, classification",
video="https://vimeo.com/manage/videos/1053464329",
pred=ggplot()+
ggtitle("Predictions for selected train/test split")+
theme_animint(height=350, width=350)+
scale_fill_manual(values=set.colors)+
scale_color_manual(values=c(spam="black","not spam"="white"))+
geom_point(aes(
x1, x2, color=label, fill=set.name),
showSelected="iteration",
size=3,
help="One dot for each train/test/unused data point.",
stroke=2,
shape=21,
data=class.point.dt)+
geom_path(aes(
x1, x2,
group=paste(algorithm, iteration, contour.i)),
showSelected=c("iteration","algorithm"),
help="Red path represents decision boundary of rpart decision tree learning algorithm.",
color=algo.colors[["rpart"]],
data=class.pred.dt)+
facet_grid(
task_id ~ `person/subset`,
labeller=label_both,
space="free",
scales="free")+
scale_y_continuous(
breaks=seq(-100, 100, by=2)),
err=ggplot()+
ggtitle("Test error for each split")+
theme_animint(height=350, width=350)+
theme(panel.margin=grid::unit(1, "lines"))+
scale_y_continuous(
"Classification error on test set",
breaks=seq(0, 1, by=0.25))+
scale_fill_manual(values=algo.colors)+
scale_x_discrete(
"People/subsets in train set")+
geom_hline(aes(
yintercept=yint),
help="Horizontal lines highlight baseline error rate of 50%.",
data=data.table(yint=0.5),
color="grey50")+
geom_point(aes(
train.subsets, classif.ce, fill=algorithm),
help="One dot per test set and learning algorithm.",
shape=1,
size=5,
stroke=2,
color="black",
color_off=NA,
clickSelects="iteration",
data=class.bench.score)+
facet_grid(
task_id ~ `person/subset`,
labeller=label_both),
diagram=ggplot()+
ggtitle("Select train/test split")+
theme_animint(height=350, width=300)+
facet_grid(
. ~ train.subsets,
scales="free",
space="free")+
scale_size_manual(values=c(subset=3, fold=1))+
scale_color_manual(values=c(subset="orange", fold="grey50"))+
geom_rect(aes(
xmin=-Inf, xmax=Inf,
color=rows,
size=rows,
ymin=display_row, ymax=display_end),
help="One rect per chunk of data with common fold (grey) and subset (gold).",
fill=NA,
data=inst$viz.rect.dt)+
scale_fill_manual(values=set.colors)+
geom_text(aes(
x=ifelse(rows=="subset", Inf, -Inf),
y=(display_row+display_end)/2,
hjust=ifelse(rows=="subset", 1, 0),
label=paste0(rows, "=", ifelse(rows=="subset", subset, fold))),
help="Text labels indicate chunks of data with common fold (grey) and subset (gold).",
showSelected="rows",
data=data.table(train.name="same", inst$viz.rect.dt))+
geom_rect(aes(
xmin=iteration-rect.expand, ymin=display_row,
xmax=iteration+rect.expand, ymax=display_end,
fill=set.name),
help="One rect per chunk of data assigned to train/test set in cross-validation.",
alpha=0.5,
alpha_off=0.5,
color="black",
color_off=NA,
clickSelects="iteration",
data=inst$viz.set.dt)+
scale_x_continuous(
"Split number / cross-validation iteration",
breaks=c(1,6, 7,12, 13,18))+
scale_y_continuous(
"Row number"),
source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
animint2pages(viz, "2023-12-13-train-predict-subsets-classification")
}
If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-13-train-predict-subsets-classification/
In this section we have shown how to use mlr3resampling for comparing test error of models trained on same/all/other subsets.
The goal of this section is to explain how to
ResamplingVariableSizeTrainCV
, which can be used to determine how
many train data are necessary to provide accurate predictions on a
given test set.
The code below creates data for simulated regression problems. First we define a vector of input values,
N <- 300
abs.x <- 10
set.seed(1)
x.vec <- runif(N, -abs.x, abs.x)
str(x.vec)
#> num [1:300] -4.69 -2.56 1.46 8.16 -5.97 ...
Below we define a list of two true regression functions (tasks in mlr3 terminology) for our simulated data,
reg.pattern.list <- list(
sin=sin,
constant=function(x)0)
The constant function represents a regression problem which can be solved by always predicting the mean value of outputs (featureless is the best possible learning algorithm). The sin function will be used to generate data with a non-linear pattern that will need to be learned. Below we use a for loop over these two functions/tasks, to simulate the data which will be used as input to the learning algorithms:
library(data.table)
reg.task.list <- list()
reg.data.list <- list()
for(task_id in names(reg.pattern.list)){
f <- reg.pattern.list[[task_id]]
task.dt <- data.table(
x=x.vec,
y = f(x.vec)+rnorm(N,sd=0.5))
reg.data.list[[task_id]] <- data.table(task_id, task.dt)
reg.task.list[[task_id]] <- mlr3::TaskRegr$new(
task_id, task.dt, target="y"
)
}
(reg.data <- rbindlist(reg.data.list))
#> task_id x y
#> <char> <num> <num>
#> 1: sin -4.689827 1.2248390
#> 2: sin -2.557522 -0.5607042
#> 3: sin 1.457067 0.8345056
#> 4: sin 8.164156 0.4875994
#> 5: sin -5.966361 -0.4321800
#> ---
#> 596: constant 3.628850 -0.6728968
#> 597: constant -8.016618 0.5168327
#> 598: constant -7.621949 -0.4058882
#> 599: constant -8.991207 0.9008627
#> 600: constant 8.585078 0.8857710
In the table above, the input is x, and the output is y. Below we visualize these data, with one task in each facet/panel:
if(require(animint2)){
ggplot()+
geom_point(aes(
x, y),
data=reg.data)+
facet_grid(task_id ~ ., labeller=label_both)
}
In the plot above we can see two different simulated data sets
(constant and sin). Note that the code above used the animint2
package, which provides interactive extensions to the static graphics
of the ggplot2
package (see below section Interactive data viz).
In the code below, we define a K-fold cross-validation experiment, with K=3 folds.
reg_size_cv <- mlr3resampling::ResamplingVariableSizeTrainCV$new()
reg_size_cv$param_set$values$train_sizes <- 6
reg_size_cv
#>
#> ── <ResamplingVariableSizeTrainCV> : Cross-Validation with variable size train s
#> • Iterations:
#> • Instantiated: FALSE
#> • Parameters: folds=3, min_train_data=10, random_seeds=3, train_sizes=6
In the output above we can see the parameters of the resampling object, all of which should be integer scalars:
folds
is the number of cross-validation folds.min_train_data
is the minimum number of train data to consider.random_seeds
is the number of random seeds, each of which
determines a different random ordering of the train data. The random
ordering determines which data are included in small train set
sizes.train_sizes
is the number of train set sizes, evenly spaced on a
log scale, from min_train_data
to the max number of train data
(determined by folds
).Below we instantiate the resampling on one of the tasks:
reg_size_cv$instantiate(reg.task.list[["sin"]])
reg_size_cv$instance
#> $iteration.dt
#> test.fold seed small_stratum_size train_size_i train_size
#> <int> <int> <int> <int> <int>
#> 1: 1 1 10 1 10
#> 2: 1 1 18 2 18
#> 3: 1 1 33 3 33
#> 4: 1 1 60 4 60
#> 5: 1 1 110 5 110
#> 6: 1 1 200 6 200
#> 7: 1 2 10 1 10
#> 8: 1 2 18 2 18
#> 9: 1 2 33 3 33
#> 10: 1 2 60 4 60
#> 11: 1 2 110 5 110
#> 12: 1 2 200 6 200
#> 13: 1 3 10 1 10
#> 14: 1 3 18 2 18
#> 15: 1 3 33 3 33
#> 16: 1 3 60 4 60
#> 17: 1 3 110 5 110
#> 18: 1 3 200 6 200
#> 19: 2 1 10 1 10
#> 20: 2 1 18 2 18
#> 21: 2 1 33 3 33
#> 22: 2 1 60 4 60
#> 23: 2 1 110 5 110
#> 24: 2 1 200 6 200
#> 25: 2 2 10 1 10
#> 26: 2 2 18 2 18
#> 27: 2 2 33 3 33
#> 28: 2 2 60 4 60
#> 29: 2 2 110 5 110
#> 30: 2 2 200 6 200
#> 31: 2 3 10 1 10
#> 32: 2 3 18 2 18
#> 33: 2 3 33 3 33
#> 34: 2 3 60 4 60
#> 35: 2 3 110 5 110
#> 36: 2 3 200 6 200
#> 37: 3 1 10 1 10
#> 38: 3 1 18 2 18
#> 39: 3 1 33 3 33
#> 40: 3 1 60 4 60
#> 41: 3 1 110 5 110
#> 42: 3 1 200 6 200
#> 43: 3 2 10 1 10
#> 44: 3 2 18 2 18
#> 45: 3 2 33 3 33
#> 46: 3 2 60 4 60
#> 47: 3 2 110 5 110
#> 48: 3 2 200 6 200
#> 49: 3 3 10 1 10
#> 50: 3 3 18 2 18
#> 51: 3 3 33 3 33
#> 52: 3 3 60 4 60
#> 53: 3 3 110 5 110
#> 54: 3 3 200 6 200
#> test.fold seed small_stratum_size train_size_i train_size
#> <int> <int> <int> <int> <int>
#> train test iteration train_min_size
#> <list> <list> <int> <int>
#> 1: 216,197, 81,171,143, 36,... 1, 7,11,13,15,19,... 1 10
#> 2: 216,197, 81,171,143, 36,... 1, 7,11,13,15,19,... 2 18
#> 3: 216,197, 81,171,143, 36,... 1, 7,11,13,15,19,... 3 33
#> 4: 216,197, 81,171,143, 36,... 1, 7,11,13,15,19,... 4 60
#> 5: 216,197, 81,171,143, 36,... 1, 7,11,13,15,19,... 5 110
#> 6: 216,197, 81,171,143, 36,... 1, 7,11,13,15,19,... 6 200
#> 7: 260,291, 16,164,109, 45,... 1, 7,11,13,15,19,... 7 10
#> 8: 260,291, 16,164,109, 45,... 1, 7,11,13,15,19,... 8 18
#> 9: 260,291, 16,164,109, 45,... 1, 7,11,13,15,19,... 9 33
#> 10: 260,291, 16,164,109, 45,... 1, 7,11,13,15,19,... 10 60
#> 11: 260,291, 16,164,109, 45,... 1, 7,11,13,15,19,... 11 110
#> 12: 260,291, 16,164,109, 45,... 1, 7,11,13,15,19,... 12 200
#> 13: 14,253,115,102,293, 18,... 1, 7,11,13,15,19,... 13 10
#> 14: 14,253,115,102,293, 18,... 1, 7,11,13,15,19,... 14 18
#> 15: 14,253,115,102,293, 18,... 1, 7,11,13,15,19,... 15 33
#> 16: 14,253,115,102,293, 18,... 1, 7,11,13,15,19,... 16 60
#> 17: 14,253,115,102,293, 18,... 1, 7,11,13,15,19,... 17 110
#> 18: 14,253,115,102,293, 18,... 1, 7,11,13,15,19,... 18 200
#> 19: 203,197, 81,171,130, 43,... 4, 6, 9,12,14,16,... 19 10
#> 20: 203,197, 81,171,130, 43,... 4, 6, 9,12,14,16,... 20 18
#> 21: 203,197, 81,171,130, 43,... 4, 6, 9,12,14,16,... 21 33
#> 22: 203,197, 81,171,130, 43,... 4, 6, 9,12,14,16,... 22 60
#> 23: 203,197, 81,171,130, 43,... 4, 6, 9,12,14,16,... 23 110
#> 24: 203,197, 81,171,130, 43,... 4, 6, 9,12,14,16,... 24 200
#> 25: 251,291, 19,164,109, 55,... 4, 6, 9,12,14,16,... 25 10
#> 26: 251,291, 19,164,109, 55,... 4, 6, 9,12,14,16,... 26 18
#> 27: 251,291, 19,164,109, 55,... 4, 6, 9,12,14,16,... 27 33
#> 28: 251,291, 19,164,109, 55,... 4, 6, 9,12,14,16,... 28 60
#> 29: 251,291, 19,164,109, 55,... 4, 6, 9,12,14,16,... 29 110
#> 30: 251,291, 19,164,109, 55,... 4, 6, 9,12,14,16,... 30 200
#> 31: 15,253,115,110,293, 18,... 4, 6, 9,12,14,16,... 31 10
#> 32: 15,253,115,110,293, 18,... 4, 6, 9,12,14,16,... 32 18
#> 33: 15,253,115,110,293, 18,... 4, 6, 9,12,14,16,... 33 33
#> 34: 15,253,115,110,293, 18,... 4, 6, 9,12,14,16,... 34 60
#> 35: 15,253,115,110,293, 18,... 4, 6, 9,12,14,16,... 35 110
#> 36: 15,253,115,110,293, 18,... 4, 6, 9,12,14,16,... 36 200
#> 37: 203,211, 82,194,130, 43,... 2, 3, 5, 8,10,17,... 37 10
#> 38: 203,211, 82,194,130, 43,... 2, 3, 5, 8,10,17,... 38 18
#> 39: 203,211, 82,194,130, 43,... 2, 3, 5, 8,10,17,... 39 33
#> 40: 203,211, 82,194,130, 43,... 2, 3, 5, 8,10,17,... 40 60
#> 41: 203,211, 82,194,130, 43,... 2, 3, 5, 8,10,17,... 41 110
#> 42: 203,211, 82,194,130, 43,... 2, 3, 5, 8,10,17,... 42 200
#> 43: 251,295, 19,189,102, 55,... 2, 3, 5, 8,10,17,... 43 10
#> 44: 251,295, 19,189,102, 55,... 2, 3, 5, 8,10,17,... 44 18
#> 45: 251,295, 19,189,102, 55,... 2, 3, 5, 8,10,17,... 45 33
#> 46: 251,295, 19,189,102, 55,... 2, 3, 5, 8,10,17,... 46 60
#> 47: 251,295, 19,189,102, 55,... 2, 3, 5, 8,10,17,... 47 110
#> 48: 251,295, 19,189,102, 55,... 2, 3, 5, 8,10,17,... 48 200
#> 49: 15,263,135,110,296, 25,... 2, 3, 5, 8,10,17,... 49 10
#> 50: 15,263,135,110,296, 25,... 2, 3, 5, 8,10,17,... 50 18
#> 51: 15,263,135,110,296, 25,... 2, 3, 5, 8,10,17,... 51 33
#> 52: 15,263,135,110,296, 25,... 2, 3, 5, 8,10,17,... 52 60
#> 53: 15,263,135,110,296, 25,... 2, 3, 5, 8,10,17,... 53 110
#> 54: 15,263,135,110,296, 25,... 2, 3, 5, 8,10,17,... 54 200
#> train test iteration train_min_size
#> <list> <list> <int> <int>
#>
#> $id.dt
#> row_id fold
#> <int> <int>
#> 1: 1 1
#> 2: 2 3
#> 3: 3 3
#> 4: 4 2
#> 5: 5 3
#> ---
#> 296: 296 2
#> 297: 297 1
#> 298: 298 1
#> 299: 299 3
#> 300: 300 2
Above we see the instance, which need not be examined by the user, but for informational purposes, it contains the following data:
iteration.dt
has one row for each train/test split,id.dt
has one row for each data point.In the code below, we define two learners to compare,
(reg.learner.list <- list(
if(requireNamespace("rpart"))mlr3::LearnerRegrRpart$new(),
mlr3::LearnerRegrFeatureless$new()))
#> [[1]]
#>
#> ── <LearnerRegrRpart> (regr.rpart): Regression Tree ────────────────────────────
#> • Model: -
#> • Parameters: xval=0
#> • Packages: mlr3 and rpart
#> • Predict Types: [response]
#> • Feature Types: logical, integer, numeric, factor, and ordered
#> • Encapsulation: none (fallback: -)
#> • Properties: importance, missings, selected_features, and weights
#> • Other settings: use_weights = 'use'
#>
#> [[2]]
#>
#> ── <LearnerRegrFeatureless> (regr.featureless): Featureless Regression Learner ─
#> • Model: -
#> • Parameters: robust=FALSE
#> • Packages: mlr3 and stats
#> • Predict Types: [response], se, and quantiles
#> • Feature Types: logical, integer, numeric, character, factor, ordered,
#> POSIXct, and Date
#> • Encapsulation: none (fallback: -)
#> • Properties: featureless, importance, missings, selected_features, and weights
#> • Other settings: use_weights = 'use'
The code above defines
regr.rpart
: Regression Tree learning algorithm, which should be
able to learn the non-linear pattern in the sin data (if there are
enough data in the train set).regr.featureless
: Featureless Regression learning algorithm, which
should be optimal for the constant data, and can be used as a
baseline in the sin data. When the rpart learner gets smaller
prediction error rates than featureless, then we know that it has
learned some non-trivial relationship between inputs and outputs.In the code below, we define the benchmark grid, which is all combinations of tasks (constant and sin), learners (rpart and featureless), and the one resampling method.
(reg.bench.grid <- mlr3::benchmark_grid(
reg.task.list,
reg.learner.list,
reg_size_cv))
#> task learner resampling
#> <char> <char> <char>
#> 1: sin regr.rpart variable_size_train_cv
#> 2: sin regr.featureless variable_size_train_cv
#> 3: constant regr.rpart variable_size_train_cv
#> 4: constant regr.featureless variable_size_train_cv
In the code below, we execute the benchmark experiment (optionally in parallel using the multisession future plan).
if(FALSE){
if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
(reg.bench.result <- mlr3::benchmark(
reg.bench.grid, store_models = TRUE))
#>
#> ── <BenchmarkResult> of 216 rows with 4 resampling run ─────────────────────────
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 sin regr.rpart variable_size_train_cv 54 0 0
#> 2 sin regr.featureless variable_size_train_cv 54 0 0
#> 3 constant regr.rpart variable_size_train_cv 54 0 0
#> 4 constant regr.featureless variable_size_train_cv 54 0 0
The code below computes the test error for each split, and visualizes the information stored in the first row of the result:
reg.bench.score <- mlr3resampling::score(reg.bench.result)
reg.bench.score[1]
#> test.fold seed small_stratum_size train_size_i train_size
#> <int> <int> <int> <int> <int>
#> 1: 1 1 10 1 10
#> train test iteration train_min_size
#> <list> <list> <int> <int>
#> 1: 216,197, 81,171,143, 36,... 1, 7,11,13,15,19,... 1 10
#> uhash nr task task_id
#> <char> <int> <list> <char>
#> 1: 293d6802-5ff9-40ed-a7da-2816ae651936 1 <TaskRegr:sin> sin
#> learner learner_id resampling
#> <list> <char> <list>
#> 1: <LearnerRegrRpart:regr.rpart> regr.rpart <ResamplingVariableSizeTrainCV>
#> resampling_id prediction_test regr.mse algorithm
#> <char> <list> <num> <char>
#> 1: variable_size_train_cv <PredictionRegr> 0.8008255 rpart
The output above contains all of the results related to a particular train/test split. In particular for our purposes, the interesting columns are:
test.fold
is the cross-validation fold ID.seed
is the random seed used to determine the train set order.train_size
is the number of data in the train set.train
and test
are vectors of row numbers assigned to each set.iteration
is an ID for the train/test split, for a particular
learning algorithm and task. It is the row number of iteration.dt
(see instance above), which has one row for each unique combination
of test.fold
, seed
, and train_size
.learner
is the mlr3 learner object, which can be used to compute
predictions on new data (including a grid of inputs, to show
predictions in the visualization below).regr.mse
is the mean squared error on the test set.algorithm
is the name of the learning algorithm (same as
learner_id
but without regr.
prefix).The code below visualizes the resulting test accuracy numbers.
train_size_vec <- unique(reg.bench.score$train_size)
if(require(animint2)){
ggplot()+
scale_x_log10(
breaks=train_size_vec)+
scale_y_log10()+
geom_line(aes(
train_size, regr.mse,
group=paste(algorithm, seed),
color=algorithm),
shape=1,
data=reg.bench.score)+
geom_point(aes(
train_size, regr.mse, color=algorithm),
shape=1,
data=reg.bench.score)+
facet_grid(
test.fold~task_id,
labeller=label_both,
scales="free")
}
Above we plot the test error for each fold and train set size. There is a different panel for each task and test fold. Each line represents a random seed (ordering of data in train set), and each dot represents a specific train set size. So the plot above shows that some variation in test error, for a given test fold, is due to the random ordering of the train data.
Below we summarize each train set size, by taking the mean and standard deviation over each random seed.
reg.mean.dt <- dcast(
reg.bench.score,
task_id + train_size + test.fold + algorithm ~ .,
list(mean, sd),
value.var="regr.mse")
if(require(animint2)){
ggplot()+
scale_x_log10(
breaks=train_size_vec)+
scale_y_log10()+
geom_ribbon(aes(
train_size,
ymin=regr.mse_mean-regr.mse_sd,
ymax=regr.mse_mean+regr.mse_sd,
fill=algorithm),
alpha=0.5,
data=reg.mean.dt)+
geom_line(aes(
train_size, regr.mse_mean, color=algorithm),
shape=1,
data=reg.mean.dt)+
facet_grid(
test.fold~task_id,
labeller=label_both,
scales="free")
}
#> Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): la semi-transparence
#> n'est pas supportée sur ce périphérique : signalé seulement une fois par page
The plot above shows a line for the mean, and a ribbon for the standard deviation, over the three random seeds. It is clear from the plot above that
The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.
grid.dt <- data.table(x=seq(-abs.x, abs.x, l=101), y=0)
grid.task <- mlr3::TaskRegr$new("grid", grid.dt, target="y")
pred.dt.list <- list()
point.dt.list <- list()
for(score.i in 1:nrow(reg.bench.score)){
reg.bench.row <- reg.bench.score[score.i]
task.dt <- data.table(
reg.bench.row$task[[1]]$data(),
reg.bench.row$resampling[[1]]$instance$id.dt)
set.ids <- data.table(
set.name=c("test","train")
)[
, data.table(row_id=reg.bench.row[[set.name]][[1]])
, by=set.name]
i.points <- set.ids[
task.dt, on="row_id"
][
is.na(set.name), set.name := "unused"
]
point.dt.list[[score.i]] <- data.table(
reg.bench.row[, .(task_id, iteration)],
i.points)
i.learner <- reg.bench.row$learner[[1]]
pred.dt.list[[score.i]] <- data.table(
reg.bench.row[, .(
task_id, iteration, algorithm
)],
as.data.table(
i.learner$predict(grid.task)
)[, .(x=grid.dt$x, y=response)]
)
}
(pred.dt <- rbindlist(pred.dt.list))
#> task_id iteration algorithm x y
#> <char> <int> <char> <num> <num>
#> 1: sin 1 rpart -10.0 0.25011658
#> 2: sin 1 rpart -9.8 0.25011658
#> 3: sin 1 rpart -9.6 0.25011658
#> 4: sin 1 rpart -9.4 0.25011658
#> 5: sin 1 rpart -9.2 0.25011658
#> ---
#> 21812: constant 54 featureless 9.2 -0.03385654
#> 21813: constant 54 featureless 9.4 -0.03385654
#> 21814: constant 54 featureless 9.6 -0.03385654
#> 21815: constant 54 featureless 9.8 -0.03385654
#> 21816: constant 54 featureless 10.0 -0.03385654
(point.dt <- rbindlist(point.dt.list))
#> task_id iteration set.name row_id y x fold
#> <char> <int> <char> <int> <num> <num> <int>
#> 1: sin 1 test 1 1.2248390 -4.689827 1
#> 2: sin 1 unused 2 -0.5607042 -2.557522 3
#> 3: sin 1 unused 3 0.8345056 1.457067 3
#> 4: sin 1 unused 4 0.4875994 8.164156 2
#> 5: sin 1 unused 5 -0.4321800 -5.966361 3
#> ---
#> 64796: constant 54 train 296 -0.6728968 3.628850 2
#> 64797: constant 54 train 297 0.5168327 -8.016618 1
#> 64798: constant 54 train 298 -0.4058882 -7.621949 1
#> 64799: constant 54 test 299 0.9008627 -8.991207 3
#> 64800: constant 54 train 300 0.8857710 8.585078 2
set.colors <- c(
train="#1B9E77",
test="#D95F02",
unused="white")
algo.colors <- c(
featureless="blue",
rpart="red")
if(require(animint2)){
viz <- animint(
title="Variable size train set, regression",
pred=ggplot()+
ggtitle("Predictions for selected train/test split")+
theme_animint(height=400)+
scale_fill_manual(values=set.colors)+
geom_point(aes(
x, y, fill=set.name),
help="One dot per sample in train/test/unused set.",
showSelected="iteration",
size=3,
shape=21,
data=point.dt)+
scale_size_manual(values=c(
featureless=3,
rpart=2))+
scale_color_manual(values=algo.colors)+
geom_line(aes(
x, y,
color=algorithm,
size=algorithm,
group=paste(algorithm, iteration)),
help="One line per learned prediction function.",
showSelected="iteration",
data=pred.dt)+
facet_grid(
task_id ~ .,
labeller=label_both),
err=ggplot()+
ggtitle("Test error for each split")+
theme_animint(width=500)+
theme(
panel.margin=grid::unit(1, "lines"),
legend.position="none")+
scale_y_log10(
"Mean squared error on test set")+
scale_color_manual(values=algo.colors)+
scale_x_log10(
"Train set size",
breaks=train_size_vec)+
geom_line(aes(
train_size, regr.mse,
group=paste(algorithm, seed),
color=algorithm),
help="One line per algorithm and random seed used to order train set.",
clickSelects="seed",
alpha_off=0.2,
showSelected="algorithm",
size=4,
data=reg.bench.score)+
facet_grid(
test.fold~task_id,
labeller=label_both,
scales="free")+
geom_point(aes(
train_size, regr.mse,
color=algorithm),
help="One point per algorithm and train set size, for the selected random ordering.",
size=5,
stroke=3,
fill="black",
fill_off=NA,
showSelected=c("algorithm","seed"),
clickSelects="iteration",
data=reg.bench.score),
video="https://vimeo.com/manage/videos/1053467310",
source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
animint2pages(viz, "2023-12-26-train-sizes-regression")
}
If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-26-train-sizes-regression/
The interactive data viz consists of two plots:
Whereas in the section above, we focused on regression (output is a real number), in this section we simulate a binary classification problem (output if a factor with two levels).
class.N <- 900
class.abs.x <- 1
rclass <- function(){
runif(class.N, -class.abs.x, class.abs.x)
}
library(data.table)
set.seed(1)
class.x.dt <- data.table(x1=rclass(), x2=rclass())
class.fun.list <- list(
constant=function(...)0.5,
xor=function(x1, x2)xor(x1>0, x2>0))
class.data.list <- list()
class.task.list <- list()
for(task_id in names(class.fun.list)){
class.fun <- class.fun.list[[task_id]]
y <- factor(ifelse(
class.x.dt[, class.fun(x1, x2)+rnorm(class.N, sd=0.5)]>0.5,
"spam", "not"))
task.dt <- data.table(class.x.dt, y)
this.task <- mlr3::TaskClassif$new(
task_id, task.dt, target="y")
this.task$col_roles$stratum <- "y"
class.task.list[[task_id]] <- this.task
class.data.list[[task_id]] <- data.table(task_id, task.dt)
}
(class.data <- rbindlist(class.data.list))
#> task_id x1 x2 y
#> <char> <num> <num> <fctr>
#> 1: constant -0.4689827 0.66379798 not
#> 2: constant -0.2557522 0.53368551 spam
#> 3: constant 0.1457067 -0.45443937 spam
#> 4: constant 0.8164156 -0.62367340 not
#> 5: constant -0.5966361 -0.54847633 spam
#> ---
#> 1796: xor -0.7614714 -0.01958119 not
#> 1797: xor 0.1871909 -0.96323285 not
#> 1798: xor -0.9253746 -0.64121842 not
#> 1799: xor -0.9808564 -0.40121772 spam
#> 1800: xor -0.6768077 -0.44607188 not
The simulated data table above consists of two input features (x1
and x2
) along with an output/label to predict (y
). Below we count
the number of times each label appears in each task:
class.data[, .(count=.N), by=.(task_id, y)]
#> task_id y count
#> <char> <fctr> <int>
#> 1: constant not 462
#> 2: constant spam 438
#> 3: xor spam 462
#> 4: xor not 438
The table above shows that the spam
label is the minority class
(not
is majority, so that will be the prediction of the featureless
baseline). Below we visualize the data in the feature space:
if(require(animint2)){
ggplot()+
geom_point(aes(
x1, x2, color=y),
shape=1,
data=class.data)+
facet_grid(. ~ task_id, labeller=label_both)+
coord_equal()
}
The plot above shows how the output y
is related to the two inputs x1
and
x2
, for the two tasks.
x1
or
x2
being negative (but not both).In the mlr3 code below, we define a list of learners, our resampling method, and a benchmark grid:
class.learner.list <- list(
if(requireNamespace("rpart"))mlr3::LearnerClassifRpart$new(),
mlr3::LearnerClassifFeatureless$new())
size_cv <- mlr3resampling::ResamplingVariableSizeTrainCV$new()
(class.bench.grid <- mlr3::benchmark_grid(
class.task.list,
class.learner.list,
size_cv))
#> task learner resampling
#> <char> <char> <char>
#> 1: constant classif.rpart variable_size_train_cv
#> 2: constant classif.featureless variable_size_train_cv
#> 3: xor classif.rpart variable_size_train_cv
#> 4: xor classif.featureless variable_size_train_cv
Below we run the learning algorithm for each of the train/test splits defined by our benchmark grid:
if(FALSE){
if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
(class.bench.result <- mlr3::benchmark(
class.bench.grid, store_models = TRUE))
#>
#> ── <BenchmarkResult> of 180 rows with 4 resampling run ─────────────────────────
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 constant classif.rpart variable_size_train_cv 45 0 0
#> 2 constant classif.featureless variable_size_train_cv 45 0 0
#> 3 xor classif.rpart variable_size_train_cv 45 0 0
#> 4 xor classif.featureless variable_size_train_cv 45 0 0
Below we compute scores (test error) for each resampling iteration, and show the first row of the result.
class.bench.score <- mlr3resampling::score(class.bench.result)
class.bench.score[1]
#> test.fold seed small_stratum_size train_size_i train_size
#> <int> <int> <int> <int> <int>
#> 1: 1 1 10 1 21
#> train test iteration train_min_size
#> <list> <list> <int> <int>
#> 1: 91,746,863,730,208,508,... 4,10,12,33,40,49,... 1 21
#> uhash nr task task_id
#> <char> <int> <list> <char>
#> 1: 640facf3-e501-4c22-b15c-8027e10bdddb 1 <TaskClassif:constant> constant
#> learner learner_id
#> <list> <char>
#> 1: <LearnerClassifRpart:classif.rpart> classif.rpart
#> resampling resampling_id prediction_test
#> <list> <char> <list>
#> 1: <ResamplingVariableSizeTrainCV> variable_size_train_cv <PredictionClassif>
#> classif.ce algorithm
#> <num> <char>
#> 1: 0.5266667 rpart
The output above has columns which are very similar to the regression
example in the previous section. The main difference is the
classif.ce
column, which is the classification error on the test
set.
Finally we plot the test error values below.
if(require(animint2)){
ggplot()+
geom_line(aes(
train_size, classif.ce,
group=paste(algorithm, seed),
color=algorithm),
shape=1,
data=class.bench.score)+
geom_point(aes(
train_size, classif.ce, color=algorithm),
shape=1,
data=class.bench.score)+
facet_grid(
task_id ~ test.fold,
labeller=label_both)+
scale_x_log10(
breaks=unique(class.bench.score$train_size))+
scale_y_continuous(
"Test error rate",
limits=c(0.1,0.6),
breaks=seq(0.1,0.6,by=0.1))
}
It is clear from the plot above that
Exercise for the reader: compute and plot mean and SD for these classification tasks, similar to the plot for the regression tasks in the previous section.
The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.
class.grid.vec <- seq(-class.abs.x, class.abs.x, l=21)
class.grid.dt <- CJ(x1=class.grid.vec, x2=class.grid.vec)
class.pred.dt.list <- list()
class.point.dt.list <- list()
for(score.i in 1:nrow(class.bench.score)){
class.bench.row <- class.bench.score[score.i]
task.dt <- data.table(
class.bench.row$task[[1]]$data(),
class.bench.row$resampling[[1]]$instance$id.dt)
set.ids <- data.table(
set.name=c("test","train")
)[
, data.table(row_id=class.bench.row[[set.name]][[1]])
, by=set.name]
i.points <- set.ids[
task.dt, on="row_id"
][
is.na(set.name), set.name := "unused"
][]
class.point.dt.list[[score.i]] <- data.table(
class.bench.row[, .(task_id, iteration)],
i.points)
if(class.bench.row$algorithm!="featureless"){
i.learner <- class.bench.row$learner[[1]]
i.learner$predict_type <- "prob"
i.task <- class.bench.row$task[[1]]
grid.class.task <- mlr3::TaskClassif$new(
"grid", class.grid.dt[, label:=factor(NA,levels(task.dt$y))], target="label")
pred.grid <- as.data.table(
i.learner$predict(grid.class.task)
)[, data.table(class.grid.dt, prob.spam)]
pred.wide <- dcast(pred.grid, x1 ~ x2, value.var="prob.spam")
prob.mat <- as.matrix(pred.wide[,-1])
if(length(table(prob.mat))>1){
contour.list <- contourLines(
class.grid.vec, class.grid.vec, prob.mat, levels=0.5)
class.pred.dt.list[[score.i]] <- data.table(
class.bench.row[, .(
task_id, iteration, algorithm
)],
data.table(contour.i=seq_along(contour.list))[, {
do.call(data.table, contour.list[[contour.i]])[, .(level, x1=x, x2=y)]
}, by=contour.i]
)
}
}
}
(class.pred.dt <- rbindlist(class.pred.dt.list))
#> task_id iteration algorithm contour.i level x1 x2
#> <char> <int> <char> <int> <num> <num> <num>
#> 1: constant 1 rpart 1 0.5 -1.0000000 -0.3531915
#> 2: constant 1 rpart 1 0.5 -0.9000000 -0.3531915
#> 3: constant 1 rpart 1 0.5 -0.8000000 -0.3531915
#> 4: constant 1 rpart 1 0.5 -0.7000000 -0.3531915
#> 5: constant 1 rpart 1 0.5 -0.6000000 -0.3531915
#> ---
#> 5502: xor 45 rpart 2 0.5 0.7000000 0.0499392
#> 5503: xor 45 rpart 2 0.5 0.8000000 0.0499392
#> 5504: xor 45 rpart 2 0.5 0.8465335 0.0000000
#> 5505: xor 45 rpart 2 0.5 0.9000000 -0.0460000
#> 5506: xor 45 rpart 2 0.5 1.0000000 -0.0460000
(class.point.dt <- rbindlist(class.point.dt.list))
#> task_id iteration set.name row_id y x1 x2 fold
#> <char> <int> <char> <int> <fctr> <num> <num> <int>
#> 1: constant 1 unused 1 not -0.4689827 0.66379798 3
#> 2: constant 1 unused 2 spam -0.2557522 0.53368551 2
#> 3: constant 1 unused 3 spam 0.1457067 -0.45443937 2
#> 4: constant 1 test 4 not 0.8164156 -0.62367340 1
#> 5: constant 1 test 5 spam -0.5966361 -0.54847633 1
#> ---
#> 161996: xor 45 test 896 not -0.7614714 -0.01958119 3
#> 161997: xor 45 test 897 not 0.1871909 -0.96323285 3
#> 161998: xor 45 train 898 not -0.9253746 -0.64121842 2
#> 161999: xor 45 train 899 spam -0.9808564 -0.40121772 1
#> 162000: xor 45 train 900 not -0.6768077 -0.44607188 1
set.colors <- c(
train="#1B9E77",
test="#D95F02",
unused="white")
algo.colors <- c(
featureless="blue",
rpart="red")
if(require(animint2)){
viz <- animint(
title="Variable size train sets, classification",
pred=ggplot()+
ggtitle("Predictions for selected train/test split")+
theme(panel.margin=grid::unit(1, "lines"))+
theme_animint(width=600)+
coord_equal()+
scale_fill_manual(values=set.colors)+
scale_color_manual(values=c(spam="black","not spam"="white"))+
geom_point(aes(
x1, x2, color=y, fill=set.name),
showSelected="iteration",
help="One dot per data sample in the train/test/unused set.",
size=3,
stroke=2,
shape=21,
data=class.point.dt)+
geom_path(aes(
x1, x2,
group=paste(algorithm, iteration, contour.i)),
showSelected=c("iteration","algorithm"),
help="Red path represents decision boundary of rpart decision tree learning algorithm.",
color=algo.colors[["rpart"]],
data=class.pred.dt)+
facet_grid(
. ~ task_id,
labeller=label_both,
space="free",
scales="free"),
err=ggplot()+
ggtitle("Test error for each split")+
theme_animint(height=400)+
theme(panel.margin=grid::unit(1, "lines"))+
scale_y_continuous(
"Classification error on test set",
limits=c(0.1,0.6),
breaks=seq(0.1,0.6,by=0.1))+
scale_color_manual(values=algo.colors)+
scale_x_log10(
"Train set size",
breaks=unique(class.bench.score$train_size))+
geom_line(aes(
train_size, classif.ce,
group=paste(algorithm, seed),
color=algorithm),
help="One line per algorithm and random seed used to order train set.",
clickSelects="seed",
alpha_off=0.2,
showSelected="algorithm",
size=4,
data=class.bench.score)+
facet_grid(
test.fold~task_id,
labeller=label_both,
scales="free")+
geom_point(aes(
train_size, classif.ce,
color=algorithm),
size=5,
stroke=3,
fill="black",
fill_off=NA,
help="One point per algorithm and train set size, for the selected random ordering.",
showSelected=c("algorithm","seed"),
clickSelects="iteration",
data=class.bench.score),
video="https://vimeo.com/1053477025",
source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
animint2pages(viz, "2023-12-27-train-sizes-classification")
}
If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-27-train-sizes-classification/
The interactive data viz consists of two plots
In this section we have shown how to use mlr3resampling for comparing test error of models trained on different sized train sets.
sessionInfo()
#> R Under development (unstable) (2025-05-21 r88220)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
#> [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Paris
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] lgr_0.4.4 animint2_2025.6.4 directlabels_2025.5.20
#> [4] mlr3resampling_2025.6.23 mlr3_1.0.0.9000 future_1.58.0
#> [7] ggplot2_3.5.1 data.table_1.17.99
#>
#> loaded via a namespace (and not attached):
#> [1] generics_0.1.3 stringi_1.8.7 listenv_0.9.1
#> [4] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
#> [7] grid_4.6.0 plyr_1.8.9 backports_1.5.0
#> [10] scales_1.3.0 mlr3tuning_1.3.0 codetools_0.2-20
#> [13] mlr3measures_1.0.0 palmerpenguins_0.1.1 cli_3.6.5
#> [16] rlang_1.1.6 crayon_1.5.3 parallelly_1.45.0
#> [19] litedown_0.6 future.apply_1.20.0 munsell_0.5.1
#> [22] commonmark_1.9.5 withr_3.0.2 nc_2025.3.24
#> [25] tools_4.6.0 parallel_4.6.0 reshape2_1.4.4
#> [28] RJSONIO_1.3-1.9 uuid_1.2-1 checkmate_2.3.2
#> [31] dplyr_1.1.4 colorspace_2.1-1 globals_0.18.0
#> [34] bbotk_1.5.0 vctrs_0.6.5 R6_2.6.1
#> [37] mime_0.13 rpart_4.1.24 lifecycle_1.0.4
#> [40] stringr_1.5.1 mlr3misc_0.18.0 pkgconfig_2.0.3
#> [43] pillar_1.10.2 gtable_0.3.6 Rcpp_1.0.14
#> [46] glue_1.8.0 paradox_1.0.1 xfun_0.51
#> [49] tibble_3.2.1 tidyselect_1.2.1 knitr_1.50
#> [52] farver_2.1.2 labeling_0.4.3 compiler_4.6.0
#> [55] quadprog_1.5-8 markdown_2.0