This vignette of package maxEff (Github, RPubs) documents three
methods of selecting one from many numeric predictors for a
regression model, to ensure that the additional predictor has the
maximum effect size. The three methods are implemented in functions
add_num(), add_dummy() and
add_dummy_partition().
Examples in this vignette require that the search path
has
Users should remove parameter mc.cores = 1L from all
examples and use the default option, which engages all CPU cores on the
current host for macOS. The authors are forced to have
mc.cores = 1L in this vignette in order to pass
CRAN’s submission check.
| Term / Abbreviation | Description | Reference |
|---|---|---|
| Forward pipe operator | ?base::pipeOp introduced in R
4.1.0 |
|
abs |
Absolute value | base::abs |
coxph |
Cox model | survival::coxph |
CRAN, R |
The Comprehensive R Archive Network | https://cran.r-project.org |
factor |
Factor, or categorical variable | base::factor |
function |
R function |
base::`function` |
groupedHyperframe |
Grouped hyper data frame | groupedHyperframe::as.groupedHyperframe |
head |
First parts of an object | utils::head;
utils:::head.default |
hypercolumns, hyperframe |
(Hyper columns of) hyper data frame | spatstat.geom::hyperframe |
labels |
Labels from object | base::labels;
maxEff::labels.node1 |
levels |
Levels of a factor |
base::levels |
listof |
List of objects | stats::listof |
logistic |
Logistic regression model | stats::glm(., family = binomial('logit')) |
matrix |
Matrix | base::matrix |
partition |
Stratified partition | maxEff::statusPartition,
caret::createDataPartition |
PFS |
Progression/recurrence free survival | https://en.wikipedia.org/wiki/Progression-free_survival |
predict |
Model prediction | stats::predict;
maxEff::predict.add_num;
maxEff::predict.add_dummy |
quantile |
Quantile | stats::quantile |
rpart |
Recursive partitioning and regression trees | rpart::rpart |
S3, generic,
methods |
S3 object oriented system |
base::UseMethod;
utils::methods; utils::getS3method; https://adv-r.hadley.nz/s3.html |
sort_by |
Sort an object by some criterion | base::sort_by;
maxEff::sort_by.add_ |
subset |
Subsets of object by conditions | base::subset;
maxEff::subset.add_dummy |
Surv |
Survival object | survival::Surv |
update |
Update and re-fit a model call | stats::update |
This work is supported by NCI R01CA222847 (I. Chervoneva, T. Zhan, and H. Rui) and R01CA253977 (H. Rui and I. Chervoneva).
node1(): Dichotomization via First Node of Recursive
PartitioningConsider the following recursive partitioning and regression example,
data(cu.summary, package = 'rpart')
(r = rpart(Price ~ Mileage, data = cu.summary, cp = .Machine$double.eps, maxdepth = 1L))
#> n=60 (57 observations deleted due to missingness)
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 60 983551500 12615.67
#> 2) Mileage>=24.5 25 206417200 9730.16 *
#> 3) Mileage< 24.5 35 420299300 14676.74 *Function node1() creates a dichotomizing rule, i.e., a
function, based on the first node of
partitioning,
(foo = r |> node1())
#>
#> Dichotomizing Rule based on Recursive Partitioning:
#>
#> function (newx)
#> {
#> ret <- (newx >= 24.5)
#> if (all(ret, na.rm = TRUE) || !any(ret, na.rm = TRUE))
#> warning("Dichotomized value is all-0 or all-1")
#> return(ret)
#> }
#> <environment: 0x125350040>The dichotomizing rule foo converts a
numeric object to a logical object.
Developers may obtain the numeric cutoff value of
foo, or a brief text of to describe foo, for
further analysis.
statusPartition(): Stratified Partition by Survival
StatusFunction statusPartition() performs stratified partition
based on the survival status of a Surv response, to avoid
the situation that Cox model is degenerate if all subjects are censored.
This is a deviation from the very popular function
caret::createDataPartition(), which stratifies
Surv response by the quantiles of the survival
time.
y = (survival::capacitor) |>
with(expr = Surv(time, status))
set.seed(15); id = y |>
statusPartition(times = 1L, p = .5)
table(y[id[[1L]], 2L]) / table(y[,2L]) # balanced by status
#>
#> 0 1
#> 0.5 0.5
set.seed(15); id0 = y |>
caret::createDataPartition(times = 1L, p = .5)
table(y[id0[[1L]], 2L]) / table(y[,2L]) # *not* balanced by status
#>
#> 0 1
#> 0.46875 0.56250Data set Ki67 in package
groupedHyperframe is a
groupedHyperframe.
data(Ki67, package = 'groupedHyperframe')
Ki67
#> Grouped Hyperframe: ~patientID/tissueID
#>
#> 645 tissueID nested in
#> 622 patientID
#>
#> logKi67 tissueID Tstage PFS recfreesurv_mon recurrence adj_rad adj_chemo
#> 1 (numeric) TJUe_I17 2 100+ 100 0 FALSE FALSE
#> 2 (numeric) TJUe_G17 1 22 22 1 FALSE FALSE
#> 3 (numeric) TJUe_F17 1 99+ 99 0 FALSE NA
#> 4 (numeric) TJUe_D17 1 99+ 99 0 FALSE TRUE
#> 5 (numeric) TJUe_J18 1 112 112 1 TRUE TRUE
#> 6 (numeric) TJUe_N17 4 12 12 1 TRUE FALSE
#> 7 (numeric) TJUe_J17 2 64+ 64 0 FALSE FALSE
#> 8 (numeric) TJUe_F19 2 56+ 56 0 FALSE FALSE
#> 9 (numeric) TJUe_P19 2 79+ 79 0 FALSE FALSE
#> 10 (numeric) TJUe_O19 2 26 26 1 FALSE TRUE
#> histology Her2 HR node race age patientID
#> 1 3 TRUE TRUE TRUE White 66 PT00037
#> 2 3 FALSE TRUE FALSE Black 42 PT00039
#> 3 3 FALSE TRUE FALSE White 60 PT00040
#> 4 3 FALSE TRUE TRUE White 53 PT00042
#> 5 3 FALSE TRUE TRUE White 52 PT00054
#> 6 2 TRUE TRUE TRUE Black 51 PT00059
#> 7 3 FALSE TRUE TRUE Asian 50 PT00062
#> 8 2 TRUE TRUE TRUE White 37 PT00068
#> 9 3 TRUE TRUE FALSE White 68 PT00082
#> 10 2 TRUE TRUE FALSE Black 55 PT00084s = Ki67 |>
aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))
#> Column(s) tissueID removed; as they are not identical per aggregation-groupCandidate of additional predictors are stored in a
numeric-hypercolumn
logKi67.quantile. Users are encouraged to learn
more about groupedHyperframe class and function
aggregate_quantile() from package
groupedHyperframe vignette.
Partition into a training (80%) and test (20%) set.
set.seed(234); id = sample.int(n = nrow(s), size = nrow(s)*.8) |> sort.int()
s0 = s[id, , drop = FALSE] # training set
s1 = s[-id, , drop = FALSE] # test setLet’s consider a starting model of endpoint PFS with
predictor Tstage on the training set s0.
summary(m <- coxph(PFS ~ Tstage, data = s0))
#> Warning in as.data.frame.hyperframe(data, optional = TRUE): 1 variable
#> discarded in conversion to data frame
#> Warning in as.data.frame.hyperframe(data, optional = TRUE): 1 variable
#> discarded in conversion to data frame
#> Warning in as.data.frame.hyperframe(data): 1 variable discarded in conversion
#> to data frame
#> Call:
#> coxph(formula = PFS ~ Tstage, data = s0)
#>
#> n= 497, number of events= 90
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> Tstage 0.7162 2.0466 0.1020 7.019 2.23e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> Tstage 2.047 0.4886 1.676 2.5
#>
#> Concordance= 0.679 (se = 0.028 )
#> Likelihood ratio test= 39.61 on 1 df, p=3e-10
#> Wald test = 49.27 on 1 df, p=2e-12
#> Score (logrank) test = 54.96 on 1 df, p=1e-13numeric Predictor with Maximum Effect
SizeFunction add_num() treats each additional predictor as a
numeric variable, and updates the starting
model with each additional predictor. Function add_num()
returns an 'add_num' object, which is a listof
objects with an internal class 'add_num_'.
The S3 generic sort_by() sorts the
'add_num' object by the absolute value of the
regression coefficient (i.e., effect size) of the additioanal
numeric predictor.
The S3 generic head() chooses the top
n element from the object returned from the previous
step.
set.seed(14837); m1 = m |>
add_num(x = ~ logKi67.quantile, mc.cores = 1L) |>
sort_by(y = abs(effsize)) |>
head(n = 2L)
m1
#> .slice(logKi67.quantile, "0.99")
#> .slice(logKi67.quantile, "0.98")The S3 generic predict() uses model m1 on
the test set s1.
m1[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.99") :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#>
#> coef exp(coef) se(coef) z p
#> Tstage 0.4068 1.5020 0.2359 1.724 0.0847
#> x. 0.1114 1.1179 0.1926 0.579 0.5629
#>
#> Likelihood ratio test=3.21 on 2 df, p=0.2008
#> n= 125, number of events= 28logical Predictor with Maximum Effect
Sizeadd_dummy(): Naive PracticeFunction add_dummy() partitions each additional
numeric predictor into a logical variable
using function node1(), and updates the
starting model by adding in each of the dichotomized
logical predictor. Function add_dummy()
returns an 'add_dummy' object, which is a
listof node1 objects.
The S3 generic subset() subsets the the
'add_dummy' object by the balance of partition of the
additional predictor.
The S3 generic sort_by() sorts the
'add_dummy' object by the absolute value of
regression coefficient (i.e., effect size) of the additional
logical predictor.
The S3 generic head() chooses the top
n element from the object returned from the previous
step.
set.seed(14837); m2 = m |>
add_dummy(x = ~ logKi67.quantile, mc.cores = 1L) |>
subset(subset = p1 > .05 & p1 < .95) |>
sort_by(y = abs(effsize)) |>
head(n = 2L)
m2
#> .slice(logKi67.quantile, "0.7")>=6.60981019916946
#> .slice(logKi67.quantile, "0.01")>=6.02789726883965The S3 generic predict() uses model m2 on
the test set s1.
m2[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.7")>=6.60981019916946 :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#>
#> coef exp(coef) se(coef) z p
#> Tstage 0.3644 1.4397 0.2402 1.517 0.1292
#> x.TRUE 0.6600 1.9348 0.3841 1.718 0.0857
#>
#> Likelihood ratio test=5.83 on 2 df, p=0.05415
#> n= 125, number of events= 28add_dummy_partition(): via Repeated PartitionsFunction add_dummy_partition() partitions each
additional numeric predictor into a logical
variable in the following steps.
node1()) on the training set. Apply this dichotomizing rule
on the test set and obtain the estimated regression coefficient (i.e.,
effect size) of the additional logical predictor.logical predictor.Function add_dummy_partition() also returns an
'add_dummy' object.
set.seed(14837); m3 = m |>
add_dummy_partition(~ logKi67.quantile, times = 20L, mc.cores = 1L) |>
subset(subset = p1 > .15 & p1 < .85) |>
sort_by(y = abs(effsize), decreasing = TRUE) |>
head(n = 2L)
m3
#> .slice(logKi67.quantile, "0.14")>=6.26711450931504
#> .slice(logKi67.quantile, "0.26")>=6.38090570405786m3[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.14")>=6.26711450931504 :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#>
#> coef exp(coef) se(coef) z p
#> Tstage 0.3912 1.4788 0.2348 1.666 0.0956
#> x.TRUE 0.3783 1.4598 0.3918 0.966 0.3343
#>
#> Likelihood ratio test=3.79 on 2 df, p=0.1505
#> n= 125, number of events= 28