\[\\[0.5in]\]
The primary aim of dataSDA
package is to gather various
symbolic data tailored to different research themes, and to execute the
reading, writing, and conversion of symbolic data in diverse formats, as
well as compute the descriptive statistics of symbolic variables. The
package is currently available on the Comprehensive R Archive Network
(CRAN) at https://CRAN.R-project.org/package=dataSDA and https://hmwu.idv.tw/dataSDA/.
Current version (build on April 30, 2025): dataSDA_0.1.1.zip \[\\[0.5in]\]
symbolic_tbl
classFor the purpose of illustrating two distinct class of interval-valued
dataset in R
, we will utilize two of these datasets:
Abalone
and mushroom
. The Abalone
dataset includes 24 observations, each featuring 7 interval-valued
variables. It is categorized as an object of the
symbolic_tbl
class, and each variable within it belongs the
symbolic_n
class. Both of these classes are defined by the
RSDA
package. Positioning the dataset object within the
symbolic_tbl
class facilitates the more straightforward
application of symbolic data methods provided by RSDA
. The
Abalone
dataset is integrated as a built-in dataset in the
RSDA
package. A copy of it is also included in the
dataSDA
package and is renamed as
Abalone.int
.
> library(dataSDA)
> data(Abalone.iGAP)
> head(Abalone.iGAP)
#> Length Diameter Height Whole
#> F-10-12 0.1275,0.9975 0.075, 0.815 -0.0175, 0.3125 -1.021, 3.883
#> F-13-15 0.1775,1.0275 0.125,0.825 0.025, 0.325 -0.8567, 3.6303
#> F-16-18 0.22,0.92 0.1725, 0.7425 0.0375, 0.3075 -0.5725, 3.1235
#> F-19-21 0.3725, 0.8425 0.2575, 0.6875 0.0825, 0.2525 -0.0368, 2.8443
#> F-23-24 0.275, 0.975 0.255, 0.755 0.09, 0.27 -0.303, 3.469
#> F-25-29 0.475, 0.775 0.405, 0.645 0.1625, 0.2325 0.915, 2.105
#> Shucked Viscera Shell
#> F-10-12 -0.6322, 2.1948 -0.2077, 0.7712 -0.258, 1.054
#> F-13-15 -0.4548, 1.7942 -0.1905, 0.7555 -0.269, 1.153
#> F-16-18 -0.244, 1.206 -0.1037, 0.6752 -0.3233, 1.4477
#> F-19-21 -0.16, 1.14 -0.033, 0.615 -0.1175, 1.1725
#> F-23-24 -0.2295, 1.3205 -0.13, 0.83 0.005, 0.945
#> F-25-29 0.134, 0.896 0.1467, 0.3798 0.45, 0.55
> class(Abalone.iGAP)
#> [1] "data.frame"
> data(Abalone)
> head(Abalone)
#> Length_min Length_max Diameter_min Diameter_max Height_min Height_max
#> F-10-12 0.1275 0.9975 0.075 0.815 -0.0175 0.3125
#> F-13-15 0.1775 1.0275 0.125 0.825 0.025 0.325
#> F-16-18 0.22 0.92 0.1725 0.7425 0.0375 0.3075
#> F-19-21 0.3725 0.8425 0.2575 0.6875 0.0825 0.2525
#> F-23-24 0.275 0.975 0.255 0.755 0.09 0.27
#> F-25-29 0.475 0.775 0.405 0.645 0.1625 0.2325
#> Whole_min Whole_max Shucked_min Shucked_max Viscera_min Viscera_max
#> F-10-12 -1.021 3.883 -0.6322 2.1948 -0.2077 0.7712
#> F-13-15 -0.8567 3.6303 -0.4548 1.7942 -0.1905 0.7555
#> F-16-18 -0.5725 3.1235 -0.244 1.206 -0.1037 0.6752
#> F-19-21 -0.0368 2.8443 -0.16 1.14 -0.033 0.615
#> F-23-24 -0.303 3.469 -0.2295 1.3205 -0.13 0.83
#> F-25-29 0.915 2.105 0.134 0.896 0.1467 0.3798
#> Shell_min Shell_max
#> F-10-12 -0.258 1.054
#> F-13-15 -0.269 1.153
#> F-16-18 -0.3233 1.4477
#> F-19-21 -0.1175 1.1725
#> F-23-24 0.005 0.945
#> F-25-29 0.45 0.55
> class(Abalone)
#> [1] "data.frame"
The mushroom
dataset consists of a set of 23 species
described by 3 interval-valued variables: stipe length, stipe thickness
and pileus cap width. The dataset also contains other two categorical
variables: Species and Edibility. The dataset is the
data.frame
class by default and we would like to convert it
into the symbolic_tbl
class.
> data(mushroom)
> head(mushroom)
#> Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1 arorae 3 8 4
#> 2 arvenis 6 21 4
#> 3 benesi 4 8 5
#> 4 bernardii 7 6 4
#> 5 bisporus 5 12 2
#> 6 bitorquis 5 15 4
#> Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1 9 0.5 2.5 U
#> 2 14 1.0 3.5 Y
#> 3 11 1.0 2.0 Y
#> 4 7 3.0 4.5 Y
#> 5 5 1.5 2.5 Y
#> 6 10 2.0 4.0 Y
Within the framework of SDA, the variables “Species” and “Edibility”
are treated as set variables. We have developed the
set_variable_format
function to create pseudo-variables
that correspond to the categories of a specified categorical variable,
employing the one-hot encoding method. The location
argument denotes the position of the set variable in the data. Following
the restructuring of the dataset, the values assigned to the “Species”
and “Edibility” variables are modified to reflect the number of
categories associated with each variable.
> mushroom_set <- set_variable_format(data = mushroom, location = 8,
+ var = "Species")
> head(mushroom_set, 3)
#> Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 23 1 0 0 0 0 0 0
#> 2 23 0 1 0 0 0 0 0
#> 3 23 0 0 1 0 0 0 0
#> campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> perobscurus semotus silvicola subrutilescens xanthodermus
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min Stipe.Length_max
#> 1 3 8 4 9
#> 2 6 21 4 14
#> 3 4 8 5 11
#> Stipe.Thickness_min Stipe.Thickness_max Edibility U Y T
#> 1 0.5 2.5 3 1 0 0
#> 2 1.0 3.5 3 0 1 0
#> 3 1.0 2.0 3 0 1 0
To adhere to the formatting conventions of RSDA
, we have
implemented the RSDA_format
function. This function
prefixes each variable with a $
symbol to indicate the type
of symbolic variable. Specifically, set variables are prefixed with
$S
, and interval-valued variables are prefixed with $I. The
syntax of the RSDA_format
function with its arguments is as
follows.
RSDA_format(data, sym_type1, location, sym_type2, var)
data
: a conventional data.sym_type1
, sym_type2
: the labels I means
an interval variable and S means set variable.location
: the location of the sym_type
in
the data.var
: the name of the symbolic variable in the
data.> mushroom_tmp <- RSDA_format(data = mushroom_set,
+ sym_type1 = c("I", "I", "I", "S"),
+ location = c(25, 27, 29, 31),
+ sym_type2 = c("S"),
+ var = c("Species"))
> head(mushroom_tmp, 3)
#> $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S 23 1 0 0 0 0 0 0
#> 2 $S 23 0 1 0 0 0 0 0
#> 3 $S 23 0 0 1 0 0 0 0
#> campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> perobscurus semotus silvicola subrutilescens xanthodermus $I
#> 1 0 0 0 0 0 $I
#> 2 0 0 0 0 0 $I
#> 3 0 0 0 0 0 $I
#> Pileus.Cap.Width_min Pileus.Cap.Width_max $I Stipe.Length_min
#> 1 3 8 $I 4
#> 2 6 21 $I 4
#> 3 4 8 $I 5
#> Stipe.Length_max $I Stipe.Thickness_min Stipe.Thickness_max $S Edibility U Y
#> 1 9 $I 0.5 2.5 $S 3 1 0
#> 2 14 $I 1.0 3.5 $S 3 0 1
#> 3 11 $I 1.0 2.0 $S 3 0 1
#> T
#> 1 0
#> 2 0
#> 3 0
The suffixes min
and max
from the variable
names are removed using the clean_colnames
function.
Subsequently, the modified dataset is written out utilizing the
write_csv_table
function. This external data file is then
read using the read.sym.table
function, which is provided
by the RSDA
package. Upon import, the dataset is a
symbolic_tbl
class.
> mushroom_clean <- clean_colnames(data = mushroom_tmp)
> head(mushroom_clean, 3)
#> $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S 23 1 0 0 0 0 0 0
#> 2 $S 23 0 1 0 0 0 0 0
#> 3 $S 23 0 0 1 0 0 0 0
#> campestris comtulus cupreo-brunneus dutives fuseo-fibrillosus fuscovelatus
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> hondensis lilaceps micromegathus praeclaresquamosus pattersonae perobscurus
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> semotus silvicola subrutilescens xanthodermus $I Pileus.Cap.Width
#> 1 0 0 0 0 $I 3
#> 2 0 0 0 0 $I 6
#> 3 0 0 0 0 $I 4
#> Pileus.Cap.Width $I Stipe.Length Stipe.Length $I Stipe.Thickness
#> 1 8 $I 4 9 $I 0.5
#> 2 21 $I 4 14 $I 1.0
#> 3 8 $I 5 11 $I 1.0
#> Stipe.Thickness $S Edibility U Y T
#> 1 2.5 $S 3 1 0 0
#> 2 3.5 $S 3 0 1 0
#> 3 2.0 $S 3 0 1 0
Write the data object with symbolic_tbl
class to a csv
file.
> write_csv_table(data = mushroom_clean, file = "mushroom_interval.csv")
> mushroom_int <- read.sym.table(file = 'mushroom_interval.csv',
+ header = T, sep = ';', dec = '.',
+ row.names = 1)
> head(mushroom_int, 3)
#> # A tibble: 3 × 5
#> Species Pileus.Cap.Width Stipe.Length Stipe.Thickness Edibility
#> <symblc_s> <symblc_n> <symblc_n> <symblc_n> <symblc_s>
#> 1 {arorae} [3.00 : 8.00] [4.00 : 9.00] [0.50 : 2.50] {U}
#> 2 {arvenis} [6.00 : 21.00] [4.00 : 14.00] [1.00 : 3.50] {Y}
#> 3 {benesi} [4.00 : 8.00] [5.00 : 11.00] [1.00 : 2.00] {Y}
> class(mushroom_int)
#> [1] "symbolic_tbl" "tbl_df" "tbl" "data.frame"
MatH
classTo demonstrate the process of converting a dataset into a matrix of
histogram-valued data class — specifically, the MatH
class
— we utilize two datasets: BLOOD
and Weight
.
This class is facilitated by the HistDAWass
package.
The BLOOD
dataset is a MatH
object,
supplied by the HistDAWass
package, and it encompasses 14
groups of patients, each characterized by three distributional
variables. Each distribution within a cell is depicted by its mean and
standard deviation.
Below, we illustrate the process of transforming a list
object into an instance of the MatH
class, specifically,
Weight
. We use the distributionH
function from
the HistDAWass
package to encapsulate the histogram-valued
data present in the dataset. Subsequently, the constructed
Weight
dataset comprises 7 observations, each with 1
variable. We then utilize the new_method
function to assign
the data object as a member of the MatH
class. As a result,
we can leverage the analysis methods offered by the
HistDAWass
package on objects of the MatH
class.
> library(HistDAWass)
> BLOOD[1:3, 1:2]
#> a matrix of distributions
#> 2 variables 3 rows
#> each distibution in the cell is represented by the mean and the standard deviation
#> Cholesterol Hemoglobin
#> u1: F-20 [m= 150.1 ,s= 26.336 ] [m= 13.695 ,s= 0.55031 ]
#> u2: F-30 [m= 150.71 ,s= 25.284 ] [m= 12.158 ,s= 0.52834 ]
#> u3: F-40 [m= 164.96 ,s= 25.334 ] [m= 12.134 ,s= 0.50739 ]
> A1 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B1 <- c(0.00, 0.02, 0.08, 0.32, 0.62, 0.86, 0.92, 1.00)
> A2 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B2 <- c(0.00, 0.05, 0.12, 0.42, 0.68, 0.88, 0.94, 1.00)
> A3 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B3 <- c(0.00, 0.03, 0.24, 0.36, 0.75, 0.85, 0.98, 1.00)
> List <- list(A1, B1, A2, B2, A3, B3)
> List
#> [[1]]
#> [1] 50 60 70 80 90 100 110 120
#>
#> [[2]]
#> [1] 0.00 0.02 0.08 0.32 0.62 0.86 0.92 1.00
#>
#> [[3]]
#> [1] 50 60 70 80 90 100 110 120
#>
#> [[4]]
#> [1] 0.00 0.05 0.12 0.42 0.68 0.88 0.94 1.00
#>
#> [[5]]
#> [1] 50 60 70 80 90 100 110 120
#>
#> [[6]]
#> [1] 0.00 0.03 0.24 0.36 0.75 0.85 0.98 1.00
>
> ListOfWeight <- vector("list", 3)
> x <- 0
> for (i in 1:length(ListOfWeight)){
+ ListOfWeight[[i]] <- distributionH(List[[i + x]], List[[i + x + 1]])
+ x <- x + 1
+ }
> Weight <- methods::new("MatH",
+ nrows = 3, ncols = 1, ListOfDist = ListOfWeight,
+ names.rows = c("20s", "30s", "40s"),
+ names.cols = c("weight"), by.row = FALSE)
> Weight
#> a matrix of distributions
#> 1 variables 3 rows
#> each distibution in the cell is represented by the mean and the standard deviation
#> weight
#> 20s [m= 86.8 ,s= 13.824 ]
#> 30s [m= 84.1 ,s= 14.44 ]
#> 40s [m= 82.9 ,s= 14.385 ]
To convert iGAP files to MM format:
> data(Face.iGAP)
> class(Face.iGAP)
#> [1] "data.frame"
> head(Face.iGAP)
#> AD BC AH DH EH
#> FRA1 155.00,157.00 58.00,61.01 100.45,103.28 105.00,107.30 61.40,65.73
#> FRA2 154.00,160.01 57.00,64.00 101.98,105.55 104.35,107.30 60.88,63.03
#> FRA3 154.01,161.00 57.00,63.00 99.36,105.65 101.04,109.04 60.95,65.60
#> HUS1 168.86,172.84 58.55,63.39 102.83,106.53 122.38,124.52 56.73,61.07
#> HUS2 169.85,175.03 60.21,64.38 102.94,108.71 120.24,124.52 56.73,62.37
#> HUS3 168.76,175.15 61.40,63.51 104.35,107.45 120.93,125.18 57.20,61.72
#> GH
#> FRA1 64.20,67.80
#> FRA2 62.94,66.47
#> FRA3 60.42,66.40
#> HUS1 60.44,64.54
#> HUS2 60.44,66.84
#> HUS3 58.14,67.08
> Face <- iGAP_to_MM(data = Face.iGAP, location = 1:6)
> head(Face)
#> AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max
#> FRA1 155.00 157.00 58.00 61.01 100.45 103.28 105.00 107.30 61.40 65.73
#> FRA2 154.00 160.01 57.00 64.00 101.98 105.55 104.35 107.30 60.88 63.03
#> FRA3 154.01 161.00 57.00 63.00 99.36 105.65 101.04 109.04 60.95 65.60
#> HUS1 168.86 172.84 58.55 63.39 102.83 106.53 122.38 124.52 56.73 61.07
#> HUS2 169.85 175.03 60.21 64.38 102.94 108.71 120.24 124.52 56.73 62.37
#> HUS3 168.76 175.15 61.40 63.51 104.35 107.45 120.93 125.18 57.20 61.72
#> GH_min GH_max
#> FRA1 64.20 67.80
#> FRA2 62.94 66.47
#> FRA3 60.42 66.40
#> HUS1 60.44 64.54
#> HUS2 60.44 66.84
#> HUS3 58.14 67.08
Changes the format of the data to conform to RSDA format.
> Face.tmp <- RSDA_format(data = Face,
+ sym_type1 = c("I", "I", "I", "I", "I", "I"),
+ location = c(1, 3, 5, 7, 9, 11))
> head(Face.tmp)
#> $I AD_min AD_max $I BC_min BC_max $I AH_min AH_max $I DH_min DH_max $I
#> FRA1 $I 155.00 157.00 $I 58.00 61.01 $I 100.45 103.28 $I 105.00 107.30 $I
#> FRA2 $I 154.00 160.01 $I 57.00 64.00 $I 101.98 105.55 $I 104.35 107.30 $I
#> FRA3 $I 154.01 161.00 $I 57.00 63.00 $I 99.36 105.65 $I 101.04 109.04 $I
#> HUS1 $I 168.86 172.84 $I 58.55 63.39 $I 102.83 106.53 $I 122.38 124.52 $I
#> HUS2 $I 169.85 175.03 $I 60.21 64.38 $I 102.94 108.71 $I 120.24 124.52 $I
#> HUS3 $I 168.76 175.15 $I 61.40 63.51 $I 104.35 107.45 $I 120.93 125.18 $I
#> EH_min EH_max $I GH_min GH_max
#> FRA1 61.40 65.73 $I 64.20 67.80
#> FRA2 60.88 63.03 $I 62.94 66.47
#> FRA3 60.95 65.60 $I 60.42 66.40
#> HUS1 56.73 61.07 $I 60.44 64.54
#> HUS2 56.73 62.37 $I 60.44 66.84
#> HUS3 57.20 61.72 $I 58.14 67.08
Clean up variable names to conform to the RSDA format.
> Face.clean <- clean_colnames(data = Face.tmp)
> head(Face.clean)
#> $I AD AD $I BC BC $I AH AH $I DH DH $I EH
#> FRA1 $I 155.00 157.00 $I 58.00 61.01 $I 100.45 103.28 $I 105.00 107.30 $I 61.40
#> FRA2 $I 154.00 160.01 $I 57.00 64.00 $I 101.98 105.55 $I 104.35 107.30 $I 60.88
#> FRA3 $I 154.01 161.00 $I 57.00 63.00 $I 99.36 105.65 $I 101.04 109.04 $I 60.95
#> HUS1 $I 168.86 172.84 $I 58.55 63.39 $I 102.83 106.53 $I 122.38 124.52 $I 56.73
#> HUS2 $I 169.85 175.03 $I 60.21 64.38 $I 102.94 108.71 $I 120.24 124.52 $I 56.73
#> HUS3 $I 168.76 175.15 $I 61.40 63.51 $I 104.35 107.45 $I 120.93 125.18 $I 57.20
#> EH $I GH GH
#> FRA1 65.73 $I 64.20 67.80
#> FRA2 63.03 $I 62.94 66.47
#> FRA3 65.60 $I 60.42 66.40
#> HUS1 61.07 $I 60.44 64.54
#> HUS2 62.37 $I 60.44 66.84
#> HUS3 61.72 $I 58.14 67.08
Write a symbolic data table to a CSV data file.
Read the symbolic data table and check the format.
> Face.interval <- read.sym.table(file = 'Face_interval.csv', header = T, sep = ';', dec = '.', row.names = 1)
> head(Face.interval)
#> # A tibble: 6 × 6
#> AD BC AH DH
#> <symblc_n> <symblc_n> <symblc_n> <symblc_n>
#> 1 [155.00 : 157.00] [58.00 : 61.01] [100.45 : 103.28] [105.00 : 107.30]
#> 2 [154.00 : 160.01] [57.00 : 64.00] [101.98 : 105.55] [104.35 : 107.30]
#> 3 [154.01 : 161.00] [57.00 : 63.00] [99.36 : 105.65] [101.04 : 109.04]
#> 4 [168.86 : 172.84] [58.55 : 63.39] [102.83 : 106.53] [122.38 : 124.52]
#> 5 [169.85 : 175.03] [60.21 : 64.38] [102.94 : 108.71] [120.24 : 124.52]
#> 6 [168.76 : 175.15] [61.40 : 63.51] [104.35 : 107.45] [120.93 : 125.18]
#> # ℹ 2 more variables: EH <symblc_n>, GH <symblc_n>
Convert RSDA format interval dataframe to MM format.
> Face.MM <- RSDA_to_MM(Face.interval, RSDA = TRUE)
> head(Face.MM)
#> AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max GH_min
#> 1 155.00 157.00 58.00 61.01 100.45 103.28 105.00 107.30 61.40 65.73 64.20
#> 2 154.00 160.01 57.00 64.00 101.98 105.55 104.35 107.30 60.88 63.03 62.94
#> 3 154.01 161.00 57.00 63.00 99.36 105.65 101.04 109.04 60.95 65.60 60.42
#> 4 168.86 172.84 58.55 63.39 102.83 106.53 122.38 124.52 56.73 61.07 60.44
#> 5 169.85 175.03 60.21 64.38 102.94 108.71 120.24 124.52 56.73 62.37 60.44
#> 6 168.76 175.15 61.40 63.51 104.35 107.45 120.93 125.18 57.20 61.72 58.14
#> GH_max
#> 1 67.80
#> 2 66.47
#> 3 66.40
#> 4 64.54
#> 5 66.84
#> 6 67.08
Convert MM format interval dataframe to iGAP format.
> Face.iGAP_trans <- MM_to_iGAP(Face.MM)
> head(Face.iGAP_trans)
#> AD BC AH DH EH GH
#> 1 155,157 58,61.01 100.45,103.28 105,107.3 61.4,65.73 64.2,67.8
#> 2 154,160.01 57,64 101.98,105.55 104.35,107.3 60.88,63.03 62.94,66.47
#> 3 154.01,161 57,63 99.36,105.65 101.04,109.04 60.95,65.6 60.42,66.4
#> 4 168.86,172.84 58.55,63.39 102.83,106.53 122.38,124.52 56.73,61.07 60.44,64.54
#> 5 169.85,175.03 60.21,64.38 102.94,108.71 120.24,124.52 56.73,62.37 60.44,66.84
#> 6 168.76,175.15 61.4,63.51 104.35,107.45 120.93,125.18 57.2,61.72 58.14,67.08
\[\\[0.5in]\]
> data(mushroom.int)
> int_mean(mushroom.int, var_name = "Pileus.Cap.Width")
#> Pileus.Cap.Width
#> CM 7.978261
> int_mean(mushroom.int, var_name = 2:3)
#> Pileus.Cap.Width Stipe.Length
#> CM 7.978261 7.391304
>
> var_name <- c("Stipe.Length", "Stipe.Thickness")
> method <- c("CM", "FV", "EJD")
> int_mean(mushroom.int, var_name, method)
#> Stipe.Length Stipe.Thickness
#> CM 7.391304 1.823913
#> FV 10.304348 2.371739
#> EJD 7.391304 1.823913
> int_var(mushroom.int, var_name, method)
#> Stipe.Length Stipe.Thickness
#> CM 9.544466 0.9872431
#> FV 13.858573 1.1729910
#> EJD 12.651229 1.0836673
>
> var_name1 <- "Pileus.Cap.Width"
> var_name2 <- c("Stipe.Length", "Stipe.Thickness")
> method <- c("CM", "VM", "QM", "SE", "FV", "EJD", "GQ", "SPT")
> int_cov(mushroom.int, var_name1, var_name2, method)
#> $CM
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 8.417984 2.480657
#>
#> $VM
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 8.095985 2.385769
#>
#> $QM
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 14.02174 3.523277
#>
#> $SE
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 20.18647 4.714976
#>
#> $FV
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 10.37204 2.994745
#>
#> $EJD
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 8.051985 2.372802
#>
#> $GQ
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 11.46091 3.243229
#>
#> $SPT
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 11.95054 3.11936
> int_cor(mushroom.int, var_name1, var_name2, method)
#> $CM
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.8047063 0.7373264
#>
#> $VM
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.3555993 0.3984261
#>
#> $QM
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.857843 0.7619691
#>
#> $SE
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.8817636 0.7830681
#>
#> $FV
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.7496622 0.7440004
#>
#> $EJD
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.5695142 0.5734316
#>
#> $GQ
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.8106262 0.7837862
#>
#> $SPT
#> Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width 0.8452575 0.7538511
> data(BLOOD)
> hist_mean(BLOOD, "Cholesterol")
#> [1] 180.677
> hist_var(BLOOD, "Cholesterol")
#> [1] 1002.339
> hist_cov(BLOOD, 'Cholesterol', 'Hemoglobin', method = "B")
#> [1] -4.692686
> hist_cor(BLOOD, 'Cholesterol', 'Hemoglobin', method = "L2W")
#> [1] -0.4794806
\[\\[0.5in]\]
We welcome contributions of high-quality datasets for symbolic data analysis. Submitted datasets will be made publicly available (or under specified constraints) to support research in machine learning, statistics, and related fields. You can submit the related files via email to wuhm@g.nccu.edu.tw or through the Google Form at Symbolic Dataset Submission Form. The submission requirements are as follows.
Dataset Format:
.csv
, .xlsx
, or any
symbolic format in plain text..zip
or .gz
) if multiple files
are included.Required Metadata:
Contributors must provide the following details:
Field | Description | Example |
---|---|---|
Dataset Name | A clear, descriptive title. | “face recognition data” |
Dataset Short Name | A clear,abbreviation title. | “face data” |
Authors | Full names of donator. | “First name, Last name” |
Contact email. | “abc123@gmail.com” | |
Institutes | Affiliated organizations. | “-” |
Country | Origin of the dataset. | “France” |
Dataset Descriptions | Data descriptive | See ‘README’ |
Sample Size | Number of instances/rows. | 27 |
Number of Variables | Total features/columns (categorical/numeric). | 6 (interval) |
Missing Values | Indicate if missing values exist and how they’re handled. | “None” / “Yes, marked as NA” |
Variable Descriptions | Detailed description of each column (name, type, units, range). | See ‘README’ |
Source | Original data source (if applicable). | “Leroy et al. (1996)” |
References | Citations for prior work using the dataset. | “Douzal-Chouakria, Billard, and Diday (2011)” |
Applied Areas | Relevant fields (e.g., biology, finance). | “Machine Learning” |
Usage Constraints | Licensing (CC-BY, MIT) or restrictions. | “Public domain” |
Data Link | URL to download the dataset (Google Drive, GitHub, etc.). | “(https)” |
Quality Assurance:
Optional (Recommended):
README
file with:
Po-Wei Chen, Chun-houh Chen, Han-Ming Wu (2025), dataSDA: datasets and basic statistics for symbolic data analysis in R. Technical report.