Introduction to dataSDA

Po-Wei Chen, Chun-houh Chen and Han-Ming Wu*

2025-04-30

\[\\[0.5in]\]

1 Introduction

The primary aim of dataSDA package is to gather various symbolic data tailored to different research themes, and to execute the reading, writing, and conversion of symbolic data in diverse formats, as well as compute the descriptive statistics of symbolic variables. The package is currently available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=dataSDA and https://hmwu.idv.tw/dataSDA/.

Current version (build on April 30, 2025): dataSDA_0.1.1.zip \[\\[0.5in]\]

2 Symbolic data formats conversion

2.1 Example: convert interval-valued datasets into the symbolic_tbl class

For the purpose of illustrating two distinct class of interval-valued dataset in R, we will utilize two of these datasets: Abalone and mushroom. The Abalone dataset includes 24 observations, each featuring 7 interval-valued variables. It is categorized as an object of the symbolic_tbl class, and each variable within it belongs the symbolic_n class. Both of these classes are defined by the RSDA package. Positioning the dataset object within the symbolic_tbl class facilitates the more straightforward application of symbolic data methods provided by RSDA. The Abalone dataset is integrated as a built-in dataset in the RSDA package. A copy of it is also included in the dataSDA package and is renamed as Abalone.int.

> library(dataSDA)
> data(Abalone.iGAP)
> head(Abalone.iGAP)
#>                 Length       Diameter          Height           Whole
#> F-10-12  0.1275,0.9975   0.075, 0.815 -0.0175, 0.3125   -1.021, 3.883
#> F-13-15  0.1775,1.0275    0.125,0.825    0.025, 0.325 -0.8567, 3.6303
#> F-16-18      0.22,0.92 0.1725, 0.7425  0.0375, 0.3075 -0.5725, 3.1235
#> F-19-21 0.3725, 0.8425 0.2575, 0.6875  0.0825, 0.2525 -0.0368, 2.8443
#> F-23-24   0.275, 0.975   0.255, 0.755      0.09, 0.27   -0.303, 3.469
#> F-25-29   0.475, 0.775   0.405, 0.645  0.1625, 0.2325    0.915, 2.105
#>                 Shucked         Viscera           Shell
#> F-10-12 -0.6322, 2.1948 -0.2077, 0.7712   -0.258, 1.054
#> F-13-15 -0.4548, 1.7942 -0.1905, 0.7555   -0.269, 1.153
#> F-16-18   -0.244, 1.206 -0.1037, 0.6752 -0.3233, 1.4477
#> F-19-21     -0.16, 1.14   -0.033, 0.615 -0.1175, 1.1725
#> F-23-24 -0.2295, 1.3205     -0.13, 0.83    0.005, 0.945
#> F-25-29    0.134, 0.896  0.1467, 0.3798      0.45, 0.55
> class(Abalone.iGAP)
#> [1] "data.frame"
> data(Abalone)
> head(Abalone)
#>         Length_min Length_max Diameter_min Diameter_max Height_min Height_max
#> F-10-12     0.1275     0.9975        0.075        0.815    -0.0175     0.3125
#> F-13-15     0.1775     1.0275        0.125        0.825      0.025      0.325
#> F-16-18       0.22       0.92       0.1725       0.7425     0.0375     0.3075
#> F-19-21     0.3725     0.8425       0.2575       0.6875     0.0825     0.2525
#> F-23-24      0.275      0.975        0.255        0.755       0.09       0.27
#> F-25-29      0.475      0.775        0.405        0.645     0.1625     0.2325
#>         Whole_min Whole_max Shucked_min Shucked_max Viscera_min Viscera_max
#> F-10-12    -1.021     3.883     -0.6322      2.1948     -0.2077      0.7712
#> F-13-15   -0.8567    3.6303     -0.4548      1.7942     -0.1905      0.7555
#> F-16-18   -0.5725    3.1235      -0.244       1.206     -0.1037      0.6752
#> F-19-21   -0.0368    2.8443       -0.16        1.14      -0.033       0.615
#> F-23-24    -0.303     3.469     -0.2295      1.3205       -0.13        0.83
#> F-25-29     0.915     2.105       0.134       0.896      0.1467      0.3798
#>         Shell_min Shell_max
#> F-10-12    -0.258     1.054
#> F-13-15    -0.269     1.153
#> F-16-18   -0.3233    1.4477
#> F-19-21   -0.1175    1.1725
#> F-23-24     0.005     0.945
#> F-25-29      0.45      0.55
> class(Abalone)
#> [1] "data.frame"

The mushroom dataset consists of a set of 23 species described by 3 interval-valued variables: stipe length, stipe thickness and pileus cap width. The dataset also contains other two categorical variables: Species and Edibility. The dataset is the data.frame class by default and we would like to convert it into the symbolic_tbl class.

> data(mushroom)
> head(mushroom) 
#>     Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1    arorae                    3                    8                4
#> 2   arvenis                    6                   21                4
#> 3    benesi                    4                    8                5
#> 4 bernardii                    7                    6                4
#> 5  bisporus                    5                   12                2
#> 6 bitorquis                    5                   15                4
#>   Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1                9                 0.5                 2.5         U
#> 2               14                 1.0                 3.5         Y
#> 3               11                 1.0                 2.0         Y
#> 4                7                 3.0                 4.5         Y
#> 5                5                 1.5                 2.5         Y
#> 6               10                 2.0                 4.0         Y

Within the framework of SDA, the variables “Species” and “Edibility” are treated as set variables. We have developed the set_variable_format function to create pseudo-variables that correspond to the categories of a specified categorical variable, employing the one-hot encoding method. The location argument denotes the position of the set variable in the data. Following the restructuring of the dataset, the values assigned to the “Species” and “Edibility” variables are modified to reflect the number of categories associated with each variable.

> mushroom_set <- set_variable_format(data = mushroom, location = 8, 
+                                        var = "Species")
> head(mushroom_set, 3)
#>   Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1      23      1       0      0         0        0         0           0
#> 2      23      0       1      0         0        0         0           0
#> 3      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1          0        0               0           0                 0
#> 2          0        0               0           0                 0
#> 3          0        0               0           0                 0
#>   fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1            0         0        0             0                  0           0
#> 2            0         0        0             0                  0           0
#> 3            0         0        0             0                  0           0
#>   perobscurus semotus silvicola subrutilescens xanthodermus
#> 1           0       0         0              0            0
#> 2           0       0         0              0            0
#> 3           0       0         0              0            0
#>   Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min Stipe.Length_max
#> 1                    3                    8                4                9
#> 2                    6                   21                4               14
#> 3                    4                    8                5               11
#>   Stipe.Thickness_min Stipe.Thickness_max Edibility U Y T
#> 1                 0.5                 2.5         3 1 0 0
#> 2                 1.0                 3.5         3 0 1 0
#> 3                 1.0                 2.0         3 0 1 0

To adhere to the formatting conventions of RSDA, we have implemented the RSDA_format function. This function prefixes each variable with a $ symbol to indicate the type of symbolic variable. Specifically, set variables are prefixed with $S, and interval-valued variables are prefixed with $I. The syntax of the RSDA_format function with its arguments is as follows.

RSDA_format(data, sym_type1, location, sym_type2, var)
> mushroom_tmp <- RSDA_format(data = mushroom_set, 
+                                sym_type1 = c("I", "I", "I", "S"),
+                                location = c(25, 27, 29, 31), 
+                                sym_type2 = c("S"),
+                                var = c("Species"))
> head(mushroom_tmp, 3)
#>   $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S      23      1       0      0         0        0         0           0
#> 2 $S      23      0       1      0         0        0         0           0
#> 3 $S      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1          0        0               0           0                 0
#> 2          0        0               0           0                 0
#> 3          0        0               0           0                 0
#>   fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1            0         0        0             0                  0           0
#> 2            0         0        0             0                  0           0
#> 3            0         0        0             0                  0           0
#>   perobscurus semotus silvicola subrutilescens xanthodermus $I
#> 1           0       0         0              0            0 $I
#> 2           0       0         0              0            0 $I
#> 3           0       0         0              0            0 $I
#>   Pileus.Cap.Width_min Pileus.Cap.Width_max $I Stipe.Length_min
#> 1                    3                    8 $I                4
#> 2                    6                   21 $I                4
#> 3                    4                    8 $I                5
#>   Stipe.Length_max $I Stipe.Thickness_min Stipe.Thickness_max $S Edibility U Y
#> 1                9 $I                 0.5                 2.5 $S         3 1 0
#> 2               14 $I                 1.0                 3.5 $S         3 0 1
#> 3               11 $I                 1.0                 2.0 $S         3 0 1
#>   T
#> 1 0
#> 2 0
#> 3 0

The suffixes min and max from the variable names are removed using the clean_colnames function. Subsequently, the modified dataset is written out utilizing the write_csv_table function. This external data file is then read using the read.sym.table function, which is provided by the RSDA package. Upon import, the dataset is a symbolic_tbl class.

> mushroom_clean <- clean_colnames(data = mushroom_tmp)
> head(mushroom_clean, 3)
#>   $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S      23      1       0      0         0        0         0           0
#> 2 $S      23      0       1      0         0        0         0           0
#> 3 $S      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus dutives fuseo-fibrillosus fuscovelatus
#> 1          0        0               0       0                 0            0
#> 2          0        0               0       0                 0            0
#> 3          0        0               0       0                 0            0
#>   hondensis lilaceps micromegathus praeclaresquamosus pattersonae perobscurus
#> 1         0        0             0                  0           0           0
#> 2         0        0             0                  0           0           0
#> 3         0        0             0                  0           0           0
#>   semotus silvicola subrutilescens xanthodermus $I Pileus.Cap.Width
#> 1       0         0              0            0 $I                3
#> 2       0         0              0            0 $I                6
#> 3       0         0              0            0 $I                4
#>   Pileus.Cap.Width $I Stipe.Length Stipe.Length $I Stipe.Thickness
#> 1                8 $I            4            9 $I             0.5
#> 2               21 $I            4           14 $I             1.0
#> 3                8 $I            5           11 $I             1.0
#>   Stipe.Thickness $S Edibility U Y T
#> 1             2.5 $S         3 1 0 0
#> 2             3.5 $S         3 0 1 0
#> 3             2.0 $S         3 0 1 0

Write the data object with symbolic_tbl class to a csv file.

> write_csv_table(data = mushroom_clean, file = "mushroom_interval.csv")
> mushroom_int <- read.sym.table(file = 'mushroom_interval.csv', 
+                                   header = T, sep = ';', dec = '.', 
+                                   row.names = 1)
> head(mushroom_int, 3)
#> # A tibble: 3 × 5
#>      Species Pileus.Cap.Width   Stipe.Length Stipe.Thickness  Edibility
#>   <symblc_s>       <symblc_n>     <symblc_n>      <symblc_n> <symblc_s>
#> 1   {arorae}    [3.00 : 8.00]  [4.00 : 9.00]   [0.50 : 2.50]        {U}
#> 2  {arvenis}   [6.00 : 21.00] [4.00 : 14.00]   [1.00 : 3.50]        {Y}
#> 3   {benesi}    [4.00 : 8.00] [5.00 : 11.00]   [1.00 : 2.00]        {Y}
> class(mushroom_int)
#> [1] "symbolic_tbl" "tbl_df"       "tbl"          "data.frame"

2.2 Example: the conversion of histogram-valued datasets into the MatH class

To demonstrate the process of converting a dataset into a matrix of histogram-valued data class — specifically, the MatH class — we utilize two datasets: BLOOD and Weight. This class is facilitated by the HistDAWass package.

The BLOOD dataset is a MatH object, supplied by the HistDAWass package, and it encompasses 14 groups of patients, each characterized by three distributional variables. Each distribution within a cell is depicted by its mean and standard deviation.

> library(dataSDA)
> data(BLOOD)
> BLOOD[1:3, 1:2]

Below, we illustrate the process of transforming a list object into an instance of the MatH class, specifically, Weight. We use the distributionH function from the HistDAWass package to encapsulate the histogram-valued data present in the dataset. Subsequently, the constructed Weight dataset comprises 7 observations, each with 1 variable. We then utilize the new_method function to assign the data object as a member of the MatH class. As a result, we can leverage the analysis methods offered by the HistDAWass package on objects of the MatH class.

> library(HistDAWass)
> BLOOD[1:3, 1:2]
#> a matrix of distributions 
#>  2  variables  3  rows 
#>  each distibution in the cell is represented by the mean and the standard deviation 
#>                  Cholesterol               Hemoglobin        
#> u1: F-20  [m= 150.1  ,s= 26.336 ]  [m= 13.695  ,s= 0.55031 ]
#> u2: F-30 [m= 150.71  ,s= 25.284 ]  [m= 12.158  ,s= 0.52834 ]
#> u3: F-40 [m= 164.96  ,s= 25.334 ]  [m= 12.134  ,s= 0.50739 ]
> A1 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B1 <- c(0.00, 0.02, 0.08, 0.32, 0.62, 0.86, 0.92, 1.00)
> A2 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B2 <- c(0.00, 0.05, 0.12, 0.42, 0.68, 0.88, 0.94, 1.00)
> A3 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B3 <- c(0.00, 0.03, 0.24, 0.36, 0.75, 0.85, 0.98, 1.00)
> List <- list(A1, B1, A2, B2, A3, B3)
> List
#> [[1]]
#> [1]  50  60  70  80  90 100 110 120
#> 
#> [[2]]
#> [1] 0.00 0.02 0.08 0.32 0.62 0.86 0.92 1.00
#> 
#> [[3]]
#> [1]  50  60  70  80  90 100 110 120
#> 
#> [[4]]
#> [1] 0.00 0.05 0.12 0.42 0.68 0.88 0.94 1.00
#> 
#> [[5]]
#> [1]  50  60  70  80  90 100 110 120
#> 
#> [[6]]
#> [1] 0.00 0.03 0.24 0.36 0.75 0.85 0.98 1.00
> 
> ListOfWeight <- vector("list", 3)
> x <- 0
> for (i in 1:length(ListOfWeight)){
+   ListOfWeight[[i]] <- distributionH(List[[i + x]], List[[i + x + 1]])
+   x <- x + 1
+ }
> Weight <- methods::new("MatH",
+                     nrows = 3, ncols = 1, ListOfDist = ListOfWeight,
+                     names.rows = c("20s", "30s", "40s"),
+                     names.cols = c("weight"), by.row = FALSE)
> Weight
#> a matrix of distributions 
#>  1  variables  3  rows 
#>  each distibution in the cell is represented by the mean and the standard deviation 
#>              weight        
#> 20s [m= 86.8  ,s= 13.824 ]
#> 30s [m= 84.1  ,s= 14.44 ] 
#> 40s [m= 82.9  ,s= 14.385 ]

2.3 Example: convert iGAP format to MM format and RSDA format.

To convert iGAP files to MM format:

> data(Face.iGAP)
> class(Face.iGAP)
#> [1] "data.frame"
> head(Face.iGAP)
#>                 AD          BC            AH            DH          EH
#> FRA1 155.00,157.00 58.00,61.01 100.45,103.28 105.00,107.30 61.40,65.73
#> FRA2 154.00,160.01 57.00,64.00 101.98,105.55 104.35,107.30 60.88,63.03
#> FRA3 154.01,161.00 57.00,63.00  99.36,105.65 101.04,109.04 60.95,65.60
#> HUS1 168.86,172.84 58.55,63.39 102.83,106.53 122.38,124.52 56.73,61.07
#> HUS2 169.85,175.03 60.21,64.38 102.94,108.71 120.24,124.52 56.73,62.37
#> HUS3 168.76,175.15 61.40,63.51 104.35,107.45 120.93,125.18 57.20,61.72
#>               GH
#> FRA1 64.20,67.80
#> FRA2 62.94,66.47
#> FRA3 60.42,66.40
#> HUS1 60.44,64.54
#> HUS2 60.44,66.84
#> HUS3 58.14,67.08
> Face <- iGAP_to_MM(data = Face.iGAP, location = 1:6)
> head(Face)
#>      AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max
#> FRA1 155.00 157.00  58.00  61.01 100.45 103.28 105.00 107.30  61.40  65.73
#> FRA2 154.00 160.01  57.00  64.00 101.98 105.55 104.35 107.30  60.88  63.03
#> FRA3 154.01 161.00  57.00  63.00  99.36 105.65 101.04 109.04  60.95  65.60
#> HUS1 168.86 172.84  58.55  63.39 102.83 106.53 122.38 124.52  56.73  61.07
#> HUS2 169.85 175.03  60.21  64.38 102.94 108.71 120.24 124.52  56.73  62.37
#> HUS3 168.76 175.15  61.40  63.51 104.35 107.45 120.93 125.18  57.20  61.72
#>      GH_min GH_max
#> FRA1  64.20  67.80
#> FRA2  62.94  66.47
#> FRA3  60.42  66.40
#> HUS1  60.44  64.54
#> HUS2  60.44  66.84
#> HUS3  58.14  67.08

Changes the format of the data to conform to RSDA format.

> Face.tmp <- RSDA_format(data = Face, 
+                         sym_type1 = c("I", "I", "I", "I", "I", "I"), 
+                         location = c(1, 3, 5, 7, 9, 11))
> head(Face.tmp)
#>      $I AD_min AD_max $I BC_min BC_max $I AH_min AH_max $I DH_min DH_max $I
#> FRA1 $I 155.00 157.00 $I  58.00  61.01 $I 100.45 103.28 $I 105.00 107.30 $I
#> FRA2 $I 154.00 160.01 $I  57.00  64.00 $I 101.98 105.55 $I 104.35 107.30 $I
#> FRA3 $I 154.01 161.00 $I  57.00  63.00 $I  99.36 105.65 $I 101.04 109.04 $I
#> HUS1 $I 168.86 172.84 $I  58.55  63.39 $I 102.83 106.53 $I 122.38 124.52 $I
#> HUS2 $I 169.85 175.03 $I  60.21  64.38 $I 102.94 108.71 $I 120.24 124.52 $I
#> HUS3 $I 168.76 175.15 $I  61.40  63.51 $I 104.35 107.45 $I 120.93 125.18 $I
#>      EH_min EH_max $I GH_min GH_max
#> FRA1  61.40  65.73 $I  64.20  67.80
#> FRA2  60.88  63.03 $I  62.94  66.47
#> FRA3  60.95  65.60 $I  60.42  66.40
#> HUS1  56.73  61.07 $I  60.44  64.54
#> HUS2  56.73  62.37 $I  60.44  66.84
#> HUS3  57.20  61.72 $I  58.14  67.08

Clean up variable names to conform to the RSDA format.

> Face.clean <- clean_colnames(data = Face.tmp)
> head(Face.clean)
#>      $I     AD     AD $I    BC    BC $I     AH     AH $I     DH     DH $I    EH
#> FRA1 $I 155.00 157.00 $I 58.00 61.01 $I 100.45 103.28 $I 105.00 107.30 $I 61.40
#> FRA2 $I 154.00 160.01 $I 57.00 64.00 $I 101.98 105.55 $I 104.35 107.30 $I 60.88
#> FRA3 $I 154.01 161.00 $I 57.00 63.00 $I  99.36 105.65 $I 101.04 109.04 $I 60.95
#> HUS1 $I 168.86 172.84 $I 58.55 63.39 $I 102.83 106.53 $I 122.38 124.52 $I 56.73
#> HUS2 $I 169.85 175.03 $I 60.21 64.38 $I 102.94 108.71 $I 120.24 124.52 $I 56.73
#> HUS3 $I 168.76 175.15 $I 61.40 63.51 $I 104.35 107.45 $I 120.93 125.18 $I 57.20
#>         EH $I    GH    GH
#> FRA1 65.73 $I 64.20 67.80
#> FRA2 63.03 $I 62.94 66.47
#> FRA3 65.60 $I 60.42 66.40
#> HUS1 61.07 $I 60.44 64.54
#> HUS2 62.37 $I 60.44 66.84
#> HUS3 61.72 $I 58.14 67.08

Write a symbolic data table to a CSV data file.

> write_csv_table(data = Face.clean, file = 'Face_interval.csv')

Read the symbolic data table and check the format.

> Face.interval <- read.sym.table(file = 'Face_interval.csv', header = T, sep = ';', dec = '.', row.names = 1)
> head(Face.interval)
#> # A tibble: 6 × 6
#>                  AD              BC                AH                DH
#>          <symblc_n>      <symblc_n>        <symblc_n>        <symblc_n>
#> 1 [155.00 : 157.00] [58.00 : 61.01] [100.45 : 103.28] [105.00 : 107.30]
#> 2 [154.00 : 160.01] [57.00 : 64.00] [101.98 : 105.55] [104.35 : 107.30]
#> 3 [154.01 : 161.00] [57.00 : 63.00]  [99.36 : 105.65] [101.04 : 109.04]
#> 4 [168.86 : 172.84] [58.55 : 63.39] [102.83 : 106.53] [122.38 : 124.52]
#> 5 [169.85 : 175.03] [60.21 : 64.38] [102.94 : 108.71] [120.24 : 124.52]
#> 6 [168.76 : 175.15] [61.40 : 63.51] [104.35 : 107.45] [120.93 : 125.18]
#> # ℹ 2 more variables: EH <symblc_n>, GH <symblc_n>

2.4 Example: convert RSDA format to MM format and iGAP format.

Convert RSDA format interval dataframe to MM format.

> Face.MM <- RSDA_to_MM(Face.interval, RSDA = TRUE)
> head(Face.MM)
#>   AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max GH_min
#> 1 155.00 157.00  58.00  61.01 100.45 103.28 105.00 107.30  61.40  65.73  64.20
#> 2 154.00 160.01  57.00  64.00 101.98 105.55 104.35 107.30  60.88  63.03  62.94
#> 3 154.01 161.00  57.00  63.00  99.36 105.65 101.04 109.04  60.95  65.60  60.42
#> 4 168.86 172.84  58.55  63.39 102.83 106.53 122.38 124.52  56.73  61.07  60.44
#> 5 169.85 175.03  60.21  64.38 102.94 108.71 120.24 124.52  56.73  62.37  60.44
#> 6 168.76 175.15  61.40  63.51 104.35 107.45 120.93 125.18  57.20  61.72  58.14
#>   GH_max
#> 1  67.80
#> 2  66.47
#> 3  66.40
#> 4  64.54
#> 5  66.84
#> 6  67.08

Convert MM format interval dataframe to iGAP format.

> Face.iGAP_trans <- MM_to_iGAP(Face.MM)
> head(Face.iGAP_trans)
#>              AD          BC            AH            DH          EH          GH
#> 1       155,157    58,61.01 100.45,103.28     105,107.3  61.4,65.73   64.2,67.8
#> 2    154,160.01       57,64 101.98,105.55  104.35,107.3 60.88,63.03 62.94,66.47
#> 3    154.01,161       57,63  99.36,105.65 101.04,109.04  60.95,65.6  60.42,66.4
#> 4 168.86,172.84 58.55,63.39 102.83,106.53 122.38,124.52 56.73,61.07 60.44,64.54
#> 5 169.85,175.03 60.21,64.38 102.94,108.71 120.24,124.52 56.73,62.37 60.44,66.84
#> 6 168.76,175.15  61.4,63.51 104.35,107.45 120.93,125.18  57.2,61.72 58.14,67.08

\[\\[0.5in]\]

3 Descriptive statistics

3.1 For interval-valued data

> data(mushroom.int)
> int_mean(mushroom.int, var_name = "Pileus.Cap.Width")
#>    Pileus.Cap.Width
#> CM         7.978261
> int_mean(mushroom.int, var_name = 2:3)
#>    Pileus.Cap.Width Stipe.Length
#> CM         7.978261     7.391304
> 
> var_name <- c("Stipe.Length", "Stipe.Thickness")
> method <- c("CM", "FV", "EJD")
> int_mean(mushroom.int, var_name, method)
#>     Stipe.Length Stipe.Thickness
#> CM      7.391304        1.823913
#> FV     10.304348        2.371739
#> EJD     7.391304        1.823913
> int_var(mushroom.int, var_name, method)
#>     Stipe.Length Stipe.Thickness
#> CM      9.544466       0.9872431
#> FV     13.858573       1.1729910
#> EJD    12.651229       1.0836673
> 
> var_name1 <- "Pileus.Cap.Width"
> var_name2 <- c("Stipe.Length", "Stipe.Thickness")
> method <- c("CM", "VM", "QM", "SE", "FV", "EJD", "GQ", "SPT") 
> int_cov(mushroom.int, var_name1, var_name2, method)
#> $CM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     8.417984        2.480657
#> 
#> $VM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     8.095985        2.385769
#> 
#> $QM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     14.02174        3.523277
#> 
#> $SE
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     20.18647        4.714976
#> 
#> $FV
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     10.37204        2.994745
#> 
#> $EJD
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     8.051985        2.372802
#> 
#> $GQ
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     11.46091        3.243229
#> 
#> $SPT
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     11.95054         3.11936
> int_cor(mushroom.int, var_name1, var_name2, method)
#> $CM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8047063       0.7373264
#> 
#> $VM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.3555993       0.3984261
#> 
#> $QM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     0.857843       0.7619691
#> 
#> $SE
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8817636       0.7830681
#> 
#> $FV
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.7496622       0.7440004
#> 
#> $EJD
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.5695142       0.5734316
#> 
#> $GQ
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8106262       0.7837862
#> 
#> $SPT
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8452575       0.7538511

3.2 For histogram-valued data

> data(BLOOD)
> hist_mean(BLOOD, "Cholesterol")
#> [1] 180.677
> hist_var(BLOOD, "Cholesterol")
#> [1] 1002.339
> hist_cov(BLOOD, 'Cholesterol', 'Hemoglobin', method = "B")
#> [1] -4.692686
> hist_cor(BLOOD, 'Cholesterol', 'Hemoglobin', method = "L2W") 
#> [1] -0.4794806

\[\\[0.5in]\]

4 Symbolic dataset donation/submission guidelines

We welcome contributions of high-quality datasets for symbolic data analysis. Submitted datasets will be made publicly available (or under specified constraints) to support research in machine learning, statistics, and related fields. You can submit the related files via email to wuhm@g.nccu.edu.tw or through the Google Form at Symbolic Dataset Submission Form. The submission requirements are as follows.

  1. Dataset Format:

    • Preferred formats: .csv, .xlsx, or any symbolic format in plain text.
    • Compressed (.zip or .gz) if multiple files are included.
  2. Required Metadata:
    Contributors must provide the following details:

    Field Description Example
    Dataset Name A clear, descriptive title. “face recognition data”
    Dataset Short Name A clear,abbreviation title. “face data”
    Authors Full names of donator. “First name, Last name”
    E-mail Contact email.
    Institutes Affiliated organizations. “-”
    Country Origin of the dataset. “France”
    Dataset Descriptions Data descriptive See ‘README’
    Sample Size Number of instances/rows. 27
    Number of Variables Total features/columns (categorical/numeric). 6 (interval)
    Missing Values Indicate if missing values exist and how they’re handled. “None” / “Yes, marked as NA”
    Variable Descriptions Detailed description of each column (name, type, units, range). See ‘README’
    Source Original data source (if applicable). “Leroy et al. (1996)”
    References Citations for prior work using the dataset. “Douzal-Chouakria, Billard, and Diday (2011)”
    Applied Areas Relevant fields (e.g., biology, finance). “Machine Learning”
    Usage Constraints Licensing (CC-BY, MIT) or restrictions. “Public domain”
    Data Link URL to download the dataset (Google Drive, GitHub, etc.). “(https)”
  3. Quality Assurance:

    • Datasets should be clean (no sensitive/private data).
  4. Optional (Recommended):

    • A companion README file with:
      • Dataset background.
      • Suggested use cases.
      • Known limitations.
        \[\\[0.5in]\]

5 Citation

Po-Wei Chen, Chun-houh Chen, Han-Ming Wu (2025), dataSDA: datasets and basic statistics for symbolic data analysis in R. Technical report.