2 Symbolic data formats conversion

2.1 Example: convert interval-valued datasets into the `symbolic_tbl` class

For the purpose of illustrating two distinct class of interval-valued dataset in R, we will utilize two of these datasets: Abalone and mushroom. The Abalone dataset includes 24 observations, each featuring 7 interval-valued variables. It is categorized as an object of the symbolic_tbl class, and each variable within it belongs the symbolic_n class. Both of these classes are defined by the RSDA package. Positioning the dataset object within the symbolic_tbl class facilitates the more straightforward application of symbolic data methods provided by RSDA. The Abalone dataset is integrated as a built-in dataset in the RSDA package. A copy of it is also included in the dataSDA package and is renamed as Abalone.int.

> library(dataSDA)
> data(Abalone.iGAP)
> head(Abalone.iGAP)
#>                 Length       Diameter          Height           Whole
#> F-10-12  0.1275,0.9975   0.075, 0.815 -0.0175, 0.3125   -1.021, 3.883
#> F-13-15  0.1775,1.0275    0.125,0.825    0.025, 0.325 -0.8567, 3.6303
#> F-16-18      0.22,0.92 0.1725, 0.7425  0.0375, 0.3075 -0.5725, 3.1235
#> F-19-21 0.3725, 0.8425 0.2575, 0.6875  0.0825, 0.2525 -0.0368, 2.8443
#> F-23-24   0.275, 0.975   0.255, 0.755      0.09, 0.27   -0.303, 3.469
#> F-25-29   0.475, 0.775   0.405, 0.645  0.1625, 0.2325    0.915, 2.105
#>                 Shucked         Viscera           Shell
#> F-10-12 -0.6322, 2.1948 -0.2077, 0.7712   -0.258, 1.054
#> F-13-15 -0.4548, 1.7942 -0.1905, 0.7555   -0.269, 1.153
#> F-16-18   -0.244, 1.206 -0.1037, 0.6752 -0.3233, 1.4477
#> F-19-21     -0.16, 1.14   -0.033, 0.615 -0.1175, 1.1725
#> F-23-24 -0.2295, 1.3205     -0.13, 0.83    0.005, 0.945
#> F-25-29    0.134, 0.896  0.1467, 0.3798      0.45, 0.55
> class(Abalone.iGAP)
#> [1] "data.frame"
> data(Abalone)
> head(Abalone)
#>         Length_min Length_max Diameter_min Diameter_max Height_min Height_max
#> F-10-12     0.1275     0.9975        0.075        0.815    -0.0175     0.3125
#> F-13-15     0.1775     1.0275        0.125        0.825      0.025      0.325
#> F-16-18       0.22       0.92       0.1725       0.7425     0.0375     0.3075
#> F-19-21     0.3725     0.8425       0.2575       0.6875     0.0825     0.2525
#> F-23-24      0.275      0.975        0.255        0.755       0.09       0.27
#> F-25-29      0.475      0.775        0.405        0.645     0.1625     0.2325
#>         Whole_min Whole_max Shucked_min Shucked_max Viscera_min Viscera_max
#> F-10-12    -1.021     3.883     -0.6322      2.1948     -0.2077      0.7712
#> F-13-15   -0.8567    3.6303     -0.4548      1.7942     -0.1905      0.7555
#> F-16-18   -0.5725    3.1235      -0.244       1.206     -0.1037      0.6752
#> F-19-21   -0.0368    2.8443       -0.16        1.14      -0.033       0.615
#> F-23-24    -0.303     3.469     -0.2295      1.3205       -0.13        0.83
#> F-25-29     0.915     2.105       0.134       0.896      0.1467      0.3798
#>         Shell_min Shell_max
#> F-10-12    -0.258     1.054
#> F-13-15    -0.269     1.153
#> F-16-18   -0.3233    1.4477
#> F-19-21   -0.1175    1.1725
#> F-23-24     0.005     0.945
#> F-25-29      0.45      0.55
> class(Abalone)
#> [1] "data.frame"

The mushroom dataset consists of a set of 23 species described by 3 interval-valued variables: stipe length, stipe thickness and pileus cap width. The dataset also contains other two categorical variables: Species and Edibility. The dataset is the data.frame class by default and we would like to convert it into the symbolic_tbl class.

> data(mushroom)
> head(mushroom) 
#>     Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1    arorae                    3                    8                4
#> 2   arvenis                    6                   21                4
#> 3    benesi                    4                    8                5
#> 4 bernardii                    7                    6                4
#> 5  bisporus                    5                   12                2
#> 6 bitorquis                    5                   15                4
#>   Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1                9                 0.5                 2.5         U
#> 2               14                 1.0                 3.5         Y
#> 3               11                 1.0                 2.0         Y
#> 4                7                 3.0                 4.5         Y
#> 5                5                 1.5                 2.5         Y
#> 6               10                 2.0                 4.0         Y

Within the framework of SDA, the variables “Species” and “Edibility” are treated as set variables. We have developed the set_variable_format function to create pseudo-variables that correspond to the categories of a specified categorical variable, employing the one-hot encoding method. The location argument denotes the position of the set variable in the data. Following the restructuring of the dataset, the values assigned to the “Species” and “Edibility” variables are modified to reflect the number of categories associated with each variable.

> mushroom_set <- set_variable_format(data = mushroom, location = 8, 
+                                        var = "Species")
> head(mushroom_set, 3)
#>   Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1      23      1       0      0         0        0         0           0
#> 2      23      0       1      0         0        0         0           0
#> 3      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1          0        0               0           0                 0
#> 2          0        0               0           0                 0
#> 3          0        0               0           0                 0
#>   fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1            0         0        0             0                  0           0
#> 2            0         0        0             0                  0           0
#> 3            0         0        0             0                  0           0
#>   perobscurus semotus silvicola subrutilescens xanthodermus
#> 1           0       0         0              0            0
#> 2           0       0         0              0            0
#> 3           0       0         0              0            0
#>   Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min Stipe.Length_max
#> 1                    3                    8                4                9
#> 2                    6                   21                4               14
#> 3                    4                    8                5               11
#>   Stipe.Thickness_min Stipe.Thickness_max Edibility U Y T
#> 1                 0.5                 2.5         3 1 0 0
#> 2                 1.0                 3.5         3 0 1 0
#> 3                 1.0                 2.0         3 0 1 0

To adhere to the formatting conventions of RSDA, we have implemented the RSDA_format function. This function prefixes each variable with a $ symbol to indicate the type of symbolic variable. Specifically, set variables are prefixed with $S, and interval-valued variables are prefixed with $I. The syntax of the RSDA_format function with its arguments is as follows.

RSDA_format(data, sym_type1, location, sym_type2, var)

data: a conventional data.
sym_type1, sym_type2: the labels I means an interval variable and S means set variable.
location: the location of the sym_type in the data.
var: the name of the symbolic variable in the data.

> mushroom_tmp <- RSDA_format(data = mushroom_set, 
+                                sym_type1 = c("I", "I", "I", "S"),
+                                location = c(25, 27, 29, 31), 
+                                sym_type2 = c("S"),
+                                var = c("Species"))
> head(mushroom_tmp, 3)
#>   $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S      23      1       0      0         0        0         0           0
#> 2 $S      23      0       1      0         0        0         0           0
#> 3 $S      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1          0        0               0           0                 0
#> 2          0        0               0           0                 0
#> 3          0        0               0           0                 0
#>   fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1            0         0        0             0                  0           0
#> 2            0         0        0             0                  0           0
#> 3            0         0        0             0                  0           0
#>   perobscurus semotus silvicola subrutilescens xanthodermus $I
#> 1           0       0         0              0            0 $I
#> 2           0       0         0              0            0 $I
#> 3           0       0         0              0            0 $I
#>   Pileus.Cap.Width_min Pileus.Cap.Width_max $I Stipe.Length_min
#> 1                    3                    8 $I                4
#> 2                    6                   21 $I                4
#> 3                    4                    8 $I                5
#>   Stipe.Length_max $I Stipe.Thickness_min Stipe.Thickness_max $S Edibility U Y
#> 1                9 $I                 0.5                 2.5 $S         3 1 0
#> 2               14 $I                 1.0                 3.5 $S         3 0 1
#> 3               11 $I                 1.0                 2.0 $S         3 0 1
#>   T
#> 1 0
#> 2 0
#> 3 0

The suffixes min and max from the variable names are removed using the clean_colnames function. Subsequently, the modified dataset is written out utilizing the write_csv_table function. This external data file is then read using the read.sym.table function, which is provided by the RSDA package. Upon import, the dataset is a symbolic_tbl class.

> mushroom_clean <- clean_colnames(data = mushroom_tmp)
> head(mushroom_clean, 3)
#>   $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S      23      1       0      0         0        0         0           0
#> 2 $S      23      0       1      0         0        0         0           0
#> 3 $S      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus dutives fuseo-fibrillosus fuscovelatus
#> 1          0        0               0       0                 0            0
#> 2          0        0               0       0                 0            0
#> 3          0        0               0       0                 0            0
#>   hondensis lilaceps micromegathus praeclaresquamosus pattersonae perobscurus
#> 1         0        0             0                  0           0           0
#> 2         0        0             0                  0           0           0
#> 3         0        0             0                  0           0           0
#>   semotus silvicola subrutilescens xanthodermus $I Pileus.Cap.Width
#> 1       0         0              0            0 $I                3
#> 2       0         0              0            0 $I                6
#> 3       0         0              0            0 $I                4
#>   Pileus.Cap.Width $I Stipe.Length Stipe.Length $I Stipe.Thickness
#> 1                8 $I            4            9 $I             0.5
#> 2               21 $I            4           14 $I             1.0
#> 3                8 $I            5           11 $I             1.0
#>   Stipe.Thickness $S Edibility U Y T
#> 1             2.5 $S         3 1 0 0
#> 2             3.5 $S         3 0 1 0
#> 3             2.0 $S         3 0 1 0

Write the data object with symbolic_tbl class to a csv file.

> write_csv_table(data = mushroom_clean, file = "mushroom_interval.csv")
> mushroom_int <- read.sym.table(file = 'mushroom_interval.csv', 
+                                   header = T, sep = ';', dec = '.', 
+                                   row.names = 1)
> head(mushroom_int, 3)
#> # A tibble: 3 × 5
#>      Species Pileus.Cap.Width   Stipe.Length Stipe.Thickness  Edibility
#>   <symblc_s>       <symblc_n>     <symblc_n>      <symblc_n> <symblc_s>
#> 1   {arorae}    [3.00 : 8.00]  [4.00 : 9.00]   [0.50 : 2.50]        {U}
#> 2  {arvenis}   [6.00 : 21.00] [4.00 : 14.00]   [1.00 : 3.50]        {Y}
#> 3   {benesi}    [4.00 : 8.00] [5.00 : 11.00]   [1.00 : 2.00]        {Y}
> class(mushroom_int)
#> [1] "symbolic_tbl" "tbl_df"       "tbl"          "data.frame"

2.2 Example: the conversion of histogram-valued datasets into the `MatH` class

To demonstrate the process of converting a dataset into a matrix of histogram-valued data class — specifically, the MatH class — we utilize two datasets: BLOOD and Weight. This class is facilitated by the HistDAWass package.

The BLOOD dataset is a MatH object, supplied by the HistDAWass package, and it encompasses 14 groups of patients, each characterized by three distributional variables. Each distribution within a cell is depicted by its mean and standard deviation.

> library(dataSDA)
> data(BLOOD)
> BLOOD[1:3, 1:2]

Below, we illustrate the process of transforming a list object into an instance of the MatH class, specifically, Weight. We use the distributionH function from the HistDAWass package to encapsulate the histogram-valued data present in the dataset. Subsequently, the constructed Weight dataset comprises 7 observations, each with 1 variable. We then utilize the new_method function to assign the data object as a member of the MatH class. As a result, we can leverage the analysis methods offered by the HistDAWass package on objects of the MatH class.

> library(HistDAWass)
> BLOOD[1:3, 1:2]
#> a matrix of distributions 
#>  2  variables  3  rows 
#>  each distibution in the cell is represented by the mean and the standard deviation 
#>                  Cholesterol               Hemoglobin        
#> u1: F-20  [m= 150.1  ,s= 26.336 ]  [m= 13.695  ,s= 0.55031 ]
#> u2: F-30 [m= 150.71  ,s= 25.284 ]  [m= 12.158  ,s= 0.52834 ]
#> u3: F-40 [m= 164.96  ,s= 25.334 ]  [m= 12.134  ,s= 0.50739 ]

> A1 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B1 <- c(0.00, 0.02, 0.08, 0.32, 0.62, 0.86, 0.92, 1.00)
> A2 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B2 <- c(0.00, 0.05, 0.12, 0.42, 0.68, 0.88, 0.94, 1.00)
> A3 <- c(50, 60, 70, 80, 90, 100, 110, 120)
> B3 <- c(0.00, 0.03, 0.24, 0.36, 0.75, 0.85, 0.98, 1.00)
> List <- list(A1, B1, A2, B2, A3, B3)
> List
#> [[1]]
#> [1]  50  60  70  80  90 100 110 120
#> 
#> [[2]]
#> [1] 0.00 0.02 0.08 0.32 0.62 0.86 0.92 1.00
#> 
#> [[3]]
#> [1]  50  60  70  80  90 100 110 120
#> 
#> [[4]]
#> [1] 0.00 0.05 0.12 0.42 0.68 0.88 0.94 1.00
#> 
#> [[5]]
#> [1]  50  60  70  80  90 100 110 120
#> 
#> [[6]]
#> [1] 0.00 0.03 0.24 0.36 0.75 0.85 0.98 1.00
> 
> ListOfWeight <- vector("list", 3)
> x <- 0
> for (i in 1:length(ListOfWeight)){
+   ListOfWeight[[i]] <- distributionH(List[[i + x]], List[[i + x + 1]])
+   x <- x + 1
+ }
> Weight <- methods::new("MatH",
+                     nrows = 3, ncols = 1, ListOfDist = ListOfWeight,
+                     names.rows = c("20s", "30s", "40s"),
+                     names.cols = c("weight"), by.row = FALSE)
> Weight
#> a matrix of distributions 
#>  1  variables  3  rows 
#>  each distibution in the cell is represented by the mean and the standard deviation 
#>              weight        
#> 20s [m= 86.8  ,s= 13.824 ]
#> 30s [m= 84.1  ,s= 14.44 ] 
#> 40s [m= 82.9  ,s= 14.385 ]

2.3 Example: convert iGAP format to MM format and RSDA format.

To convert iGAP files to MM format:

> data(Face.iGAP)
> class(Face.iGAP)
#> [1] "data.frame"
> head(Face.iGAP)
#>                 AD          BC            AH            DH          EH
#> FRA1 155.00,157.00 58.00,61.01 100.45,103.28 105.00,107.30 61.40,65.73
#> FRA2 154.00,160.01 57.00,64.00 101.98,105.55 104.35,107.30 60.88,63.03
#> FRA3 154.01,161.00 57.00,63.00  99.36,105.65 101.04,109.04 60.95,65.60
#> HUS1 168.86,172.84 58.55,63.39 102.83,106.53 122.38,124.52 56.73,61.07
#> HUS2 169.85,175.03 60.21,64.38 102.94,108.71 120.24,124.52 56.73,62.37
#> HUS3 168.76,175.15 61.40,63.51 104.35,107.45 120.93,125.18 57.20,61.72
#>               GH
#> FRA1 64.20,67.80
#> FRA2 62.94,66.47
#> FRA3 60.42,66.40
#> HUS1 60.44,64.54
#> HUS2 60.44,66.84
#> HUS3 58.14,67.08
> Face <- iGAP_to_MM(data = Face.iGAP, location = 1:6)
> head(Face)
#>      AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max
#> FRA1 155.00 157.00  58.00  61.01 100.45 103.28 105.00 107.30  61.40  65.73
#> FRA2 154.00 160.01  57.00  64.00 101.98 105.55 104.35 107.30  60.88  63.03
#> FRA3 154.01 161.00  57.00  63.00  99.36 105.65 101.04 109.04  60.95  65.60
#> HUS1 168.86 172.84  58.55  63.39 102.83 106.53 122.38 124.52  56.73  61.07
#> HUS2 169.85 175.03  60.21  64.38 102.94 108.71 120.24 124.52  56.73  62.37
#> HUS3 168.76 175.15  61.40  63.51 104.35 107.45 120.93 125.18  57.20  61.72
#>      GH_min GH_max
#> FRA1  64.20  67.80
#> FRA2  62.94  66.47
#> FRA3  60.42  66.40
#> HUS1  60.44  64.54
#> HUS2  60.44  66.84
#> HUS3  58.14  67.08

Changes the format of the data to conform to RSDA format.

> Face.tmp <- RSDA_format(data = Face, 
+                         sym_type1 = c("I", "I", "I", "I", "I", "I"), 
+                         location = c(1, 3, 5, 7, 9, 11))
> head(Face.tmp)
#>      $I AD_min AD_max $I BC_min BC_max $I AH_min AH_max $I DH_min DH_max $I
#> FRA1 $I 155.00 157.00 $I  58.00  61.01 $I 100.45 103.28 $I 105.00 107.30 $I
#> FRA2 $I 154.00 160.01 $I  57.00  64.00 $I 101.98 105.55 $I 104.35 107.30 $I
#> FRA3 $I 154.01 161.00 $I  57.00  63.00 $I  99.36 105.65 $I 101.04 109.04 $I
#> HUS1 $I 168.86 172.84 $I  58.55  63.39 $I 102.83 106.53 $I 122.38 124.52 $I
#> HUS2 $I 169.85 175.03 $I  60.21  64.38 $I 102.94 108.71 $I 120.24 124.52 $I
#> HUS3 $I 168.76 175.15 $I  61.40  63.51 $I 104.35 107.45 $I 120.93 125.18 $I
#>      EH_min EH_max $I GH_min GH_max
#> FRA1  61.40  65.73 $I  64.20  67.80
#> FRA2  60.88  63.03 $I  62.94  66.47
#> FRA3  60.95  65.60 $I  60.42  66.40
#> HUS1  56.73  61.07 $I  60.44  64.54
#> HUS2  56.73  62.37 $I  60.44  66.84
#> HUS3  57.20  61.72 $I  58.14  67.08

Clean up variable names to conform to the RSDA format.

> Face.clean <- clean_colnames(data = Face.tmp)
> head(Face.clean)
#>      $I     AD     AD $I    BC    BC $I     AH     AH $I     DH     DH $I    EH
#> FRA1 $I 155.00 157.00 $I 58.00 61.01 $I 100.45 103.28 $I 105.00 107.30 $I 61.40
#> FRA2 $I 154.00 160.01 $I 57.00 64.00 $I 101.98 105.55 $I 104.35 107.30 $I 60.88
#> FRA3 $I 154.01 161.00 $I 57.00 63.00 $I  99.36 105.65 $I 101.04 109.04 $I 60.95
#> HUS1 $I 168.86 172.84 $I 58.55 63.39 $I 102.83 106.53 $I 122.38 124.52 $I 56.73
#> HUS2 $I 169.85 175.03 $I 60.21 64.38 $I 102.94 108.71 $I 120.24 124.52 $I 56.73
#> HUS3 $I 168.76 175.15 $I 61.40 63.51 $I 104.35 107.45 $I 120.93 125.18 $I 57.20
#>         EH $I    GH    GH
#> FRA1 65.73 $I 64.20 67.80
#> FRA2 63.03 $I 62.94 66.47
#> FRA3 65.60 $I 60.42 66.40
#> HUS1 61.07 $I 60.44 64.54
#> HUS2 62.37 $I 60.44 66.84
#> HUS3 61.72 $I 58.14 67.08

Write a symbolic data table to a CSV data file.

> write_csv_table(data = Face.clean, file = 'Face_interval.csv')

Read the symbolic data table and check the format.

> Face.interval <- read.sym.table(file = 'Face_interval.csv', header = T, sep = ';', dec = '.', row.names = 1)
> head(Face.interval)
#> # A tibble: 6 × 6
#>                  AD              BC                AH                DH
#>          <symblc_n>      <symblc_n>        <symblc_n>        <symblc_n>
#> 1 [155.00 : 157.00] [58.00 : 61.01] [100.45 : 103.28] [105.00 : 107.30]
#> 2 [154.00 : 160.01] [57.00 : 64.00] [101.98 : 105.55] [104.35 : 107.30]
#> 3 [154.01 : 161.00] [57.00 : 63.00]  [99.36 : 105.65] [101.04 : 109.04]
#> 4 [168.86 : 172.84] [58.55 : 63.39] [102.83 : 106.53] [122.38 : 124.52]
#> 5 [169.85 : 175.03] [60.21 : 64.38] [102.94 : 108.71] [120.24 : 124.52]
#> 6 [168.76 : 175.15] [61.40 : 63.51] [104.35 : 107.45] [120.93 : 125.18]
#> # ℹ 2 more variables: EH <symblc_n>, GH <symblc_n>

2.4 Example: convert RSDA format to MM format and iGAP format.

Convert RSDA format interval dataframe to MM format.

> Face.MM <- RSDA_to_MM(Face.interval, RSDA = TRUE)
> head(Face.MM)
#>   AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max GH_min
#> 1 155.00 157.00  58.00  61.01 100.45 103.28 105.00 107.30  61.40  65.73  64.20
#> 2 154.00 160.01  57.00  64.00 101.98 105.55 104.35 107.30  60.88  63.03  62.94
#> 3 154.01 161.00  57.00  63.00  99.36 105.65 101.04 109.04  60.95  65.60  60.42
#> 4 168.86 172.84  58.55  63.39 102.83 106.53 122.38 124.52  56.73  61.07  60.44
#> 5 169.85 175.03  60.21  64.38 102.94 108.71 120.24 124.52  56.73  62.37  60.44
#> 6 168.76 175.15  61.40  63.51 104.35 107.45 120.93 125.18  57.20  61.72  58.14
#>   GH_max
#> 1  67.80
#> 2  66.47
#> 3  66.40
#> 4  64.54
#> 5  66.84
#> 6  67.08

Convert MM format interval dataframe to iGAP format.

> Face.iGAP_trans <- MM_to_iGAP(Face.MM)
> head(Face.iGAP_trans)
#>              AD          BC            AH            DH          EH          GH
#> 1       155,157    58,61.01 100.45,103.28     105,107.3  61.4,65.73   64.2,67.8
#> 2    154,160.01       57,64 101.98,105.55  104.35,107.3 60.88,63.03 62.94,66.47
#> 3    154.01,161       57,63  99.36,105.65 101.04,109.04  60.95,65.6  60.42,66.4
#> 4 168.86,172.84 58.55,63.39 102.83,106.53 122.38,124.52 56.73,61.07 60.44,64.54
#> 5 169.85,175.03 60.21,64.38 102.94,108.71 120.24,124.52 56.73,62.37 60.44,66.84
#> 6 168.76,175.15  61.4,63.51 104.35,107.45 120.93,125.18  57.2,61.72 58.14,67.08

\[\\[0.5in]\]

Field	Description	Example
Dataset Name	A clear, descriptive title.	“face recognition data”
Dataset Short Name	A clear,abbreviation title.	“face data”
Authors	Full names of donator.	“First name, Last name”
E-mail	Contact email.	“abc123@gmail.com”
Institutes	Affiliated organizations.	“-”
Country	Origin of the dataset.	“France”
Dataset Descriptions	Data descriptive	See ‘README’
Sample Size	Number of instances/rows.	27
Number of Variables	Total features/columns (categorical/numeric).	6 (interval)
Missing Values	Indicate if missing values exist and how they’re handled.	“None” / “Yes, marked as NA”
Variable Descriptions	Detailed description of each column (name, type, units, range).	See ‘README’
Source	Original data source (if applicable).	“Leroy et al. (1996)”
References	Citations for prior work using the dataset.	“Douzal-Chouakria, Billard, and Diday (2011)”
Applied Areas	Relevant fields (e.g., biology, finance).	“Machine Learning”
Usage Constraints	Licensing (CC-BY, MIT) or restrictions.	“Public domain”
Data Link	URL to download the dataset (Google Drive, GitHub, etc.).	“(https)”

Introduction to dataSDA

Po-Wei Chen, Chun-houh Chen and Han-Ming Wu*

2025-06-07

1 Introduction

2 Symbolic data formats conversion

2.1 Example: convert interval-valued datasets into the `symbolic_tbl` class

2.2 Example: the conversion of histogram-valued datasets into the `MatH` class

2.3 Example: convert iGAP format to MM format and RSDA format.

2.4 Example: convert RSDA format to MM format and iGAP format.

3 Descriptive statistics

3.1 For interval-valued data

3.2 For histogram-valued data

4 Symbolic dataset donation/submission guidelines

5 Citation

Introduction to dataSDA

Po-Wei Chen, Chun-houh Chen and Han-Ming Wu*

2025-06-07

1 Introduction

2 Symbolic data formats conversion

2.1 Example: convert interval-valued datasets into the symbolic_tbl class

2.2 Example: the conversion of histogram-valued datasets into the MatH class

2.3 Example: convert iGAP format to MM format and RSDA format.

2.4 Example: convert RSDA format to MM format and iGAP format.

3 Descriptive statistics

3.1 For interval-valued data

3.2 For histogram-valued data

4 Symbolic dataset donation/submission guidelines

5 Citation

2.1 Example: convert interval-valued datasets into the `symbolic_tbl` class

2.2 Example: the conversion of histogram-valued datasets into the `MatH` class