| Type: | Package |
| Title: | Efficient Computations of Standard Clustering Comparison Measures |
| Version: | 1.1.0 |
| Maintainer: | Julien Chiquet <julien.chiquet@inrae.fr> |
| Description: | Implements an efficient O(n) algorithm based on bucket-sorting for fast computation of standard clustering comparison measures. Available measures include adjusted Rand index (ARI), normalized information distance (NID), normalized mutual information (NMI), normalized variation information (NVI) and entropy, as described in Vinh et al (2009) <doi:10.1145/1553374.1553511>. Include AMI (Adjusted Mutual Information) since version 0.1.2, a modified version of ARI (MARI), as described in Sundqvist et al. <doi:10.1007/s00180-022-01230-7> and simple Chi-square distance since version 1.0.0. |
| License: | GPL (≥ 3) |
| URL: | https://github.com/jchiquet/aricode |
| BugReports: | https://github.com/jchiquet/aricode/issues |
| Encoding: | UTF-8 |
| Imports: | Matrix, Rcpp, lifecycle |
| Suggests: | testthat, spelling, mclust, ggplot2, pkgdown |
| LinkingTo: | Rcpp |
| Language: | en-US |
| Config/roxygen2/version: | 8.0.0 |
| NeedsCompilation: | yes |
| Packaged: | 2026-05-13 14:30:35 UTC; jchiquet |
| Author: | Julien Chiquet |
| Repository: | CRAN |
| Date/Publication: | 2026-05-13 15:20:02 UTC |
aricode: Efficient Computations of Standard Clustering Comparison Measures
Description
Implements an efficient O(n) algorithm based on bucket-sorting for fast computation of standard clustering comparison measures. Available measures include adjusted Rand index (ARI), normalized information distance (NID), normalized mutual information (NMI), normalized variation information (NVI) and entropy, as described in Vinh et al (2009) doi:10.1145/1553374.1553511. Include AMI (Adjusted Mutual Information) since version 0.1.2, a modified version of ARI (MARI), as described in Sundqvist et al. doi:10.1007/s00180-022-01230-7 and simple Chi-square distance since version 1.0.0.
A package for efficient computations of standard clustering comparison measures. Most of the available measures are described in the paper of Vinh et al, JMLR, 2009 (see reference below).
Details
Traditional implementations (e.g., function adjustedRandIndex of
package mclust) are in Omega(n + u v) where n is the size of the
vectors whose classifications are to be compared, u and v are the
respective numbers of classes in each vector. In aricode, the implementation is
in Theta(n), with additional speed gains from C++ code.
The functions included in aricode are:
* ARI: computes the adjusted rand index * Chi2: computes the Chi-square statistic * MARI: computes the modified adjusted rand index (Sundqvist et al, 2023) * MARIraw: computes the raw version of the modified adjusted rand index * RI: computes the rand index * NVI: computes the normalized variation information * NID: computes the normalized information distance * NMI: computes the normalized mutual information * AMI: computes the adjusted mutual information * entropy: computes the conditional and joint entropies * clustComp: computes all clustering comparison measures at once
Author(s)
Maintainer: Julien Chiquet julien.chiquet@inrae.fr (ORCID)
Authors:
Julien Chiquet julien.chiquet@inrae.fr (ORCID)
Guillem Rigaill guillem.rigaill@inrae.fr
Martina Sundqvist martina.sundqvist@agroparistech.fr
Other contributors:
Valentin Dervieux valentin.dervieux@gmail.com [contributor]
Florent Bersani florent@bersani.org [contributor]
Julien Chiquet julien.chiquet@inrae.fr
Guillem Rigaill guillem.rigaill@inrae.fr
Martina Sundqvist martina.sundqvist@agroparistech.fr
References
* Nguyen Xuan Vinh, Julien Epps, and James Bailey. "Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance." Journal of Machine Learning Research 11.Oct (2010): 2837-2854. as described in Vinh et al (2009) * Sundqvist, Martina, Julien Chiquet, and Guillem Rigaill. "Adjusting the adjusted Rand Index: A multinomial story." Computational Statistics 38.1 (2023): 327-347.
See Also
Useful links:
ARI, RI, NID,
NVI, AMI, NMI,
entropy, clustComp
Adjusted Mutual Information
Description
A function to compute the adjusted mutual information between two classifications
Usage
AMI(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the adjusted rand index.
See Also
ARI, RI, NID, NVI, NMI, clustComp
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
AMI(cl, iris$Species)
Adjusted Rand Index
Description
A function to compute the adjusted rand index between two classifications
Usage
ARI(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the adjusted Rand index.
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
ARI(cl, iris$Species)
Chi-square statistics
Description
A function to compute the Chi-2 statistic
Usage
Chi2(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the Chi-square statistic.
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
Chi2(cl, iris$Species)
Frobenius norm
Description
A function to compute the Frobenius norm between two classifications as defined in Lajugie et al. 2014 and Arlot et al 2019
Usage
Frobenius(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the Frobenius norm.
References
- Rémi Lajugie, Francis Bach, and Sylvain Arlot. "Large-margin metric learning for constrained partitioning problems." International Conference on Machine Learning. PMLR, 2014. - Sylvain Arlot , Alain Celisse, and Zaid Harchaoui. "A kernel multiple change-point algorithm via model selection." Journal of machine learning research 20.162 (2019): 1-56.
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
Frobenius(cl, iris$Species)
Modified Adjusted Rand Index
Description
A function to compute a modified adjusted rand index between two classifications as proposed by Sundqvist et al. (2023), based on a multinomial model.
Usage
MARI(c1, c2, sorted_pairs = NULL, raw = FALSE)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
raw |
Boolean: should the raw version of the MARI be computed? Default to 'FALSE'. |
Value
a scalar with the modified ARI.
References
Sundqvist, Martina, Julien Chiquet, and Guillem Rigaill. "Adjusting the adjusted Rand Index: A multinomial story." Computational Statistics 38.1 (2023): 327-347.
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
MARI(cl, iris$Species)
raw Modified Adjusted Rand Index
Description
'r lifecycle::badge("deprecated")'
The function MARI now owns an argument function 'raw' if one wishes to compute the raw version of MARI.
Usage
MARIraw(c1, c2, sorted_pairs = NULL)
Details
A function to compute a modified adjusted rand index between two classifications as proposed by Sundqvist (2023), based on a multinomial model. Raw means that the index is not divided by the (maximum - expected) value.
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
out <- MARIraw(cl, iris$Species)
# ->
out <- MARI(cl, iris$Species, raw = TRUE)
Normalized information distance (NID)
Description
A function to compute the NID between two classifications
Usage
NID(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the normalized information distance .
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
NID(cl, iris$Species)
Normalized mutual information (NMI)
Description
A function to compute the NMI between two classifications
Usage
NMI(
c1,
c2,
variant = c("max", "min", "sqrt", "sum", "joint"),
sorted_pairs = NULL
)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
variant |
a string in ("max", "min", "sqrt", "sum", "joint"): different variants of NMI. Default use "max". |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the normalized mutual information .
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
NMI(cl, iris$Species)
Normalized variation of information (NVI)
Description
A function to compute the NVI between two classifications
Usage
NVI(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the normalized variation of information.
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
NVI(cl, iris$Species)
Rand Index
Description
A function to compute the Rand index between two classifications
Usage
RI(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a scalar with the Rand index.
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
RI(cl, iris$Species)
Measures of similarity between two classification
Description
'r lifecycle::badge("deprecated")'
A function for computing all the measures of similarity implemented in this package at once. Include ARI, RI, MI VI, NVI, ID, NID, NMI, Chi2, MARI, Frobenius
Just a change in the function name: please use [compare_clustering()].
Usage
clustComp(c1, c2, sorted_pairs = NULL, AMI = FALSE)
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
out <- sort_pairs(cl, iris$Species)
# ->
out <- sort_pairs(cl, iris$Species)
Measures of similarity between two classification
Description
A function for computing all the measures of similarity implemented in this package at once. Include (A)RI, (N)MI, (N)VI, (N)ID, Chi2, MARI, Frobenius
Usage
compare_clustering(c1, c2, sorted_pairs = NULL, AMI = FALSE)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
AMI |
Boolean: should the AMI be computed (more costly than all other measures)? Default is 'FALSE'. |
Value
a list with all the measures available
See Also
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
compare_clustering(cl, iris$Species)
Entropy
Description
A function to compute the empirical entropy for two vectors of classification and the joint entropy
Usage
entropy(c1, c2, sorted_pairs = NULL)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
sorted_pairs |
optional output of function sort_pairs (if already computed). If 'NULL' (the default), will be called internally |
Value
a list with the two conditional entropies, the joint entropy and output of sort_pairs.
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
entropy(cl, iris$Species)
Sort Pairs
Description
'r lifecycle::badge("deprecated")'
A function to sort pairs of integers or factors and identify the pairs between two classifications
Just a change in the function name: please use [sort_pairs()].
Usage
sortPairs(c1, c2, spMat = FALSE)
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
out <- sortPairs(cl, iris$Species)
# ->
out <- sort_pairs(cl, iris$Species)
Sort Pairs
Description
A function to sort pairs of integers or factors and identify the pairs between two classifications
Usage
sort_pairs(c1, c2, spMat = FALSE)
Arguments
c1 |
A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
c2 |
A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list. |
spMat |
Logical. If |
Details
Pair sorting, which is at the heart of computing all clustering comparison measures, has been carefully optimized. Hence, even basic R operations (checking for the presence of NAs, type conversion, or constructing a sparse contingency matrix as an output) have non-negligible cost compared to the pair sorting itself. For optimal performance, please provide the vectors as integers or factors without any NAs.
Value
A list containing the following elements:
-
spMat: A sparsely encoded contingency matrix (only if
spMat = TRUE). -
levels: A list containing the retained levels for each classification.
-
nij: A vector of positive pair counts.
-
ni., n.j: Vectors of class counts for
c1andc2, respectively. -
pair_c1, pair_c2: Integer vectors specifying the classes in
c1andc2corresponding to the counts innij. These provide the row and column indices for the contingency matrix.
Examples
data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
out <- sort_pairs(cl, iris$Species)