Abstract
Cluster analysis plays vital role in pattern recognition in several fields of science. Silhouette width is a widely used measure for assessing the fit of individual objects in the classification, as well as the quality of clusters and the entire classification. This index uses two clustering criteria, compactness (average within-cluster distances) and separation (average between-cluster distances), which implies that spherical cluster shapes are preferred over others – a property that can be seen as a disadvantage in the presence 22 of clusters with high internal heterogeneity, which is common in real situations.
We suggest a generalization of the silhouette width using the generalized mean. By changing the p parameter of the generalized mean between −∞ and +∞, several specific summary statistics, including the minimum, maximum, the arithmetic, harmonic, and geometric means, can be reproduced. Implementing the generalized mean in the calculation of silhouette width allows for changing the sensitivity of the index to compactness vs. connectedness. With higher sensitivity to connectedness instead of compactness the preference of silhouette width towards spherical clusters is expected to reduce. We test the performance of the generalized silhouette width on artificial data sets and on the Iris data set. We examine how classifications with different numbers of clusters prepared by single linkage, group average, and complete linkage algorithms are evaluated, if p is set to different values.
When p was negative, well separated clusters achieved high silhouette widths despite their elongated or circular shapes. Positive values of p increased the importance of compactness, hence the preference towards spherical clusters became even more detectable. With low p, single linkage clustering was deemed the most efficient clustering method, while with higher parameter values the performance of group average and complete linkage seemed better.
The generalized silhouette width is a promising tool for assessing clustering quality. It allows for adjusting the contribution of compactness and connectedness criteria to the index value, thus avoiding underestimation of clustering efficiency in the presence of clusters with high internal heterogeneity.