Accelerating clustering algorithms with tree data structures
Lade...
Dateien
Datum
Autor:innen
Zeitschriftentitel
ISSN der Zeitschrift
Bandtitel
Verlag
Sonstige Titel
Zusammenfassung
Clustering is a central task in unsupervised learning, enabling the discovery of structure in data without prior labels. It supports a wide range of applications, from image recognition and anomaly detection to customer analytics and text mining. Despite decades of research, classical methods such as Hierarchical Agglomerative Clustering (HAC) and k-means remain popular due to their simplicity and interpretability, yet scaling them to large datasets is challenging. One of the most influential approaches for scalability is BIRCH, which introduced the Cluster Feature tree (CF-Tree) as a compact data representation. However, BIRCH suffers from numerical instability due to problematic variance computations, which can lead to catastrophic cancellation and unreliable results.
This thesis presents BETULA, a refinement of BIRCH that replaces unstable variance formulas with robust running-statistics computations, preserving the efficiency of CF-Trees while ensuring numerical stability. For HAC, BETULA enables efficient and stable approximations of common linkage methods, making exploratory analysis feasible on much larger datasets. It also extends cluster features to support Gaussian Mixture Models, where (co-)variance-aware summaries allow scalable and stable optimization with high approximation quality. For k-means, we leverage variance information in cluster features to introduce new initialization strategies (tree, trunk, leaves) that approximate k-means++ and improve convergence speed over existing solutions. We show that the BETULA approximation for k-means delivers comparable results to standard k-means while being more efficient. For applications where approximation is not suitable, we present Cover-means, which accelerates exact k-means by integrating a Cover Tree index to prune redundant distance calculations, achieving superior runtime in a range of experimental settings. Finally, we highlight the crucial role of good initialization and its importance, directly influencing clustering quality.
Beschreibung
Inhaltsverzeichnis
Schlagwörter
Clustering, Unsupervides learning, BIRCH, Cluster feature tree, Numerical stability, Hierarchical agglomerative clustering, K-means, K-means++, Gaussian mixture models, Initialization, Cover tree, Scalability
Schlagwörter nach RSWK
Cluster-Analyse, Unüberwachtes Lernen, k-Means-Algorithmus, Zusammengesetzte Verteilung, Skalierbarkeit
