Accelerating clustering algorithms with tree data structures

Lang, Andreas Roland

Accelerating clustering algorithms with tree data structures

Dateien

Dissertation_Lang.pdf (1.75 MB)

Datum

2025

Autor:innen

Lang, Andreas Roland

Zusammenfassung

Clustering is a central task in unsupervised learning, enabling the discovery of structure in data without prior labels. It supports a wide range of applications, from image recognition and anomaly detection to customer analytics and text mining. Despite decades of research, classical methods such as Hierarchical Agglomerative Clustering (HAC) and k-means remain popular due to their simplicity and interpretability, yet scaling them to large datasets is challenging. One of the most influential approaches for scalability is BIRCH, which introduced the Cluster Feature tree (CF-Tree) as a compact data representation. However, BIRCH suffers from numerical instability due to problematic variance computations, which can lead to catastrophic cancellation and unreliable results. This thesis presents BETULA, a refinement of BIRCH that replaces unstable variance formulas with robust running-statistics computations, preserving the efficiency of CF-Trees while ensuring numerical stability. For HAC, BETULA enables efficient and stable approximations of common linkage methods, making exploratory analysis feasible on much larger datasets. It also extends cluster features to support Gaussian Mixture Models, where (co-)variance-aware summaries allow scalable and stable optimization with high approximation quality. For k-means, we leverage variance information in cluster features to introduce new initialization strategies (tree, trunk, leaves) that approximate k-means++ and improve convergence speed over existing solutions. We show that the BETULA approximation for k-means delivers comparable results to standard k-means while being more efficient. For applications where approximation is not suitable, we present Cover-means, which accelerates exact k-means by integrating a Cover Tree index to prune redundant distance calculations, achieving superior runtime in a range of experimental settings. Finally, we highlight the crucial role of good initialization and its importance, directly influencing clustering quality.

Schlagwörter

Clustering, Unsupervides learning, BIRCH, Cluster feature tree, Numerical stability, Hierarchical agglomerative clustering, K-means, K-means++, Gaussian mixture models, Initialization, Cover tree, Scalability

Schlagwörter nach RSWK

Cluster-Analyse, Unüberwachtes Lernen, k-Means-Algorithmus, Zusammengesetzte Verteilung, Skalierbarkeit

URI

http://hdl.handle.net/2003/44777
http://dx.doi.org/10.17877/DE290R-26541

Sammlungen

LS 08 Künstliche Intelligenz

Komplettanzeige

Accelerating clustering algorithms with tree data structures

Dateien

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Verlag

Sonstige Titel

Zusammenfassung

Beschreibung

Inhaltsverzeichnis

Schlagwörter

Schlagwörter nach RSWK

Zitierform

URI

Sammlungen

Befürwortung

Review

Ergänzt durch

Referenziert von