Accelerating clustering algorithms with tree data structures

Lang, Andreas Roland

Accelerating clustering algorithms with tree data structures

dc.contributor.advisor	Schubert, Erich
dc.contributor.author	Lang, Andreas Roland
dc.contributor.referee	Züfle, Andreas
dc.date.accepted	2026-01-29
dc.date.accessioned	2026-03-11T07:46:46Z
dc.date.issued	2025
dc.description.abstract	Clustering is a central task in unsupervised learning, enabling the discovery of structure in data without prior labels. It supports a wide range of applications, from image recognition and anomaly detection to customer analytics and text mining. Despite decades of research, classical methods such as Hierarchical Agglomerative Clustering (HAC) and k-means remain popular due to their simplicity and interpretability, yet scaling them to large datasets is challenging. One of the most influential approaches for scalability is BIRCH, which introduced the Cluster Feature tree (CF-Tree) as a compact data representation. However, BIRCH suffers from numerical instability due to problematic variance computations, which can lead to catastrophic cancellation and unreliable results. This thesis presents BETULA, a refinement of BIRCH that replaces unstable variance formulas with robust running-statistics computations, preserving the efficiency of CF-Trees while ensuring numerical stability. For HAC, BETULA enables efficient and stable approximations of common linkage methods, making exploratory analysis feasible on much larger datasets. It also extends cluster features to support Gaussian Mixture Models, where (co-)variance-aware summaries allow scalable and stable optimization with high approximation quality. For k-means, we leverage variance information in cluster features to introduce new initialization strategies (tree, trunk, leaves) that approximate k-means++ and improve convergence speed over existing solutions. We show that the BETULA approximation for k-means delivers comparable results to standard k-means while being more efficient. For applications where approximation is not suitable, we present Cover-means, which accelerates exact k-means by integrating a Cover Tree index to prune redundant distance calculations, achieving superior runtime in a range of experimental settings. Finally, we highlight the crucial role of good initialization and its importance, directly influencing clustering quality.	en
dc.identifier.uri	http://hdl.handle.net/2003/44777
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-26541
dc.language.iso	en
dc.subject	Clustering	en
dc.subject	Unsupervides learning	en
dc.subject	BIRCH	en
dc.subject	Cluster feature tree	en
dc.subject	Numerical stability	en
dc.subject	Hierarchical agglomerative clustering	en
dc.subject	K-means	en
dc.subject	K-means++	en
dc.subject	Gaussian mixture models	en
dc.subject	Initialization	en
dc.subject	Cover tree	en
dc.subject	Scalability	en
dc.subject.ddc	004
dc.subject.rswk	Cluster-Analyse	de
dc.subject.rswk	Unüberwachtes Lernen	de
dc.subject.rswk	k-Means-Algorithmus	de
dc.subject.rswk	Zusammengesetzte Verteilung	de
dc.subject.rswk	Skalierbarkeit	de
dc.title	Accelerating clustering algorithms with tree data structures	en
dc.type	Text
dc.type.publicationtype	PhDThesis
dcterms.accessRights	open access
eldorado.dnb.deposit	true
eldorado.secondarypublication	false

Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1

Name:: Dissertation_Lang.pdf
Größe:: 1.75 MB
Format:: Adobe Portable Document Format
Beschreibung:: DNB

Herunterladen

Lizenzbündel

Gerade angezeigt 1 - 1 von 1

Name:: license.txt
Größe:: 4.82 KB
Format:: Item-specific license agreed upon to submission
Beschreibung:

Herunterladen

Sammlungen

LS 08 Künstliche Intelligenz