Accelerating clustering algorithms with tree data structures

dc.contributor.advisorSchubert, Erich
dc.contributor.authorLang, Andreas Roland
dc.contributor.refereeZüfle, Andreas
dc.date.accepted2026-01-29
dc.date.accessioned2026-03-11T07:46:46Z
dc.date.issued2025
dc.description.abstractClustering is a central task in unsupervised learning, enabling the discovery of structure in data without prior labels. It supports a wide range of applications, from image recognition and anomaly detection to customer analytics and text mining. Despite decades of research, classical methods such as Hierarchical Agglomerative Clustering (HAC) and k-means remain popular due to their simplicity and interpretability, yet scaling them to large datasets is challenging. One of the most influential approaches for scalability is BIRCH, which introduced the Cluster Feature tree (CF-Tree) as a compact data representation. However, BIRCH suffers from numerical instability due to problematic variance computations, which can lead to catastrophic cancellation and unreliable results. This thesis presents BETULA, a refinement of BIRCH that replaces unstable variance formulas with robust running-statistics computations, preserving the efficiency of CF-Trees while ensuring numerical stability. For HAC, BETULA enables efficient and stable approximations of common linkage methods, making exploratory analysis feasible on much larger datasets. It also extends cluster features to support Gaussian Mixture Models, where (co-)variance-aware summaries allow scalable and stable optimization with high approximation quality. For k-means, we leverage variance information in cluster features to introduce new initialization strategies (tree, trunk, leaves) that approximate k-means++ and improve convergence speed over existing solutions. We show that the BETULA approximation for k-means delivers comparable results to standard k-means while being more efficient. For applications where approximation is not suitable, we present Cover-means, which accelerates exact k-means by integrating a Cover Tree index to prune redundant distance calculations, achieving superior runtime in a range of experimental settings. Finally, we highlight the crucial role of good initialization and its importance, directly influencing clustering quality.en
dc.identifier.urihttp://hdl.handle.net/2003/44777
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-26541
dc.language.isoen
dc.subjectClusteringen
dc.subjectUnsupervides learningen
dc.subjectBIRCHen
dc.subjectCluster feature treeen
dc.subjectNumerical stabilityen
dc.subjectHierarchical agglomerative clusteringen
dc.subjectK-meansen
dc.subjectK-means++en
dc.subjectGaussian mixture modelsen
dc.subjectInitializationen
dc.subjectCover treeen
dc.subjectScalabilityen
dc.subject.ddc004
dc.subject.rswkCluster-Analysede
dc.subject.rswkUnüberwachtes Lernende
dc.subject.rswkk-Means-Algorithmusde
dc.subject.rswkZusammengesetzte Verteilungde
dc.subject.rswkSkalierbarkeitde
dc.titleAccelerating clustering algorithms with tree data structuresen
dc.typeText
dc.type.publicationtypePhDThesis
dcterms.accessRightsopen access
eldorado.dnb.deposittrue
eldorado.secondarypublicationfalse

Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
Dissertation_Lang.pdf
Größe:
1.75 MB
Format:
Adobe Portable Document Format
Beschreibung:
DNB

Lizenzbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
license.txt
Größe:
4.82 KB
Format:
Item-specific license agreed upon to submission
Beschreibung: