Graph set data mining

Schäfer, Till

Graph set data mining

dc.contributor.advisor	Mutzel, Petra
dc.contributor.author	Schäfer, Till
dc.contributor.referee	Buchin, Kevin
dc.date.accepted	2023-07-05
dc.date.accessioned	2023-10-18T09:12:47Z
dc.date.available	2023-10-18T09:12:47Z
dc.date.issued	2023
dc.description.abstract	Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time.	de
dc.identifier.uri	http://hdl.handle.net/2003/42158
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-23991
dc.language.iso	en	de
dc.subject	Clustering	en
dc.subject	Data mining	en
dc.subject	Cheminformatics	en
dc.subject	Graph algorithms	en
dc.subject	Stochastic approximation algorithms	en
dc.subject	Randomized algorithms	en
dc.subject.ddc	004
dc.subject.rswk	Cluster	de
dc.subject.rswk	Data Mining	de
dc.subject.rswk	Computational chemistry	de
dc.subject.rswk	Graph	de
dc.subject.rswk	Stochastische Approximation	de
dc.subject.rswk	Randomisierter Algorithmus	de
dc.title	Graph set data mining	en
dc.title.alternative	clustering and pattern mining in the context of cheminformatics	en
dc.type	Text	de
dc.type.publicationtype	PhDThesis	de
dcterms.accessRights	open access
eldorado.dnb.deposit	true	de
eldorado.secondarypublication	false	de

Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1

Name:: Dissertation_Schaefer.pdf
Größe:: 2.57 MB
Format:: Adobe Portable Document Format
Beschreibung:: DNB

Herunterladen

Lizenzbündel

Gerade angezeigt 1 - 1 von 1

Name:: license.txt
Größe:: 4.85 KB
Format:: Item-specific license agreed upon to submission
Beschreibung:

Herunterladen

Sammlungen

LS 11