A mathematical theory of making hard decisions: model selection and robustness of matrix factorization with binary constraints

dc.contributor.advisorMorik, Katharina
dc.contributor.authorHeß, Sibylle Charlotte
dc.contributor.refereeSiebes, Arno P. J. M.
dc.date.accepted2019-06-25
dc.date.accessioned2019-10-08T06:10:24Z
dc.date.available2019-10-08T06:10:24Z
dc.date.issued2018
dc.description.abstractOne of the first and most fundamental tasks in machine learning is to group observations within a dataset. Given a notion of similarity, finding those instances which are outstandingly similar to each other has manifold applications. Recommender systems and topic analysis in text data are examples which are most intuitive to grasp. The interpretation of the groups, called clusters, is facilitated if the assignment of samples is definite. Especially in high-dimensional data, denoting a degree to which an observation belongs to a specified cluster requires a subsequent processing of the model to filter the most important information. We argue that a good summary of the data provides hard decisions on the following question: how many groups are there, and which observations belong to which clusters? In this work, we contribute to the theoretical and practical background of clustering tasks, addressing one or both aspects of this question. Our overview of state-of-the-art clustering approaches details the challenges of our ambition to provide hard decisions. Based on this overview, we develop new methodologies for two branches of clustering: the one concerns the derivation of nonconvex clusters, known as spectral clustering; the other addresses the identification of biclusters, a set of samples together with similarity defining features, via Boolean matrix factorization. One of the main challenges in both considered settings is the robustness to noise. Assuming that the issue of robustness is controllable by means of theoretical insights, we have a closer look at those aspects of established clustering methods which lack a theoretical foundation. In the scope of Boolean matrix factorization, we propose a versatile framework for the optimization of matrix factorizations subject to binary constraints. Especially Boolean factorizations have been computed by intuitive methods so far, implementing greedy heuristics which lack quality guarantees of obtained solutions. In contrast, we propose to build upon recent advances in nonconvex optimization theory. This enables us to provide convergence guarantees to local optima of a relaxed objective, requiring only approximately binary factor matrices. By means of this new optimization scheme PAL-Tiling, we propose two approaches to automatically determine the number of clusters. The one is based on information theory, employing the minimum description length principle, and the other is a novel statistical approach, controlling the false discovery rate. The flexibility of our framework PAL-Tiling enables the optimization of novel factorization schemes. In a different context, where every data point belongs to a pre-defined class, a characterization of the classes may be obtained by Boolean factorizations. However, there are cases where this traditional factorization scheme is not sufficient. Therefore, we propose the integration of another factor matrix, reflecting class-specific differences within a cluster. Our theoretical considerations are complemented by empirical evaluations, showing how our methods combine theoretical soundness with practical advantages.en
dc.identifier.urihttp://hdl.handle.net/2003/38270
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-20240
dc.language.isoende
dc.subjectMatrix factorizationen
dc.subjectBinary optimizationen
dc.subjectClusteringen
dc.subject.ddc004
dc.subject.rswkMatrizenzerlegungde
dc.subject.rswkBoolesche Optimierungde
dc.subject.rswkCluster-Analysede
dc.titleA mathematical theory of making hard decisions: model selection and robustness of matrix factorization with binary constraintsen
dc.typeTextde
dc.type.publicationtypedoctoralThesisde
dcterms.accessRightsopen access
eldorado.secondarypublicationfalsede

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DissHess.pdf
Size:
2.9 MB
Format:
Adobe Portable Document Format
Description:
DNB
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.85 KB
Format:
Item-specific license agreed upon to submission
Description: