Bayesian mixtures for cluster analysis and flexible modeling of distributions

dc.contributor.advisorIckstadt, Katja
dc.contributor.authorFritsch, Arno
dc.contributor.refereeWeihs, Claus
dc.date.accepted2010-06-11
dc.date.accessioned2010-07-02T13:36:58Z
dc.date.available2010-07-02T13:36:58Z
dc.date.issued2010-07-02
dc.description.abstractFinite mixture models assume that a distribution is a combination of several parametric distributions. They offer a compromise between the interpretability of parametric models and the flexibility of nonparametric models. This thesis considers a Bayesian approach to these models, which has several advantages. For example, using only weak prior information, it can solve problems with unbounded likelihood functions, that can occur in mixture models. The Bayesian approach also allows an elegant extension of finite to (countable) infinite mixture models. Depending on the application, the components of mixture models can either be viewed as just a means to the flexible modeling of a distribution or as defining subgroups of a population with different parametric distributions. Regarding the former case consistency results for Bayesian mixtures are stated. An example concerning the flexible modeling of a random effects distribution in a logistic regression is also given. The application considers the goalkeeper's effect in saving a penalty. In the latter case mixture models can be used for clustering. Bayesian mixtures then allow the estimation of the number of clusters at the same time as the cluster-specific parameters. For cluster analysis the standard approach for fitting Bayesian mixtures, Markov Chain Monte Carlo (MCMC), unfortunately leads to inferential difficulties. The labels associated with the clusters can change during the MCMC run, a phenomenon called label-switching. The problem gets severe, if the number of clusters is allowed to vary. Existing methods to deal with label-switching and a varying number of components are reviewed and new approaches are proposed for both situations. The first consists of a variant of the relabeling algorithm of Stephens (2000). The variant is more general, as it applies to drawn clusterings and not drawn parameter values. Therefore it does not depend on the specific form of the component distributions. The second approach is based on pairwise posterior probabilities and is an improvement of a commonly used loss function due to Binder (1978). Minimization of this loss is shown to be equivalent to maximizing the posterior expected Rand index with the true clustering. As the adjusted Rand index is preferable to the raw index, the maximization of the posterior expected adjusted Rand is proposed. The new approaches are compared to the previous methods on simulated and real data. The real data used for cluster analysis are two gene expression data sets and Fisher's iris data.en
dc.identifier.urihttp://hdl.handle.net/2003/27292
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-14740
dc.identifier.urnurn:nbn:de:hbz:290-2003/27292-3
dc.language.isoenen
dc.subjectFinite mixtureen
dc.subjectDirichlet processen
dc.subjectBayesian statisticsen
dc.subjectCluster analysisen
dc.subjectMCMCen
dc.subjectAdjusted rand indexen
dc.subjectGoalkeepers performanceen
dc.subjectGene expression dataen
dc.subject.ddc310
dc.titleBayesian mixtures for cluster analysis and flexible modeling of distributionsen
dc.typeTextde
dc.type.publicationtypedoctoralThesisde
dcterms.accessRightsopen access

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Diss_Arno_Fritsch.pdf
Size:
1 MB
Format:
Adobe Portable Document Format
Description:
DNB
Loading...
Thumbnail Image
Name:
Abstract_Diss_Fritsch.pdf
Size:
50.94 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.85 KB
Format:
Item-specific license agreed upon to submission
Description: