Finite Bayesian mixture models with applications in spatial cluster analysis and bioinformatics

Loading...
Thumbnail Image

Date

2015

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In many statistical applications, one encounters populations that form homogenous subgroups regarding one or several characteristics. Across the subgroups, however, heterogeneity may often be found. Mixture distributions are a natural means to model data from such applications. This PhD thesis is based on two projects that focus on such applications. In the first project, spatial nanoscale clusters formed by Ras proteins in the cell membrane are investigated. Such clusters play a crucial role in intracellular communication and are thus of interest in cancer research. In this case, the subgroups are clustered and non-clustered proteins. In the second project, epigenomic data obtained from sequencing experiments are integrated with another genomic or epigenomic input, aiming, e.g., to detect genes that contribute to the development of cancer. Here, the subgroups are defined by a) genes presenting congruent (epi)genomic aberrations in both considered variables, b) genes presenting incongruent aberrations, and c) genes lacking aberrations in at least one of the variables. Employing a Bayesian framework, objects are classified in both projects by fitting finite univariate mixture distributions with a small fixed number of components to values from a score summarizing relevant information about the research question. Such mixture distributions have favorable characteristics in terms of interpretation and present little sensitivity to label switching in Markov Chain Monte Carlo analyses. Mixtures of gamma distributions are considered for Ras proteins, while mixtures of normal and exponential or gamma distributions are a focus for the bioinformatic analysis. In the latter, classification is the primary goal, while in the Ras protein application, estimating key parameters of the spatial clustering is of more interest. The results of both projects are presented in this thesis. For both applications, the methods have been implemented in software and their performance is compared with competing approaches on experimental as well as on simulated data. To warrant an appropriate simulation of Ras protein patterns, a new cluster point process model called the double Matérn cluster process is developed and described in this thesis.

Description

Table of contents

Keywords

Bayesian statistics, Finite mixture model, Spatial cluster analysis, Matérn cluster process, Nearest neighbor distances, Gene transcription, ChIP-seq data, Integrative analysis

Citation