Lehrstuhl Mathematische Statistik und biometrische Anwendungen
Permanent URI for this collection
Browse
Recent Submissions
Item Compressing data for generalized linear regression(2022) Omlor, Simon; Munteanu, Alexander; Ickstadt, KatjaIn this thesis we work on algorithmic data and dimension reduction techniques to solve scalability issues and to allow better analysis of massive data. For our algorithms we use the sketch and solve paradigm as well as some initialization tricks. We will analyze a tradeoff between accuracy, running time and storage. We also show some lower bounds on the best possible data reduction factors. While we are focusing on generalized linear regression mostly, logistic and p-probit regression to be precise, we are also dealing with two layer Rectified Linear Unit (ReLU) networks with logistic loss which can be seen as an extension of logistic regression, i.e. logistic regression on the neural tangent kernel. We present coresets via sampling, sketches via random projections and several algorithmic techniques and prove that our algorithms are guaranteed to work with high probability. First, we consider the problem of logistic regression where the aim is to find the parameter beta maximizing the likelihood. We are constructing a sketch in a single pass over a turnstile data stream. Depending on some parameters we can tweak size, running time and approximation guarantee of the sketch. We also show that our sketch works for other target functions as well. Second, we construct an epsilon-coreset for p-probit regression, which is a generalized version of probit regression. Therefore, we first compute the QR decomposition of a sketched version of our dataset in a first pass. We then use the matrix R to compute an approximation of the l_p-leverage scores of our data points which we use to compute sampling probabilities to construct the coreset. We then analyze the negative log likelihood of the p-generalized normal distribution to prove that this results in an epsilon-coreset. Finally, we look at two layer ReLU networks with logistic loss. Here we show that using a coupled initialization we can reduce the width of the networks to get a good approximation down from gamma^(-8) (Ji and Telgarsky, 2020) to gamma^(-2) where gamma is the so called separation margin. We further give an example where we prove that a width of gamma^(−1) is necessary to get less than constant error.Item Spatial and spatio-temporal regression modelling with conditional autoregressive random effects for epidemiological and spatially referenced data(2022) Djeudeu-Deudjui, Dany-Armand; Ickstadt, Katja; Doebler, PhilippRegression models are suitable to analyse the association between health outcomes and environmental exposures. However, in urban health studies where spatial and temporal changes are of importance, spatial and spatio-temporal variations are usually neglected. This thesis develops and applies regression methods incorporating latent random effects terms with Conditional Autoregressive (CAR) structures in classical regression models to account for the spatial effects for cross-sectional analysis and spatio-temporal effects for longitudinal analysis. The thesis is divided into two main parts. Firstly, methods to analyse data for which all variables are given on an areal level are considered. The longitudinal Heinz Nixdorf Recall Study is used throughout this thesis for application. The association between the risk of depression and greenness at the district level is analysed. A spatial Poisson model with a latent CAR structured-Random effect is applied for selected time points. Then, a sophisticated spatio-temporal extension of the Poisson model results to a negative association between greenness and depression. The findings also suggest strong temporal autocorrelation and weak spatial effects. Even if the weak spatial effects are suggestive of neglecting them, as in the case of this thesis, spatial and spatio-temporal random effects should be taken into account to provide reliable inference in urban health studies. Secondly, to avoid ecological and atomic fallacies due to data aggregation and disaggregation, all data should be used at their finest spatial level given. Multilevel Conditional Autoregressive (CAR) models help to simultaneously use all variables at their initial spatial resolution and explain the spatial effect in epidemiological studies. This is especially important where subjects are nested within geographical units. This second part of the thesis has two goals. Essentially, it further develops the multilevel models for longitudinal data by adding existing random effects with CAR structures that change over time. These new models are named MLM tCARs. By comparing the MLM tCARs to the classical multilevel growth model via simulation studies, we observe a better performance of MLM tCARs in retrieving the true regression coefficients and with better fits. The models are comparatively applied on the analysis of the association between greenness and depressive symptoms at the individual level in the longitudinal Heinz Nixdorf Recall Study. The results show again negative association between greenness and depression and a decreasing linear individual time trend for all models. We observe once more very weak spatial variation and moderate temporal autocorrelation. Besides, the thesis provides comprehensive decision trees for analysing data in epidemiological studies for which variables have a spatial background.Item River-mediated dynamic environmental factors and perinatal data analysis(2021) Rathjens, Jonathan; Ickstadt, Katja; Groll, AndreasPerfluorooctanoic acid (PFOA) and related per- and polyfluoroalkyl substances, a group of man-made persistent organic chemicals employed for many products, are widely distributed in the environment. Adverse health effects may occur even at low exposure levels. A large-scale PFOA contamination of drinking water resources, especially of the river Ruhr, was detected in North Rhine-Westphalia, Germany, in summer 2006. Subsequent measurements are available from the water supply stations along the river and elsewhere. The first state-wide environmental-epidemiological study on the general population analyses these secondary data together with routinely collected perinatal registry data, to estimate possible developmental-toxic effects of PFOA exposure, especially regarding birth weight (BW). Drinking water data are temporally and spatially modelled to assign estimated exposure values to the residents. A generalised linear model with an inverse link deals with the steeply decreasing temporal data pattern at mainly affected stations. Confirmed by a river-wide joint model, the river's segments between the main junctions are the most important factor to explain the spatial structure, besides local effects. Deductions from stations to areal units are made possible via estimated supply proportions. Regression of perinatal data with BW as response usually includes the gestational age (GA) as an important covariate in polynomial form. However, bivariate modelling of BW and GA is recommended to distinguish effects on each, on both, and between them. Bayesian distributional copula regression is applied, where the marginals for BW and GA as well as the copula representing their dependence structure are fitted independently and all parameters are estimated conditional on covariates. While a Gaussian is suitable for BW, the skewed GA data are better modelled by the three-parametric Dagum distribution. The Clayton copula performs better than the Gumbel and the symmetric Gaussian copula, although the lower tail dependence is weak. A non-linear trend of BW on GA is detected by the standard polynomial model. Linear effects of biometric and obstetric covariates and also of maternal smoking on BW mean are similar in both models, while the distributional copula regression also reveals effects on all other parameters. The local PFOA exposure is spatio-temporally assigned to the perinatal data of the most affected town of Arns\-berg and so included in the regression models. No significant effect results and a relatively high amount of noise remains. Perspectively and for larger regions, this can be dealt with by exposure modelling on area level using dependence information, by allowing further asymmetry in the bivariate distribution of BW and GA, and by respecting geographical structures in birth data.Item Spatial and temporal analyses of perfluorooctanoic acid in drinking water for external exposure assessment in the Ruhr metropolitan area, Germany(2020-12-04) Rathjens, Jonathan; Becker, Eva; Kolbe, Arthur; Ickstadt, Katja; Hölzer, JürgenPerfluorooctanoic acid (PFOA) and related chemicals among the per- and polyfluoroalkyl substances are widely distributed in the environment. Adverse health effects may occur even at low exposure levels. A large-scale contamination of drinking water resources, especially the rivers Möhne and Ruhr, was detected in North Rhine-Westphalia, Germany, in summer 2006. As a result, concentration data are available from the water supply stations along these rivers and partly from the water network of areas supplied by them. Measurements started after the contamination’s discovery. In addition, there are sparse data from stations in other regions. Further information on the supply structure (river system, station-to-area relations) and expert statements on contamination risks are available. Within the first state-wide environmental-epidemiological study on the general population, these data are temporally and spatially modelled to assign estimated exposure values to the resident population. A generalized linear model with an inverse link offers consistent temporal approaches to model each station’s PFOA data along the river Ruhr and copes with a steeply decreasing temporal data pattern at mainly affected locations. The river’s segments between the main junctions are the most important factor to explain the spatial structure, besides local effects. Deductions from supply stations to areas and, therefore, to the residents’ risk are possible via estimated supply proportions. The resulting potential correlation structure of the supply areas is dominated by the common water supply from the Ruhr. Other areas are often isolated and, therefore, need to be modelled separately. The contamination is homogeneous within most of the areas.Item Streaming statistical models via Merge & Reduce(2020-06-12) Geppert, Leo N.; Ickstadt, Katja; Munteanu, Alexander; Sohler, ChristianMerge & Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structures—that support only queries—into dynamic data structures—that allow insertions of new elements—with as little overhead as possible. This can be used to turn classic offline algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge & Reduce has been employed. Instead of summarizing the data, we combine the Merge & Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches whose size is independent of the total number of observations n. The results are combined in a structured way at the cost of a bounded O(logn) factor in their memory requirements. It is only necessary, though nontrivial, to choose an appropriate statistical model and design merge and reduce operations on a casewise basis for the specific type of model. We illustrate our Merge & Reduce schemes on simulated and real-world data employing (Bayesian) linear regression models, Gaussian mixture models and generalized linear models.Item Statistical modeling of protein-protein interaction networks(2018) Fermin Ruiz, Yessica Yulieth; Ickstadt, Katja; Rahnenführer, JörgUnderstanding how proteins bind to each other in a cell is the key in molecular biology to determine how experts can repair anomalies in cells. The major challenge in the prediction of protein-protein interactions is the cell-to-cell heterogeneity within a sample, due to genetic and epigenetic variabilities. Most studies about protein-protein interaction carry out their analysis without awareness of the underlying heterogeneity. This situation can lead to the identification of invalid interactions. As part of the solution to this problem, we proposed in this thesis two aspects of analysis, one for snapshot data, where different samples of ten proteins were taken by toponome imaging and another for the analysis of time correlated data that guarantees a better approximation to the prediction of protein-protein interactions. The latter represents an advance in the analysis of data with high temporal resolution, such as that obtained through the quantification technique known as multicolor live cell imaging. The thesis here presented is divided into two parts: The first part called "Revealing relationships among proteins involved in assembling focal adhesions" consists of the development of a methodology based on frequentist methods, such as machine learning and meta-analysis, for the prediction of protein-protein interaction on six different toponome imaging datasets. This methodology presents an advance in the analysis of highly heterogeneous snapshot data. Our aim here focused on the formulation of a single model capable of identifying the relationship among different samples by summing is common results over them concerning their random variation. This methodology leads to a set of common models over the six datasets hierarchized by their predictive power, where the researcher can choose the model according to its accuracy in the prediction or according to its parsimony. The developing of this part is in Chapters 1-7 â this part published in Harizanova et al. (2016). The second part is called "Modelling of temporal networks with a nonparametric mixture of dynamic Bayesian networks". The content of this part contemplates the advance of a Bayesian methodology regarding temporal networks that successfully enables to identify subpopulations in heterogeneous cell populations as well as at the same time reconstructing the protein interaction network associated with each subpopulation. This method extends the nonparametric Bayesian networks (NPBNs) (Ickstadt et al., 2011) for the analysis of time-correlated data by using Gaussian dynamic Bayesian Networks (GDBNs). We evaluate our model based on the variation of specific parameters such as the underlying number of subpopulations, network density, intra-subpopulation variability among others. On the other hand, a comparative analysis with existing clustering methods such as NPBNs and hierarchical agglomerative clustering (Hclust), shows that the inclusion of temporal correlations in the classification of multivariate time series is relevant for an improvement in the classification. The classic Hclust method using the dynamic time warping distances (T-Hclust) was found to be similar in precision to our Bayesian method here proposed. On the other hand, a comparative analysis with the GDBNs shows the lack of adjustment of the GDBNs to reconstruct temporal networks in heterogeneous cell populations through a single model, while our method, as well as the joint use of the T-Hclust classifications with the GDBNs (T-Hclust+), show a high adequacy in the prediction of temporal networks in a mixture. The developing of this part is in Chapters 8-16.Item Subgroup analyses and investigations of treatment effect heterogeneity in clinical dose-finding trials(2019) Thomas, Marius; Ickstadt, Katja; Rahnenführer, JörgIdentifying subgroups, which respond differently to a treatment is an important part of drug development. Exploratory subgroup analyses, which have the aim to identify subgroups of patients with differential treatment effects are thus common in many randomized clinical trials. Statistically these analyses are known to be challenging the number of possible subgroups is often large, which leads to multiplicity issues. Often such subgroup analyses are also performed for early phase clinical trials, where an additional challenge is the small sample size. In recent years several statistical approaches to these problems have been proposed, employing for example tree-based recursive partitioning algorithms, which are well-suited for handling interactions, penalized regression methods, which can be used to prevent overfitting when explicitly modeling a large number of covariate effects or Bayesian approaches, which allow incorporating uncertainty and can be used to make optimal decisions with regard to subgroups. The available literature focuses however on two-arm clinical trials, where patients are randomized to the experimental treatment or a control (e.g. current standard of care or placebo). A particular focus of this cumulative thesis is the development of statistical methodology for identification of subgroups in dose-finding trials, in which patients are administered several doses of a new drug. Dose-finding trials play a key role in the drug development process, since they provide valuable information about the effect of the dose on efficacy and safety. For identifying subgroups in this setting we consider the treatment effect to be a function of the dose and then try to identify relevant covariate effects on this treatment effect curve. These identified covariates can then be used to define subgroups with higher treatment effects but also subgroups, which require a different dose of the treatment. We propose two different approaches for this purpose. Firstly, a tree-based recursive partitioning algorithm, which detects covariate effects on the parameters of dose-response models and builds a tree of subgroups with different dose-response curves. Secondly, a Bayesian hierarchical model, which makes use of shrinkage priors to prevent overfitting in the considered settings with low sample sizes and a large number of considered covariates. In addition to approaches for subgroup identification we also consider the problem of testing a prespecified subgroup in addition to the full population in dose-finding trials. In a dose-finding setting contrast tests are often used to test for a significant dose-response signal, while taking the underlying dose-response relationship into account. Optimal contrast tests can be derived, when the underlying dose-response model is known, however often there is uncertainty about this underlying model. Testing procedures, which allow for uncertainty with regard to the underlying model and perform multiple contrast tests are therefore popular approaches in such settings. As a part of this thesis we extend such approaches to settings with multiple populations, in particular the situation, in which a prespecified subgroup is considered in addition to the full population. A last part of this cumulative thesis focuses on treatment effect estimation in identified subgroups. Naive treatment effect estimates in subgroups will often suffer from selection bias, especially when the number of considered subgroups is large. Several approaches to obtain adjusted treatment effect estimates in such situations have been proposed, using resampling, model averaging or penalized regression. We compare these approaches in an extensive simulation study for a large range of scenarios, in which such analyses are performed.Item Bayesian and frequentist regression approaches for very large data sets(2018) Geppert, Leo Nikolaus; Ickstadt, Katja; Groll, AndreasThis thesis is concerned with the analysis of frequentist and Bayesian regression models for data sets with a very large number of observations. Such large data sets pose a challenge when conducting regression analysis, because of the memory required (mainly for frequentist regression models) and the running time of the analysis (mainly for Bayesian regression models). I present two different approaches that can be employed in this setting. The first approach is based on random projections and reduces the number of observations to manageable level as a first step before the regression analysis. The reduced number of observations depends on the number of variables in the data set and the desired goodness of the approximation. It is, however, independent of the number of observations in the original data set, making it especially useful for very large data sets. Theoretical guarantees for Bayesian linear regression are presented, which extend known guarantees for the frequentist case. The fundamental theorem covers Bayesian linear regression with arbitrary normal distributions or non-informative uniform distributions as prior distributions. I evaluate how close the posterior distributions of the original model and the reduced data set are for this theoretically covered case as well as for extensions towards hierarchical models and models using q-generalised normal distributions as prior. The second approach presents a transfer of the Merge & Reduce-principle from data structures to regression models. In Computer Science, Merge & Reduce is employed in order to enable the use of static data structures in a streaming setting. Here, I present three possibilities of employing Merge & Reduce directly on regression models. This enables sequential or parallel analysis of subsets of the data set. The partial results are then combined in a way that recovers the regression model on the full data set well. This approach is suitable for a wide range of regression models. I evaluate the performance on simulated and real world data sets using linear and Poisson regression models. Both approaches are able to recover regression models on the original data set well. They thus offer scalable versions of frequentist or Bayesian regression analysis for linear regression as well as extensions to generalised linear models, hierarchical models, and q-generalised normal distributions as prior distribution. Application on data streams or in distributed settings is also possible. Both approaches can be combined with multiple algorithms for frequentist or Bayesian regression analysis.Item Statistische Analyse und Modellierung von Clusterphänomenen bei Signalproteinen in der Plasmamembran(2016) Siebert, Sabrina; Ickstadt, Katja; Rahnenführer, JörgIn der vorliegenden Arbeit wurde sich mit Clusterphänomenen von Signalproteinen beschäftigt. Diese Proteine sind dabei in der Plasmamembran lokalisiert und für die Kommunikation und den Stoffaustausch der Zelle zuständig. Die Daten wurden mit Hilfe von Fluoreszenzmikroskopie am Max-Planck-Institut für molekulare Physiologie in Dortmund in der Arbeitsgruppe von Dr. Peter J. Verveer erhoben. Um die Clusterphänomene zu untersuchen, können unterschiedliche Blickwinkel und Fragestellungen betrachtet werden. In dieser Arbeit wurde eine zeitliche, eine räumliche und eine zeitlich-räumliche Analyse entsprechender Daten vorgenommen. In der zeitlichen Analyse wurden Proteinzeitreihen untersucht. Die Proteinzeitreihe ergibt sich aus der Messung der Lichtintensität eines Spots, d.h. eines Proteinclusters, über die Zeit hinweg. Das Ziel ist hier die Segmentierung eben dieser Proteinzeitreihe. Hier wurde ein Bayessches hierarchisches Modell zur Segmentierung genutzt. Dieses lieferte dabei sinnvolle Ergebnisse, wobei jedoch zu beachten war, dass die Anzahl an Segmenten stets als fest angesehen wurde. Um diese Einschränkung aufzuheben, wurde ein Reversible Jump Schritt in das Modell aufgenommen. Mit dieser Erweiterung konnten nun sinnvolle Ergebnisse mit einer höheren Flexibilität für den Anwender erreicht werden. In der räumlichen Analyse wurde ein Pixelbild aus einer Messung einer lebenden Zelle mit Hilfe von TIRF-Mikroskopie untersucht. Ziel war hier die räumliche Clusterstruktur zu untersuchen, wobei sich auf den Anteil an Proteinen in Clustern beschränkt wurde. Dafür wurden zunächst unterschiedliche Methoden auf einer simulierten Region untersucht. Mit Hilfe dieser Ergebnisse konnte ein Anwendungsschema zur effizienten Kombination eben dieser Methoden aufgestellt werden. Dieses wurde abschließend auf einen experimentellen Datensatz sowie auf eine Dual Colour Simulation angewendet. Es zeigte sich, dass durch das Vorgehen des Schemas die Parameterwahl für einige Methoden vereinfacht wurde und sinnvolle Ergebnisse berechnet werden konnten. Abschließend wurden in der räumlich-zeitlichen Analyse Proteintracks untersucht. Diese Proteintracks geben den Weg eines Proteins in der Zellmembran über die Zeit hinweg an. Diese Messung wurde simultan für zwei Proteinarten durchgeführt, sodass hier erneut der Dual Colour Fall vorliegt. Ziel war die Bestimmung von Zusammenhängen zweier Proteintracks unterschiedlicher Proteinarten. Um den Zusammenhang bestmöglich berechnen zu können, wurde zunächst diskutiert, welche Eigenschaften einen hohen Zusammenhang repräsentieren. Anschließend wurden diese Eigenschaften zu einem Zusammenhangsmaß zusammen gefügt. Mit diesem Maß wurden zum einen ein simuliertes Beispiel und zum anderen experimentelle Daten analysiert. Es zeigte sich, dass Abhängigkeitsstrukturen durch das Maß gut widergespiegelt wurden und mit Hilfe von Cutoffs eine Auswahl entsprechender Proteintracks erfolgen konnte. Durch diese Auswahl konnten weiter interessante Regionen sowie Cluster identifiziert werden.Item Bayesian prediction for stochastic process models in reliability(2016) Hermann, Simone; Müller, Christine; Ickstadt, KatjaItem Unimodal spline regression and its use in various applications with single or multiple modes(2016) Köllmann, Claudia; Ickstadt, Katja; Fried, RolandResearch in the field of non-parametric shape constrained regression has been extensive and there is need for such methods in various application areas, since shape constraints can reflect prior knowledge about the underlying relationship. This thesis develops semi-parametric spline regression approaches to unimodal regression. However, the prior knowledge in different applications is also of increasing complexity and data shapes may vary from few to plenty of modes and from piecewise unimodal to accumulations of identically or diversely shaped unimodal functions. Thus, we also go beyond unimodal regression in this thesis and propose to capture multimodality by employing piecewise unimodal regression or deconvolution models based on unimodal peak shapes. More explicitly, this thesis proposes unimodal spline regression methods that make use of Bernstein-Schoenberg-splines and their shape preservation property. To achieve unimodal and smooth solutions we use penalized splines, and extend the penalized spline approach towards penalizing against general parametric functions, instead of using just difference penalties. For tuning parameter selection under a unimodality constraint a restricted maximum likelihood and an alternative Bayesian approach for unimodal regression are developed. We compare the proposed methodologies to other common approaches in a simulation study and apply it to a dose-response data set. All results suggest that the unimodality constraint or the combination of unimodality and a penalty can substantially improve estimation of the functional relationship. A common feature of the approaches to multimodal regression is that the response variable is modelled using several unimodal spline regressions. This thesis examines mixture models of unimodal regressions, piecewise unimodal regression and deconvolution models with identical or diverse unimodal peak shapes. The usefulness of these extensions of unimodal regression is demonstrated by applying them to data sets from three different application areas: marine biology, astroparticle physics and breath gas analysis. The proposed methodologies are implemented in the statistical software environment R and the implementations and their usage are explained in this thesis as well.Item Entmischung und Inferenz biomolekularer Netzwerke(2015) Wieczorek, Jakob Jan; Ickstadt, Katja; Rahnenführer, JörgIn dieser Arbeit werden neue statistische Konzepte zur Erkennung und Analyse von Interaktionsmustern vorgestellt. Diese werden sowohl an simulierten Daten aus dem Erk-Signalübertragungsnetzwerk als auch an experimentellen Daten des mating pathways der Hefe mit Erfolg zur Anwendung gebracht. Methodisch kann die Arbeit in zwei Themenschwerpunkte eingeteilt werden. Den Hauptschwerpunkt bildet das aus den Bayesschen Netzwerken entwickelte Verfahren der Nichtparametrischen Bayesschen Netzwerke. Dieses ist, so weit bekannt, als einzige Netzwerkinferenzmethode in der Lage, Subgruppen innerhalb der Daten zu erkennen und die Beobachtungen zu partitionieren. Weiter gelingt es in dieser Arbeit, neben dem Dirichlet-Prozess den Pitman-Yor-Prozess als a priori Verteilung der Clusterstruktur zu adaptieren. Beide Varianten des Verfahrens werden bezüglich ihrer Leistungsfähigkeit bei der Entmischung von Beobachtungen untersucht. Den zweiten Schwerpunkt der Arbeit bildet die Entwicklung einer Methode zur Schätzung von Proteinkonzentrationen, dem Komplexeschätzer. Mit ihm ist es möglich, aus Fluoreszenzkorrelationsspektroskopiemessungen (FCS) nicht wie bisher nur feste Gruppen von Proteinen zu quantifizieren, sondern gezielt einzelne Proteine und beliebige vom Anwender ausgewählte Gruppen von Proteinen zu bestimmen. Dies stellt eine deutliche Verbesserung zum gegenwärtigen Standard dar und erhöht den Informationsgewinn durch FCS-Messungen entscheidend. Mit Hilfe dieser Methode konnte eine in der Biologie bisher unbekannte Rückkopplung im Hefe mating pathway gefunden werden. Im Rahmen der Arbeit wird außerdem ein Konzept zum Clustern von gerichteten azyklischen Graphen (DAGs) entwickelt. Im Gegensatz zu den in der Literatur vorgeschlagenen Verfahren werden an die Daten keine speziellen Anforderungen gestellt. Es müssen lediglich DAGs eines festen Zeitpunkts verwendet werden. Konkret wird ein Distanzbegriff für DAGs entwickelt, welcher die Eigenschaften einer Semimetrik erfüllt. Mit ihm ist es möglich eine sinnvolle Ähnlichkeitsmatrix aufzustellen, welche zum Clustern benutzt werden kann.Item Finite Bayesian mixture models with applications in spatial cluster analysis and bioinformatics(2015) Schäfer, Martin; Ickstadt, Katja; Rahnenführer, JörgIn many statistical applications, one encounters populations that form homogenous subgroups regarding one or several characteristics. Across the subgroups, however, heterogeneity may often be found. Mixture distributions are a natural means to model data from such applications. This PhD thesis is based on two projects that focus on such applications. In the first project, spatial nanoscale clusters formed by Ras proteins in the cell membrane are investigated. Such clusters play a crucial role in intracellular communication and are thus of interest in cancer research. In this case, the subgroups are clustered and non-clustered proteins. In the second project, epigenomic data obtained from sequencing experiments are integrated with another genomic or epigenomic input, aiming, e.g., to detect genes that contribute to the development of cancer. Here, the subgroups are defined by a) genes presenting congruent (epi)genomic aberrations in both considered variables, b) genes presenting incongruent aberrations, and c) genes lacking aberrations in at least one of the variables. Employing a Bayesian framework, objects are classified in both projects by fitting finite univariate mixture distributions with a small fixed number of components to values from a score summarizing relevant information about the research question. Such mixture distributions have favorable characteristics in terms of interpretation and present little sensitivity to label switching in Markov Chain Monte Carlo analyses. Mixtures of gamma distributions are considered for Ras proteins, while mixtures of normal and exponential or gamma distributions are a focus for the bioinformatic analysis. In the latter, classification is the primary goal, while in the Ras protein application, estimating key parameters of the spatial clustering is of more interest. The results of both projects are presented in this thesis. For both applications, the methods have been implemented in software and their performance is compared with competing approaches on experimental as well as on simulated data. To warrant an appropriate simulation of Ras protein patterns, a new cluster point process model called the double Matérn cluster process is developed and described in this thesis.Item Integrativer Ansatz zur Identifizierung neuer, prognostisch relevanter Metagene mittels Clusteranalyse(2014) Freis, Evgenia; Ickstadt, Katja; Rahnenführer, JörgIn Germany, breast cancer is the most common leading cause of cancer deaths in women. To gain insight into the processes related to the course of the disease, human genetic data can be used to identify associations between gene expression and prognosis. In the course of the several clinical studies and numerous microarray experiments, the enormous data volume is constantly generated. Its dimensionality reduction of thousands of genes to a smaller number is the aim of the so-called metagenes that aggregate the expression data of groups of genes with similar expression patterns and may be used for investigating complex diseases like breast cancer. Here, a cluster analytic approach for identification of potentially relevant metagenes is introduced. In a first step of the approach, gene expression patterns over time of receptor tyrosine kinase ErbB2 breast cancer MCF7 cell lines to obtain promising sets of genes for a metagene calculation were used. Three independent batches of MCF7/NeuT cells were exposed to doxycycline for periods of 0, 6, 12 and 24 hours as well as for 3 and 14 days in independent experiments, due to association of the oncogenic variant of ErbB2 overexpression in breast cancer with worse prognosis. With cluster analytic approaches DIB-C (difference-based clustering algorithm) and STEM (short time-series expression miner) as well as with the finite and infinite mixture models gene clusters with similar expression patterns were identified. Two non-model-based algorithms – k-means and PFP (penalized frame potential) – as well as the model-based procedure DIRECT were applied for the method comparisons. Potentially relevant gene groups were selected by promoter and Gene Ontology (GO) analysis. The verification of the applied methods was carried out with another short time-series data set. In the second step of the approach, this gene clusters were used to calculate metagenes of the gene expression data of 766 breast cancer patients from three breast cancer studies and Cox models were applied to determine the effect of the detected metagenes on the prognosis. Using this strategy, new metagenes associated with metastasis-free survival patients were identified.Item Enrichment design and sensitivity preferred classification(2014-10-06) Agueusop, Inoncent; Ickstadt, Katja; Rahnenführer, Jörg; Vonk, RichardusItem Adaption und Vergleich evolutionärer mehrkriterieller Algorithmen mit Hilfe von Variablenwichtigkeitsmaßen(2013-07-22) Casjens, Swaantje Wiarda; Katja, Ickstadt; Ligges, UweBei der Herleitung eines Klassifikationsmodells ist neben der Vorhersagegüte auch die Güte der Variablenauswahl ein wichtiges Kriterium. Bei Einflussvariablen mit unterschiedlichen Kosten ist eine kostensensitive Klassifikation erstrebenswert, bei der ein Kompromiss aus hoher Vorhersagegüte und geringen Kosten getroffen werden kann. Werden konfliktäre Ziele, wie etwa hier die Vorhersagegüte und die Kosten, gleichzeitig optimiert, entsteht ein mehrkriterielles Optimierungsproblem, für das keine einzelne sondern eine Menge unvergleichbarer Lösungen existieren. Für das Auffinden der unvergleichbaren Lösungen sind evolutionäre mehrkriterielle Optimierungsalgorithmen (EMOAs) gut geeignet, da sie unter anderem nach verschiedenen Lösungen parallel suchen können und unabhängig von der zugrunde liegenden Datenverteilung sind. Häufig werden EMOAs für die Lösung mehrkriterieller Klassifikationsprobleme in Form von Wrapper-Ansätzen verwendet, wobei die EMOA-Individuen als binäre Zeichenketten (Bitstrings) codiert sind und jedes Bit die Verfügbarkeit der entsprechenden Einflussvariable beschreibt. Basierend auf diesen Variablenteilmengen und gegebenen Daten erstellt der umhüllte (wrapped) Klassifikationsalgorithmus ein Klassifikationsmodell, mit dem Ziel die Vorhersagegüte zu optimieren. Erst nach der Konstruktion des Klassifikationsmodells können weitere Zielkriterien, wie etwa die Kosten der selektierten Variablen, ausgewertet werden. Damit entsteht eine Hierarchie der zu optimierenden Zielkriterien mit Vorteil für die Vorhersagegüte, sodass durch einen mehrkriteriellen Wrapper-Ansatz keine nicht-hierarchischen Lösungen gefunden werden können. Diese Hierarchie der Zielfunktionen wird erstmals in Rahmen dieser Arbeit beschrieben und untersucht. Als Alternative zum mehrkriteriellen Wrapper-Ansatz wird in dieser Arbeit ein nicht-hierarchischer evolutionärer mehrkriterieller Optimierungsalgorithmus mit Baum-Repräsentation (NHEMOtree) entwickelt, um mehrkriterielle Optimierungsprobleme mit gleichberechtigten Optimierungszielen zu lösen. NHEMOtree basiert auf einem EMOA mit Baum-Repräsentation, der ohne internen Klassifikationsalgorithmus die Variablenselektion vollzieht und ohne Hierarchie in den Zielfunktionen mehrkriteriell optimierte binäre Entscheidungsbäume erstellt. Des Weiteren werden ein auf mehrkriteriellen Variablenwichtigkeitsmaßen (VIMs) basierter Rekombinationsoperator für NHEMOtree und eine NHEMOtree-Version mit lokaler Cutoff-Optimierung entwickelt. In dieser Arbeit werden erstmalig die Lösungen einer mehrkriteriellen Optimierung durch einen mehrkriteriellen Wrapper-Ansatz und durch einen EMOA mit Baum-Repräsentation (NHEMOtree) miteinander verglichen. Die Bewertung der Lösungen erfolgt dabei sowohl mittels der bekannten S-Metrik als auch durch den hier entwickelten Dominanzquotienten. Die Güte des VIM-basierten Rekombinationsoperators wird im Vergleich zum Standard-Rekombinationsoperator für EMOAs mit Baum-Repräsentation untersucht. Die mehrkriteriellen Optimierungsansätze und Operatoren werden auf medizinische und simulierte Daten angewendet. Die Ergebnisse zeigen, dass NHEMOtree bessere Lösungen als der mehrkriterielle Wrapper-Ansatz findet. Die Verwendung des VIM-basierten Rekombinationsoperators führt im Gegensatz zum Standard-Operator zu nochmals besseren Lösungen des mehrkriteriellen Optimierungsproblems und zu einer schnelleren Konvergenz des NHEMOtrees.Item Discovering genetic interactions based on natural genetic variation(2012-10-05) Ackermann, Marit; Ickstadt, Katja; Rahnenführer, JörgComplex traits can be attributed to the effect of two or more genes and their interaction with each other as well as the environment. Unraveling the genetic cause of these traits, especially with regard to disease etiology, is a major goal of current research in statistical genetics. Much effort has been invested in the development of methods detecting genetic loci that are linked to variation of disease traits or intermediate molecular phenotypes such as gene expression levels. A very important aspect to be considered in the modeling of genotype-phenotype associations is that genes often interact with each other in a non-additive fashion, a phenomenon called epistasis. A special case of an epistatic interaction is an allele incompatibility, which is characterized by the inviability of all individuals carrying a certain combination of alleles at two distinct loci in the genome. The relevance and distribution of allele incompatibilities has not been investigated on a genome-wide scale in mammals. In this thesis, I propose a method for inferring allele incompatibilities that is exclusively based on DNA sequence information. We make use of genome-wide SNP data of parent-child trios and inspect 3×3 contingency tables for detecting pairs of alleles from different genomic positions that are under-represented in the population. Our method detected substantially more imbalanced allele pairs than what we got in simulations assuming no interactions. We could validate a significant number of the interactions with external data and we found that interacting loci are enriched for genes involved in developmental processes. Genes do not only interact with one another, their regulatory activity also depends on the environment or cellular context. The impact of genetic variation on gene expression will therefore also depend on cell types or on the cellular state. This aspect has long been neglected in the inference of genetic loci that are linked to gene expression variation (expression quantitative trait loci, eQTL). There is thus a need to develop methods for analyzing the variation of eQTL between different cell types and to assess the impact of genetic variation on expression dynamics rather than just static expression levels. In the second part of this thesis, I show that defining and detecting eQTL regulating expression dynamics is non-trivial. I propose to distinguish “static”, "conditional” and “dynamic” eQTL and suggest new strategies for mapping these eQTL classes. By using murine mRNA expression data from four stages of hematopoiesis, we demonstrate that eQTL from the above three classes yield associations with different modes of expression regulation. Intriguingly, dynamic and conditional eQTL complement one another although they are based on integration of the same expression data. We reveal substantial effects of individual genetic variation on cell state specific expression regulation.Item Einfluss von Dialysemodalitäten auf die Mortalität(2012-08-01) Schaller, Mathias; Ickstädt, Katja; Rahnenführer, JörgIn dieser Arbeit werden die Daten von Dialysepatienten benutzt, um Behandlungsparameter zu identifizieren, die das Überleben von Dialysepatienten beeinflussen. Dazu wird ein Cox Proportional Hazard Modell erstellt, das zeitveränderliche und nichtlineare Einflüsse sowie Zentrumsfrailtyeffekte berücksichtigt. Bei der Überprüfung der Modellannahmen werden acht Einflussfaktoren ermittelt, bei denen die Modellannahmen nicht erfüllt sind. Für diese Parameter wird mit Hilfe einer stückweise Variation des Risikos über die Zeit oder einer zeitliche Aktualisierung der Werte ein Modell mit proportionalem Risiko erstellt. Weiterhin wird bei fünf der stetigen Einflussfaktoren festgestellt, dass der Einfluss nicht linear ist. Bei diesen Einflussfaktoren können fraktionale Polynome und Polynome 4. Grades die Nichlinearität erfassen. Die zufälligen Zentrumseffekte lassen sich mit einer Log-t)Verteilung am besten anpassen. Schliesslich wird eine Variablenselektion durchgeführt, in der elf Einflussfaktoren eliminiert werden, die die Modellgüte nicht verbessern. Die im Modell verbleibenden Einflüsse weisen darauf hin, dass ein Teil des Mortalitätsrisikos vom Patienten aufgrund seiner Demographie mitgebracht wird. Ein anderer Teil geht allerdings auf unter der Dialyse beeinflussbare Parameter zurück. Diese beschreiben verschiedene Behandlungsansätze die unter der Dialyse beachtet werden müssen. Weiterhin wird das Modell mit Hilfe des Loglikelihood Ansatzes und eines bayesianischen MCMC Verfahrens geschätzt. Die dabei geschätzten Parameter sind einander sehr ähnlich. Lediglich die Varianzen der Schätzer des bayesianischen Verfahrens sind kleiner als die des Likelihood Ansatzes. Dies wird darauf zurückgeführt, dass im bayesianischen Modell neben den Schätzern auch die Baselinehazardfunktion spezifiziert wird. Weiterhin wird eine sequentielle bayesianische Analyse durchgeführt. Das Einfügen weiterer Daten führt hierbei zu einer Verbesserung der Ergebnisse in Form geringerer Varianzen der Parameterschätzer. Ein Vorteil gegenüber einem Ein-Schritt-Verfahren konnte nicht festgestellt werden. Dies wird darauf zurückgeführt, dass in der sequentiellen Analyse zwischen den Schritten keine Adaption der a-priori Verteilungen stattgefunden hat.Item Assessment of time-varying long-term effects of therapies and prognostic factors(2010-08-09) Buchholz, Anika; Ickstadt, Katja; Hauschke, Dieter; Schumacher, MartinItem Bayesian mixtures for cluster analysis and flexible modeling of distributions(2010-07-02) Fritsch, Arno; Ickstadt, Katja; Weihs, ClausFinite mixture models assume that a distribution is a combination of several parametric distributions. They offer a compromise between the interpretability of parametric models and the flexibility of nonparametric models. This thesis considers a Bayesian approach to these models, which has several advantages. For example, using only weak prior information, it can solve problems with unbounded likelihood functions, that can occur in mixture models. The Bayesian approach also allows an elegant extension of finite to (countable) infinite mixture models. Depending on the application, the components of mixture models can either be viewed as just a means to the flexible modeling of a distribution or as defining subgroups of a population with different parametric distributions. Regarding the former case consistency results for Bayesian mixtures are stated. An example concerning the flexible modeling of a random effects distribution in a logistic regression is also given. The application considers the goalkeeper's effect in saving a penalty. In the latter case mixture models can be used for clustering. Bayesian mixtures then allow the estimation of the number of clusters at the same time as the cluster-specific parameters. For cluster analysis the standard approach for fitting Bayesian mixtures, Markov Chain Monte Carlo (MCMC), unfortunately leads to inferential difficulties. The labels associated with the clusters can change during the MCMC run, a phenomenon called label-switching. The problem gets severe, if the number of clusters is allowed to vary. Existing methods to deal with label-switching and a varying number of components are reviewed and new approaches are proposed for both situations. The first consists of a variant of the relabeling algorithm of Stephens (2000). The variant is more general, as it applies to drawn clusterings and not drawn parameter values. Therefore it does not depend on the specific form of the component distributions. The second approach is based on pairwise posterior probabilities and is an improvement of a commonly used loss function due to Binder (1978). Minimization of this loss is shown to be equivalent to maximizing the posterior expected Rand index with the true clustering. As the adjusted Rand index is preferable to the raw index, the maximization of the posterior expected adjusted Rand is proposed. The new approaches are compared to the previous methods on simulated and real data. The real data used for cluster analysis are two gene expression data sets and Fisher's iris data.