Lehrstuhl Mathematische Statistik und biometrische Anwendungen

Permanent URI for this collection

http://hdl.handle.net/2003/78

Browse

Now showing 1 - 20 of 33

Leveraging real-world biomarker data: Statistical methods for investigating missingness and longitudinal information for patient risk assessment
(2025) Hunsdieck, Berit; Ickstadt, Katja; Rahnenführer, Jörg
The increasing reliance on big data within the pharmaceutical industry underscores significant challenges related to noise and missingness, which can adversely impact data quality and the interpretation of patient outcomes. Noise arises from various sources across different data types, with genomic and transcriptomic data influenced by genetic and environmental factors, while proteomic and metabolomic data are affected by a wider array of variables, complicating the extraction of meaningful insights. Additionally, missing data—whether classified as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR)—further complicates analyses, particularly in longitudinal studies. To address these challenges, this thesis emphasises the importance of replicating real-world conditions, enabling researchers to better understand data behaviour and develop robust statistical methodologies. The thesis comprises three papers that tackle these challenges from multiple perspectives. The first paper focusses on missingness in metabolomics data, proposing a novel clustering method that integrates missingness information, rather than relying solely on imputed values, to enhance clustering accuracy. This two-step clustering procedure aims to improve patient clustering outcomes, particularly when data are MNAR, and demonstrates superior performance compared to standard methods as confirmed by external validation measures. The second paper addresses the complexities of utilising longitudinal Electronic Health Record (EHR) data for health risk assessment by employing joint models that integrate longitudinal and survival data. This study simulates realistic longitudinal EHR data, incorporating noise, sample size, and cohort homogeneity, to analyse how various data quality characteristics impact model performance. The findings reveal conditions under which joint models outperform traditional Cox models in risk prediction. The third paper explores the modelling and prediction of blood pressure trajectories using data from wearable devices in hypertensive patients. For that, a framework is developed for simulating realistic blood pressure trajectories. This framework is used for evaluating the performance of novel statistical approaches for the prediction of treatment effects of antihypertensive drugs. Together, these studies contribute to a deeper understanding of the challenges posed by noise and missingness in pharmaceutical big data, while offering innovative methodologies to enhance data analysis and patient outcomes.
Variable selection methods for detecting interactions in large scale data
(2025) Teschke, Sven; Ickstadt, Katja; Schikowski, Tamara; Staerk, Christian
Large-scale data sets comprising millions of variables p, as is typical in the field of genetics, offer a wealth of information. However, it is a considerable challenge to extract this information from the data. From a biological perspective, it is desirable that this will lead to a better understanding of the development of diseases. Moreover, it is imperative to consider the interactions of genetic factors with each other and with the environment. Taking into account interactions further exacerbates the problem of the high dimensionality of the data. In addition to the computational challenges of processing the data at all, most statistical models are inapplicable or difficult to interpret in these scenarios. To address this research gap, a variable selection method was developed in this thesis that accounts for a multivariate structure and can be applied to arbitrarily large amounts of data. The selection of variables is executed through the utilization of cross-leverage scores (CLS). Due to their construction the CLS correspond to the variables individual leverage on the correlation with the multidimensional subspace spanned by the data with the outcome variable. Thus, they are directly linked to the importance of a variable also in the sense of an interaction effect. Further, under mild assumptions, each CLS equals its corresponding parameter in the least squares solution up to a small bounded additive error. In addition, in this thesis, methods have been developed and improved for the approximation of the CLS in large data. A notable advantage of these methods is their ability to be calculated streamwise, thereby overcoming the problem of processing on standard computers. Overall, a two-step procedure is recommended. In the first step, variables are selected using CLS. In the subsequent step, an established method is to be applied to the reduced data, which is appropriate for the research question, but limited in the number of input variables. The primary article of this dissertation introduces the methodology of these approaches and validates them by simulations as well as mathematically. In two additional articles, this method is employed to two large scale datasets, in order to answer biological questions. Once, in the framework of a two-step approach to identify SNP-environment interactions in COPD. In the second step, the recently developed logicDT model is applied to the reduced data. In the other paper, the CLS are directly incorporated into the calculation of so-called profile scores to estimate the risk of Alzheimer’s disease based on DNA methylation and metabolomics data.
"Integrative statistical methods for analyzing biomedical data: applications in health and disease”
(2025) Tug, Timur; Ickstadt, Katja; Rahnenführer, Jörg; Hüls, Anke
In a series of four complementary studies, we apply innovative integrative statistical methods to diverse biomedical datasets to address both fundamental research questions and practical challenges in health and disease. Two of these investigations focus on the in vivo alkaline comet assay - a pivotal tool for assessing DNA damage as a marker of genotoxicity. In the first comet assay study (Article 1), we examine the impact of different centrality measures on the evaluation of tail intensity data. Using both original experimental data and simulation frameworks, we demonstrate that even subtle variations in summarizing techniques - whether using medians, arithmetic means, or geometric means - can lead to markedly different statistical conclusions and dose–response interpretations. These findings emphasize the critical need for careful methodological selection in genotoxicity assessments. In a subsequent comet assay work (Article 2), we compile and analyze extensive historical control data from multiple laboratories. This investigation addresses key statistical issues, including inter-laboratory variability and the handling of zero-valued measurements, and discusses whether the findings from the first paper are similar in the centrality statistical measures and regulatory interpretations. In the third study (Article 3), we introduce a novel multi-omics approach to better understand Alzheimer’s disease (AD). By integrating genome-wide DNA methylation profiles with high-resolution metabolomics data from prefrontal cortex tissue samples, we develop innovative single-, joint- and multi-omics profile scores using Machine Learning and advanced regression techniques. This integrative analysis significantly improves the prediction of AD neuropathology, based on these profile scores. It also uncovers pivotal biological pathways, such as lipid metabolism and signal transduction, that are potentially involved in driving disease progression. These findings underscore the potential of combining multiple omics layers to elucidate complex molecular interactions underlying neurodegenerative disorders. Complementing these human-focused studies, our fourth investigation (Article 4) applies hierarchical modeling to veterinary epidemiology, specifically targeting respiratory diseases in piglet production. We thereby compare frequentist and Bayesian hierarchical regression models to assess the influence of various environmental and management factors - including floor condition, water flow rates, stocking density, and indoor climate conditions - on respiratory health outcomes in pigs. By accounting for the multi-level structure inherent in farm data (spanning individual animals, pens, compartments, and farms), we demonstrate that Bayesian approaches with informative priors can effectively overcome challenges posed by small sample sizes and high inter-cluster variability. This ultimately provides more robust estimates and practical insights for disease management in livestock production. Collectively, the four works of my cumulative thesis illustrate how tailored, integrative statistical methodologies can enhance our understanding of complex biological systems. These methodologies improve decision-making across a spectrum of applications, ranging from the regulatory evaluation of chemical safety and the elucidation of neurodegenerative disease mechanisms to the optimization of animal health in agricultural settings. The work emphasizes that the choice of statistical methods is not merely a technical detail but a pivotal factor that can substantially alter study outcomes and subsequent interpretations in both clinical and applied research environments. While the first two manuscripts are published, the third and fourth work are submitted and attached in its current version.
Bivariate analysis of birth weight and gestational age by Bayesian distributional regression with copulas
(2023-10-27) Rathjens, Jonathan; Kolbe, Arthur; Hölzer, Jürgen; Ickstadt, Katja; Klein, Nadja
We analyze perinatal data including biometric and obstetric information as well as data on maternal smoking, among others. Birth weight is the primarily interesting response variable. Gestational age is usually an important covariate and included in polynomial form. However, in opposition to this univariate regression, bivariate modeling of birth weight and gestational age is recommended to distinguish effects on each, on both, and between them. Rather than a parametric bivariate distribution, we apply conditional copula regression, where the marginal distributions of birth weight and gestational age (not necessarily of the same form) and the dependence structure are modeled conditionally on covariates. In the resulting distributional regression model, all parameters of the two marginals and the copula parameter are observation specific. While the Gaussian distribution is suitable for birth weight, the skewed gestational age data are better modeled by the three-parameter Dagum distribution. The Clayton copula performs better than the Gumbel and the symmetric Gaussian copula, indicating lower tail dependence (stronger dependence when both variables are low), although this non-linear dependence between birth weight and gestational age is surprisingly weak and only influenced by Cesarean section. A non-linear trend of birth weight on gestational age is detected by a univariate model that is polynomial with respect to the effect of gestational age. Covariate effects on the expected birth weight are similar in our copula regression model and a univariate regression model, while distributional copula regression reveals further insights, such as effects of covariates on the association between birth weight and gestational age.
Compressing data for generalized linear regression
(2022) Omlor, Simon; Munteanu, Alexander; Ickstadt, Katja
In this thesis we work on algorithmic data and dimension reduction techniques to solve scalability issues and to allow better analysis of massive data. For our algorithms we use the sketch and solve paradigm as well as some initialization tricks. We will analyze a tradeoff between accuracy, running time and storage. We also show some lower bounds on the best possible data reduction factors. While we are focusing on generalized linear regression mostly, logistic and p-probit regression to be precise, we are also dealing with two layer Rectified Linear Unit (ReLU) networks with logistic loss which can be seen as an extension of logistic regression, i.e. logistic regression on the neural tangent kernel. We present coresets via sampling, sketches via random projections and several algorithmic techniques and prove that our algorithms are guaranteed to work with high probability. First, we consider the problem of logistic regression where the aim is to find the parameter beta maximizing the likelihood. We are constructing a sketch in a single pass over a turnstile data stream. Depending on some parameters we can tweak size, running time and approximation guarantee of the sketch. We also show that our sketch works for other target functions as well. Second, we construct an epsilon-coreset for p-probit regression, which is a generalized version of probit regression. Therefore, we first compute the QR decomposition of a sketched version of our dataset in a first pass. We then use the matrix R to compute an approximation of the l_p-leverage scores of our data points which we use to compute sampling probabilities to construct the coreset. We then analyze the negative log likelihood of the p-generalized normal distribution to prove that this results in an epsilon-coreset. Finally, we look at two layer ReLU networks with logistic loss. Here we show that using a coupled initialization we can reduce the width of the networks to get a good approximation down from gamma^(-8) (Ji and Telgarsky, 2020) to gamma^(-2) where gamma is the so called separation margin. We further give an example where we prove that a width of gamma^(−1) is necessary to get less than constant error.
Spatial and spatio-temporal regression modelling with conditional autoregressive random effects for epidemiological and spatially referenced data
(2022) Djeudeu-Deudjui, Dany-Armand; Ickstadt, Katja; Doebler, Philipp
Regression models are suitable to analyse the association between health outcomes and environmental exposures. However, in urban health studies where spatial and temporal changes are of importance, spatial and spatio-temporal variations are usually neglected. This thesis develops and applies regression methods incorporating latent random effects terms with Conditional Autoregressive (CAR) structures in classical regression models to account for the spatial effects for cross-sectional analysis and spatio-temporal effects for longitudinal analysis. The thesis is divided into two main parts. Firstly, methods to analyse data for which all variables are given on an areal level are considered. The longitudinal Heinz Nixdorf Recall Study is used throughout this thesis for application. The association between the risk of depression and greenness at the district level is analysed. A spatial Poisson model with a latent CAR structured-Random effect is applied for selected time points. Then, a sophisticated spatio-temporal extension of the Poisson model results to a negative association between greenness and depression. The findings also suggest strong temporal autocorrelation and weak spatial effects. Even if the weak spatial effects are suggestive of neglecting them, as in the case of this thesis, spatial and spatio-temporal random effects should be taken into account to provide reliable inference in urban health studies. Secondly, to avoid ecological and atomic fallacies due to data aggregation and disaggregation, all data should be used at their finest spatial level given. Multilevel Conditional Autoregressive (CAR) models help to simultaneously use all variables at their initial spatial resolution and explain the spatial effect in epidemiological studies. This is especially important where subjects are nested within geographical units. This second part of the thesis has two goals. Essentially, it further develops the multilevel models for longitudinal data by adding existing random effects with CAR structures that change over time. These new models are named MLM tCARs. By comparing the MLM tCARs to the classical multilevel growth model via simulation studies, we observe a better performance of MLM tCARs in retrieving the true regression coefficients and with better fits. The models are comparatively applied on the analysis of the association between greenness and depressive symptoms at the individual level in the longitudinal Heinz Nixdorf Recall Study. The results show again negative association between greenness and depression and a decreasing linear individual time trend for all models. We observe once more very weak spatial variation and moderate temporal autocorrelation. Besides, the thesis provides comprehensive decision trees for analysing data in epidemiological studies for which variables have a spatial background.
River-mediated dynamic environmental factors and perinatal data analysis
(2021) Rathjens, Jonathan; Ickstadt, Katja; Groll, Andreas
Perfluorooctanoic acid (PFOA) and related per- and polyfluoroalkyl substances, a group of man-made persistent organic chemicals employed for many products, are widely distributed in the environment. Adverse health effects may occur even at low exposure levels. A large-scale PFOA contamination of drinking water resources, especially of the river Ruhr, was detected in North Rhine-Westphalia, Germany, in summer 2006. Subsequent measurements are available from the water supply stations along the river and elsewhere. The first state-wide environmental-epidemiological study on the general population analyses these secondary data together with routinely collected perinatal registry data, to estimate possible developmental-toxic effects of PFOA exposure, especially regarding birth weight (BW). Drinking water data are temporally and spatially modelled to assign estimated exposure values to the residents. A generalised linear model with an inverse link deals with the steeply decreasing temporal data pattern at mainly affected stations. Confirmed by a river-wide joint model, the river's segments between the main junctions are the most important factor to explain the spatial structure, besides local effects. Deductions from stations to areal units are made possible via estimated supply proportions. Regression of perinatal data with BW as response usually includes the gestational age (GA) as an important covariate in polynomial form. However, bivariate modelling of BW and GA is recommended to distinguish effects on each, on both, and between them. Bayesian distributional copula regression is applied, where the marginals for BW and GA as well as the copula representing their dependence structure are fitted independently and all parameters are estimated conditional on covariates. While a Gaussian is suitable for BW, the skewed GA data are better modelled by the three-parametric Dagum distribution. The Clayton copula performs better than the Gumbel and the symmetric Gaussian copula, although the lower tail dependence is weak. A non-linear trend of BW on GA is detected by the standard polynomial model. Linear effects of biometric and obstetric covariates and also of maternal smoking on BW mean are similar in both models, while the distributional copula regression also reveals effects on all other parameters. The local PFOA exposure is spatio-temporally assigned to the perinatal data of the most affected town of Arns\-berg and so included in the regression models. No significant effect results and a relatively high amount of noise remains. Perspectively and for larger regions, this can be dealt with by exposure modelling on area level using dependence information, by allowing further asymmetry in the bivariate distribution of BW and GA, and by respecting geographical structures in birth data.
Spatial and temporal analyses of perfluorooctanoic acid in drinking water for external exposure assessment in the Ruhr metropolitan area, Germany
(2020-12-04) Rathjens, Jonathan; Becker, Eva; Kolbe, Arthur; Ickstadt, Katja; Hölzer, Jürgen
Perfluorooctanoic acid (PFOA) and related chemicals among the per- and polyfluoroalkyl substances are widely distributed in the environment. Adverse health effects may occur even at low exposure levels. A large-scale contamination of drinking water resources, especially the rivers Möhne and Ruhr, was detected in North Rhine-Westphalia, Germany, in summer 2006. As a result, concentration data are available from the water supply stations along these rivers and partly from the water network of areas supplied by them. Measurements started after the contamination’s discovery. In addition, there are sparse data from stations in other regions. Further information on the supply structure (river system, station-to-area relations) and expert statements on contamination risks are available. Within the first state-wide environmental-epidemiological study on the general population, these data are temporally and spatially modelled to assign estimated exposure values to the resident population. A generalized linear model with an inverse link offers consistent temporal approaches to model each station’s PFOA data along the river Ruhr and copes with a steeply decreasing temporal data pattern at mainly affected locations. The river’s segments between the main junctions are the most important factor to explain the spatial structure, besides local effects. Deductions from supply stations to areas and, therefore, to the residents’ risk are possible via estimated supply proportions. The resulting potential correlation structure of the supply areas is dominated by the common water supply from the Ruhr. Other areas are often isolated and, therefore, need to be modelled separately. The contamination is homogeneous within most of the areas.
Streaming statistical models via Merge & Reduce
(2020-06-12) Geppert, Leo N.; Ickstadt, Katja; Munteanu, Alexander; Sohler, Christian
Merge & Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structures—that support only queries—into dynamic data structures—that allow insertions of new elements—with as little overhead as possible. This can be used to turn classic offline algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge & Reduce has been employed. Instead of summarizing the data, we combine the Merge & Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches whose size is independent of the total number of observations n. The results are combined in a structured way at the cost of a bounded O(logn) factor in their memory requirements. It is only necessary, though nontrivial, to choose an appropriate statistical model and design merge and reduce operations on a casewise basis for the specific type of model. We illustrate our Merge & Reduce schemes on simulated and real-world data employing (Bayesian) linear regression models, Gaussian mixture models and generalized linear models.
Statistical modeling of protein-protein interaction networks
(2018) Fermin Ruiz, Yessica Yulieth; Ickstadt, Katja; Rahnenführer, Jörg
Understanding how proteins bind to each other in a cell is the key in molecular biology to determine how experts can repair anomalies in cells. The major challenge in the prediction of protein-protein interactions is the cell-to-cell heterogeneity within a sample, due to genetic and epigenetic variabilities. Most studies about protein-protein interaction carry out their analysis without awareness of the underlying heterogeneity. This situation can lead to the identification of invalid interactions. As part of the solution to this problem, we proposed in this thesis two aspects of analysis, one for snapshot data, where different samples of ten proteins were taken by toponome imaging and another for the analysis of time correlated data that guarantees a better approximation to the prediction of protein-protein interactions. The latter represents an advance in the analysis of data with high temporal resolution, such as that obtained through the quantification technique known as multicolor live cell imaging. The thesis here presented is divided into two parts: The first part called "Revealing relationships among proteins involved in assembling focal adhesions" consists of the development of a methodology based on frequentist methods, such as machine learning and meta-analysis, for the prediction of protein-protein interaction on six different toponome imaging datasets. This methodology presents an advance in the analysis of highly heterogeneous snapshot data. Our aim here focused on the formulation of a single model capable of identifying the relationship among different samples by summing is common results over them concerning their random variation. This methodology leads to a set of common models over the six datasets hierarchized by their predictive power, where the researcher can choose the model according to its accuracy in the prediction or according to its parsimony. The developing of this part is in Chapters 1-7 â this part published in Harizanova et al. (2016). The second part is called "Modelling of temporal networks with a nonparametric mixture of dynamic Bayesian networks". The content of this part contemplates the advance of a Bayesian methodology regarding temporal networks that successfully enables to identify subpopulations in heterogeneous cell populations as well as at the same time reconstructing the protein interaction network associated with each subpopulation. This method extends the nonparametric Bayesian networks (NPBNs) (Ickstadt et al., 2011) for the analysis of time-correlated data by using Gaussian dynamic Bayesian Networks (GDBNs). We evaluate our model based on the variation of specific parameters such as the underlying number of subpopulations, network density, intra-subpopulation variability among others. On the other hand, a comparative analysis with existing clustering methods such as NPBNs and hierarchical agglomerative clustering (Hclust), shows that the inclusion of temporal correlations in the classification of multivariate time series is relevant for an improvement in the classification. The classic Hclust method using the dynamic time warping distances (T-Hclust) was found to be similar in precision to our Bayesian method here proposed. On the other hand, a comparative analysis with the GDBNs shows the lack of adjustment of the GDBNs to reconstruct temporal networks in heterogeneous cell populations through a single model, while our method, as well as the joint use of the T-Hclust classifications with the GDBNs (T-Hclust+), show a high adequacy in the prediction of temporal networks in a mixture. The developing of this part is in Chapters 8-16.
Subgroup analyses and investigations of treatment effect heterogeneity in clinical dose-finding trials
(2019) Thomas, Marius; Ickstadt, Katja; Rahnenführer, Jörg
Identifying subgroups, which respond differently to a treatment is an important part of drug development. Exploratory subgroup analyses, which have the aim to identify subgroups of patients with differential treatment effects are thus common in many randomized clinical trials. Statistically these analyses are known to be challenging the number of possible subgroups is often large, which leads to multiplicity issues. Often such subgroup analyses are also performed for early phase clinical trials, where an additional challenge is the small sample size. In recent years several statistical approaches to these problems have been proposed, employing for example tree-based recursive partitioning algorithms, which are well-suited for handling interactions, penalized regression methods, which can be used to prevent overfitting when explicitly modeling a large number of covariate effects or Bayesian approaches, which allow incorporating uncertainty and can be used to make optimal decisions with regard to subgroups. The available literature focuses however on two-arm clinical trials, where patients are randomized to the experimental treatment or a control (e.g. current standard of care or placebo). A particular focus of this cumulative thesis is the development of statistical methodology for identification of subgroups in dose-finding trials, in which patients are administered several doses of a new drug. Dose-finding trials play a key role in the drug development process, since they provide valuable information about the effect of the dose on efficacy and safety. For identifying subgroups in this setting we consider the treatment effect to be a function of the dose and then try to identify relevant covariate effects on this treatment effect curve. These identified covariates can then be used to define subgroups with higher treatment effects but also subgroups, which require a different dose of the treatment. We propose two different approaches for this purpose. Firstly, a tree-based recursive partitioning algorithm, which detects covariate effects on the parameters of dose-response models and builds a tree of subgroups with different dose-response curves. Secondly, a Bayesian hierarchical model, which makes use of shrinkage priors to prevent overfitting in the considered settings with low sample sizes and a large number of considered covariates. In addition to approaches for subgroup identification we also consider the problem of testing a prespecified subgroup in addition to the full population in dose-finding trials. In a dose-finding setting contrast tests are often used to test for a significant dose-response signal, while taking the underlying dose-response relationship into account. Optimal contrast tests can be derived, when the underlying dose-response model is known, however often there is uncertainty about this underlying model. Testing procedures, which allow for uncertainty with regard to the underlying model and perform multiple contrast tests are therefore popular approaches in such settings. As a part of this thesis we extend such approaches to settings with multiple populations, in particular the situation, in which a prespecified subgroup is considered in addition to the full population. A last part of this cumulative thesis focuses on treatment effect estimation in identified subgroups. Naive treatment effect estimates in subgroups will often suffer from selection bias, especially when the number of considered subgroups is large. Several approaches to obtain adjusted treatment effect estimates in such situations have been proposed, using resampling, model averaging or penalized regression. We compare these approaches in an extensive simulation study for a large range of scenarios, in which such analyses are performed.
Bayesian and frequentist regression approaches for very large data sets
(2018) Geppert, Leo Nikolaus; Ickstadt, Katja; Groll, Andreas
This thesis is concerned with the analysis of frequentist and Bayesian regression models for data sets with a very large number of observations. Such large data sets pose a challenge when conducting regression analysis, because of the memory required (mainly for frequentist regression models) and the running time of the analysis (mainly for Bayesian regression models). I present two different approaches that can be employed in this setting. The first approach is based on random projections and reduces the number of observations to manageable level as a first step before the regression analysis. The reduced number of observations depends on the number of variables in the data set and the desired goodness of the approximation. It is, however, independent of the number of observations in the original data set, making it especially useful for very large data sets. Theoretical guarantees for Bayesian linear regression are presented, which extend known guarantees for the frequentist case. The fundamental theorem covers Bayesian linear regression with arbitrary normal distributions or non-informative uniform distributions as prior distributions. I evaluate how close the posterior distributions of the original model and the reduced data set are for this theoretically covered case as well as for extensions towards hierarchical models and models using q-generalised normal distributions as prior. The second approach presents a transfer of the Merge & Reduce-principle from data structures to regression models. In Computer Science, Merge & Reduce is employed in order to enable the use of static data structures in a streaming setting. Here, I present three possibilities of employing Merge & Reduce directly on regression models. This enables sequential or parallel analysis of subsets of the data set. The partial results are then combined in a way that recovers the regression model on the full data set well. This approach is suitable for a wide range of regression models. I evaluate the performance on simulated and real world data sets using linear and Poisson regression models. Both approaches are able to recover regression models on the original data set well. They thus offer scalable versions of frequentist or Bayesian regression analysis for linear regression as well as extensions to generalised linear models, hierarchical models, and q-generalised normal distributions as prior distribution. Application on data streams or in distributed settings is also possible. Both approaches can be combined with multiple algorithms for frequentist or Bayesian regression analysis.
Statistische Analyse und Modellierung von Clusterphänomenen bei Signalproteinen in der Plasmamembran
(2016) Siebert, Sabrina; Ickstadt, Katja; Rahnenführer, Jörg
In der vorliegenden Arbeit wurde sich mit Clusterphänomenen von Signalproteinen beschäftigt. Diese Proteine sind dabei in der Plasmamembran lokalisiert und für die Kommunikation und den Stoffaustausch der Zelle zuständig. Die Daten wurden mit Hilfe von Fluoreszenzmikroskopie am Max-Planck-Institut für molekulare Physiologie in Dortmund in der Arbeitsgruppe von Dr. Peter J. Verveer erhoben. Um die Clusterphänomene zu untersuchen, können unterschiedliche Blickwinkel und Fragestellungen betrachtet werden. In dieser Arbeit wurde eine zeitliche, eine räumliche und eine zeitlich-räumliche Analyse entsprechender Daten vorgenommen. In der zeitlichen Analyse wurden Proteinzeitreihen untersucht. Die Proteinzeitreihe ergibt sich aus der Messung der Lichtintensität eines Spots, d.h. eines Proteinclusters, über die Zeit hinweg. Das Ziel ist hier die Segmentierung eben dieser Proteinzeitreihe. Hier wurde ein Bayessches hierarchisches Modell zur Segmentierung genutzt. Dieses lieferte dabei sinnvolle Ergebnisse, wobei jedoch zu beachten war, dass die Anzahl an Segmenten stets als fest angesehen wurde. Um diese Einschränkung aufzuheben, wurde ein Reversible Jump Schritt in das Modell aufgenommen. Mit dieser Erweiterung konnten nun sinnvolle Ergebnisse mit einer höheren Flexibilität für den Anwender erreicht werden. In der räumlichen Analyse wurde ein Pixelbild aus einer Messung einer lebenden Zelle mit Hilfe von TIRF-Mikroskopie untersucht. Ziel war hier die räumliche Clusterstruktur zu untersuchen, wobei sich auf den Anteil an Proteinen in Clustern beschränkt wurde. Dafür wurden zunächst unterschiedliche Methoden auf einer simulierten Region untersucht. Mit Hilfe dieser Ergebnisse konnte ein Anwendungsschema zur effizienten Kombination eben dieser Methoden aufgestellt werden. Dieses wurde abschließend auf einen experimentellen Datensatz sowie auf eine Dual Colour Simulation angewendet. Es zeigte sich, dass durch das Vorgehen des Schemas die Parameterwahl für einige Methoden vereinfacht wurde und sinnvolle Ergebnisse berechnet werden konnten. Abschließend wurden in der räumlich-zeitlichen Analyse Proteintracks untersucht. Diese Proteintracks geben den Weg eines Proteins in der Zellmembran über die Zeit hinweg an. Diese Messung wurde simultan für zwei Proteinarten durchgeführt, sodass hier erneut der Dual Colour Fall vorliegt. Ziel war die Bestimmung von Zusammenhängen zweier Proteintracks unterschiedlicher Proteinarten. Um den Zusammenhang bestmöglich berechnen zu können, wurde zunächst diskutiert, welche Eigenschaften einen hohen Zusammenhang repräsentieren. Anschließend wurden diese Eigenschaften zu einem Zusammenhangsmaß zusammen gefügt. Mit diesem Maß wurden zum einen ein simuliertes Beispiel und zum anderen experimentelle Daten analysiert. Es zeigte sich, dass Abhängigkeitsstrukturen durch das Maß gut widergespiegelt wurden und mit Hilfe von Cutoffs eine Auswahl entsprechender Proteintracks erfolgen konnte. Durch diese Auswahl konnten weiter interessante Regionen sowie Cluster identifiziert werden.
Bayesian prediction for stochastic process models in reliability
(2016) Hermann, Simone; Müller, Christine; Ickstadt, Katja
Unimodal spline regression and its use in various applications with single or multiple modes
(2016) Köllmann, Claudia; Ickstadt, Katja; Fried, Roland
Research in the field of non-parametric shape constrained regression has been extensive and there is need for such methods in various application areas, since shape constraints can reflect prior knowledge about the underlying relationship. This thesis develops semi-parametric spline regression approaches to unimodal regression. However, the prior knowledge in different applications is also of increasing complexity and data shapes may vary from few to plenty of modes and from piecewise unimodal to accumulations of identically or diversely shaped unimodal functions. Thus, we also go beyond unimodal regression in this thesis and propose to capture multimodality by employing piecewise unimodal regression or deconvolution models based on unimodal peak shapes. More explicitly, this thesis proposes unimodal spline regression methods that make use of Bernstein-Schoenberg-splines and their shape preservation property. To achieve unimodal and smooth solutions we use penalized splines, and extend the penalized spline approach towards penalizing against general parametric functions, instead of using just difference penalties. For tuning parameter selection under a unimodality constraint a restricted maximum likelihood and an alternative Bayesian approach for unimodal regression are developed. We compare the proposed methodologies to other common approaches in a simulation study and apply it to a dose-response data set. All results suggest that the unimodality constraint or the combination of unimodality and a penalty can substantially improve estimation of the functional relationship. A common feature of the approaches to multimodal regression is that the response variable is modelled using several unimodal spline regressions. This thesis examines mixture models of unimodal regressions, piecewise unimodal regression and deconvolution models with identical or diverse unimodal peak shapes. The usefulness of these extensions of unimodal regression is demonstrated by applying them to data sets from three different application areas: marine biology, astroparticle physics and breath gas analysis. The proposed methodologies are implemented in the statistical software environment R and the implementations and their usage are explained in this thesis as well.
Entmischung und Inferenz biomolekularer Netzwerke
(2015) Wieczorek, Jakob Jan; Ickstadt, Katja; Rahnenführer, Jörg
In dieser Arbeit werden neue statistische Konzepte zur Erkennung und Analyse von Interaktionsmustern vorgestellt. Diese werden sowohl an simulierten Daten aus dem Erk-Signalübertragungsnetzwerk als auch an experimentellen Daten des mating pathways der Hefe mit Erfolg zur Anwendung gebracht. Methodisch kann die Arbeit in zwei Themenschwerpunkte eingeteilt werden. Den Hauptschwerpunkt bildet das aus den Bayesschen Netzwerken entwickelte Verfahren der Nichtparametrischen Bayesschen Netzwerke. Dieses ist, so weit bekannt, als einzige Netzwerkinferenzmethode in der Lage, Subgruppen innerhalb der Daten zu erkennen und die Beobachtungen zu partitionieren. Weiter gelingt es in dieser Arbeit, neben dem Dirichlet-Prozess den Pitman-Yor-Prozess als a priori Verteilung der Clusterstruktur zu adaptieren. Beide Varianten des Verfahrens werden bezüglich ihrer Leistungsfähigkeit bei der Entmischung von Beobachtungen untersucht. Den zweiten Schwerpunkt der Arbeit bildet die Entwicklung einer Methode zur Schätzung von Proteinkonzentrationen, dem Komplexeschätzer. Mit ihm ist es möglich, aus Fluoreszenzkorrelationsspektroskopiemessungen (FCS) nicht wie bisher nur feste Gruppen von Proteinen zu quantifizieren, sondern gezielt einzelne Proteine und beliebige vom Anwender ausgewählte Gruppen von Proteinen zu bestimmen. Dies stellt eine deutliche Verbesserung zum gegenwärtigen Standard dar und erhöht den Informationsgewinn durch FCS-Messungen entscheidend. Mit Hilfe dieser Methode konnte eine in der Biologie bisher unbekannte Rückkopplung im Hefe mating pathway gefunden werden. Im Rahmen der Arbeit wird außerdem ein Konzept zum Clustern von gerichteten azyklischen Graphen (DAGs) entwickelt. Im Gegensatz zu den in der Literatur vorgeschlagenen Verfahren werden an die Daten keine speziellen Anforderungen gestellt. Es müssen lediglich DAGs eines festen Zeitpunkts verwendet werden. Konkret wird ein Distanzbegriff für DAGs entwickelt, welcher die Eigenschaften einer Semimetrik erfüllt. Mit ihm ist es möglich eine sinnvolle Ähnlichkeitsmatrix aufzustellen, welche zum Clustern benutzt werden kann.
Finite Bayesian mixture models with applications in spatial cluster analysis and bioinformatics
(2015) Schäfer, Martin; Ickstadt, Katja; Rahnenführer, Jörg
In many statistical applications, one encounters populations that form homogenous subgroups regarding one or several characteristics. Across the subgroups, however, heterogeneity may often be found. Mixture distributions are a natural means to model data from such applications. This PhD thesis is based on two projects that focus on such applications. In the first project, spatial nanoscale clusters formed by Ras proteins in the cell membrane are investigated. Such clusters play a crucial role in intracellular communication and are thus of interest in cancer research. In this case, the subgroups are clustered and non-clustered proteins. In the second project, epigenomic data obtained from sequencing experiments are integrated with another genomic or epigenomic input, aiming, e.g., to detect genes that contribute to the development of cancer. Here, the subgroups are defined by a) genes presenting congruent (epi)genomic aberrations in both considered variables, b) genes presenting incongruent aberrations, and c) genes lacking aberrations in at least one of the variables. Employing a Bayesian framework, objects are classified in both projects by fitting finite univariate mixture distributions with a small fixed number of components to values from a score summarizing relevant information about the research question. Such mixture distributions have favorable characteristics in terms of interpretation and present little sensitivity to label switching in Markov Chain Monte Carlo analyses. Mixtures of gamma distributions are considered for Ras proteins, while mixtures of normal and exponential or gamma distributions are a focus for the bioinformatic analysis. In the latter, classification is the primary goal, while in the Ras protein application, estimating key parameters of the spatial clustering is of more interest. The results of both projects are presented in this thesis. For both applications, the methods have been implemented in software and their performance is compared with competing approaches on experimental as well as on simulated data. To warrant an appropriate simulation of Ras protein patterns, a new cluster point process model called the double Matérn cluster process is developed and described in this thesis.
Integrativer Ansatz zur Identifizierung neuer, prognostisch relevanter Metagene mittels Clusteranalyse
(2014) Freis, Evgenia; Ickstadt, Katja; Rahnenführer, Jörg
In Germany, breast cancer is the most common leading cause of cancer deaths in women. To gain insight into the processes related to the course of the disease, human genetic data can be used to identify associations between gene expression and prognosis. In the course of the several clinical studies and numerous microarray experiments, the enormous data volume is constantly generated. Its dimensionality reduction of thousands of genes to a smaller number is the aim of the so-called metagenes that aggregate the expression data of groups of genes with similar expression patterns and may be used for investigating complex diseases like breast cancer. Here, a cluster analytic approach for identification of potentially relevant metagenes is introduced. In a first step of the approach, gene expression patterns over time of receptor tyrosine kinase ErbB2 breast cancer MCF7 cell lines to obtain promising sets of genes for a metagene calculation were used. Three independent batches of MCF7/NeuT cells were exposed to doxycycline for periods of 0, 6, 12 and 24 hours as well as for 3 and 14 days in independent experiments, due to association of the oncogenic variant of ErbB2 overexpression in breast cancer with worse prognosis. With cluster analytic approaches DIB-C (difference-based clustering algorithm) and STEM (short time-series expression miner) as well as with the finite and infinite mixture models gene clusters with similar expression patterns were identified. Two non-model-based algorithms – k-means and PFP (penalized frame potential) – as well as the model-based procedure DIRECT were applied for the method comparisons. Potentially relevant gene groups were selected by promoter and Gene Ontology (GO) analysis. The verification of the applied methods was carried out with another short time-series data set. In the second step of the approach, this gene clusters were used to calculate metagenes of the gene expression data of 766 breast cancer patients from three breast cancer studies and Cox models were applied to determine the effect of the detected metagenes on the prognosis. Using this strategy, new metagenes associated with metastasis-free survival patients were identified.
Enrichment design and sensitivity preferred classification
(2014-10-06) Agueusop, Inoncent; Ickstadt, Katja; Rahnenführer, Jörg; Vonk, Richardus
Adaption und Vergleich evolutionärer mehrkriterieller Algorithmen mit Hilfe von Variablenwichtigkeitsmaßen
(2013-07-22) Casjens, Swaantje Wiarda; Katja, Ickstadt; Ligges, Uwe
Bei der Herleitung eines Klassifikationsmodells ist neben der Vorhersagegüte auch die Güte der Variablenauswahl ein wichtiges Kriterium. Bei Einflussvariablen mit unterschiedlichen Kosten ist eine kostensensitive Klassifikation erstrebenswert, bei der ein Kompromiss aus hoher Vorhersagegüte und geringen Kosten getroffen werden kann. Werden konfliktäre Ziele, wie etwa hier die Vorhersagegüte und die Kosten, gleichzeitig optimiert, entsteht ein mehrkriterielles Optimierungsproblem, für das keine einzelne sondern eine Menge unvergleichbarer Lösungen existieren. Für das Auffinden der unvergleichbaren Lösungen sind evolutionäre mehrkriterielle Optimierungsalgorithmen (EMOAs) gut geeignet, da sie unter anderem nach verschiedenen Lösungen parallel suchen können und unabhängig von der zugrunde liegenden Datenverteilung sind. Häufig werden EMOAs für die Lösung mehrkriterieller Klassifikationsprobleme in Form von Wrapper-Ansätzen verwendet, wobei die EMOA-Individuen als binäre Zeichenketten (Bitstrings) codiert sind und jedes Bit die Verfügbarkeit der entsprechenden Einflussvariable beschreibt. Basierend auf diesen Variablenteilmengen und gegebenen Daten erstellt der umhüllte (wrapped) Klassifikationsalgorithmus ein Klassifikationsmodell, mit dem Ziel die Vorhersagegüte zu optimieren. Erst nach der Konstruktion des Klassifikationsmodells können weitere Zielkriterien, wie etwa die Kosten der selektierten Variablen, ausgewertet werden. Damit entsteht eine Hierarchie der zu optimierenden Zielkriterien mit Vorteil für die Vorhersagegüte, sodass durch einen mehrkriteriellen Wrapper-Ansatz keine nicht-hierarchischen Lösungen gefunden werden können. Diese Hierarchie der Zielfunktionen wird erstmals in Rahmen dieser Arbeit beschrieben und untersucht. Als Alternative zum mehrkriteriellen Wrapper-Ansatz wird in dieser Arbeit ein nicht-hierarchischer evolutionärer mehrkriterieller Optimierungsalgorithmus mit Baum-Repräsentation (NHEMOtree) entwickelt, um mehrkriterielle Optimierungsprobleme mit gleichberechtigten Optimierungszielen zu lösen. NHEMOtree basiert auf einem EMOA mit Baum-Repräsentation, der ohne internen Klassifikationsalgorithmus die Variablenselektion vollzieht und ohne Hierarchie in den Zielfunktionen mehrkriteriell optimierte binäre Entscheidungsbäume erstellt. Des Weiteren werden ein auf mehrkriteriellen Variablenwichtigkeitsmaßen (VIMs) basierter Rekombinationsoperator für NHEMOtree und eine NHEMOtree-Version mit lokaler Cutoff-Optimierung entwickelt. In dieser Arbeit werden erstmalig die Lösungen einer mehrkriteriellen Optimierung durch einen mehrkriteriellen Wrapper-Ansatz und durch einen EMOA mit Baum-Repräsentation (NHEMOtree) miteinander verglichen. Die Bewertung der Lösungen erfolgt dabei sowohl mittels der bekannten S-Metrik als auch durch den hier entwickelten Dominanzquotienten. Die Güte des VIM-basierten Rekombinationsoperators wird im Vergleich zum Standard-Rekombinationsoperator für EMOAs mit Baum-Repräsentation untersucht. Die mehrkriteriellen Optimierungsansätze und Operatoren werden auf medizinische und simulierte Daten angewendet. Die Ergebnisse zeigen, dass NHEMOtree bessere Lösungen als der mehrkriterielle Wrapper-Ansatz findet. Die Verwendung des VIM-basierten Rekombinationsoperators führt im Gegensatz zum Standard-Operator zu nochmals besseren Lösungen des mehrkriteriellen Optimierungsproblems und zu einer schnelleren Konvergenz des NHEMOtrees.

Browse

Recent Submissions