Statistische Methoden in der Genetik und Chemometrie
Permanent URI for this collection
Browse
Recent Submissions
Item Simulation study to evaluate when plasmode simulation is superior to parametric simulation in estimating the mean squared error of the least squares estimator in linear regression(2024-05-15) Stolte, Marieke; Schreck, Nicholas; Slynko, Alla; Saadati, Maral; Benner, Axel; Rahnenführer, Jörg; Bommert, AndreaSimulation is a crucial tool for the evaluation and comparison of statistical methods. How to design fair and neutral simulation studies is therefore of great interest for both researchers developing new methods and practitioners confronted with the choice of the most suitable method. The term simulation usually refers to parametric simulation, that is, computer experiments using artificial data made up of pseudo-random numbers. Plasmode simulation, that is, computer experiments using the combination of resampling feature data from a real-life dataset and generating the target variable with a known user-selected outcome-generating model, is an alternative that is often claimed to produce more realistic data. We compare parametric and Plasmode simulation for the example of estimating the mean squared error (MSE) of the least squares estimator (LSE) in linear regression. If the true underlying data-generating process (DGP) and the outcome-generating model (OGM) were known, parametric simulation would obviously be the best choice in terms of estimating the MSE well. However, in reality, both are usually unknown, so researchers have to make assumptions: in Plasmode simulation studies for the OGM, in parametric simulation for both DGP and OGM. Most likely, these assumptions do not exactly reflect the truth. Here, we aim to find out how assumptions deviating from the true DGP and the true OGM affect the performance of parametric and Plasmode simulations in the context of MSE estimation for the LSE and in which situations which simulation type is preferable. Our results suggest that the preferable simulation method depends on many factors, including the number of features, and on how and to what extent the assumptions of a parametric simulation differ from the true DGP. Also, the resampling strategy used for Plasmode influences the results. In particular, subsampling with a small sampling proportion can be recommended.Item Modeling approaches for dose-response data in toxicology(2024) Duda, Julia Christin; Rahnenführer, Jörg; Schorning, KirstenDose-response modeling occurs in many application areas and has a rich research history. An extensively studied application feld is clinical studies, where dose-response modeling is used in Phase II studies to identify the dose closest to a pre-defned effect. Many non-clinical, toxicological studies also aim at identifying a dose-response relationship. However, for non-clinical or toxicological studies there are fewer regulations or guidelines. This leads to a gap between nowadays research advances in statistical modeling and the use of these methods in practice in toxicology. In addition, toxicological doseresponse studies differ from clinical studies in various technical aspects. For example, cells might be studied instead of human patients, and administered doses are constrained due to laboratory, and technical reasons rather than ethical considerations. Therefore, the transfer of clinical methodological knowledge into toxicological applications is only possible to a limited extent and tailored methodologies are required that match the specifc data structure of toxicological studies. This cumulative thesis is based upon four works that all present approaches for modeling toxicological dose-response data. The frst manuscript reveals the potential of applying the Multiple Comparison Testing and Modeling (MCP-Mod) approach by Bretz et al. (2005) developed for Phase II clinical studies on toxicological, gene-expression doseresponse data. In the second manuscript, a parametric, mechanistically motivated model for toxicological dose-time-response data is developed. The third manuscript is application-focused and explains the use of interaction effects when analyzing doseresponse gene expression in a two-factor setting. At last, a non-parametric Bayesian dose-response modeling approach was developed that performs functional shrinkage for non-linear function spaces. While the frst three manuscripts are published, the fourth work is attached in its current version.Item Benefit of using interaction effects for the analysis of high-dimensional time-response or dose-response data for two-group comparisons(2023-11-27) Duda, Julia C.; Drenda, Carolin; Kästel, Hue; Rahnenführer, Jörg; Kappenberg, FranziskaHigh throughput RNA sequencing experiments are widely conducted and analyzed to identify differentially expressed genes (DEGs). The statistical models calculated for this task are often not clear to practitioners, and analyses may not be optimally tailored to the research hypothesis. Often, interaction effects (IEs) are the mathematical equivalent of the biological research question but are not considered for different reasons. We fill this gap by explaining and presenting the potential benefit of IEs in the search for DEGs using RNA-Seq data of mice that receive different diets for different time periods. Using an IE model leads to a smaller, but likely more biologically informative set of DEGs compared to a common approach that avoids the calculation of IEs.Item Model selection characteristics when using MCP-Mod for dose–response gene expression data(2022-02-20) Duda, Julia C.; Kappenberg, Franziska; Rahnenführer, JörgWe extend the scope of application for MCP-Mod (Multiple Comparison Procedure and Modeling) to in vitro gene expression data and assess its characteristics regarding model selection for concentration gene expression curves. Precisely, we apply MCP-Mod on single genes of a high-dimensional gene expression data set, where human embryonic stem cells were exposed to eight concentration levels of the compound valproic acid (VPA). As candidate models we consider the sigmoid Emax (four-parameter log-logistic), linear, quadratic, Emax, exponential, and beta model. Through simulations we investigate the impact of omitting one or more models from the candidate model set to uncover possibly superfluous models and to evaluate the precision and recall rates of selected models. Each model is selected according to Akaike information criterion (AIC) for a considerable number of genes. For less noisy cases the popular sigmoid Emax model is frequently selected. For more noisy data, often simpler models like the linear model are selected, but mostly without relevant performance advantage compared to the second best model. Also, the commonly used standard Emax model has an unexpected low performance.Item Improving adaptive seamless designs through Bayesian optimization(2022-02-25) Richter, Jakob; Friede, Tim; Rahnenführer, JörgWe propose to use Bayesian optimization (BO) to improve the efficiency of the design selection process in clinical trials. BO is a method to optimize expensive black-box functions, by using a regression as a surrogate to guide the search. In clinical trials, planning test procedures and sample sizes is a crucial task. A common goal is to maximize the test power, given a set of treatments, corresponding effect sizes, and a total number of samples. From a wide range of possible designs, we aim to select the best one in a short time to allow quick decisions. The standard approach to simulate the power for each single design can become too time consuming. When the number of possible designs becomes very large, either large computational resources are required or an exhaustive exploration of all possible designs takes too long. Here, we propose to use BO to quickly find a clinical trial design with high power from a large number of candidate designs. We demonstrate the effectiveness of our approach by optimizing the power of adaptive seamless designs for different sets of treatment effect sizes. Comparing BO with an exhaustive evaluation of all candidate designs shows that BO finds competitive designs in a fraction of the time.Item Statistische Methoden zur Validierung von Inhaltsanalysen(2023) Koppers, Lars; Rahnenführer, Jörg; Ickstadt, KatjaAuch in den Geistes- und Sozialwissenschaften hat sich die Analyse von großen Textkorpora inzwischen durchgesetzt. Mit den Digital Humanity ist dort ein komplett neues Forschungsfeld entstanden. Damit wurde es zum ersten mal möglich große Textkorpora systematisch auszuwerten und nicht nur Stichproben daraus zu untersuchen. Am Dortmund Center für Datenbasierte Medienanalyse (DoCMA) wird Journalismusforschung anhand von Medienkorpora betrieben. Ein Hauptaugenmerk liegt dabei auf die Entwicklung Themen in Medienerzeugnissen. Als zentrale Methode wurde dabei mit der Latent Dirichlet Allocation (LDA; Blei, Ng u.a. 2003) gearbeitet, ein generatives Themenmodell, das aus Textkorpora Themen identifiziert, bei denen sowohl die Themenverteilung, als auch die Wortverteilung, die ein Thema definiert als latent hinter dem Text liegend angenommen werden. Die vorliegende Arbeit hat sich drei verschiedene Aspekte in diesem Themenbereich vorgenommen: Ein R-Paket für die Vorverarbeitung und Analyse der Textkorpora, mit einem Schwerpunkt auf Grafikvisualisierungen, die die zeitliche Komponente der Korpora in den Mittelpunkt stellt, ein effektiveres Sampling bei der Validierung von Subkorpora und eine Analyse der Topic Coherence für die Modellwahl. Beim Textmining von Medienkorpora fallen immer wieder die gleichen Vorverarbeitungsschritte wie z.B. das Tokenisieren, das Entfernen von Stopwörtern und Umlauten an, bis eine LDA durchgeführt werden kann. Sowohl für die LDA. als auch für die Vorverarbeitung konnte dabei auf bestehende R-Pakete zurückgegriffen werden. Das R-Paket tosca liefert wrapper, die eine Vorverarbeitung übersichtlicher gestalten. Darüber hinaus bietet tosca einige auf die angebotene Analysepipeline abgestimmte Grafikfunktionen, die es ermöglichen ohne viel Aufwand zeitliche Verläufe von Themen und Wörtern zu erhalten. Im Bereich der Validierung wurden die von Blei vorgeschlagenen Intruder Words und Intruder Topics für R implementiert. Für Inhaltsanalysen ist meistens nicht der ganze Korpus, sondern nur Teile davon relevant. Diese können über Wortfilter oder Themen der LDA identifiziert werden. Da die Qualität der Analyse von der Qualität des erzeugten Subkorpusses abhängt, muss dieser validiert werden, was über menschliche Kodierer*innen erfolgt. Oft braucht es mehrere Versuche, bis die Auswahlkriterien für den Subkorpus so optimiert wurden, dass seine Qualität ausreichend ist. In dieser Arbeit wird ein Verfahren vorgestellt, mit dem nicht zufällig aus dem gesamten Korpus Texte zur Validierung gezogen werden, sondern abhängig von dem bereits bestehenden Wissen aus frühreren Durchläufen aus den Schnittmengen der Subkorpora gezogen wird, die die Gesamtunsicherheit am stärksten reduzieren. Die LDA hat das Problem, dass mathematisch optimierte Modelle für Anwender*innen oft nicht die inhaltlich besten Ergebnisse liefern. Gleichzeitig ist eine manuelle Modellwahl aus Kapazitätsgründen nur begrenzt möglich. In dieser Arbeit wird die Topic Coherence (Mimno u.a. 2011) als eine der vorgeschlagenen Maßzahlen zur Modellwahl untersucht. Während der Modellvergleich über Modelle mit verschiedenen Parametern nicht möglich ist, bietet diese Maßzahl die Möglichkeit unter wiederholten Läufen ein Modell auszusuchen. Darauf basierend wird ein Vorgehen vorgestellt, wie ein optimales Modell ausgesucht werden kann, wenn bereits von Anwender*innen für ihre Forschungsfrage optimale Themen aus anderen Läufen identifiziert wurden.Item Implications on feature detection when using the benefit–cost ratio(2021-06-03) Jagdhuber, Rudolf; Rahnenführer, JörgIn many practical machine learning applications, there are two objectives: one is to maximize predictive accuracy and the other is to minimize costs of the resulting model. These costs of individual features may be financial costs, but can also refer to other aspects, for example, evaluation time. Feature selection addresses both objectives, as it reduces the number of features and can improve the generalization ability of the model. If costs differ between features, the feature selection needs to trade-off the individual benefit and cost of each feature. A popular trade-off choice is the ratio of both, the benefit–cost ratio (BCR). In this paper, we analyze implications of using this measure with special focus to the ability to distinguish relevant features from noise. We perform simulation studies for different cost and data settings and obtain detection rates of relevant features and empirical distributions of the trade-off ratio. Our simulation studies exposed a clear impact of the cost setting on the detection rate. In situations with large cost differences and small effect sizes, the BCR missed relevant features and preferred cheap noise features. We conclude that a trade-off between predictive performance and costs without a controlling hyperparameter can easily overemphasize very cheap noise features. While the simple benefit–cost ratio offers an easy solution to incorporate costs, it is important to be aware of its risks. Avoiding costs close to 0, rescaling large cost differences, or using a hyperparameter trade-off are ways to counteract the adverse effects exposed in this paper.Item Statistical approaches for calculating alert concentrations from cytotoxicity and gene expression data(2021) Kappenberg, Franziska; Rahnenführer, Jörg; Schorning, KirstenIn this thesis, three different topics regarding the calculation of alert concentrations are considered. In toxicology, an alert concentration is the concentration where the response variable of interest attains or exceeds a pre-specified threshold. The first topic, handling deviating control values, considers cytotoxicity data. Often, response values for the lowest tested concentrations and the negative control do not coincide. This leads to the inability to properly interpret or even calculate the concentration where the curve attains a pre-specified percentage. Four different methods are proposed and compared in a controlled simulation study. All of these methods are based on the family of log-logistic functions. Based on the results from this simulation study, a concrete algorithm is stated, which method to use in which case. The second topic is called identification of alert concentrations and considers gene expression data. Four methods to calculate specific alert concentrations are compared in a controlled simulation study, two based on the discrete observations only and two based on a parametric model fit, with one method taking the significance into account, respectively, and one method considering absolute exceedance of the threshold only. Results show that generally, the methods based on modelling of curves less drastically overestimate the true underlying alert concentrations while at the same time, the number alerts at too low concentrations, does not exceed the significance level. The third topic aims at improving the estimation of the parameter in a 4pLL model corresponding to the half-maximal effect by conducting some information sharing across. Two approaches are presented: The first approach is to conduct a meta-analysis for estimates of this parameters for all genes that are `similar' to each other. The second method makes use of an empirical Bayes procedure to effectively calculate a weighted mean between individual observed value and the mean of all observed parameter values for a large dataset. The meta-analysis approach performs worse than directly estimating the parameter of interest, but results for the Bayes method improved in contrast to the direct estimate in terms of the MSE.Item Handling deviating control values in concentration-response curves(2020-09-23) Kappenberg, Franziska; Brecklinghaus, Tim; Albrecht, Wiebke; Blum, Jonathan; van der Wurp, Carola; Leist, Marcel; Hengstler, Jan G.; Rahnenführer, JörgIn cell biology, pharmacology and toxicology dose-response and concentration-response curves are frequently fitted to data with statistical methods. Such fits are used to derive quantitative measures (e.g. EC20 values) describing the relationship between the concentration of a compound or the strength of an intervention applied to cells and its effect on viability or function of these cells. Often, a reference, called negative control (or solvent control), is used to normalize the data. The negative control data sometimes deviate from the values measured for low (ineffective) test compound concentrations. In such cases, normalization of the data with respect to control values leads to biased estimates of the parameters of the concentration-response curve. Low quality estimates of effective concentrations can be the consequence. In a literature study, we found that this problem occurs in a large percentage of toxicological publications. We propose different strategies to tackle the problem, including complete omission of the controls. Data from a controlled simulation study indicate the best-suited problem solution for different data structure scenarios. This was further exemplified by a real concentration-response study. We provide the following recommendations how to handle deviating controls: (1) The log-logistic 4pLL model is a good default option. (2) When there are at least two concentrations in the no-effect range, low variances of the replicate measurements, and deviating controls, control values should be omitted before fitting the model. (3) When data are missing in the no-effect range, the Brain-Cousens model sometimes leads to better results than the default model.Item Integration of feature selection stability in model fitting(2020) Bommert, Andrea Martina; Rahnenführer, Jörg; Weihs, ClausIn this thesis, four aspects connected to feature selection are analyzed: Firstly, a benchmark of filter methods for feature selection is conducted. Secondly, measures for the assessment of feature selection stability are compared both theoretically and empirically. Some of the stability measures are newly defined. Thirdly, a multi-criteria approach for obtaining desirable models with respect to predictive accuracy, feature selection stability, and sparsity is proposed and evaluated. Fourthly, an approach for finding desirable models for data sets with many similar features is suggested and evaluated. For the benchmark of filter methods, 20 filter methods are analyzed. First, the filter methods are compared with respect to the order in which they rank the features and with respect to their scaling behavior, identifying groups of similar filter methods. Next, the predictive accuracy of the filter methods when combined with a predictive model and the run time are analyzed, resulting in recommendations on filter methods that work well on many data sets. To identify suitable measures for stability assessment, 20 stability measures are compared based on both theoretical properties and on their empirical behavior. Five of the measures are newly proposed by us. Groups of stability measures that consider the same feature sets as stable or unstable are identified and the impact of the number of selected features on the stability values is studied. Additionally, the run times for calculating the stability measures are analyzed. Based on all analyses, recommendations on which stability measures should be used in future analyses are made. When searching for a good predictive model, the predictive accuracy is usually the only criterion considered in the model finding process. In this thesis, the benefits of additionally considering the feature selection stability and the number of selected features are investigated. To find desirable configurations with respect to all three performance criteria, the hyperparameter tuning is performed in a multi-criteria fashion. This way, it is possible to find configurations that perform a more stable selection of fewer features without losing much predictive accuracy compared to model fitting only considering the predictive performance. Also, with multi-criteria tuning, models are obtained that over-fit the training data less than the models obtained with single-criteria tuning only with respect to predictive accuracy. For data sets with many similar features, we propose the approach of employing L0-regularized regression and tuning its hyperparameter in a multi-criteria fashion with respect to both predictive accuracy and feature selection stability. We suggest assessing the stability with an adjusted stability measure, that is, a stability measure that takes into account similarities between features. The approach is evaluated based on both simulated and real data sets. Based on simulated data, it is observed that the proposed approach achieves the same or better predictive performance compared to established approaches. In contrast to the competing approaches, the proposed approach succeeds at selecting the relevant features while avoiding irrelevant or redundant features. On real data, the proposed approach is beneficial for fitting models with fewer features without losing predictive accuracy.Item Modelling with feature costs under a total cost budget constraint(2020) Jagdhuber, Rudolf; Rahnenführer, Jörg; Ligges, UweIn modern high-dimensional data sets, feature selection is an essential pre-processing step for many statistical modelling tasks. The field of cost-sensitive feature selection extends the concepts of feature selection by introducing so-called feature costs. These do not necessarily relate to financial costs, but can be seen as a general construct to numerically valuate any disfavored aspect of a feature, like for example the run-time of a measurement procedure, or the patient harm of a biomarker test. There are multiple ideas to define a cost-sensitive feature selection setup. The strategy applied in this thesis is to introduce an additive cost-budget as an upper bound of the total costs. This extends the standard feature selection problem by an additional constraint on the sum of costs for included features. Main areas of research in this field include adaptations of standard feature selection algorithms to account for this additional constraint. However, cost-aware selection criteria also play an important role for the overall performance of these methods and need to be discussed in detail as well. This cumulative dissertation summarizes the work of three papers in this field. Two of these introduce new methods for cost-sensitive feature selection with a fixed budget constraint. The other discusses a common trade-off criterion of performance and cost. For this criterion, an analysis of the selection outcome in different setups revealed a reduction of the ability to distinguish between information and noise. This can for example be counteracted by introducing a hyperparameter in the criterion. The presented research on new cost-sensitive methods comprises adaptations of Greedy Forward Selection, Genetic Algorithms, filter approaches and a novel Random Forest based algorithm, which selects individual trees from a low-cost tree ensemble. Central concepts of each method are discussed and thorough simulation studies to evaluate individual strengths and weaknesses are provided. Every simulation study includes artificial, as well as real-world data examples to validate results in a broad context. Finally, all chapters present discussions with practical recommendations on the application of the proposed methods and conclude with an outlook on possible further research for the respective topics.Item Extending model-based optimization with resource-aware parallelization and for dynamic optimization problems(2020) Richter, Jakob; Rahnenführer, Jörg; Groll, AndreasThis thesis contains two works on the topic of sequential model-based optimization (MBO). In the first part an extension of MBO towards resource-aware parallelization is presented and in the second part MBO is adapted to optimize dynamic optimization problems. Before the newly developed methods are introduced the reader is given a detailed introduction into various aspects of MBO and related work. This covers thoughts on the choice of the initial design, the surrogate model, the acquisition functions, and the final optimization result. As most methods in this thesis rely on the Gaussian process regression it is covered in detail as well. The chapter on “Parallel MBO” dives into the topic of making use of multiple workers that can evaluate the black-box and especially focuses on the problem of heterogeneous runtimes. Strategies that tackle this problem can be divided into synchronous and asynchronous methods. Instead of proposing one configuration in an iterative fashion, as done by ordinary MBO, synchronous methods usually propose as many configurations as there are workers available. Previously proposed synchronous methods neglect the problem of heterogeneous runtimes which causes idling, when evaluations end at different times. This work presents current methods for parallel MBO that cover synchronous and asynchronous methods and presents the newly proposed Resource-Aware Model-based Optimization (RAMBO) Framework. This work shows that synchronous and asynchronous methods each have their advantages and disadvantages and that RAMBO can outperform common synchronous MBO methods if the runtime is predictable but still obtains comparable results in the worst case. The chapter on “MBO with Concept Drift” (MBO-CD) explains the adaptions that have been developed to allow optimization of black-box functions that change systematically over time. Two approaches are explained on how MBO can be taught to handle black-box functions where the relation between input and output changes over time, i.e. where a concept drift occurs. The window approach trains the surrogate only on the most recent observations. The time-as-covariate approach includes the time as an additional input variable in the surrogate, giving it the ability to learn the effect of the time. For the latter, a special acquisition function, the temporal expected improvement, is proposed.Item Statistische Analyse von MCC-IMS-Messungen(2020) Horsch, Salome; Rahnenführer, Jörg; Ickstadt, KatjaDie Atemluft eines Menschen zu diagnostischen Zwecken zu analysieren, hat verschiedene Vorteile gegenüber anderen Methoden, wie beispielsweise Untersuchungen des Blutes. Die Atemluft ist stets verfügbar und ihre Gewinnung ist sicher, da kein Eingriff in den Körper notwendig ist. Wird zur Analyse der Atemluft die Multikapillarsäulen-Ionenmobilitätspektrometrie (MCC-IMS) verwendet, so ist die Messung innerhalb weniger Minuten abgeschlossen und könnte theoretisch direkt ausgewertet werden. Damit dies möglich wird, müssen die entstehenden Rohmessungen jedoch automatisch verarbeitet werden. Dies geschieht im Augenblick noch durch eine manuelle Begutachtung der Rohmessungen. Um diesen Goldstandard durch automatische Verfahren ersetzen zu können, wurden in dieser Arbeit zahlreiche Algorithmen-Kombinationen getestet. Da es in der Atemluftanalyse häufig das Ziel ist, kranke und gesunde Personen voneinander zu unterscheiden, wurden die Methoden auf drei verschiedene entsprechende Datensätze angewendet und zusätzlich verschiedene Klassifikationsalgorithmen getestet. Eine automatische Algorithmenkombination, die gute Ergebnisse für die einzelnen Analyseschritte erzielt, wurde für den zukünftigen Einsatz empfohlen. Der zweite Abschnitt der Arbeit beschäftigte sich mit Einflussfaktoren auf die Atemluft bei MCC-IMS-Messungen. Dabei wurden die Effekte des Geschlechts, des Raucherstatus, Beeinflussung durch ein Nahrungsmittel und der Einfluss des verwendeten Gerätes untersucht. Insbesondere die Messungen der beiden untersuchten Geräte wiesen deutliche Unterschiede auf. Diese wurden in der Arbeit ausführlich untersucht und Ansätze zur Lösung des Problems vorgestellt.Item Cost-constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms(2020-01-28) Jagdhuber, Rudolf; Lang, Michel; Stenzl, Arnulf; Neuhaus, Jochen; Rahnenführer, JörgBackground: With modern methods in biotechnology, the search for biomarkers has advanced to a challenging statistical task exploring high dimensional data sets. Feature selection is a widely researched preprocessing step to handle huge numbers of biomarker candidates and has special importance for the analysis of biomedical data. Such data sets often include many input features not related to the diagnostic or therapeutic target variable. A less researched, but also relevant aspect for medical applications are costs of different biomarker candidates. These costs are often financial costs, but can also refer to other aspects, for example the decision between a painful biopsy marker and a simple urine test. In this paper, we propose extensions to two feature selection methods to control the total amount of such costs: greedy forward selection and genetic algorithms. In comprehensive simulation studies of binary classification tasks, we compare the predictive performance, the run-time and the detection rate of relevant features for the new proposed methods and five baseline alternatives to handle budget constraints. Results: In simulations with a predefined budget constraint, our proposed methods outperform the baseline alternatives, with just minor differences between them. Only in the scenario without an actual budget constraint, our adapted greedy forward selection approach showed a clear drop in performance compared to the other methods. However, introducing a hyperparameter to adapt the benefit-cost trade-off in this method could overcome this weakness. Conclusions: In feature cost scenarios, where a total budget has to be met, common feature selection algorithms are often not suitable to identify well performing subsets for a modelling task. Adaptations of these algorithms such as the ones proposed in this paper can help to tackle this problem.Item The G protein-coupled bile acid receptor TGR5 (Gpbar1) modulates endothelin-1 signaling in liver(2019-11-19) Klindt, Caroline; Reich, Maria; Hellwig, Birte; Stindt, Jan; Rahnenführer, Jörg; Hengstler, Jan G.; Köhrer, Karl; Schoonjans, Kristina; Häussinger, Dieter; Keitel, VerenaTGR5 (Gpbar1) is a G protein-coupled receptor responsive to bile acids (BAs), which is expressed in different non-parenchymal cells of the liver, including biliary epithelial cells, liver-resident macrophages, sinusoidal endothelial cells (LSECs), and activated hepatic stellate cells (HSCs). Mice with targeted deletion of TGR5 are more susceptible towards cholestatic liver injury induced by cholic acid-feeding and bile duct ligation, resulting in a reduced proliferative response and increased liver injury. Conjugated lithocholic acid (LCA) represents the most potent TGR5 BA ligand and LCA-feeding has been used as a model to rapidly induce severe cholestatic liver injury in mice. Thus, TGR5 knockout (KO) mice and wildtype (WT) littermates were fed a diet supplemented with 1% LCA for 84 h. Liver injury and gene expression changes induced by the LCA diet revealed an enrichment of pathways associated with inflammation, proliferation, and matrix remodeling. Knockout of TGR5 in mice caused upregulation of endothelin-1 (ET-1) expression in the livers. Analysis of TGR5-dependent ET-1 signaling in isolated LSECs and HSCs demonstrated that TGR5 activation reduces ET-1 expression and secretion from LSECs and triggers internalization of the ET-1 receptor in HSCs, dampening ET-1 responsiveness. Thus, we identified two independent mechanisms by which TGR5 inhibits ET-1 signaling and modulates portal pressure.Item Survival models with selection of genomic covariates in heterogeneous cancer studies(2018) Madjar, Katrin; Rahnenführer, Jörg; Ickstadt, KatjaBuilding a risk prediction model for a specific subgroup of patients based on high-dimensional molecular measurements such as gene expression data is an important current field of biostatistical research. Major objectives in modeling high-dimensional data are good prediction performance and finding a subset of covariates that are truly relevant to the outcome (here: time-to-event endpoint). The latter requires variable selection to obtain a sparse, interpretable model solution. In this thesis, one further objective in modeling is taking into account heterogeneity in data due to known subgroups of patients that may differ in their relationship between genomic covariates and survival outcome. We consider multiple cancer studies as subgroups, however, our approaches can be applied to any other subgroups, for example, defined by clinical covariates. We aim at providing a separate prediction model for each subgroup that allows the identification of common as well as subgroup-specific effects and has improved prediction accuracy over standard approaches. Standard subgroup analysis includes only patients of the subgroup of interest and may lead to a loss of power when sample size is small, whereas standard combined analysis simply pools patients of all subgroups and may suffer from biased results and averaging of subgroup-specific effects. To overcome these drawbacks, we propose two different statistical models that allow sharing information between subgroups to increase power when this is supported by data. One approach is a classical frequentist Cox proportional hazards model with a lasso penalty for variable selection and a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. Patients who fit well to the subgroup of interest receive higher weights in the subgroup-specific model. The other approach is a novel Bayesian Cox model that uses a stochastic search variable selection prior with latent indicators of variable inclusion. We assume a sparse graphical model that links genes within subgroups and the same genes across different subgroups. This graph structure is not known a priori and inferred simultaneously with the important variables of each subgroup. Both approaches are evaluated through extensive simulations and applied to real lung cancer studies. Simulation results demonstrate that our proposed models can achieve improved prediction and variable selection accuracy over standard subgroup models when sample size is low. As expected, the standard combined model only identifies common effects but fails to detect subgroup-specific effects.Item Clustermethoden für Massenspektren in proteomweiten statistischen Analysen(2018) Rieder, Vera; Rahnenführer, Jörg; Weihs, ClausDie Arbeit handelt von Clustermethoden für massenspektrometrische Analysen in der Biodiversitätsforschung. Alternativ zur Artenbestimmung mittels DNA-Barcoding wird die Analyse der Proteinzusammensetzung von Organismen verwendet. Die Mehrheit der Proteinanalytik basiert mittlerweile auf der sogenannten LC-MS/MS-Methode. Dabei wird eine Flüssigchromatographie (LC) als Trennmethode mit der Tandem-Massenspektrometrie (MS/MS) kombiniert. Tandem-Massenspektren, die aus detektierten Intensitäten von vorkommenden Massen bestehen, dienen zur Identifikation von Peptiden und Proteinen mittels Datenbanksuchalgorithmen. Neuartige unbekannte Peptide werden mittlerweile über fehleranfällige De-Novo-Peptidsequenzierungsalgorithmen detektiert. Alternativ zu Annotationsverfahren wird hier die direkte Clusteranalyse der Tandem-Massenspektren behandelt. Zwei Aspekte, die Clusteranalyse sogenannter Läufe, die tausende Spektren einer Proteinprobe umfasst, und die Clusteranalyse von einzelnen Tandem-Massenspektren werden untersucht. Eine Clusteranalyse sogenannter Läufe wird für mehrere reale Datensätze mithilfe der neuen Methode DISMS2 durchgeführt, die ohne Annotationen Distanzen zwischen MS/MS-Läufen bestimmt. Es handelt sich also um eine Alternative zum Vergleich von Peptidlisten, die auf der Identifikation von Spektren in Datenbanksuchen basieren. Die Parameter von DISMS2 sind frei wählbar, sodass die Auswahl der höchsten Peaks je Spektrum (topn), die Bingröße im Binning (bin), die Einschränkung bei dem Vergleich von Spektren auf zeitlich nahe Spektren (ret) mit ähnlicher Precursormasse (prec) und das Distanzmaß für Massenspektren (dist) mit einem frei wählbaren Schwellenwert (cdis) variieren. Zur Parameterwahl wird ein Vorgehen zur Optimierung angewandt, das das Bestimmtheitsmaß R2 eines nichtparametrischen Verfahrens zur Varianzanalyse verwendet. Zur Clusteranalyse von einzelnen Massenspektren wird ein bisher in der Literatur fehlender umfassender Vergleich von Algorithmen erstellt, die für Tandem-Massenspektren etabliert (CAST, MS-Cluster, PRIDE Cluster), für große Datensätze bekannt (hierarchische Clusteranalyse, DBSCAN, Zusammenhangskomponenten eines Graphen) oder neu (Neighbor Clustering) sind. Die Evaluierung basiert auf realen Daten und mehreren Gütemaßen.Item Klassifikation von Brustkrebspatientinnen anhand vorausgewählter Gene mit charakteristischer Expressionsverteilung(2018) Hellwig, Birte; Rahnenführer, Jörg; Ligges, UweZiel der Arbeit ist es mit Hilfe von Genexpressionsdaten Klassifikatoren für Brustkrebspatientinnen zu erstellen, mit denen vorhergesagt werden soll, ob eine Patientin in den ersten fünf Jahren nach der Operation eine Fernmetastase bekommt oder metastasenfrei bleibt. Die Anforderungen an den Klassifikator sind dabei, dass er eine hohe prognostische Güte besitzt und gleichzeitig gut interpretierbar ist. Die Idee der Arbeit ist daher Gene mit Expressionsverteilungen zu identifizieren, die klar zwischen einer Gruppe mit niedriger und einer Gruppe mit hoher Expression unterscheiden, und diese dann zur Konstruktion von Klassifikatoren zu verwenden. Zur Identifikation von Genen mit charakteristischer Expressionsverteilung werden verschiedene Scores verwendet. Dabei gibt es verschiedene Ansätze, wie etwa das Verwenden von Clusterverfahren und Maßzahlen zur Beurteilung der Gruppeneinteilung. Alternative Ansätze sind das Definieren einer Ausreißergruppe oder der dip-Test auf Unimodalität. Die Bimodalitätsmaße werden auf die Expressionsdaten einer Kohorte von 200 nodal-negativen unbehandelten Brustkrebspatientinnen angewendet. Die Gene mit den auffälligsten Bimodalitäts-Scores werden dann zur Konstruktion von Klassifikationsbäumen und Random Forests verwendet. Bei beiden Ansätzen werden verschiedene Parametereinstellungen untersucht, wobei insbesondere in der Verwendung der genetischen Variablen unterschieden wird (stetig oder dichotomisiert). Abschließend werden Random Forests mit optimierten Parametern erzeugt, wobei auf eine Vorauswahl der Gene verzichtet wird. Zur Validierung der Modelle werden zwei unabhängige Kohorten von nodal-negativen unbehandelten Patientinnen verwendet. Als Referenz zur Beurteilung der Klassifikationsgüte der neuentwickelten Klassifikatoren dienen etablierte Gensignaturen. Einfache Klassifikationsbäume führen im Gegensatz zu Random Forests zu interpretierbaren Klassifikatoren, sind in Bezug auf die prognostische Güte aber leicht unterlegen. Random Forests mit vorausgewählten Genen führen zu Klassifikatoren mit akzeptabler Güte. Dabei ist es wichtig, dass die Expressionswerte eines vorausgewählten Gens anhand der Verteilung direkt dichotomisiert werden. Bei der Validierung zeigen die neuentwickelten Modelle eine Tendenz zum Overfitting, wodurch etablierte Klassifikatoren zum Teil überlegen sind.Item No longer confidential(2012-11-15) Briesemeister, Sebastian; Rahnenführer, Jörg; Kohlbacher, OliverQuantitative predictions in computational life sciences are often based on regression models. The advent of machine learning has led to highly accurate regression models that have gained widespread acceptance. While there are statistical methods available to estimate the global performance of regression models on a test or training dataset, it is often not clear how well this performance transfers to other datasets or how reliable an individual prediction is–a fact that often reduces a user’s trust into a computational method. In analogy to the concept of an experimental error, we sketch how estimators for individual prediction errors can be used to provide confidence intervals for individual predictions. Two novel statistical methods, named CONFINE and CONFIVE, can estimate the reliability of an individual prediction based on the local properties of nearby training data. The methods can be applied equally to linear and non-linear regression methods with very little computational overhead. We compare our confidence estimators with other existing confidence and applicability domain estimators on two biologically relevant problems (MHC–peptide binding prediction and quantitative structure-activity relationship (QSAR)). Our results suggest that the proposed confidence estimators perform comparable to or better than previously proposed estimation methods. Given a sufficient amount of training data, the estimators exhibit error estimates of high quality. In addition, we observed that the quality of estimated confidence intervals is predictable. We discuss how confidence estimation is influenced by noise, the number of features, and the dataset size. Estimating the confidence in individual prediction in terms of error intervals represents an important step from plain, non-informative predictions towards transparent and interpretable predictions that will help to improve the acceptance of computational methods in the biological community.Item Going from where to why(2010-03-17) Briesemeister, Sebastian; Rahnenführer, Jörg; Kohlbacher, OliverMotivation: Protein subcellular localization is pivotal in understanding a protein’s function. Computational prediction of subcellular localization has become a viable alternative to experimental approaches. While current machine learning-based methods yield good prediction accuracy, most of them suffer from two key problems: lack of interpretability and dealing with multiple locations. Results: We present YLoc, a novel method for predicting protein subcellular localization that addresses these issues. Due to its simple architecture, YLoc can identify the relevant features of a protein sequence contributing to its subcellular localization, e.g. localization signals or motifs relevant to protein sorting. We present several example applications where YLoc identifies the sequence features responsible for protein localization, and thus reveals not only to which location a protein is transported to, but also why it is transported there. YLoc also provides a confidence estimate for the prediction. Thus, the user can decide what level of error is acceptable for a prediction. Due to a probabilistic approach and the use of several thousands of dual-targeted proteins, YLoc is able to predict multiple locations per protein. YLoc was benchmarked using several independent datasets for protein subcellular localization and performs on par with other state-of-the-art predictors. Disregarding low-confidence predictions, YLoc can achieve prediction accuracies of over 90%. Moreover, we show that YLoc is able to reliably predict multiple locations and outperforms the best predictors in this area.