Eldorado - Repository of the TU Dortmund

Resources for and from Research, Teaching and Studying

This is the institutional repository of the TU Dortmund. Ressources for Research, Study and Teaching are archived and made publicly available.

 

Recent Submissions

Item
Classic statistical and modern machine learning methods for modeling and prediction of major tennis tournaments
(2025) Buhamra, Nourah; Groll, Andreas; Pauly, Markus
The cumulative dissertation proposes a comprehensive approach to predicting outcomes in Grand Slam tennis tournaments, focusing on the probability that the first-named player will win. Our study incorporates several classical regression and machine learning models, evaluated using cross-validation and external validation through performance measures such as classification rate, predictive likelihood, and Brier score. Two specific aspects are examined in greater detail: non-linear effects and the inclusion of additional player and court-specific abilities. Moreover, we analyze the predictive potential of statistically enhanced covariates and apply procedures from the field of interpretable machine learning to make complex models more understandable. Our analyses show that in predicting Grand Slam tennis matches, while there are slight differences across various statistical and machine learning approaches, the specific forecasting strategy used plays an even more critical role. Additionally, the results confirm that enhanced variables contribute positively to model performance and provide deeper insights into predictors of match outcomes in sports analytics.
Item
Entwicklung von Machine Learning basierten Materialmodellen für Finite-Elemente-Simulationen
(2024) Böhringer, Pauline; Rudolph, Günter; Wiederkehr, Petra
Finite-Elemente-Simulationen sind essenziell für die Strukturanalyse mechanischer Komponenten und finden Anwendung in Bereichen wie Umformprozessen und Crashtests. Die Genauigkeit solcher Simulationen hängt stark von den eingesetzten Materialmodellen ab, deren Erstellung jedoch komplex ist. In dieser Arbeit wird untersucht, ob klassische Materialmodelle durch datenbasierte Modelle mittels maschinellen Lernens (ML) ersetzt werden können. Dazu werden verschiedene ML-Modelle mithilfe zufällig generierter Daten aus klassischen Materialmodellen trainiert und hinsichtlich ihrer Eignung bewertet. Im zweiten Teil wird ein Ansatz vorgestellt, ML-Modelle direkt mittels Daten aus Versuchen, ohne ein klassisches Referenzmodell zu trainieren, wobei physikalisch motivierte Gleichungen für das Training genutzt werden. Der Fokus liegt auf der Anpassung und Evaluierung unterschiedlicher Trainingsmethoden für die ML-Materialmodelle.
Item
Variable selection methods for detecting interactions in large scale data
(2025) Teschke, Sven; Ickstadt, Katja; Schikowski, Tamara; Staerk, Christian
Large-scale data sets comprising millions of variables p, as is typical in the field of genetics, offer a wealth of information. However, it is a considerable challenge to extract this information from the data. From a biological perspective, it is desirable that this will lead to a better understanding of the development of diseases. Moreover, it is imperative to consider the interactions of genetic factors with each other and with the environment. Taking into account interactions further exacerbates the problem of the high dimensionality of the data. In addition to the computational challenges of processing the data at all, most statistical models are inapplicable or difficult to interpret in these scenarios. To address this research gap, a variable selection method was developed in this thesis that accounts for a multivariate structure and can be applied to arbitrarily large amounts of data. The selection of variables is executed through the utilization of cross-leverage scores (CLS). Due to their construction the CLS correspond to the variables individual leverage on the correlation with the multidimensional subspace spanned by the data with the outcome variable. Thus, they are directly linked to the importance of a variable also in the sense of an interaction effect. Further, under mild assumptions, each CLS equals its corresponding parameter in the least squares solution up to a small bounded additive error. In addition, in this thesis, methods have been developed and improved for the approximation of the CLS in large data. A notable advantage of these methods is their ability to be calculated streamwise, thereby overcoming the problem of processing on standard computers. Overall, a two-step procedure is recommended. In the first step, variables are selected using CLS. In the subsequent step, an established method is to be applied to the reduced data, which is appropriate for the research question, but limited in the number of input variables. The primary article of this dissertation introduces the methodology of these approaches and validates them by simulations as well as mathematically. In two additional articles, this method is employed to two large scale datasets, in order to answer biological questions. Once, in the framework of a two-step approach to identify SNP-environment interactions in COPD. In the second step, the recently developed logicDT model is applied to the reduced data. In the other paper, the CLS are directly incorporated into the calculation of so-called profile scores to estimate the risk of Alzheimer’s disease based on DNA methylation and metabolomics data.
Item
Optimal design theory of dose-response experiments in toxicology
(2025) Schürmeyer, Leonie; Schorning, Kirsten; Rahnenführer, Jörg; Hengstler, Jan Georg
New research approaches of sciences like statistics and toxicology have been coming up through rapid development in the last years. However, those research approaches often refer to only one of the two sciences, not considering important aspects of the other discipline. Especially in the crucial part of planning an experiment, the laboratory routine in toxicology does not consider optimal design theory of statistics, although there is already much research present, which could help to improve the research results. This demonstrates a huge gap between practical applications in toxicological research and existing statistical theory. On the one hand, this gap exists due to the missing statistical methods specifically tailored to toxicological applications, and on the other hand, if those statistical methods exist, they are not reported in a clear manner for non-statisticians. The consequences for toxicological experiments are a waste of observations, or even worse animals, and non-optimal results in terms of precision. Therefore, this is an important aspect in the field of statistics and toxicology, which needs to be addressed. Optimal design approaches specifically tailored to toxicology must be developed and reported appropriately. This cumulative thesis is based on three works, that all present optimal design approaches for diverse toxicological applications. The first manuscript highlights the importance of considering optimal design approaches for classical cytotoxicity experiments in a userfriendly manner. Here, different optimal design approaches are compared to typically used designs in practice based on an extensive case study. Moreover, a guideline for cytotoxicity experiments and an R-Shiny software tool are presented, which both facilitate the planning of upcoming cytotoxicity experiments. In the second manuscript a new design approach for the precise estimation of effective dose sets in drug combination studies is developed. For that matter, the performance of the corresponding developed criterion is investigated in a simulation study based on various scenarios including a case study. Finally, a new design approach for the analysis of high-dimensional gene-expression data is developed in the third manuscript. While two of the manuscripts are already published, the second one is attached in its current, unpublished form.
Item
"Integrative statistical methods for analyzing biomedical data: applications in health and disease”
(2025) Tug, Timur; Ickstadt, Katja; Rahnenführer, Jörg; Hüls, Anke
In a series of four complementary studies, we apply innovative integrative statistical methods to diverse biomedical datasets to address both fundamental research questions and practical challenges in health and disease. Two of these investigations focus on the in vivo alkaline comet assay - a pivotal tool for assessing DNA damage as a marker of genotoxicity. In the first comet assay study (Article 1), we examine the impact of different centrality measures on the evaluation of tail intensity data. Using both original experimental data and simulation frameworks, we demonstrate that even subtle variations in summarizing techniques - whether using medians, arithmetic means, or geometric means - can lead to markedly different statistical conclusions and dose–response interpretations. These findings emphasize the critical need for careful methodological selection in genotoxicity assessments. In a subsequent comet assay work (Article 2), we compile and analyze extensive historical control data from multiple laboratories. This investigation addresses key statistical issues, including inter-laboratory variability and the handling of zero-valued measurements, and discusses whether the findings from the first paper are similar in the centrality statistical measures and regulatory interpretations. In the third study (Article 3), we introduce a novel multi-omics approach to better understand Alzheimer’s disease (AD). By integrating genome-wide DNA methylation profiles with high-resolution metabolomics data from prefrontal cortex tissue samples, we develop innovative single-, joint- and multi-omics profile scores using Machine Learning and advanced regression techniques. This integrative analysis significantly improves the prediction of AD neuropathology, based on these profile scores. It also uncovers pivotal biological pathways, such as lipid metabolism and signal transduction, that are potentially involved in driving disease progression. These findings underscore the potential of combining multiple omics layers to elucidate complex molecular interactions underlying neurodegenerative disorders. Complementing these human-focused studies, our fourth investigation (Article 4) applies hierarchical modeling to veterinary epidemiology, specifically targeting respiratory diseases in piglet production. We thereby compare frequentist and Bayesian hierarchical regression models to assess the influence of various environmental and management factors - including floor condition, water flow rates, stocking density, and indoor climate conditions - on respiratory health outcomes in pigs. By accounting for the multi-level structure inherent in farm data (spanning individual animals, pens, compartments, and farms), we demonstrate that Bayesian approaches with informative priors can effectively overcome challenges posed by small sample sizes and high inter-cluster variability. This ultimately provides more robust estimates and practical insights for disease management in livestock production. Collectively, the four works of my cumulative thesis illustrate how tailored, integrative statistical methodologies can enhance our understanding of complex biological systems. These methodologies improve decision-making across a spectrum of applications, ranging from the regulatory evaluation of chemical safety and the elucidation of neurodegenerative disease mechanisms to the optimization of animal health in agricultural settings. The work emphasizes that the choice of statistical methods is not merely a technical detail but a pivotal factor that can substantially alter study outcomes and subsequent interpretations in both clinical and applied research environments. While the first two manuscripts are published, the third and fourth work are submitted and attached in its current version.