Variable selection methods for detecting interactions in large scale data

Teschke, Sven

Variable selection methods for detecting interactions in large scale data

Dateien

Dissertation_Teschke.pdf (8.74 MB)

Datum

2025

Autor:innen

Teschke, Sven

Zusammenfassung

Large-scale data sets comprising millions of variables p, as is typical in the field of genetics, offer a wealth of information. However, it is a considerable challenge to extract this information from the data. From a biological perspective, it is desirable that this will lead to a better understanding of the development of diseases. Moreover, it is imperative to consider the interactions of genetic factors with each other and with the environment. Taking into account interactions further exacerbates the problem of the high dimensionality of the data. In addition to the computational challenges of processing the data at all, most statistical models are inapplicable or difficult to interpret in these scenarios. To address this research gap, a variable selection method was developed in this thesis that accounts for a multivariate structure and can be applied to arbitrarily large amounts of data. The selection of variables is executed through the utilization of cross-leverage scores (CLS). Due to their construction the CLS correspond to the variables individual leverage on the correlation with the multidimensional subspace spanned by the data with the outcome variable. Thus, they are directly linked to the importance of a variable also in the sense of an interaction effect. Further, under mild assumptions, each CLS equals its corresponding parameter in the least squares solution up to a small bounded additive error. In addition, in this thesis, methods have been developed and improved for the approximation of the CLS in large data. A notable advantage of these methods is their ability to be calculated streamwise, thereby overcoming the problem of processing on standard computers. Overall, a two-step procedure is recommended. In the first step, variables are selected using CLS. In the subsequent step, an established method is to be applied to the reduced data, which is appropriate for the research question, but limited in the number of input variables. The primary article of this dissertation introduces the methodology of these approaches and validates them by simulations as well as mathematically. In two additional articles, this method is employed to two large scale datasets, in order to answer biological questions. Once, in the framework of a two-step approach to identify SNP-environment interactions in COPD. In the second step, the recently developed logicDT model is applied to the reduced data. In the other paper, the CLS are directly incorporated into the calculation of so-called profile scores to estimate the risk of Alzheimer’s disease based on DNA methylation and metabolomics data.

Schlagwörter

Variable selection, Cross leverage scores, Large scale data, Gene-Environment interactions

Schlagwörter nach RSWK

Dimensionsreduktion <Data Science>

URI

http://hdl.handle.net/2003/43787
http://dx.doi.org/10.17877/DE290R-25561

Sammlungen

Lehrstuhl Mathematische Statistik und biometrische Anwendungen

Komplettanzeige

Variable selection methods for detecting interactions in large scale data

Dateien

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Verlag

Sonstige Titel

Zusammenfassung

Beschreibung

Inhaltsverzeichnis

Schlagwörter

Schlagwörter nach RSWK

Zitierform

URI

Sammlungen

Befürwortung

Review

Ergänzt durch

Referenziert von