Variable selection methods for detecting interactions in large scale data

Teschke, Sven

Variable selection methods for detecting interactions in large scale data

dc.contributor.advisor	Ickstadt, Katja
dc.contributor.author	Teschke, Sven
dc.contributor.referee	Schikowski, Tamara
dc.contributor.referee	Staerk, Christian
dc.date.accepted	2025-05-21
dc.date.accessioned	2025-07-03T06:33:05Z
dc.date.available	2025-07-03T06:33:05Z
dc.date.issued	2025
dc.description.abstract	Large-scale data sets comprising millions of variables p, as is typical in the field of genetics, offer a wealth of information. However, it is a considerable challenge to extract this information from the data. From a biological perspective, it is desirable that this will lead to a better understanding of the development of diseases. Moreover, it is imperative to consider the interactions of genetic factors with each other and with the environment. Taking into account interactions further exacerbates the problem of the high dimensionality of the data. In addition to the computational challenges of processing the data at all, most statistical models are inapplicable or difficult to interpret in these scenarios. To address this research gap, a variable selection method was developed in this thesis that accounts for a multivariate structure and can be applied to arbitrarily large amounts of data. The selection of variables is executed through the utilization of cross-leverage scores (CLS). Due to their construction the CLS correspond to the variables individual leverage on the correlation with the multidimensional subspace spanned by the data with the outcome variable. Thus, they are directly linked to the importance of a variable also in the sense of an interaction effect. Further, under mild assumptions, each CLS equals its corresponding parameter in the least squares solution up to a small bounded additive error. In addition, in this thesis, methods have been developed and improved for the approximation of the CLS in large data. A notable advantage of these methods is their ability to be calculated streamwise, thereby overcoming the problem of processing on standard computers. Overall, a two-step procedure is recommended. In the first step, variables are selected using CLS. In the subsequent step, an established method is to be applied to the reduced data, which is appropriate for the research question, but limited in the number of input variables. The primary article of this dissertation introduces the methodology of these approaches and validates them by simulations as well as mathematically. In two additional articles, this method is employed to two large scale datasets, in order to answer biological questions. Once, in the framework of a two-step approach to identify SNP-environment interactions in COPD. In the second step, the recently developed logicDT model is applied to the reduced data. In the other paper, the CLS are directly incorporated into the calculation of so-called profile scores to estimate the risk of Alzheimer’s disease based on DNA methylation and metabolomics data.	en
dc.identifier.uri	http://hdl.handle.net/2003/43787
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-25561
dc.language.iso	en
dc.subject	Variable selection	en
dc.subject	Cross leverage scores	en
dc.subject	Large scale data	en
dc.subject	Gene-Environment interactions	en
dc.subject.ddc	310
dc.subject.rswk	Dimensionsreduktion <Data Science>	de
dc.title	Variable selection methods for detecting interactions in large scale data	en
dc.type	Text
dc.type.publicationtype	PhDThesis
dcterms.accessRights	open access
eldorado.dnb.deposit	true
eldorado.secondarypublication	false

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dissertation_Teschke.pdf
Size:: 8.74 MB
Format:: Adobe Portable Document Format
Description:: DNB

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 4.82 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Lehrstuhl Mathematische Statistik und biometrische Anwendungen