Variable selection methods for detecting interactions in large scale data

dc.contributor.advisorIckstadt, Katja
dc.contributor.authorTeschke, Sven
dc.contributor.refereeSchikowski, Tamara
dc.contributor.refereeStaerk, Christian
dc.date.accepted2025-05-21
dc.date.accessioned2025-07-03T06:33:05Z
dc.date.available2025-07-03T06:33:05Z
dc.date.issued2025
dc.description.abstractLarge-scale data sets comprising millions of variables p, as is typical in the field of genetics, offer a wealth of information. However, it is a considerable challenge to extract this information from the data. From a biological perspective, it is desirable that this will lead to a better understanding of the development of diseases. Moreover, it is imperative to consider the interactions of genetic factors with each other and with the environment. Taking into account interactions further exacerbates the problem of the high dimensionality of the data. In addition to the computational challenges of processing the data at all, most statistical models are inapplicable or difficult to interpret in these scenarios. To address this research gap, a variable selection method was developed in this thesis that accounts for a multivariate structure and can be applied to arbitrarily large amounts of data. The selection of variables is executed through the utilization of cross-leverage scores (CLS). Due to their construction the CLS correspond to the variables individual leverage on the correlation with the multidimensional subspace spanned by the data with the outcome variable. Thus, they are directly linked to the importance of a variable also in the sense of an interaction effect. Further, under mild assumptions, each CLS equals its corresponding parameter in the least squares solution up to a small bounded additive error. In addition, in this thesis, methods have been developed and improved for the approximation of the CLS in large data. A notable advantage of these methods is their ability to be calculated streamwise, thereby overcoming the problem of processing on standard computers. Overall, a two-step procedure is recommended. In the first step, variables are selected using CLS. In the subsequent step, an established method is to be applied to the reduced data, which is appropriate for the research question, but limited in the number of input variables. The primary article of this dissertation introduces the methodology of these approaches and validates them by simulations as well as mathematically. In two additional articles, this method is employed to two large scale datasets, in order to answer biological questions. Once, in the framework of a two-step approach to identify SNP-environment interactions in COPD. In the second step, the recently developed logicDT model is applied to the reduced data. In the other paper, the CLS are directly incorporated into the calculation of so-called profile scores to estimate the risk of Alzheimer’s disease based on DNA methylation and metabolomics data.en
dc.identifier.urihttp://hdl.handle.net/2003/43787
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-25561
dc.language.isoen
dc.subjectVariable selectionen
dc.subjectCross leverage scoresen
dc.subjectLarge scale dataen
dc.subjectGene-Environment interactionsen
dc.subject.ddc310
dc.subject.rswkDimensionsreduktion <Data Science>de
dc.titleVariable selection methods for detecting interactions in large scale dataen
dc.typeText
dc.type.publicationtypePhDThesis
dcterms.accessRightsopen access
eldorado.secondarypublicationfalse

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Dissertation_Teschke.pdf
Size:
8.74 MB
Format:
Adobe Portable Document Format
Description:
DNB
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.82 KB
Format:
Item-specific license agreed upon to submission
Description: