Variable selection methods for detecting interactions in large scale data
Loading...
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Alternative Title(s)
Abstract
Large-scale data sets comprising millions of variables p, as is typical in the field of genetics, offer a wealth of information. However, it is a considerable challenge to extract this information from the data. From a biological perspective, it is desirable that this will lead to a better understanding of the development of diseases. Moreover, it is imperative to consider the interactions of genetic factors with each other and with the environment. Taking into account interactions further exacerbates the problem of the high dimensionality of the data. In addition to the computational challenges of processing the data at all, most statistical models are inapplicable or difficult to interpret in these scenarios. To address this research gap, a variable selection method was developed in this thesis that accounts for a multivariate structure and can be applied to arbitrarily large amounts of data. The selection of variables is executed through the utilization of cross-leverage scores (CLS). Due to their construction the CLS correspond to the variables individual leverage on the correlation with the multidimensional subspace spanned by the data with the outcome variable. Thus, they are directly linked to the importance of a variable also in the sense of an interaction effect. Further, under mild assumptions, each CLS equals its corresponding parameter in the least squares solution up to a small bounded additive error. In addition, in this thesis, methods have been developed and improved for the approximation of the CLS in large data. A notable advantage of these methods is their ability to be calculated streamwise, thereby overcoming the problem of processing on standard computers. Overall, a two-step procedure is recommended. In the first step, variables are selected using CLS. In the subsequent step, an established method is to be applied to the reduced data, which is appropriate for the research question, but limited in the number of input variables. The primary article of this dissertation introduces the methodology of these approaches and validates them by simulations as well as mathematically. In two additional articles, this method is employed to two large scale datasets, in order to answer biological questions. Once, in the framework of a two-step approach to identify SNP-environment interactions in COPD. In the second step, the recently developed logicDT model is applied to the reduced data. In the other paper, the CLS are directly incorporated into the calculation of so-called profile scores to estimate the risk of Alzheimer’s disease based on DNA methylation and metabolomics data.
Description
Table of contents
Keywords
Variable selection, Cross leverage scores, Large scale data, Gene-Environment interactions
Subjects based on RSWK
Dimensionsreduktion <Data Science>