Non-linear modeling and structured variable selection in environmental and biomedical data

Loading...
Thumbnail Image

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Alternative Title(s)

Abstract

This thesis addresses three key challenges in the analysis of biological and environmental datasets. First, potential influence factors such as environmental or clinical variables are often modeled as having linear effects on health outcomes. In practice, however, these effects can be non-linear and may involve complex interactions between variables. Despite their relevance, non-linear modeling techniques remain underutilized in the environmental health domain. Second, the problem of variable selection is complicated because relevant predictors are often naturally grouped, such as molecular data, environmental variables, clinical information, family history, and genetic or pathological markers. Traditional variable selection methods often fail to account for this grouping structure, leading to the over-representation of certain groups and the neglect of others, especially low-dimensional but clinically meaningful variables. Third, this limitation is particularly problematic in time-to-event prediction, where clinical and pathological features can substantially impact patient survival outcomes. This cumulative thesis comprises three studies with five contributed articles that aim to overcome these methodological challenges. The first study investigates the joint effects of ambient temperature and air pollution on systolic and diastolic blood pressure in elderly German women. Using generalized additive models (GAMs), the study captures non-linear exposure-response relationships and complex interactions. The second project provides a new methodological contribution and introduces a novel variant of the Exclusive Lasso for high-dimensional data with grouped variables, such as multi-omics datasets. Smooth approximations are applied to address the non-differentiability of the group-wise L1-norm, enabling efficient optimization using Newton-based methods. Unlike the conventional Exclusive Lasso, the proposed method does not force selection from every group, allowing for greater sparsity and improved performance. The third study extends this regularization technique to time-to-event data by incorporating Exclusive Lasso into the Cox proportional hazards model. This allows the integration of multiple heterogeneous data types, including gene expression and clinical variables, while preserving the grouping structure. The method is applied to a real-world cancer dataset, showing improved survival prediction and ensuring that low-dimensional but important clinical variables are retained in the model.

Description

Table of contents

Keywords

Generalized additive models, Variable selection, High-dimensional data

Subjects based on RSWK

Dimensionsreduktion <Data Science>, Statistische Analyse

Citation