Statistical Methods for Big Data

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 6 of 6
  • Item
    Classic statistical and modern machine learning methods for modeling and prediction of major tennis tournaments
    (2025) Buhamra, Nourah; Groll, Andreas; Pauly, Markus
    The cumulative dissertation proposes a comprehensive approach to predicting outcomes in Grand Slam tennis tournaments, focusing on the probability that the first-named player will win. Our study incorporates several classical regression and machine learning models, evaluated using cross-validation and external validation through performance measures such as classification rate, predictive likelihood, and Brier score. Two specific aspects are examined in greater detail: non-linear effects and the inclusion of additional player and court-specific abilities. Moreover, we analyze the predictive potential of statistically enhanced covariates and apply procedures from the field of interpretable machine learning to make complex models more understandable. Our analyses show that in predicting Grand Slam tennis matches, while there are slight differences across various statistical and machine learning approaches, the specific forecasting strategy used plays an even more critical role. Additionally, the results confirm that enhanced variables contribute positively to model performance and provide deeper insights into predictors of match outcomes in sports analytics.
  • Item
    Non-linear modeling and structured variable selection in environmental and biomedical data
    (2025) Ravi, Dayasri; Groll, Andreas; Staerk, Christian; Schikowski, Tamara
    This thesis addresses three key challenges in the analysis of biological and environmental datasets. First, potential influence factors such as environmental or clinical variables are often modeled as having linear effects on health outcomes. In practice, however, these effects can be non-linear and may involve complex interactions between variables. Despite their relevance, non-linear modeling techniques remain underutilized in the environmental health domain. Second, the problem of variable selection is complicated because relevant predictors are often naturally grouped, such as molecular data, environmental variables, clinical information, family history, and genetic or pathological markers. Traditional variable selection methods often fail to account for this grouping structure, leading to the over-representation of certain groups and the neglect of others, especially low-dimensional but clinically meaningful variables. Third, this limitation is particularly problematic in time-to-event prediction, where clinical and pathological features can substantially impact patient survival outcomes. This cumulative thesis comprises three studies with five contributed articles that aim to overcome these methodological challenges. The first study investigates the joint effects of ambient temperature and air pollution on systolic and diastolic blood pressure in elderly German women. Using generalized additive models (GAMs), the study captures non-linear exposure-response relationships and complex interactions. The second project provides a new methodological contribution and introduces a novel variant of the Exclusive Lasso for high-dimensional data with grouped variables, such as multi-omics datasets. Smooth approximations are applied to address the non-differentiability of the group-wise L1-norm, enabling efficient optimization using Newton-based methods. Unlike the conventional Exclusive Lasso, the proposed method does not force selection from every group, allowing for greater sparsity and improved performance. The third study extends this regularization technique to time-to-event data by incorporating Exclusive Lasso into the Cox proportional hazards model. This allows the integration of multiple heterogeneous data types, including gene expression and clinical variables, while preserving the grouping structure. The method is applied to a real-world cancer dataset, showing improved survival prediction and ensuring that low-dimensional but important clinical variables are retained in the model.
  • Item
    Extending the distributional regression framework
    (2025) Briseño Sanchez, Guillermo; Groll, Andreas; Klein, Nadja
    This thesis develops distributional regression methods tailored to the estimation of treatment effects as well as joint modelling of multivariate non-commensurable responses, all based on the Generalised Additive Models for Location Scale and Shape (GAMLSS) approach. In addition, it postulates methods for data-driven variable selection for the aforementioned model class. These developments are introduced across four contributed articles and are implemented in the statistical programming software R. In the first article, we derive treatment effects on the entire conditional response distribution via an instrumental variable estimation approach based on GAMLSS. Our approach allows to model all parameters of possibly complex outcome distributions as well as non-linear relationships between explanatory variables, instrument and outcome of interest. This demonstrates the potential of using distributional regression in instrumental variable regression both to account for endogeneity and estimate treatment effects beyond the mean. The second article introduces flexible copula-based statistical models for bivariate responses comprised of non-commensurate (i.e. mixed) variables whose components are a right-censored time-to-event response and a non-time-to-event outcome. The copula approach allows for separate specification of the dependence structure between the margins and their individual distribution functions. The model of the time-to-event margin is constructed via discrete-time-to-event or piecewise-exponential methods using the correspondence of their likelihood of the aforementioned approaches with well-known univariate distributions. The last two articles tackle the issue of data-driven variable selection for copula-based distributional regression models. In the third article we devise a gradient boosting estimation algorithm adapted to accommodate copula models with arbitrary marginal distributions suited for bivariate binary, count and non-commensurable mixed outcomes. The last article further extends these methods to bivariate right-censored time-to-event responses. This dramatically streamlines the model-building process for a wide range of response structures. The versatility of the proposed methods is demonstrated through the analysis of various synthetic and real data structures from labour economics, transportation, genetic epidemiology, healthcare utilisation, childhood undernutrition and ovarian cancer.
  • Item
    Introducing LASSO-type penalisation to generalised joint regression modelling for count data
    (2021-11-12) van der Wurp, Hendrik; Groll, Andreas
    In this work, we propose an extension of the versatile joint regression framework for bivariate count responses of the R package GJRM by Marra and Radice (R package version 0.2-3, 2020) by incorporating an (adaptive) LASSO-type penalty. The underlying estimation algorithm is based on a quadratic approximation of the penalty. The method enables variable selection and the corresponding estimates guarantee shrinkage and sparsity. Hence, this approach is particularly useful in high-dimensional count response settings. The proposal’s empirical performance is investigated in a simulation study and an application on FIFA World Cup football data.
  • Item
    Flexible instrumental variable distributional regression
    (2020-08-16) Briseño Sanchez, Guillermo; Hohberg, Maike; Groll, Andreas; Kneib, Thomas
    We tackle two limitations of standard instrumental variable regression in experimen- tal and observational studies: restricted estimation to the conditional mean of the outcome and the assumption of a linear relationship between regressors and outcome. More flexible regres- sion approaches that solve these limitations have already been developed but have not yet been adopted in causality analysis. The paper develops an instrumental variable estimation proce- dure building on the framework of generalized additive models for location, scale and shape. This enables modelling all distributional parameters of potentially complex response distribu- tions and non-linear relationships between the explanatory variables, instrument and outcome. The approach shows good performance in simulations and is applied to a study that estimates the effect of rural electrification on the employment of females and males in the South African province of KwaZulu-Natal. We find positive marginal effects for the mean for employment of females rates, negative effects for employment of males and a reduced conditional standard deviation for both, indicating homogenization in employment rates due to the electrification pro- gramme. Although none of the effects are statistically significant, the application demonstrates the potentials of using generalized additive models for location, scale and shape in instrumental variable regression for both to account for endogeneity and to estimate treatment effects beyond the mean.