Statistical Methods for Big Data
Permanent URI for this collection
Browse
Recent Submissions
Item Extending the distributional regression framework(2025) Briseño Sanchez, Guillermo; Groll, Andreas; Klein, NadjaThis thesis develops distributional regression methods tailored to the estimation of treatment effects as well as joint modelling of multivariate non-commensurable responses, all based on the Generalised Additive Models for Location Scale and Shape (GAMLSS) approach. In addition, it postulates methods for data-driven variable selection for the aforementioned model class. These developments are introduced across four contributed articles and are implemented in the statistical programming software R. In the first article, we derive treatment effects on the entire conditional response distribution via an instrumental variable estimation approach based on GAMLSS. Our approach allows to model all parameters of possibly complex outcome distributions as well as non-linear relationships between explanatory variables, instrument and outcome of interest. This demonstrates the potential of using distributional regression in instrumental variable regression both to account for endogeneity and estimate treatment effects beyond the mean. The second article introduces flexible copula-based statistical models for bivariate responses comprised of non-commensurate (i.e. mixed) variables whose components are a right-censored time-to-event response and a non-time-to-event outcome. The copula approach allows for separate specification of the dependence structure between the margins and their individual distribution functions. The model of the time-to-event margin is constructed via discrete-time-to-event or piecewise-exponential methods using the correspondence of their likelihood of the aforementioned approaches with well-known univariate distributions. The last two articles tackle the issue of data-driven variable selection for copula-based distributional regression models. In the third article we devise a gradient boosting estimation algorithm adapted to accommodate copula models with arbitrary marginal distributions suited for bivariate binary, count and non-commensurable mixed outcomes. The last article further extends these methods to bivariate right-censored time-to-event responses. This dramatically streamlines the model-building process for a wide range of response structures. The versatility of the proposed methods is demonstrated through the analysis of various synthetic and real data structures from labour economics, transportation, genetic epidemiology, healthcare utilisation, childhood undernutrition and ovarian cancer.Item Paola Zuccolotto and Marica Manisera (2020): Basketball Data Science: With Applications in R, CRC Press, 243 pp., £80.50 (Hardcover), ISBN: 978-1-138-60079-9(2022-04-10) Groll, Andreas; Jentsch, CarstenItem Introducing LASSO-type penalisation to generalised joint regression modelling for count data(2021-11-12) van der Wurp, Hendrik; Groll, AndreasIn this work, we propose an extension of the versatile joint regression framework for bivariate count responses of the R package GJRM by Marra and Radice (R package version 0.2-3, 2020) by incorporating an (adaptive) LASSO-type penalty. The underlying estimation algorithm is based on a quadratic approximation of the penalty. The method enables variable selection and the corresponding estimates guarantee shrinkage and sparsity. Hence, this approach is particularly useful in high-dimensional count response settings. The proposal’s empirical performance is investigated in a simulation study and an application on FIFA World Cup football data.Item Flexible instrumental variable distributional regression(2020-08-16) Briseño Sanchez, Guillermo; Hohberg, Maike; Groll, Andreas; Kneib, ThomasWe tackle two limitations of standard instrumental variable regression in experimen- tal and observational studies: restricted estimation to the conditional mean of the outcome and the assumption of a linear relationship between regressors and outcome. More flexible regres- sion approaches that solve these limitations have already been developed but have not yet been adopted in causality analysis. The paper develops an instrumental variable estimation proce- dure building on the framework of generalized additive models for location, scale and shape. This enables modelling all distributional parameters of potentially complex response distribu- tions and non-linear relationships between the explanatory variables, instrument and outcome. The approach shows good performance in simulations and is applied to a study that estimates the effect of rural electrification on the employment of females and males in the South African province of KwaZulu-Natal. We find positive marginal effects for the mean for employment of females rates, negative effects for employment of males and a reduced conditional standard deviation for both, indicating homogenization in employment rates due to the electrification pro- gramme. Although none of the effects are statistically significant, the application demonstrates the potentials of using generalized additive models for location, scale and shape in instrumental variable regression for both to account for endogeneity and to estimate treatment effects beyond the mean.