Bayesian and frequentist regression approaches for very large data sets

Geppert, Leo Nikolaus

Bayesian and frequentist regression approaches for very large data sets

dc.contributor.advisor	Ickstadt, Katja
dc.contributor.author	Geppert, Leo Nikolaus
dc.contributor.referee	Groll, Andreas
dc.date.accepted	2018-11-16
dc.date.accessioned	2019-03-19T08:27:15Z
dc.date.available	2019-03-19T08:27:15Z
dc.date.issued	2018
dc.description.abstract	This thesis is concerned with the analysis of frequentist and Bayesian regression models for data sets with a very large number of observations. Such large data sets pose a challenge when conducting regression analysis, because of the memory required (mainly for frequentist regression models) and the running time of the analysis (mainly for Bayesian regression models). I present two different approaches that can be employed in this setting. The first approach is based on random projections and reduces the number of observations to manageable level as a first step before the regression analysis. The reduced number of observations depends on the number of variables in the data set and the desired goodness of the approximation. It is, however, independent of the number of observations in the original data set, making it especially useful for very large data sets. Theoretical guarantees for Bayesian linear regression are presented, which extend known guarantees for the frequentist case. The fundamental theorem covers Bayesian linear regression with arbitrary normal distributions or non-informative uniform distributions as prior distributions. I evaluate how close the posterior distributions of the original model and the reduced data set are for this theoretically covered case as well as for extensions towards hierarchical models and models using q-generalised normal distributions as prior. The second approach presents a transfer of the Merge & Reduce-principle from data structures to regression models. In Computer Science, Merge & Reduce is employed in order to enable the use of static data structures in a streaming setting. Here, I present three possibilities of employing Merge & Reduce directly on regression models. This enables sequential or parallel analysis of subsets of the data set. The partial results are then combined in a way that recovers the regression model on the full data set well. This approach is suitable for a wide range of regression models. I evaluate the performance on simulated and real world data sets using linear and Poisson regression models. Both approaches are able to recover regression models on the original data set well. They thus offer scalable versions of frequentist or Bayesian regression analysis for linear regression as well as extensions to generalised linear models, hierarchical models, and q-generalised normal distributions as prior distribution. Application on data streams or in distributed settings is also possible. Both approaches can be combined with multiple algorithms for frequentist or Bayesian regression analysis.	en
dc.identifier.uri	http://hdl.handle.net/2003/37946
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-19931
dc.language.iso	en	de
dc.subject	Regression analysis	en
dc.subject	Very large data sets	en
dc.subject	Random projections	en
dc.subject	Merge & reduce	en
dc.subject	Data reduction	en
dc.subject.ddc	310
dc.subject.rswk	Regressionsanalyse	de
dc.subject.rswk	Massendaten	de
dc.subject.rswk	Datenkompression	de
dc.subject.rswk	Dimensionsreduktion	de
dc.title	Bayesian and frequentist regression approaches for very large data sets	en
dc.type	Text	en
dc.type.publicationtype	doctoralThesis	de
dcterms.accessRights	open access
eldorado.secondarypublication	false	de

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dissertation Leo Geppert Belegexemplar.pdf
Size:: 1.86 MB
Format:: Adobe Portable Document Format
Description:: DNB

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 4.85 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Lehrstuhl Mathematische Statistik und biometrische Anwendungen