Bayesian and frequentist regression approaches for very large data sets

Geppert, Leo Nikolaus

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Ickstadt, Katja	-
dc.contributor.author	Geppert, Leo Nikolaus	-
dc.date.accessioned	2019-03-19T08:27:15Z	-
dc.date.available	2019-03-19T08:27:15Z	-
dc.date.issued	2018	-
dc.identifier.uri	http://hdl.handle.net/2003/37946	-
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-19931	-
dc.description.abstract	This thesis is concerned with the analysis of frequentist and Bayesian regression models for data sets with a very large number of observations. Such large data sets pose a challenge when conducting regression analysis, because of the memory required (mainly for frequentist regression models) and the running time of the analysis (mainly for Bayesian regression models). I present two different approaches that can be employed in this setting. The first approach is based on random projections and reduces the number of observations to manageable level as a first step before the regression analysis. The reduced number of observations depends on the number of variables in the data set and the desired goodness of the approximation. It is, however, independent of the number of observations in the original data set, making it especially useful for very large data sets. Theoretical guarantees for Bayesian linear regression are presented, which extend known guarantees for the frequentist case. The fundamental theorem covers Bayesian linear regression with arbitrary normal distributions or non-informative uniform distributions as prior distributions. I evaluate how close the posterior distributions of the original model and the reduced data set are for this theoretically covered case as well as for extensions towards hierarchical models and models using q-generalised normal distributions as prior. The second approach presents a transfer of the Merge & Reduce-principle from data structures to regression models. In Computer Science, Merge & Reduce is employed in order to enable the use of static data structures in a streaming setting. Here, I present three possibilities of employing Merge & Reduce directly on regression models. This enables sequential or parallel analysis of subsets of the data set. The partial results are then combined in a way that recovers the regression model on the full data set well. This approach is suitable for a wide range of regression models. I evaluate the performance on simulated and real world data sets using linear and Poisson regression models. Both approaches are able to recover regression models on the original data set well. They thus offer scalable versions of frequentist or Bayesian regression analysis for linear regression as well as extensions to generalised linear models, hierarchical models, and q-generalised normal distributions as prior distribution. Application on data streams or in distributed settings is also possible. Both approaches can be combined with multiple algorithms for frequentist or Bayesian regression analysis.	en
dc.language.iso	en	de
dc.subject	Regression analysis	en
dc.subject	Very large data sets	en
dc.subject	Random projections	en
dc.subject	Merge & reduce	en
dc.subject	Data reduction	en
dc.subject.ddc	310	-
dc.title	Bayesian and frequentist regression approaches for very large data sets	en
dc.type	Text	en
dc.contributor.referee	Groll, Andreas	-
dc.date.accepted	2018-11-16	-
dc.type.publicationtype	doctoralThesis	de
dc.subject.rswk	Regressionsanalyse	de
dc.subject.rswk	Massendaten	de
dc.subject.rswk	Datenkompression	de
dc.subject.rswk	Dimensionsreduktion	de
dcterms.accessRights	open access	-
eldorado.secondarypublication	false	de
Appears in Collections:	Lehrstuhl Mathematische Statistik und biometrische Anwendungen

Files in This Item:

File	Description	Size	Format
Dissertation Leo Geppert Belegexemplar.pdf	DNB	1.91 MB	Adobe PDF	View/Open

This item is protected by original copyright

View License

Show simple item record

This item is protected by original copyright rightsstatements.org