Bayesian and frequentist regression approaches for very large data sets

dc.contributor.advisorIckstadt, Katja
dc.contributor.authorGeppert, Leo Nikolaus
dc.contributor.refereeGroll, Andreas
dc.date.accepted2018-11-16
dc.date.accessioned2019-03-19T08:27:15Z
dc.date.available2019-03-19T08:27:15Z
dc.date.issued2018
dc.description.abstractThis thesis is concerned with the analysis of frequentist and Bayesian regression models for data sets with a very large number of observations. Such large data sets pose a challenge when conducting regression analysis, because of the memory required (mainly for frequentist regression models) and the running time of the analysis (mainly for Bayesian regression models). I present two different approaches that can be employed in this setting. The first approach is based on random projections and reduces the number of observations to manageable level as a first step before the regression analysis. The reduced number of observations depends on the number of variables in the data set and the desired goodness of the approximation. It is, however, independent of the number of observations in the original data set, making it especially useful for very large data sets. Theoretical guarantees for Bayesian linear regression are presented, which extend known guarantees for the frequentist case. The fundamental theorem covers Bayesian linear regression with arbitrary normal distributions or non-informative uniform distributions as prior distributions. I evaluate how close the posterior distributions of the original model and the reduced data set are for this theoretically covered case as well as for extensions towards hierarchical models and models using q-generalised normal distributions as prior. The second approach presents a transfer of the Merge & Reduce-principle from data structures to regression models. In Computer Science, Merge & Reduce is employed in order to enable the use of static data structures in a streaming setting. Here, I present three possibilities of employing Merge & Reduce directly on regression models. This enables sequential or parallel analysis of subsets of the data set. The partial results are then combined in a way that recovers the regression model on the full data set well. This approach is suitable for a wide range of regression models. I evaluate the performance on simulated and real world data sets using linear and Poisson regression models. Both approaches are able to recover regression models on the original data set well. They thus offer scalable versions of frequentist or Bayesian regression analysis for linear regression as well as extensions to generalised linear models, hierarchical models, and q-generalised normal distributions as prior distribution. Application on data streams or in distributed settings is also possible. Both approaches can be combined with multiple algorithms for frequentist or Bayesian regression analysis.en
dc.identifier.urihttp://hdl.handle.net/2003/37946
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-19931
dc.language.isoende
dc.subjectRegression analysisen
dc.subjectVery large data setsen
dc.subjectRandom projectionsen
dc.subjectMerge & reduceen
dc.subjectData reductionen
dc.subject.ddc310
dc.subject.rswkRegressionsanalysede
dc.subject.rswkMassendatende
dc.subject.rswkDatenkompressionde
dc.subject.rswkDimensionsreduktionde
dc.titleBayesian and frequentist regression approaches for very large data setsen
dc.typeTexten
dc.type.publicationtypedoctoralThesisde
dcterms.accessRightsopen access
eldorado.secondarypublicationfalsede

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Dissertation Leo Geppert Belegexemplar.pdf
Size:
1.86 MB
Format:
Adobe Portable Document Format
Description:
DNB
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.85 KB
Format:
Item-specific license agreed upon to submission
Description: