Compressing data for generalized linear regression

dc.contributor.advisorMunteanu, Alexander
dc.contributor.authorOmlor, Simon
dc.contributor.refereeIckstadt, Katja
dc.date.accepted2023-05-25
dc.date.accessioned2023-06-26T05:43:05Z
dc.date.available2023-06-26T05:43:05Z
dc.date.issued2022
dc.description.abstractIn this thesis we work on algorithmic data and dimension reduction techniques to solve scalability issues and to allow better analysis of massive data. For our algorithms we use the sketch and solve paradigm as well as some initialization tricks. We will analyze a tradeoff between accuracy, running time and storage. We also show some lower bounds on the best possible data reduction factors. While we are focusing on generalized linear regression mostly, logistic and p-probit regression to be precise, we are also dealing with two layer Rectified Linear Unit (ReLU) networks with logistic loss which can be seen as an extension of logistic regression, i.e. logistic regression on the neural tangent kernel. We present coresets via sampling, sketches via random projections and several algorithmic techniques and prove that our algorithms are guaranteed to work with high probability. First, we consider the problem of logistic regression where the aim is to find the parameter beta maximizing the likelihood. We are constructing a sketch in a single pass over a turnstile data stream. Depending on some parameters we can tweak size, running time and approximation guarantee of the sketch. We also show that our sketch works for other target functions as well. Second, we construct an epsilon-coreset for p-probit regression, which is a generalized version of probit regression. Therefore, we first compute the QR decomposition of a sketched version of our dataset in a first pass. We then use the matrix R to compute an approximation of the l_p-leverage scores of our data points which we use to compute sampling probabilities to construct the coreset. We then analyze the negative log likelihood of the p-generalized normal distribution to prove that this results in an epsilon-coreset. Finally, we look at two layer ReLU networks with logistic loss. Here we show that using a coupled initialization we can reduce the width of the networks to get a good approximation down from gamma^(-8) (Ji and Telgarsky, 2020) to gamma^(-2) where gamma is the so called separation margin. We further give an example where we prove that a width of gamma^(−1) is necessary to get less than constant error.en
dc.identifier.urihttp://hdl.handle.net/2003/41841
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-23684
dc.language.isoende
dc.subjectGeneralized linear regressionen
dc.subjectDimension reductionen
dc.subjectNeural networksen
dc.subjectData compressionen
dc.subject.ddc310
dc.subject.rswkVerallgemeinertes lineares Modellde
dc.subject.rswkHochdimensionale Datende
dc.subject.rswkDatenkompressionde
dc.subject.rswkIrrelevanzminderungde
dc.titleCompressing data for generalized linear regressionen
dc.typeTextde
dc.type.publicationtypePhDThesisde
dcterms.accessRightsopen access
eldorado.secondarypublicationfalsede

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Dissertation_Simon_Omlor.pdf
Size:
1.69 MB
Format:
Adobe Portable Document Format
Description:
DNB
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.85 KB
Format:
Item-specific license agreed upon to submission
Description: