Compressing data for generalized linear regression

Omlor, Simon

Compressing data for generalized linear regression

dc.contributor.advisor	Munteanu, Alexander
dc.contributor.author	Omlor, Simon
dc.contributor.referee	Ickstadt, Katja
dc.date.accepted	2023-05-25
dc.date.accessioned	2023-06-26T05:43:05Z
dc.date.available	2023-06-26T05:43:05Z
dc.date.issued	2022
dc.description.abstract	In this thesis we work on algorithmic data and dimension reduction techniques to solve scalability issues and to allow better analysis of massive data. For our algorithms we use the sketch and solve paradigm as well as some initialization tricks. We will analyze a tradeoff between accuracy, running time and storage. We also show some lower bounds on the best possible data reduction factors. While we are focusing on generalized linear regression mostly, logistic and p-probit regression to be precise, we are also dealing with two layer Rectified Linear Unit (ReLU) networks with logistic loss which can be seen as an extension of logistic regression, i.e. logistic regression on the neural tangent kernel. We present coresets via sampling, sketches via random projections and several algorithmic techniques and prove that our algorithms are guaranteed to work with high probability. First, we consider the problem of logistic regression where the aim is to find the parameter beta maximizing the likelihood. We are constructing a sketch in a single pass over a turnstile data stream. Depending on some parameters we can tweak size, running time and approximation guarantee of the sketch. We also show that our sketch works for other target functions as well. Second, we construct an epsilon-coreset for p-probit regression, which is a generalized version of probit regression. Therefore, we first compute the QR decomposition of a sketched version of our dataset in a first pass. We then use the matrix R to compute an approximation of the l_p-leverage scores of our data points which we use to compute sampling probabilities to construct the coreset. We then analyze the negative log likelihood of the p-generalized normal distribution to prove that this results in an epsilon-coreset. Finally, we look at two layer ReLU networks with logistic loss. Here we show that using a coupled initialization we can reduce the width of the networks to get a good approximation down from gamma^(-8) (Ji and Telgarsky, 2020) to gamma^(-2) where gamma is the so called separation margin. We further give an example where we prove that a width of gamma^(−1) is necessary to get less than constant error.	en
dc.identifier.uri	http://hdl.handle.net/2003/41841
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-23684
dc.language.iso	en	de
dc.subject	Generalized linear regression	en
dc.subject	Dimension reduction	en
dc.subject	Neural networks	en
dc.subject	Data compression	en
dc.subject.ddc	310
dc.subject.rswk	Verallgemeinertes lineares Modell	de
dc.subject.rswk	Hochdimensionale Daten	de
dc.subject.rswk	Datenkompression	de
dc.subject.rswk	Irrelevanzminderung	de
dc.title	Compressing data for generalized linear regression	en
dc.type	Text	de
dc.type.publicationtype	PhDThesis	de
dcterms.accessRights	open access
eldorado.secondarypublication	false	de

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dissertation_Simon_Omlor.pdf
Size:: 1.69 MB
Format:: Adobe Portable Document Format
Description:: DNB

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 4.85 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Lehrstuhl Mathematische Statistik und biometrische Anwendungen