Compressing data for generalized linear regression

Omlor, Simon

Langanzeige der Metadaten

DC Element	Wert	Sprache
dc.contributor.advisor	Munteanu, Alexander	-
dc.contributor.author	Omlor, Simon	-
dc.date.accessioned	2023-06-26T05:43:05Z	-
dc.date.available	2023-06-26T05:43:05Z	-
dc.date.issued	2022	-
dc.identifier.uri	http://hdl.handle.net/2003/41841	-
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-23684	-
dc.description.abstract	In this thesis we work on algorithmic data and dimension reduction techniques to solve scalability issues and to allow better analysis of massive data. For our algorithms we use the sketch and solve paradigm as well as some initialization tricks. We will analyze a tradeoff between accuracy, running time and storage. We also show some lower bounds on the best possible data reduction factors. While we are focusing on generalized linear regression mostly, logistic and p-probit regression to be precise, we are also dealing with two layer Rectified Linear Unit (ReLU) networks with logistic loss which can be seen as an extension of logistic regression, i.e. logistic regression on the neural tangent kernel. We present coresets via sampling, sketches via random projections and several algorithmic techniques and prove that our algorithms are guaranteed to work with high probability. First, we consider the problem of logistic regression where the aim is to find the parameter beta maximizing the likelihood. We are constructing a sketch in a single pass over a turnstile data stream. Depending on some parameters we can tweak size, running time and approximation guarantee of the sketch. We also show that our sketch works for other target functions as well. Second, we construct an epsilon-coreset for p-probit regression, which is a generalized version of probit regression. Therefore, we first compute the QR decomposition of a sketched version of our dataset in a first pass. We then use the matrix R to compute an approximation of the l_p-leverage scores of our data points which we use to compute sampling probabilities to construct the coreset. We then analyze the negative log likelihood of the p-generalized normal distribution to prove that this results in an epsilon-coreset. Finally, we look at two layer ReLU networks with logistic loss. Here we show that using a coupled initialization we can reduce the width of the networks to get a good approximation down from gamma^(-8) (Ji and Telgarsky, 2020) to gamma^(-2) where gamma is the so called separation margin. We further give an example where we prove that a width of gamma^(−1) is necessary to get less than constant error.	en
dc.language.iso	en	de
dc.subject	Generalized linear regression	en
dc.subject	Dimension reduction	en
dc.subject	Neural networks	en
dc.subject	Data compression	en
dc.subject.ddc	310	-
dc.title	Compressing data for generalized linear regression	en
dc.type	Text	de
dc.contributor.referee	Ickstadt, Katja	-
dc.date.accepted	2023-05-25	-
dc.type.publicationtype	PhDThesis	de
dc.subject.rswk	Verallgemeinertes lineares Modell	de
dc.subject.rswk	Hochdimensionale Daten	de
dc.subject.rswk	Datenkompression	de
dc.subject.rswk	Irrelevanzminderung	de
dcterms.accessRights	open access	-
eldorado.secondarypublication	false	de
Enthalten in den Sammlungen:	Lehrstuhl Mathematische Statistik und biometrische Anwendungen

Dateien zu dieser Ressource:

Datei	Beschreibung	Größe	Format
Dissertation_Simon_Omlor.pdf	DNB	1.73 MB	Adobe PDF	Öffnen/Anzeigen

Diese Ressource ist urheberrechtlich geschützt.

Lizenzbestimmungen ansehen

Zur Kurzanzeige

Diese Ressource ist urheberrechtlich geschützt. rightsstatements.org